Timing accuracy and precision for the way forward for computing
Meta is deploying a network timing protocol, Precision Time Protocol (PTP), to improve the accuracy and precision across our networks as well as in many use cases. PTP offers a level of timing accuracy and precision that will be foundational as we continue to build toward the metaverse and develop the increasingly complex networks and systems it will require.
To support the billions of people around the globe using our technologies, we need to be confident that every server — in every data center — knows and agrees what time it is, as accurately and precisely as possible. Features like messaging, videoconferencing, online gaming, and even updating or deleting content rely on precise, accurate timing among multiple servers and sometimes even across multiple data centers. The more servers there are between endpoints, the more important synchronization is. Having just one server out of sync with the rest can create noticeable delays and errors.
Network Time Protocol (NTP) has served us well, but it is reaching its limits as we work to improve our products and services and introduce new ones. PTP offers a level of accuracy and precision that NTP simply cannot achieve, and it will significantly reduce the chance of network delays and errors.
After a successful pilot, we’ve started extending PTP into all our data centers.
Why we need PTP
So, what is PTP? And what makes it so important?
It all comes down to accuracy, or how close a computer’s measurement of time is to the actual time, and precision, or how close different computers’ measurements of time are to one another.
In 2002, PTP was introduced as a method to precisely synchronize clocks in a distributed system. A network computer called a stratum holds the current time and sends a time reference to any other computer on a network that asks what time it is. The current time gets sent to the computer via a network data packet (a process called sync messaging) that is used to update the computer’s clock. Essentially, one machine holds time for the other machines on the network.
However, because of network latency, that time is no longer accurate when it arrives at the receiving computer. Latency (also called delay) can happen for several reasons, including:
- The speed of a signal (electrical or optical) traveling across a medium (wire or fiber) is finite (often approximated with the speed of light).
- The conversion time in the transceivers used for those signals may vary based on the temperature of the transceiver.
- The quality of the network router switches and network interfaces.
- The software/driver/firmware stack that needs to be executed to send or receive a network packet (referred to as the open systems interconnection, or OSI).
There is a famous saying: “If you can measure it, you can manage it.” Latency is inevitable and cannot be avoided, but we can compensate for it if we can measure it. If the latency is measured, then it can be added to the current time of the sync message at the client side. However, measuring latency between the time reference computer and the client computer is not a trivial task, because there is no global clock and each of these computers has its own clock.
To measure latency and the clock difference between the reference and client (also called offset), two assumptions must be made: consistency and symmetry. Consistency means that the latency of a packet faces while traveling over the network is consistent, and symmetry means that the latency going from the reference computer to the client computer should equal the latency going back in the other direction (client computer to reference computer). Any imperfection in the consistency and symmetry will decrease precision in the client computer’s clock synchronization.
To improve the precision of clock synchronization, it is necessary to maximize the consistency and symmetry in our network. This is where PTP comes in. PTP uses hardware timestamping and transparent clocks to improve consistency and symmetry, respectively.
PTP has already been heavily supported by the telecom industry as networks transition to 5G connectivity. PTP’s additional precision and accuracy will be vital as 5G brings higher than ever network bandwidth to devices all over the world. Even though the telecom industry has been using PTP for over a decade, the hyperscale data centers have been slow to adopt PTP — until now.
Compared with NTP, PTP allows hosts to be synchronized to one common source of time with much higher precision. Where NTP allows for precision within milliseconds, PTP allows for precision within nanoseconds.
How PTP outperforms NTP
Migrating our systems to PTP has taken years of engineering because of a fundamental difference in how NTP and PTP systems operate.
Systems that use NTP are asynchronous. They’re distributed systems with no global clock. They do their jobs independently, but they check in with one another to make sure they’re in sync. The problem with this is that, as a system grows, more and more of these check-ins are required. And the more check-ins that are done, the slower the network runs.
NTP is also prone to variance and latency because of how it keeps time. Depending on its implementation, NTP uses either a logical clock or physical clock method. A logical clock is an older method that times things as a sequence of steps — one after the other.
A physical clock is a newer method used in distributed databases where tasks are scheduled on a clock and ordered accordingly. Rather than a central, common clock, each node uses its own clock. To help ensure that all these clocks are in sync, engineers will deliberately add in a delay to compensate for network latency.
An easy way to think about NTP is to think of the clock in a microwave. A microwave keeps time on-device. If there’s a time shift, like the switch to daylight savings time, the clock needs to be adjusted manually and checked against some source of truth (eg, another trusted clock).
PTP, on the other hand, works more like the clock on a smartphone. When daylight savings happens, or the phone moves to a new time zone, a smartphone’s clock updates its time on its own by cross-referencing the time over a network. In the same way that smartphone clocks can update themselves, PTP allows systems to be synchronized and rely on a single source of truth for timing.
Migrating from NTP to PTP
While PTP is more precise than NTP (measuring in nanoseconds versus milliseconds), it also places more demands on network hardware. As Meta’s engineers were working to implement PTP, we quickly found that off-the-shelf components weren’t designed to handle PTP at scale. One important component of PTP, the Server Clock, provides standard time information to other clocks across a network. Think of it like an icemaker, disseminating packets of time to all the other machines on the network like ice cubes.
There are also boundary clocks and transparent clocks sitting between the ServerClock and the various network nodes. Boundary clocks are like middle managers that communicate and sync with the server clock and provide time to the devices underneath. Imagine ice cubes being sent through a pipe, but the pipe is getting hotter. Boundary clocks are like a fridge holding onto the ice cubes. The problem is that if the ice is already a bit melted the fridge only keeps it from melting further. Transparent clocks try to mitigate this by measuring and adjusting for time delays to improve synchronization. They’re like insulation on pipes.
We decided to eliminate boundary clocks from the system altogether so that each machine would be speaking directly to the Server Clock. But how do you sync your machines to a single clock on a global scale?
Usually, you’d rely on GPS synchronization. You can say a data center in the United States is as precise as a data center in Ireland, for example, because you know the GPS is precise. But this is only true in theory because there is no way to be in two places at the same time to verify it. But we wanted to actually verify it.
To create our own source of truth, we built our own TimeAppliance, an open source device capable of supporting PTP at Meta’s scale. The Time Appliance consists of a GNSS receiver and a miniaturized atomic clock (MAC), and can keep accurate time, even if GNSS connectivity is lost. While building our Time Appliance, we also invented a Time Card, a PCIe card that can turn any commodity server into a Time Appliance. We then worked with the Open Compute Project to establish the Open Compute Time Appliance Project and open-sourced every aspect of the Open Time Server.
Benefits of PTP
It won’t be possible to move onto the next generation compute platforms and the metaverse without solving the tight synchronization requirements that PTP can address. PTP offers benefits for the products and services of the future that will drive the metaverse, but also has important implications for today’s products and services as well.
Think about something as common as sending a message over Messenger. Thanks to network timing, someone can send a message to a friend on the other side of the world and have it appear in real time. This won’t happen if timing among servers isn’t correct. PTP will make even everyday network transactions like these even faster. And it will help systems better detect and avoid network congestion.
Another example where PTP can help is to reduce lags in gaming, a notorious pain point for anyone that has played games online. Lag happens because systems are out of sync. To the cloud based gaming becomes more commonplace, and the games themselves become more graphically intense, PTP’s ability to mitigate lag will make it an important piece of the future of gaming.
And for more business-minded people, the same goes for video conferencing and remote work. Everything from today’s video conference calls to the new possibilities with remote work and collaboration will stand to reap the benefits of PTP.
We believe that, among its other applications, PTP has the potential to enable synchronization of GPUs across data centers, which could open up unprecedented scale in AI capabilities that is difficult to achieve today. This level of accuracy will help ensure synchronization of not only the computers on our networks today but also the advanced systems that will be on our networks in the future.
Next steps for PTP
To help increase adoption of PTP, we’re open-sourcing all our PTP-related work (our Time Appliance and source code, client software, and transparent clock). Vendors who produce networking equipment need to introduce new equipment that supports PTP, and we want to help support them.
In the coming years, we believe PTP will become the standard for keeping time in computer networks, and it will be a foundational component of the technologies that will drive the metaverse.