Utilizing short-lived certificates to guard TLS secrets and techniques
- Short-lived certificates (SLCs) are part of our latest efforts to further secure our Transport Layer Security (TLS) private keys on our edge networks.
- SLCs have a very short exposure compared to traditional certificates and lower the chances of a compromised private key being abused.
- Implementing SLCs has required us to address tradeoffs between operability and reliability, while satisfying the strict security requirements of our edge environment.
To create an optimal experience for the people who use our products and services all over the world, Meta runs a widely distributed network with many points of presence spread across different geographies. We call this the edge of our network, and it is also the entry point into Meta’s infrastructure for our end user requests.
On the edge, our Transport Layer Security (TLS) deployment helps ensure end-to-end security for our applications over the internet.
Now, we’ve introduced short-lived certificates (SLCs) as part of our latest efforts for securing our TLS private keys. As the name indicates, SLCs have a very short exposure compared to traditional certificates (i.e., in order of days, as opposed to validity that lasts months or years). With a shorter validity period the potential for abuse, if the private key is compromised, is lower. This approach also preserves all the benefits of doing all sign operations at the edge, such as capacity and latency, and improves the reliability of the setup.
The evolution of Meta’s public-facing TLS infrastructure
Let’s go over the evolution of our TLS infrastructure over the years. When a client (an app or browser), wants to send a request to a dynamic application, it first needs to complete a TLS handshake with our L7 load balancer Proxygen in the PoP to obtain a secure connection. We also run an OffloadService on the edge in a locked-down separate environment that is not accessible from the external internet. The OffloadService loads the private key in memory. To facilitate TLS termination, Proxygen would offload the TLS sign operations to the OffloadService. Previously, Proxygen used to load both the certification and private key in memory to complete TLS handshakes. However, loading the private keys directly in Proxygen made us potentially vulnerable to Heartbleed types of attacks.
The certificate rotation still required a Proxygen service push, and this was starting to slow us down. We decided to utilize the OffloadService to enable certificate rotation during runtime. This architecture improved security by decoupling load balancer from TLS infrastructure and also improved reliability and operational agility.
The challenges of securing TLS private keys
We examined several methods for better securing our TLS private keys or reacting quickly if a key is compromised.
Protocol layer revocation
Classic revocation protocols such as certificate revocation lists (CRLs) and online certificate status protocols (OCSPs) offer ways to revoke certificates, but they have some drawbacks. Each setup of a secure channel may require access to CRL distribution point (DP) or the OCSP responder. This adds latency and increases failure scenarios for the system. Overall, a system that includes revocation checking is more complex and less reliable than a system that does not include it.
Remote offload refers to the TLS certificate infrastructure for PoPs where private keys for high-value certificates would be kept in our datacenter regions. During a TLS handshake, the sign operation is completed by an RPC from the edge to the OffloadService running in our datacenter. As expected, this approach increases the latency of the TLS handshake. We deployed this to a few clusters for six months and it ran smoothly. However, these clusters are relatively well connected to their nearest datacenter. Remote offload is vulnerable to datacenter outages and backbone issues. As a result, we decided not to roll it out to the rest of the edge infrastructure.
To improve reliability and security tradeoffs, we also experimented with delegated credentials. But this needs further adoption from browsers to have the desired effect across our entire user base.
How we built SLCs
We have been charting a path towards reducing the lifetime of certificates. Years ago, we started with certificates with a years-long lifespan, then improved it to a few months. Currently, our certificates are only valid for a few days. Given the step function increase in security, and the days-long lifespan of certificates, we have built robust automation to have a reliable certificate pipeline
We built a ConfigBuilder with reliability as our top priority, as certificates have only days’ worth of lifespan. ConfigBuilder picks the recently issued certificates, runs proper validation and testing, and updates the OffloadService configuration that gets picked at runtime. OffloadService loads the new certificate-key pairs and starts serving it for user traffic right away. To eliminate any transient sign errors during certificate rotation, OffloadService will successfully do sign operations with both private keys for a short duration.
In our implementation of SLC, we decided to limit the private key exposure to 10 days and rotate the certificates every day. Previously, with longer, 90-day certifications, we would request renewal 30-45 days before expiration. It wasn’t a big deal if certificate rotation was blocked for a few days. On the other hand, SLC requires certification issuance and rotation of all our public certificates every single day, which means our pipeline has to operate almost flawlessly.
Here are main requirements of this system:
- External dependencies: We use trusted and well-known third-party certificate authorities (CAs) to issue our certificates. Given that we need to rotate certs every day, we wanted to keep certificate issuance dependency out of the critical path. We decided to issue the certificates with a lifespan of a few months, but only distribute them 10 days before expiration. With this pipeline in place, in case of some unforeseen CA operational issues, the certificates are available well in advance before they need to be used.
- Locking down private keys: During certificate issuance the private keys are stored in a locked down secret store, which cannot be accessed by the edge services. Ten days before expiration, the necessary secrets are moved to another store that is only accessible by the OffloadService service on the edge.
- Backup certificate: If somehow our private keys are exposed, we need to do a fast rotation. As mentioned above, the future credentials that are issued well in advance can also be used as a backup.
- Backfill: We wanted to keep certificate validity details transparent to non-security teams. This required a major update in our certificate management service. (More details in the section on “Certificate issuance overhaul”).
- Reliability: Given the high volume of certificates and time sensitive rotation, we need high availability and reliability for the end-to-end flow. At this scale, we need to minimize manual intervention.
- Clock skew: Since the certificates are issued well in advance, clock skew should not be a problem for ‘valid_from’ date. We assumed that 10 days of remaining validity should give us enough buffer in case of clock skew. As part of careful deployment, there was no noticeable increase in TLS handshake errors. After the deployment was complete, we analyzed the client data further to understand the effects of shorter cert validity.
Scaling our infrastructure for SLCs
Certificate issuance overhaul
Our internal system for managing all our public certificates includes many users (systems and engineers) internally beyond our load balancers. It provides an easy interface for requesting certificates and supports automated renewals. We wanted to keep management of SLCs transparent to all our users. To set up a new site that needs a certificate, our management system bootstraps the corresponding SLC series without additional effort from the requestor. It also ensures that we always have a certificate available for any upcoming expiration date and backfill any missing certificate.
The issued certificates are then grouped together by their expiration date into “cert bundles”. For example, all certs expiring on April 11th, will be added to a bundle named, “expires_2023_04_11.”
The volume of certificates was a challenge on the distribution side as well. For longer validity certs, we renew usually fewer than 10 certs at a time. With SLC, we renew each cert every single day. This increases our load 10x. During daily SLC rotation, the OffloadService fetches the private key from the secret store to load it in memory. The simultaneous fetch requests (~100x) started to overload the secret store due to a sudden spike. We improved the caching and staggered secret fetch requests in the OffloadService to prevent request spikes to the secret store. We use the prev-current-next model to handle any synchronization delays during certificate rotation.
Every day, ConfigBuilder picks a candidate cert bundle with an exposure of 10 days (for example on April 1, the bundle “expires_2023_04_11” will be selected). Before distribution, all candidate certs need to be validated. ConfigBuilder canaries the new configuration to a limited set of OffloadService, and hence, Proxygen instances.
Data from these canaries is used to establish the status of TLS handshakes. This test, however, is not exhaustive due to the breadth of our use cases. Some cases may not be detected, for example, certificates missing from the bundle or incompatible with the Proxygen version running on a different machine. To gather more confidence on the validity of our certificate pipeline, the cert bundle is pushed to a wider (order of 100 machines) deployment. After baking for a day, the cert bundle is then rolled out to the rest of the edge fleet.
Future work for protecting TLS private keys
We have come a long way since the early days of our TLS infra and have dramatically improved our security posture by reducing the lifetime of certificates from the order of years to days. We continue making improvements to the reliability of this pipeline, improving security by further hardening secret management using hardware based techniques and also are evaluating further reduction in lifetime of certificates
This effort could not have been possible without our amazing partners and especially Xiangyu Bu, Puneet Mehra, Anirudh Ramachandran, and Kai Yuan Thng.