How Fb retains its large-scale infrastructure {hardware} up and working

Facebook’s services are based on fleets of servers in data centers around the world – all running applications and delivering the performance our services need. For this reason, we need to ensure that our server hardware is reliable and that we can interrupt server hardware failures of our magnitude as little as possible.

Hardware components themselves can fail for a number of reasons, including material degradation (e.g., the mechanical components of a spinning hard drive), a device that is being used beyond its useful life (e.g., NAND flash devices), environmental impact ( e.g. corrosion) due to moisture) and manufacturing defects.

In general, we always expect some level of hardware failure in our data centers, which is why we implement systems like ours Cluster management system to minimize business interruptions. In this article, we introduce four key methods we can use to maintain high levels of hardware availability. We have built systems that can do that Identify and fix problems. We monitor and correct hardware events without affecting application performance. We take proactive approaches to hardware repairs and Use the predictive method for corrections. And we Automate root cause analysis for hardware and system errors on a large scale to get to the bottom of the problems quickly.

How we deal with hardware corrections

We regularly run a tool called MachineChecker on each server to detect hardware and connectivity errors. As soon as MachineChecker creates an alert in a central alarm handling system, a tool called Facebook Auto-Remediation (FBAR) picks up the alert and makes customizable corrections to fix the error. To ensure that there is still enough capacity for Facebook’s services, we can also set rate limits to limit how many servers are repaired at the same time.

When FBAR can’t get a server back in working order, the error is passed to a tool called Cyborg. Cyborg can perform minor corrections such as firmware or kernel upgrades and reimaging. If the problem requires manual repair by a technician, the system creates a ticket in our repair ticketing system.

We take a closer look at this process in our “Hardware Correction at Scale” article.

The process of detecting and correcting hardware errors.

How we can minimize the negative impact of error reporting on server performance

MachineChecker detects hardware errors by checking various server logs for error reports. When a hardware failure occurs, it is usually recognized by the system (e.g. when a parity check fails) and an interrupt signal is sent to the CPU to handle and log the failure.

Since these interrupt signals are viewed as high priority signals, the CPU will stop normal operation and deal with the error. However, this has a negative impact on the performance of the server. For logging correctable memory errors, e.g. B. a conventional interrupt system management interrupt (SMI) would block all CPU cores while the correctable machine check interrupt (CMCI) would only block one of the CPU cores and leave the rest of the CPU cores available for normal operation.

Although CPU deadlocks typically only last a few hundred milliseconds, they can disrupt services that are latency sensitive. On the order of magnitude, this means that interrupts on some machines can cascade performance at the service level.

To minimize the performance impact caused by error reporting, we implemented a hybrid memory error reporting mechanism that uses both CMCI and SMI without losing the accuracy of the number of memory errors that can be corrected.

This is discussed in detail in our article “Optimizing Interrupt Handling Performance for Memory Failures in Large Data Centers”.

CPU downtime from memory error reporting from SMI vs. CMCI.

How we use machine learning to predict repairs

As we often introduce new hardware and software configurations into our data centers, we also need to create new rules for our automatic correction system.

If the automated system cannot fix a hardware failure, a ticket for manual repair is assigned to the problem. New hardware and software mean new types of potential bugs that need to be fixed. However, there can be a gap between implementing new hardware or software and introducing new correction rules. During this void, some repair cards may be classified as “undiagnosed”, which means that the system has not proposed a repair action, or as “misdiagnosed”, which means that the suggested repair action is ineffective. This means more work and system downtime while technicians have to diagnose the problem themselves.

To fill this gap, we created a machine learning framework that learns from troubleshooting past errors and predicts what repairs are needed for current undiagnosed and misdiagnosed repair cards. Based on the costs and benefits of incorrect and correct predictions, we assign a prediction confidence threshold for each repair action and optimize the order of the repair actions. For example, in some cases we’d prefer to do a reboot or firmware upgrade first, as these types of repairs don’t require physical hardware repair and take less time to complete. Therefore, the algorithm should recommend this type of action first. Machine learning not only allows us to predict how to fix an undiagnosed or misdiagnosed problem, it also allows us to prioritize the most important ones.

For more information, see our article “Predicting Corrections for Hardware Failures in Large Data Centers”.

The hardware failure repair procedure with the repair predictions.

How we automated root cause analysis at the fleet level

In addition to server logs that record reboots, kernel panics, insufficient memory, etc., there are also software and tooling logs in our production system. However, because of the size and complexity of all of these problems, it is difficult to examine all of the protocols together to find correlations between them.

We implemented a scalable root cause analysis (RCA) tool that sorts millions of log entries (each described by possibly hundreds of columns) to find easy-to-understand and actionable correlations.

With data pre-aggregation with DivingAs a real-time in-memory database, we have significantly improved the scalability of a traditional pattern mining algorithm. FP growthto find correlations in this RCA framework. We also added a number of filters on the reported correlations to make the result easier to interpret. We have widely used this analyzer on Facebook for the RCA to fix hardware component failure rate, unexpected server restarts and software failures.

For more information, see our article, Quick Dimensional Analysis To Investigate Causes In A Large Service Environment.

The Fast Dimensional Analysis Framework.

Comments are closed.