Automating root trigger evaluation for infrastructure methods
What the research is:
Facebook products run on a highly complex infrastructure system consisting of servers, networks, back-end services and client-side software. Operating such systems with high performance, reliability, and efficiency requires real-time monitoring, proactive fault detection, and immediate diagnosis of production problems. While a number of research and applications have addressed the need to monitor the use of state-of-the-art anomaly detection, diagnosing root causes remains a largely manual and time consuming process. Modern software systems can be so complex that unit / integration tests and error logs alone cannot be found in a humane way for root cause research. For example, trying an alert would require manually examining a mixture of structured data (e.g. telemetry logging) and unstructured data (e.g. code changes, error messages).
Facebook’s Infrastructure Data Science team is developing a unified framework of algorithms in the form of a Python library to overcome such challenges (see Figure 1). In this blog post we illustrate applications of RCA from large infrastructure systems and discuss possibilities for the application of statistics and data science to introduce a new automation in this area.
Figure 1. RCA methods and applications for infrastructure problems
How it works:
I. Mapping ML Performance Deterioration to Record Movement
Machine learning is an important part of Facebook products: it helps recommend content, connect new friends, and report integrity violations. Feature shifts caused by damaged training / inference data are a typical cause of model performance degradation. We investigate how a sudden change in model accuracy can be attributed to changing data distributions. Machine learning models typically use complex functions such as images, text, and high-dimensional embeddings as inputs. We apply statistical methods to perform change point detection for these high-dimensional features and create black box attribution models that are agnostic from the original deep learning models to attribute model performance degradation to feature and label shifts. See Figure 2 for an example of exposing displaced high-dimensional embedding features between two model training datasets. The methodology can also be used to explain deterioration in accuracy of an older model whose training data distribution differs from the inference data set.
Figure 2. An example of a sudden dramatic shift in data set in high-dimensional embedding features. Two-dimensional projections of the embedding (with T-SNE) before and after the displacement are visualized. This example, shown as an illustration with synthetic data, is similar to the shifts observed in production settings.
II. Automatic diagnosis of key performance indicators deterioration
Infrastructure systems are monitored in real time, which generates a large amount of telemetry data. Diagnostic workflows typically start with drill-down data analysis, such as: B. running analytical queries to find out which country, app, or device type has the greatest decrease in reliability from week to week. Such findings could point the on-call service in the direction of further investigations. We are experimenting with dynamic programming algorithms that can automatically traverse the space of these sub-dimensions. We also try to fit a predictive model using the metric and dimension dataset and identify dimensions of interest by looking at the importance of the functions. With the help of such tools, the time required for repetitive analytical tasks is reduced.
Another diagnostic task is to investigate which correlated telemetry metrics may have caused the key performance metric to deteriorate. For example, as the latency of a service increases, its owner can manually browse the telemetry metrics (sometimes of a large number of) dependent services. Simple automations like setting up anomaly detection for each metric can lead to noisy and false positive detections. A better approach, as shown in Figure 3, is to learn from historical data about the temporal correlations between suspicious metrics and the key performance metric, and to identify real causes of misrelated anomalies.
Figure 3. Methodology for assessing and ranking potential causal factors.
III. Event ranking and isolation
Many production problems are caused by internal changes to the software / infrastructure systems. Examples include code changes, configuration changes, and starting A / B tests for new features that affect a subset of users.
Ongoing research is to develop a model to isolate the changes that have potential causes. As a first step, we use heuristic rules such as ranking based on the time between code change and production problem. There is the option of adopting more signals such as team, author and code content in order to further reduce false positives and missing cases compared to simple heuristics. A ranking model based on machine learning can use such inputs effectively. The limited amount of tagged data is an obstacle to the automatic learning of such rules. One possible solution is to examine a human-in-the-loop framework that iteratively collects specialist expert feedback and adaptively updates the ranking model (see Figure 4).
Figure 4. A human-in-the-loop framework for guilty of bad code changes.
At the Facebook level there are numerous code / configuration / experimentation changes every day. Simply trying to put them all in order may not work. The ranking algorithm requires “previous” knowledge of the systems in order to narrow down the pool of suspicious causal changes. For example, all back-end services can be represented as a graph with edges showing how likely the degradation of one node is to cause production problems for its neighbors. An example algorithm for creating such a graph is to apply a deep neural network framework that represents the dynamic dependencies between a large number of time series. Another possible direction is to apply causal graph inference models to discover the degree of dependencies between nodes. With the help of this prior knowledge, the isolation of bad changes can be achieved more effectively.
Why it matters:
Operating an efficient and reliable infrastructure is important for the success of Facebook products. While production problems would inevitably arise, using data to quickly identify root causes can speed resolution and minimize the damage to such events. The proposed algorithm framework will enable automated diagnosis using a mixture of structured data (e.g. telemetry) and unstructured data (e.g. traces, code change events). The methods have been developed to be generally applicable to different types of infrastructure systems. The algorithms, written as a Python library, can also be useful externally for the data science and software engineering community. Root cause analysis is an emerging area in data science that sits at the intersection of existing areas such as data mining, supervised learning, and time series analysis.