Minesweeper automates root trigger evaluation as a first-line protection towards bugs

Root cause analysis (RCA) is an important part of troubleshooting. After all, you can’t solve a problem without getting to the heart of it. But RCA isn’t always easy, especially on a scale like Facebook’s. When billions of people use an app on a variety of platforms and devices, a single failure can cause several different problems, and multiple errors can occur at the same time.

There was a time when on-call engineers had to spend hours or even days manually browsing bug reports and looking for patterns to help them debug. Today you can use it Minesweeper – A technique we developed to automate RCA that identifies the causes of failures based on their symptoms.

Minesweeper-based rines is fully automated, scalable and based on formal statistical concepts. Our own reviews of Minesweeper using real-world bug reports from Facebook apps have shown that it can RCA on tens of thousands of reports in minutes and identify the root cause of errors with an accuracy of 85 percent.

Since its inception, Minesweeper has been Facebook’s first line of defense against errors and has helped us prevent potentially widespread glitches from affecting people in our apps.

How does minesweeper work?

Every time someone reports a bug through the Facebook app, the bug reporting system usually keeps a chronological track of actions (or “events”) that that person took on the app before the bug occurred. The idea is to get a snapshot of what could have caused the error, like in the following example:

Minesweeper searches these traces for unique patterns that could indicate the cause of an error. Traces that contain the defect (the test group) are compared to traces that do not (the control group). Minesweeper finds patterns of events that are statistically different from the test group as opposed to the control group. These patterns are likely to be correlated with the fault and can thus point to its root cause.

Here is an example. For example, let’s say (hypothetically) 10 people are using the Facebook app. Five of these people are reporting a problem and the app is tracking eight possible events (on, b, … by H). We have a total of 10 tracks on these events, five in the test group (T.) of people who encountered the bug and five in the control group (C.) from the rest of the people who did not.

When Minesweeper is applied to these tracks, it starts extracting sequential patterns in T. and C.. A sequential pattern is simply a chronological sequence of events that have occurred, but not necessarily sequentially (that is, there may be other events in between that were not significant).

We could visualize the data like this:

Events: { a, b,…, h }}

Test group T. Control group C.
t1:: on b c d t6:: on b d
t2:: on b c t7th:: on c d
t3:: b c t8th:: on c
t4th:: e f G H t9:: f G
t5:: e G t10:: e f H

An example of a pattern that the system can extract here is on c. For each pattern, the number of tracks in which it appears in each group is calculated (T. and C.). In the example above, this pattern appears twice in T. (t1and t2), and so its support in T. is 2. In this case, his support is in C. is also 2 (t7th and t8th). Note that the pattern space is combinatorial in nature, so it is critical to use algorithms that can efficiently search this space without exponential inflation.

Once all patterns are extracted along with their supports in T. and C.the system performs statistical isolation. For every pattern P.with the help of its assistance, it calculates precision and recall:

Informally, precision describes how exactly P. detects whether a particular lane is in the test group rather than the control group, and the recall describes how much of the test group is present P. can cover. For example the pattern b has an accuracy of 0.75 because it occurs in a total of four lanes, three of which are in the test group (i.e. they are 75 percent specific to the test group). It also has a recall of 0.6 because it occurs in three out of five lanes in the test group (i.e. it covers 60 percent of the test group).

The harmonic mean of the two, 0.67, is the F1 score. In this way, the system calculates the F1 score of all patterns and returns the list of all patterns ordered by F1 score as shown in the following table.

In this example, the result shows that the pattern b c is the highest rank. In a real-world environment, a technician debugging the reports can infer these events b and cthat appear in this order are suspicious and deserve scrutiny.

Automate RCA at scale

To make Minesweeper work on the scale and complexity of Facebook, we borrowed an idea from the data mining community. sequential pattern mining – to keep track of the order in which events are displayed in traces.

It’s not uncommon on Facebook to find tens of thousands of traces related to a bug. To cope with this we have the PrefixSpan Algorithm known for its high efficiency in sequential pattern mining. In order to classify patterns according to their distinctiveness for the test group and thus be useful for the RCA, we used the statistical approach mentioned above, which is based on the precision and retrieval of patterns. Finally, to make Minesweeper practical from the point of view of ease of use, we looked at several human-centered challenges, such as: B. Avoiding “redundant” patterns that are similar in explaining the root cause.

The following figure gives a general overview of the automated Cinch architecture.

The effects of minesweeper

Minesweeper played an important role in helping Facebook engineers analyze and diagnose regressions – sudden spikes in a group of crash or bug reports – and provide insights in minutes that could previously have taken days to collect. Due to the complexity of Facebook’s apps and the product release cycle, multiple regressions often happen at the same time, especially after a new version is released. Thanks to Minesweeper, engineers working on regressions for users can analyze them instantly, making it easier than ever to reduce and mitigate disruptions to Facebook services.

Further technical information can be found in our article “Scalable statistical root cause analysis in app telemetry. ”

Comments are closed.