Don’t be seduced by the attract: A information for the way (not) to make use of proxy metrics in experiments | by Analytics at Meta | Oct, 2022
The problem of unscalable ground truth measurement
At Meta we care deeply about developing tools for valid, reliable, and representative measurement that enable teams to make better decisions and improve our products.
While many things we’re interested in measuring can be observed directly based on logged events on Meta products (e.g., did a new design increase the number of people who donate to charity?), other outcomes of interest cannot be so easily observed.
Consider these example research questions:
- What impact, if any, does a new feature have on brand sentiment?
- Does a new AI technology change the prevalence of specific kinds of content viewed on our platforms?
- Does a new feature cause any changes in long term behavior?
All of these questions share something in common: the “ground truth” answer cannot be immediately nor directly observed from logged data and must be measured in other ways.
For example, people’s attitudes are often best measured using a valid and reliable survey instrument that asks a random sample of people about their experience. The prevalence of specific kinds of content on our platforms can be measured by a team of professionally trained raters who actually look at a random sample of content and label it. And long term impact on behavior can be measured by, well, waiting the specified period of time and seeing what happens.
In other words, the best strategies for the measurement of important outcomes of interest often require using scarce, expensive resources, or a lot of time. This does not scale well. It is simply not possible to run a well-powered survey or have professional raters label content for every A/B test that teams across Meta would like to run, nor is it always feasible to just wait months to understand long term impact. And yet, teams would still like to understand their impact on outcomes typically measured using these kinds of ground truth methods.
The problem of unscalable ground truth. Whereas teams are interested in the effect of their experiment on the ground truth outcome of interest, it cannot always be measured.
Are there any solutions to this problem of unscalable ground truth measurement?
In our experience, the same alluring idea tends to emerge again and again from teams looking for an answer. Specifically, teams often want to use a predicted ground truth outcome from a machine learning model (i.e., a modeled variable) — trained on an observational sample of data — as a proxy for ground truth measurement of their outcome of interest to evaluate experiments.
We cheekily refer to this as the seductive allure of proxies, comparable to the idea in early alchemy that abundantly available base metals (e.g., lead) can be transformed into so-called noble metals (e.g., gold). This would be incredible! Can teams really use machine learning to turn the abundantly available logged data (i.e., “base metals”) into reliable insights about the unobserved outcomes they usually measure using scarce or expensive ground truth methods (i.e., “noble metals”)?
It’s important to manage expectations. While it is theoretically possible for proxies to provide some useful signal, it is also possible for proxies to lead teams to make worse decisions than they otherwise would have.
An 1860’s painting by artist Jan Matejko depicting the scene of an alchemist turning base metals into gold, with onlookers showing varying degrees of excitement and skepticism.
We developed the experimentation with modeled variables playbook (xMVP) to help teams across Meta navigate this problem of unscalable ground truth.
The playbook aims to help teams confront and understand the assumptions and limitations involved when using proxies for ground truth measurement in the context of experimentation. It provides guidance on how to evaluate the “maturity” of their proxy for use in decision making, provides best practices about how (not) to use them, and ultimately aims to help teams decide whether or not a proxy can meet their measurement goals.
The xMVP consists of the following elements:
- Six steps that teams can walk through to evaluate the maturity of different aspects of their proxy.
- Four levels of maturity to be assigned at each step, summarizing how much the proxy can be trusted for valid and reliable decision making.
- Considerations and/or analyses at each step that can give teams confidence (or not) in the maturity of their proxy.
While a detailed description of the playbook is outside the scope of this blog post, here we briefly describe each of the steps.
The preliminary assessment amounts to making sure teams have already ruled out potentially better (unbiased) solutions for their measurement needs, including (a) acquiring the resources they need to measure their outcomes in experiments using ground truth methods, (b) using strategies like controlling for pre-treatment covariates/outcomes to help increase the statistical power of estimates from ground truth without introducing bias, or (c) relying on different outcomes that can be more readily measured, not as a proxy for a ground truth outcome of interest but as a defensible end-in-itself.
Only if none of these options are feasible — and the team still wants to pursue using a proxy to try to understand experimental effects on ground truth outcomes of interest — do we recommend continuing with an xMVP evaluation.
An important first step whenever using a machine learning model is to assess the quality of the ground truth data being used for training and evaluation. Therefore, the xMVP recommends that the maturity of ground truth data is assessed first and foremost, using the ground truth maturity framework.
Note: this step is an entire detailed 7-step ground truth maturity framework in-itself, developed by our colleagues on the Core Data Science (CDS) and Demography & Survey Science (DSS) teams at Meta.
In order for a proxy to provide reliable estimates of the effect of an experiment on the ground truth outcome of interest, some (very strong) assumptions must be met.
While it can depend on what is being predicted, in practice it is usually not a question of whether these assumptions are violated (they almost certainly are), but just how badly they are violated and whether the biases introduced by those violations are tolerable.
Following Athey et al. (2019) and VanderWeele (2013), the xMVP frames the assumptions around three main concepts:
- Unconfoundedness: the relationship between the proxy and the ground truth outcome of interest must be entirely due to the causal effect of the proxy on ground truth.
Ask yourself: might the association between the proxy (including all the features used in the ML model to create the proxy) and ground truth be due to confounding?
- Exclusivity: the effect of the experiment on the ground truth outcome of interest must only go through the effect of the experiment on the proxy.
Ask yourself: might there be an effect of the experiment on ground truth not through the proxy?
- Comparability: The relationship between the proxy and the ground truth outcome of interest established in the observational sample must also hold in the experimental treatments (e.g., distributional monotonicity).
Ask yourself: might the experiment affect the proxy for different people than for whom the proxy affects ground truth?
If the answer to any of these “ask yourself” questions is yes, then a directionally consistent proxy cannot be guaranteed. In fact, it is the violation of these assumptions that leads to biased estimates of the treatment effect on ground truth and can lead to the “proxy paradox” — when the treatment effect on the proxy is actually in the opposite direction of the treatment effect on ground truth.
Given that the above assumptions are almost certainly violated, it is important to set the expectation that, in practice, having a proxy that can qualitatively detect the existence and/or sign of an experimental effect on the ground truth outcome of interest some tolerable percentage of the time (i.e., a consistent proxy) is usually the best a team can hope for.
Validating these assumptions is not an easy task, but including this step in the xMVP at least forces teams to be aware of them and understand their implications. The xMVP also provides some guidance and recommendations to help teams think through causal diagrams and “sign the bias” for a better understanding of how these assumptions come into play (Knox, Lucas, & Cho, 2022), as well as establish evidence of causal relationships where possible. The rest of the xMVP also includes steps to validate the performance of their proxy against ground truth as a practical check on assumptions (see xMVP steps 5 and 6).
Note: see Athey et al. (2019) for a discussion of the “surrogate index” which can help with exclusivity, and VanderWeele (2013) for further discussion of the assumptions required for at least a consistent proxy: one that can provide directional signal on the outcome of interest but without accurately capturing the effect size magnitude.
This step is relatively straightforward, and simply asks teams to explicitly consider whether the benefits of using a (very likely biased) proxy outweigh the risks.
- How bad is it if the proxy shows an experimental effect when no effect on ground truth exists (i.e., false positives)? How bad is it if the opposite effect in fact exists (i.e., the proxy paradox)?
- How bad is it if the proxy shows no experimental effect when an effect on ground truth does exist (i.e., false negatives)?
- To meet your measurement goals, what % of the time must the experimental effect on the proxy be directionally consistent with the experimental effect on ground truth?
Note that in Step 5 of the xMVP, teams will be able to get some signal on the extent of false positives, false negatives, true positives, and true negatives to help determine whether their proxy can meet the criteria specified here.
This step considers both the proper calculation of confidence intervals (CIs) and the bias of point estimates from using predictions from a machine learning model as a proxy in experimentation.
First, it is important to consider all the sources of error that are involved when calculating confidence intervals around estimates of the treatment effect on the proxy:
- User variance captures the uncertainty in what impact an experiment had on the proxy as measured by a single set of fixed predictions from an ML model. This variance is the typical error you see represented in A/B testing results (e.g., 95% CIs).
- Ruler variance captures the uncertainty from the fact the proxy is a prediction from an ML model that has been trained on a relatively small number of labels. This limited sampling of labels in training creates randomness in the proxy, such that if a different sample of labels had been used for training then it would likely result in a slightly different estimate of the treatment effect in the experiment. This source of error is often not represented in A/B testing results, but should be accounted for (e.g., by bootstrapping training data; see also, Knox et al., 2022).
Without considering both of these sources of error, it is likely that overly-narrow CIs are being used around treatment effect estimates. This, of course, can lead to an excess of false positives and threaten a team’s credibility.
Second, it is important not to mistake even very narrow CIs for unbiased estimates. While proper calculation of CIs will capture the statistical uncertainty in treatment effect estimates, they can say nothing about the presence or absence of bias (which is roughly proportional to the unexplained variance). That is, bias will necessarily be small if either (1) the proxy can explain most of the variation in the ground truth (i.e., the model makes perfect predictions), or (2) the proxy can explain most of the variation in treatment assignment.
The evaluation methods described in the next step of the xMVP are designed to help teams understand the direction and magnitude of bias.
Note: the estimation of “bias bounds” is a promising direction for this kind of research in the future (see Athey at al., 2019). However, given that in practice predictions from a machine learning model are generally able to explain only a small percentage of the variance in ground truth outcomes of interest, we expect any bias bounds to result in a lot of overlap with an estimated treatment effect of 0 (which highlights that proxies are generally not reliable for quantitative estimates of effect size).
This step evaluates the performance of the proxy against the ground truth outcome it is intended to represent. This validation step is critical, as it will help teams determine how useful the proxy can be expected to be in practice.
Importantly, this step requires having data available for the treatment effect on both (a) the ground-truth outcome of interest and (b) the proxy, ideally across many experiments that represent a random sample of the “class” of experiments teams are planning to evaluate using the proxy in the future.
The primary evaluation method we recommend is meta-analysis across many experiments, which allows teams to calculate metrics like:
- The probability of effect sharing by sign (i.e., sign consistency)
- The probability of effect sharing within some specified magnitude (i.e., effect size consistency)
This evaluation step will be invaluable for determining whether or not the proxy can meet a team’s measurement goals. Most importantly, it will directly inform just how often estimates of the treatment effect on the proxy are at least directionally consistent with treatment effects on ground truth. However, caution is warranted when interpreting these kinds of meta-analytic scatterplots of treatment effects on ground truth vs. proxy.
First, even in the absence of any true treatment effects (e.g., if all experiments were actually A/A tests), we’d still expect to see a correlation in ground truth vs. proxy treatment effects equal to their unit level relationship simply due to correlated sampling error (Cunningham & Kim, 2020). To help account for this, we also recommend using methods like multivariate shrinkage in meta-analytic evaluations, which provide improved posterior treatment effect estimates by leveraging information about the patterns of similarity across experiments while also effectively correcting for multiple comparisons (Urbut et al., 2019).
Second, recall that assumptions must be met for each experiment and decision making contexts are constantly changing (e.g., teams are often interested in evaluating a new treatment or the effect of a treatment on a new population, or both). It is thus important to ensure that the kinds of experiments teams include in their evaluation step are representative of the kinds of experiments (and populations) they plan to evaluate using the proxy in the future.
This final step involves developing (and executing) a plan for validating any decisions made using a proxy.
If teams have followed the previous steps in this playbook and — after assessing the maturity of their proxy at each step — determined that they will use a proxy to help inform product launch decisions, then there should be a plan to continuously validate those decisions against ground truth.
One way in which we do this at Meta is by grouping multiple new feature launches, each evaluated in a separate A/B test, and keeping a small control set of users out of the launch for a reasonable but limited time (3–6 months). At the end of this period, we collect well-powered ground truth data to assess the effect of the collection of launches over this longer period of time. This allows teams another opportunity to compare the ground-truth treatment effect estimate against the proxy-based treatment effect estimates and evaluate their directional consistency. Given that proxies are not reliable for quantitative estimates of effect sizes, this exercise also allows continuous monitoring of the actual magnitude of effect sizes through planned, periodic evaluations against ground truth.
Whereas teams often want to use their proxy as a precise air quality monitor (i.e., something that can provide precise quantitative information about the size of a treatment effect on ground truth), in actuality they are more like a fire alarm (i.e., alerting us to the potential existence and direction of an effect). And they tend to be error-prone fire alarms at that, sometimes triggered when they shouldn’t and failing to trigger when they should. Knox et al. (2022) similarly refer to this kind of distinction as “causal estimation” (quantitative estimates of effect size) vs. “causal exploration” (qualitative evidence about the existence and direction of an effect).
At Meta we repeatedly emphasize the importance of this distinction: while proxies can sometimes be useful for causal exploration, their utility for causal estimation is simply not warranted unless all steps of the xMVP (especially whether or not the assumptions are met) have a fully operational level of maturity.
However alluring of an idea it may be, “measurement alchemy” is generally not possible to the extent that teams want it to be — the problem of unscalable ground truth remains a problem and we are unfortunately not able to turn lead into gold. However, providing teams with the experimentation with modeled variables playbook (xMVP) has helped them to confront and understand why, provide tools to evaluate the maturity of their own proxy and whether or not it should be used at all, as well as educate them about how proxies can be used responsibly to help inform (but not replace) ground truth decision making.
Authors: Ryan R, Carlos D, Yevgeniy G, Aude H, Alex D.
Special thanks to Tom C, Peter F and many other Meta colleagues across our Core Data Science (CDS), Demography and Survey Science (DSS), and UX Research teams for their contributions and feedback throughout development of the xMVP.
Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. National Bureau of Economic Research.
Cunningham, T., & Kim, J. (2020). Interpreting Experiments with Multiple Outcomes.
Knox, D., Lucas, C., & Cho, W. K. T. (2022). Testing causal theories with learned proxies. Annual Review of Political Science, 25, 419–441.
Urbut, S. M., Wang, G., Carbonetto, P., & Stephens, M. (2019). Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nature genetics, 51(1), 187–195. [see also, the mashr R package]
VanderWeele, T. J. (2013). Surrogate measures and consistent surrogates. Biometrics, 69(3), 561–565.