Testing product adjustments with community results

This project is a collaborative effort between the Facebook Core Data Science team, the Experimentation Platform team, and the Messenger team.

What the research is:

Experiments are ubiquitous in online services like Facebook, where the effects of product changes are explicitly tested and analyzed in randomized studies. Interferences, sometimes referred to as network effects in the context of online social networks, pose a threat to the validity of these randomized trials because the presence of interference violates the stable unit treatment value (STUVA) assumption, which is important for the analysis of these experiments. Colloquially, interference means that the response of an experimental unit to an intervention depends not only on its own treatment, but also on the treatment of other units. For example, imagine a grocery delivery market testing a treatment that will encourage users to order deliveries faster. This could reduce the supply of delivery drivers to the users in the control group, which could lead the experimenter to overestimate the effect of the treatment.

Figure 1. An illustrative cartoon showing potential interference between test and control units and how cluster randomization explains the interference within the cluster.

In our work we propose a network experimentation framework that takes into account partial interferences between experimental units by means of cluster randomization (Fig. 1). The framework has been used extensively on Facebook, is as easy to use as other traditional A / B tests on Facebook, and has been used by many product teams to measure the impact of product changes. On the design side, we find that unbalanced clusters are often better than balanced clusters, which were widely used in previous research, in terms of the bias-variance trade-off. On the analytical side, we are introducing a cluster-based regression fitting that greatly improves the precision in estimating treatment effects as well as testing for interference as part of our estimation process. In addition, we show how the logging of the units treated, the so-called trigger logging, can be used to further reduce variance.

While interference is a widely recognized problem in on-line field experiments, there is less evidence from real world experiments showing interference in on-line environments. By performing many network experiments, we have found a number of experiments with obvious and substantial SUTVA violations. In our article, two experiments, a stories experiment with social graph clustering and a commuter zone experiment based on geographic clustering, are detailed that show significant network effects and demonstrate the value of this experimental framework.

How it works:

Design of network experiments

Network experiment design has two main components: treatment allocation and unit clustering. The component employing treatments is shown visually in Figure 2, which should be read from left to right. A cluster of experimental units is used as input, represented by larger circles containing colored dots for units. A given clustering and the associated units are considered as a universe, the considered population. These clusters of experimental units are hashed deterministically into universe segments based on the universe name, which are then assigned to the experiments. Universe segments allow a universe to contain multiple mutually exclusive experiments at any one time, a requirement for a production system used by development teams. After the assignment to an experiment, segments are randomly divided into unit-randomized segments and / or cluster-randomized segments using a deterministic hash based on the experiment name. The final condition assignment deterministically hash units or clusters into treatment conditions, depending on whether the segment was assigned to unit or cluster randomization. The result of this final hash creates the treatment vector that is used for the experiment.


Figure 2. Visualization of the randomization process of the network experiment.

The other main component of network experiments is the clustering of experimental units. Ideal clustering includes all interferences within clusters so that there is no inter-cluster interference, which removes the bias in our estimators. A naive approach that captures all interference is to group all units into a huge single cluster. However, this is not acceptable as a cluster randomized experiment should also have enough statistical power to detect treatment effects. A single cluster that contains all the units has no performance, and clustering that puts each unit in its own cluster, which is a random distribution of the units, performs well but does not detect any interference. This is essentially a trade-off between bias and variance: more detected interferences result in less bias, while more statistical power requires smaller clusters. In our paper we consider two prototypical clustering algorithms due to their scalable implementation: Louvain Community Detection and recursive balanced partitioning. We find that unbalanced graph clusters generated by Louvain are typically superior in terms of the bias-variance trade-off for graph cluster randomization.

Analysis of network experiments

We are mainly interested in the Average Treatment Effect (ATE) of an intervention (a product change or a new feature), the average effect when the intervention is applied to all users. There are many estimation methods for ATE for cluster-randomized studies, from methods to summaries at the cluster level, to models with mixed effects, to generalized estimation equations. In order to enable simple implementation on a scale and explainability, the difference-in-mean value estimator, ie test_mean – control_mean, is used in our framework. The details of the appraisers and appraisers can be found in our paper. Here we briefly introduce our two methodological innovations for variance reduction: agnostic regression adjustment and trigger logging (logging units that receive the intervention). The reduction in variance is essential because cluster-randomized experiments typically have less performance than unit-randomized ones. In our framework, we use the contrast between the conditions of the pretreatment metrics as covariates to do a regression fit. We show that the adjusted estimator is asymptotically unbiased with a much smaller variance. In addition, trigger logging allows us to make an estimate of ATE using only the units that were actually exposed in the experiment. Under mild assumptions, we show that the ATE of the exposed units corresponds to the ATE of all units assigned to the experiment. Fig. 3 shows for seven metrics in a Stories experiment how point estimates and CIs change when we perform an intent-to-treat (ITT) analysis on the triggered clusters instead of on triggered users, and when we do this do not use regression fitting. The variance reduction from regression fitting and trigger logging is significant.


Figure 3. Comparison of ATE estimates with scaled 95 percent confidence intervals calculated for triggered users and triggered clusters (ITT) with and without regression fit (RA) for cluster test and control in a stories experiment.

Use case: commuter zone experiment

As an illustrative example, we describe a commuter zone experiment in this blog. Commuter zones, as shown in Figure 4, are a Facebook Data for Good product and can be used as geographical clustering for network experiments on Facebook. For products like Jobs on Facebook (JoF), geographic clusters can be particularly suitable as individuals are likely to interact with employers close to their own location. To demonstrate the value of network experimentation, we ran a mixed experiment, running unit randomized and cluster randomized experiments side by side, for a JoF product change that adds value to jobs with few previous applications.


Figure 4. Facebook commuting zones in North America


Table 1. Results of the commuter zone experiment

Table 1 summarizes the results of this experiment. In the user-randomized test, applications for positions without prior application rose by 71.8 percent. However, the cluster-randomized conditions showed that these estimates were skewed upwards, and we saw a 49.7 percent increase instead. This comparison benefited significantly from the regression fit, which can reduce the size of the confidence interval in commuter zone experiments by over 30 percent.

By randomizing this experiment at the commuter zone level, the team also confirmed that changes in user experience that increase this metric could result in employers posting more jobs on the platform (the likelihood that one employer posted another job, has increased by 17 percent). Understanding the interactions between applicants and employers in a two-way marketplace is important to the health of such a marketplace, and through network experiments we can better understand these interactions.

Why it matters:

Experimentation with interference has been explored for many years in various industries because of its practical importance. Our paper presented a practical framework for designing, implementing, and analyzing network experiments on a large scale. This framework allows us to better predict what will happen when we launch a product or send a product change to Facebook apps.

Our implementation of network experiments takes into account mixed experiments, cluster updates and the need to support multiple experiments at the same time. The simple analysis method that we present leads to a considerable reduction in variance through the use of trigger logging and our novel cluster-based regression-adjusted estimator. We also introduce a cluster scoring method that indicates that bias-variance trade-offs are in favor of unbalanced clusters and allows researchers to evaluate these trade-offs for any clustering method they want to investigate. We hope that experimenters and practitioners will find this framework useful for their applications and that the findings from the paper will fuel future research in the design and analysis of experiments under interference.

Read the full paper:

Network experiments on a large scale

Comments are closed.