{"id":868917,"date":"2022-11-15T07:22:30","date_gmt":"2022-11-15T15:22:30","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=868917"},"modified":"2022-11-15T12:33:24","modified_gmt":"2022-11-15T20:33:24","slug":"deep-dive-into-variance-reduction","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/deep-dive-into-variance-reduction\/","title":{"rendered":"Deep Dive Into Variance Reduction"},"content":{"rendered":"\n
Variance Reduction (VR) is a popular topic that is frequently discussed in the context of A\/B testing. However, it requires a deeper understanding to maximize its value in an A\/B test.\u202f In this blog post, we will answer questions including: What does the \u201cvariance\u201d in VR refer to? \u202fWill VR make A\/B tests more trustworthy?\u202f How will VR impact the ability to detect true change in A\/B metrics? <\/p>\n\n\n\n
This blog post provides an overview of ExP\u2019s implementation of VR, a technique called CUPED (Controlled experiment Using Pre-Experiment Data). Other authors have contributed excellent explainers of CUPED\u2019s performance and its ubiquity as an industry-standard variance reduction technique [1][2]. We have covered in previous blog posts how ExP uses CUPED in the experiment lifecycle [3]. <\/p>\n\n\n\n
In this post, we share the foundations of VR in statistical theory and how it amplifies the power of an A\/B testing program without increasing the likelihood of making a wrong decision. [a]<\/a>[4]<\/p>\n\n\n\n [a]<\/a> Many of the elements covered quickly in this blog are covered in excellent detail in Causal Inference and Its Applications in Online Industry [4].<\/p>\n\n\n\n To understand where variance reduction fits in, let\u2019s start with a more fundamental question: What\u2019s our ideal case for analyzing an A\/B test? <\/em>We want to estimate the difference in two potential outcomes for a user: the outcome in a world where the treatment was applied, and the outcome in a world where the treatment was not applied \u2013 the counterfactual. <\/p>\n\n\n\n The fundamental challenge of causal inference is that we cannot observe those two worlds simultaneously, and so we must come up with a process for estimating the counterfactual difference. In A\/B testing, that process relies on applying treatments to different users. Different users are never perfect substitutes for one another because their outcomes are not only functions of the treatment assignment, but also impacted by many other factors that influence user behavior.<\/p>\n\n\n\n Causal inference is a set of scientific methods to estimate the counterfactual difference in potential outcomes between our two imagined worlds. Any process of estimating this counterfactual difference introduces uncertainty. <\/p>\n\n\n\n Statistical inference is the process of proposing and refining estimators of an average counterfactual difference to improve the estimators\u2019 core statistical properties: <\/p>\n\n\n\n In fact, that\u2019s what the \u201cvariance\u201d in variance reduction refers to: the property of the estimator of the average treatment effect. Variance reduction (as in CUPED-VR) is not a reduction in variance of underlying data<\/em> such as when sample data is modified through outlier removal, capping, or log-transformation. \u202fInstead, variance reduction refers to a change in the estimator <\/em>which produces estimates of the\u202ftreatment effect with lower standard error. <\/p>\n\n\n\n Random assignment ensures that the difference between treatment and control populations is an unbiased estimator. However, we need to consider how much uncertainty our estimation process has introduced. <\/p>\n\n\n\n To do so, we use the known rate of convergence to the true population difference \u2013 called consistency <\/em>\u2013 to estimate the true variance of the average treatment effect using our sample. With the delta estimate from difference-in-means (\\( \\delta_{DiM}\\)) and the sample variance estimate, we report an interval of estimates that is likely to contain the true population difference, called a confidence interval<\/em>:<\/p>\n\n\n\n \\( \\begin{aligned} Var(\\delta_{DiM}) &=\\frac{ \\sigma_{Y^T}^2}{{n^T}} + \\frac{ \\sigma_{Y^C}^2}{n^C} \\\\ CI_{lb,ub}&= \\delta_{DiM} \\pm z_{\\alpha\/2}\\sqrt{Var(\\delta_{DiM})} \\\\ \\end{aligned} \\) [b]<\/a><\/p>\n\n\n\n The difference-in-means estimator for the average treatment effect is unbiased, and the variance of the estimator shrinks at a known rate as the sample size grows. When we propose VR estimators, we\u2019ll need to describe their relationship to the bias, variance, and the consistent variance estimate of the difference-in-means estimator to understand if we\u2019re improving.<\/p>\n\n\n\n [b]<\/a> \\( z_{\\alpha\/2} \\) is the standard normal quantile at your acceptable \\( \\alpha \\), or false positive rate. For example, a 95% confidence interval uses 1.96 for \\( z_{0.05\/2} \\).<\/p>\n\n\n\n Statistical tests that use variance reduction rely on an additional strategy to reduce the variance of an estimator of average treatment effect, which has a similar power benefit to increasing the A\/B test sample size.<\/p>\n\n\n\n This is rooted in the insight that even if we have a single-user treatment and single-user control, if the users are good substitutes for one another, we\u2019ll expect to obtain a treatment effect estimate that\u2019s closer to the true treatment effect than if the users are very different from one another. The assignment procedure can be modified to try to ensure \u201cbalanced\u201d treatment and control assignments. Re-randomization of assignments with checks to ensure baseline balance uses this idea [5].<\/p>\n\n\n\n In many online A\/B tests, we don\u2019t modify our assignment procedure. Instead, we perform a correction in the analysis phase with VR estimators. VR combines large-sample asymptotic properties of A\/B tests with the optimization of comparing similar users through statistical adjustment. Similarity is modeled through use of characteristics known to be independent of the assignment of A or B test feature to the user.<\/p>\n\n\n\n CUPED <\/a> is one method of VR, with the following steps:<\/p>\n\n\n\n From simulating CUPED-VR\u2019s performance versus difference-in-means on repeated samples of the same data, we can observe the extent of variance reduction for the estimator (plot<\/em> below<\/em>). In this plot of estimates, the set of estimates that are closer to the true effect of 2.5 compared to the difference-in-means estimates on the same trial are shifted because, in those trials, CUPED-adjusted metrics accounted for chance imbalance in the pre-A\/B test period.<\/p>\n\n\n\n <\/a> When the estimated coefficients are weighted by assignment probability, the CUPED-VR estimator is equivalent to another popular regression adjustment estimator for A\/B tests: ANCOVA2, or Lin\u2019s estimator [6][7] [Table 1<\/a>]. <\/p>\n\n\n\n The CUPED-VR estimator has known analytic results [7] of how its variance compares to the variance of the difference-in-means estimator:<\/p>\n\n\n\n\\(\\begin{aligned} Var(\\delta_{VR}) &=(\\frac{ \\sigma_{Y^T}^2}{n^T} + \\frac{ \\sigma_{Y^C}^2}{n^C}) (1 – R^2) \\\\ Var(\\delta_{DiM}) &=\\frac{ \\sigma_{Y^T}^2}{n^T} + \\frac{ \\sigma_{Y^C}^2}{n^C} \\\\ \\end{aligned} \\)\n\n\n\n The variance is reduced in proportion to the amount of variance explained by the linear model in treatment and control, or the total \\( R^2 \\). And, importantly, the estimator is still consistent: We don\u2019t sacrifice bias in favor of lower variance. This means that when we estimate the variance of our \\( \\delta_{VR} \\) , we can build narrower confidence intervals, with values that are closer to the \\( \\delta_{VR} \\) but reflect the same level of confidence about the range. This also means that if the true treatment effect is non-zero, we are more likely to detect a statistically significant effect. Indeed, the ratio of raw variance to VR variance \\( \\frac{1}{1-R^2} \\) represents the amount of traffic that would need to be added to the simple difference estimator to provide the same level of variance reduction as VR. The effectiveness of CUPED-VR is influenced by various attributes of the product, telemetry, experiment, and metric. At Microsoft, we see substantial difference in efficacy across different product surfaces and metric types. Variance reduction is the use of alternative estimators, like CUPED, to improve difference-in-means and effectively multiply observed traffic in an A\/B test. <\/strong>Its variance-reducing properties are rooted in the foundations of design-based statistical inference, which makes it a trustworthy estimator at scale.<\/p>\n\n\n\n \u2013 Laura Cosgrove, Jen Townsend, and Jonathan Litz, Microsoft Experimentation Platform<\/em><\/p>\n\n\n\n
\n\n\n\nVariance is a Statistical Property of Estimators<\/h3>\n\n\n\n
The Difference-in-Means Estimator Provides Consistency in A\/B tests<\/h3>\n\n\n\n
CUPED-VR Outperforms the Difference-in-Means Estimator <\/h3>\n\n\n\n
CUPED-VR Procedure<\/h4>\n\n\n\n
\n\n\n\nMeasuring CUPED-VR Performance with Effective Traffic Multiplier<\/h4>\n\n\n\n
Decision-makers understand that having more traffic in an A\/B test for a given time period helps decrease time-to-decision or increase confidence in a decision if evaluating at a fixed time. And at ExP, we have found this to be an easy-to-interpret representation of VR’s efficacy for Microsoft experimenters. We surface it for each variance-reduced metric and refer to it as the \u201ceffective traffic multiplier\u201d.<\/p>\n\n\n\n
Based on a recent 12-week sample of week-long experiments, groups of VR metrics from two different surfaces for the same product have very different average performance. In one Microsoft product surface, VR is not effective for most metrics: a majority of metrics (>68%<\/strong>) <\/strong>have effective traffic multiplier <=1.05x<\/strong>. In contrast, another product surface sees substantial gain from VR methods: a majority of metrics (>55%<\/strong>) <\/strong>have effective traffic multiplier >1.2x. <\/strong><\/p>\n\n\n\nSummary<\/h2>\n\n\n\n
\n\n\n\nCUPED-VR and ANCOVA2 Comparison Table<\/a><\/h3>\n\n\n\n