{"id":1101960,"date":"2024-11-20T12:45:40","date_gmt":"2024-11-20T20:45:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1101960"},"modified":"2024-11-22T12:16:51","modified_gmt":"2024-11-22T20:16:51","slug":"external-validity-of-online-experiments-can-we-predict-the-future","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/external-validity-of-online-experiments-can-we-predict-the-future\/","title":{"rendered":"External Validity of Online Experiments: Can We Predict the Future?"},"content":{"rendered":"\n
\n\u201cIt is difficult to make predictions, especially about the future\u201d<\/p>\n\n\n\n
\u2013 Yogi Berra (perhaps apocryphal)<\/p>\n<\/blockquote>\n\n\n\n
How well can experiments be used to predict the future? At Microsoft\u2019s Experimentation Platform (ExP)<\/a>, we pride ourselves on ensuring the trustworthiness of our experiments. We carefully check to make sure that all statistical tests are run correctly, and that the assumptions that underlie them are valid. But is this enough? All this goes to the internal validity of our experiments. What about their external validity [1<\/a>]?<\/p>\n\n\n\n
Suppose an experiment on Norwegian users shows that a change to your website increases revenue by 5%, with a tiny p-value and narrow confidence interval. How confident can you be that the treatment would increase revenue by 5% when shipped in Korea? Most data scientists would express reservations, urging that a separate experiment be run on Korean users. They would point out that Norwegian and Korean users likely have different preferences and user behaviors, and a change that users in one country love may be hated in another. In other words, they would question the external validity of the experiment and say that if you wanted to draw conclusions about the second population of users, you should run an experiment based on that population.<\/p>\n\n\n\n
External Validity of the Future<\/h3>\n\n\n\n
However, at ExP we (along with every other online experimentation platform in the world) routinely assume the external validity of our results on a population that we never experimented on: users in the future. If we see a 5% revenue gain in an experiment one week, we assume that this means we will get revenue gains after we ship it, even though the future is different: user behavior may change over time, the type of users who use the product may change, other developers may ship features which interact with the first one, etc… How much should we worry about external validity here?<\/p>\n\n\n\n
It\u2019s a bit strong to say that we just \u201cassume\u201d this. We\u2019re of course well aware both that the future is different, and that issues like \u201cthe winner curse\u201d lead to systematically overestimating treatment effects [2<\/a>,3<\/a>]. We frequently validate our assumptions, by running reverse experiments (where the treatment reverts a feature) or holdout flights (where a holdout group of users are never shipped one or more new features) [4<\/a>,5<\/a>]. Our focus here is not on these sorts of checks, but rather on how often we should expect to have problems with external validity for users in the future.<\/p>\n\n\n\n
External Validity and Surprises<\/h3>\n\n\n\n
Suppose we’ve collected data for a week and calculated a treatment effect and 3\u03c3 confidence interval for a metric. We’re planning to collect a second week of data and combine it with the first week of data to get an even better estimate. How confident are you that the new estimate will lie within that original 3\u03c3 confidence interval? How often would you expect to be surprised? Would you expect the surprises to come from external validity problems? Before reading on, try to form a rough estimate for your expectation.<\/p>\n\n\n\n
<\/p>\n\n\n\n
Next Day External Validity: What Will Tomorrow Bring?<\/h2>\n\n\n\n
We started by looking at one-day metric movements. Most ExP experiments generate a 7-day scorecard which calculates treatment effects and their standard errors for all metrics relevant to the feature being experimented on. The scorecards typically also calculates those for each of the individual 7 days. We collected all 7-day scorecards generated by ExP over the course of a week. For every metric, for all 6 pairs of adjacent days, we compared the treatment effect estimates for those two days by calculating<\/p>\n\n\n\n
\\( z = \\frac{\\Delta_{t+1} – \\Delta_t}{\\sqrt{\\sigma_t^2 + \\sigma_{t+1}^2}}\\)<\/p>\n\n\n\n
Here \u0394t<\/sub> is the observed treatment effect on day t and \u03c3t<\/sub> is its standard error. This gave us several million treatment effect pairs, drawn from over a thousand online experiments.<\/p>\n\n\n\n
Next-Day Deviations: a First Look<\/h3>\n\n\n\n
If the treatment effects from the two adjacent days are drawn independently from the same distribution, we should expect z to have the distribution of a unit width Gaussian. In practice, we might expect some positive correlation between the two, which would shift the distribution to smaller values of |z|. Comparing the distributions:<\/p>\n\n\n\n
Figure 1: Day-to-day differences of treatment effects follow the expected normal distribution for |z|<3.<\/figcaption><\/figure>\n\n\n\n At first sight, it looks pretty good! There\u2019s an extra spike in the observed distribution at z=0, which corresponds to a large number of metrics that have exactly 0 treatment effect on both days. Most of those come from poorly designed metrics that are almost always 0 in both control and treatment. But other than that, the fit looks quite close.<\/p>\n\n\n\n
Next-Day deviations: a second look<\/h3>\n\n\n\n
Before we declare success and close up shop, let\u2019s switch to a cumulative distribution function, and plot on a log scale:<\/p>\n\n\n\n
3, indicating external validity problems.\" class=\"wp-image-1101972\" style=\"width:553px;height:auto\" \/>Figure 2: Day-to-day differences of treatment effects are much more common than normal distribution would predict for |z|>3.<\/figcaption><\/figure>\n\n\n\n Now we see that the match is pretty good for |z| < 3, but past that point, we start to get large |z| values much more than the Gaussian distribution would predict. As mentioned above, if there were positive correlations, we would have less values of large |z| than predicted by the unit Gaussian. But we have more. To show a few sample values from the graph above:<\/p>\n\n\n\n
z<\/strong><\/td> Observed CDF<\/strong><\/td> Unit Gaussian CDF<\/strong><\/td><\/tr> 1.96<\/td> 3.4%<\/td> 5.0%<\/td><\/tr> 3<\/td> 0.25%<\/td> 0.27%<\/td><\/tr> 4<\/td> 0.09%<\/td> 0.006%<\/td><\/tr> 5<\/td> 0.07%<\/td> 0.00006%<\/td><\/tr> 10<\/td> 0.03%<\/td> 2*10-21<\/sup>%<\/td><\/tr><\/tbody><\/table> Table 1: Selected data points from the graph above, comparing observed CDF with the CDF of a unit Gaussian distribution.<\/figcaption><\/figure>\n\n\n\n Observing differences with |z| > 10 should essentially be impossible. In practice, it\u2019s not common, but it does happen much more than it should: 3 out of 10,000 times isn\u2019t a lot in absolute terms, but it\u2019s a factor of 1.5*1019<\/sup> too high!<\/p>\n\n\n\n
Weekday and Weekend Effects<\/h3>\n\n\n\n
You might think that these large discrepancies come from comparing weekdays to weekends. For many business products, like Office or MS Teams, we would expect usage to be quite different on Friday and Saturday, for example. If we segment based on whether we’re comparing two weekdays, or two weekends, or a weekday to a weekend, we do find more large discrepancies when comparing a weekday to a weekend. But large discrepancies are found for all three categories:<\/p>\n\n\n\n