At Microsoft, we continuously improve products by developing new features for them. To facilitate data-driven decision-making in software development, product teams across Microsoft run tens of thousands of A/B tests each year. While the primary purpose of A/B testing is to rigorously evaluate customer satisfaction with new features and experiences, it also helps uncover anomalies, bugs, performance degradation, and user dissatisfaction. To catch these issues early, we rely on alerts. Alerts are proactive notifications to experimenters when something unexpected has occurred in an A/B test. For example, Bing runs several A/B tests every year, which raises hundreds of alerts (ref. Fig 1). These alerts have helped identify and rectify significant issues.
In this blog post, we highlight the importance of timely and trustworthy alerts, illustrate the alerting methodologies, and summarize the typical alerting mechanisms and workflows on our experimentation platform.
Why are alerts important?
In 2020, one of the product features had a minor bug wherein a subset of users got mistakenly assigned the duplicate Device IDs (randomization unit for A/B tests). The bug got resolved, and these users were then assigned new (and different) IDs.
A few days later, one of the A/B tests running on the product resulted in a Sample Ratio Mismatch (SRM) alert between treatment and control [1]. The team investigated and discovered that the treatment variant contained a larger-than-expected number of users because it erroneously read the old corrupt Device IDs. On the other hand, the control variant correctly read the new correct Device Ids and included the expected number of users. This resulted in a mismatch in user (device) count between the treatment and control variants. The SRM alert helped the experimenters identify issues with the set-up of an A/B test and the usage of corrupt Ids promptly, fix these issues, and restart the test quickly.
As indicated above, alerts help experimenters ensure their A/B tests’ success by timely detecting egregious changes that may have occurred in the feature that is being tested. At Microsoft, alerts fired during A/B tests constantly help experimenters catch issues before negatively impacting user experience or user satisfaction. Issues ranging from incorrect ‘no-result’ search queries to software crashes in productivity tools and other failures get timely attention and notification. In the information technology industry, similar practices are adopted by other companies as well. For example, Criteo has an automated system to raise alerts on key metrics such as clicks and click revenues [2] and Walmart Lab ran anomaly detection and alerting on a number of visitors [3].
In the next section, we will share the main alert types at Microsoft ExP and present the alerting workflow in our system.
What are the main alerts used by Microsoft ExP?
Sample Ratio Mismatch (SRM)
SRM is an important issue in A/B testing [4] (read more here) and therefore, we incorporate it in our alerting system. Before each experiment, we configure the ratio of counts of randomization units (e.g., users, devices) between treatment and control groups, typically 1:1 split. During the experiment, we compute the actual sample ratio and the corresponding p-value using the chi-squared test. An alert gets fired whenever the observed ratio has a statistically significant deviation from the configured value.
Metric Out of Range
Roughly speaking, the metric-out-of-range alert fires when a metric (e.g., click-through rate) is observed to be radically different in magnitude in treatment versus control, with a low-enough p-value to be statistically significant [5]. To define this alert, we slightly modify the classic equivalence test procedure so that the null hypothesis is that the metric movement is within a pre-specified range (e.g., between ±1%) and the alternative is that the movement is out of range. In other words, we examine if the metric movement is within equivalence bounds (i.e., “allowed” metric movement range) \(\begin{equation}\left[b_{A L}, b_{A U}\right]\end{equation}\), and the corresponding null and alternative hypotheses are:
\(\begin{equation}H_{1}: \mu_{T}-\mu_{C} \lt b_{A L} \quad \text{or} \quad \mu_{T}-\mu_{C} \gt b_{A U}\end{equation} \)
Intuitively, if the metric moves drastically and out of the “normal” range, we reject \(\begin{equation}H_{0}\end{equation}\) and fire an alert. In practice, we rely heavily on domain knowledge, heuristics, and historical data to specify the equivalence bounds. For example, as pointed out in [5], we may not alert on two-millisecond page load time degradations, even if they are highly statistically significant. On the other hand, for key reliability metrics for system health, we might consider alerting even on 0.1% change.
We conduct two one-sided tests (TOST) [6] to compute the corresponding p-values (ref. Fig. 2). Alerts will be fired if p-values fall below certain thresholds. \(\begin{equation}H_{0}\end{equation}\) can be decomposed into two one-sided hypotheses:
\begin{equation}H_{0 L}: \mu_{T}-\mu_{C} \geq b_{A L} \quad \text { and } \quad H_{0 U}: \mu_{T}-\mu_{C} \leq b_{A U}\end{equation}
\)
First, we calculate two p-values for the two corresponding one-sided tests:
\begin{equation}p_{A L}= \text{Pr} \left(\frac{\left(m_{T}-m_{C}\right)-b_{A L}}{S} \lt z_{\alpha} \mid \mu_{T}-\mu_{C}=b_{A L}\right)
\end{equation}
\)
\(
\begin{equation}
p_{A U}={Pr}\left(\frac{\left(m_{T}-m_{C}\right)-b_{A U}}{S} \gt z_{1-\alpha} \mid \mu_{T}-\mu_{C}=b_{A U}\right)
\end{equation}
\)
Second, we reject \(\begin{equation}H_{0}\end{equation}\) when we reject \(\begin{equation}H_{0L}\end{equation}\) or \(\begin{equation}H_{0R}\end{equation}\), i.e., when \(\begin{equation}p=\min \left(p_{A L}, p_{A U}\right)\end{equation}\) is small.
Remark: In practice, we often repeat the above procedure for relative movements and obtain.
p_{R L}={Pr}\left(\frac{\left(m_{T}-m_{C}\right)-b_{R L} \mu_{C}}{S} \lt z_{\alpha} \mid \mu_{T}-\mu_{C}=b_{R L} \mu_{C}\right)\end{equation}\)
\(
\begin{equation}
p_{R U}={Pr}\left(\frac{\left(m_{T}-m_{C}\right)-b_{R U} \mu_{C}}{s} \gt z_{1-\alpha} \mid \mu_{T}-\mu_{C}=b_{R U} \mu_{C}\right)
\end{equation}
\)
And consider the “aggregate” p-value as \(\begin{equation}\min \left(p_{A L}, p_{A U}, p_{R L}, p_{R U}\right)\end{equation}\).
For an explanation of terms (\(
\begin{equation}
\mu_{T},\mu_{C},m_{T},m_{C},s_{T}^{2}, s_{C}^{2},N_{T},N_{C},s,b_{AL},b_{AU},b_{RL},b_{RU},p_{AL},p_{AU},p_{RL},p_{RU}
\end{equation}
\)), refer to the glossary at the end.
Here is an example (ref. Fig. 3) of a metric-out-of-range alert configured on the “PageClickRate” metric on the Bing metric set. An alert gets fired (the experimenter gets notified) if this metric moves significantly in the negative direction by 5% or more.
P-value adjustment in alerting
As pointed out in [5], “the naïve approach to alerting on any statistically significant negative metric changes will lead to an unacceptable number of false alerts and thus make the entire alerting system useless”, emphasizing the need for p-value adjustments for multiple testing. This is particularly important in practice because:
• There can be multiple A/B tests simultaneously running on the same product line.
• There can be multiple analyses (e.g., partial-day, 1-day et al.) for an A/B test.
• There can be hundreds or even thousands of metrics in the analysis.
Depending on the user scenario, different adjustment methods can be chosen. Their technical details are out of the scope of this blog. For a recent review, see [7]. Common methods include the O’Brien & Fleming procedure [8] and the Benjamini & Hochberg false discovery rate control (for independent and dependent cases) [9] [10]. At Microsoft ExP we use the latter.
How does one get notified?
The experiment owners are usually notified of an alert on their experiment. The notification can be an email with all the details regarding the alert along with the meta-data for that experiment. It may also contain a link for the experimenters to review and resolve the alerts or suppress specific alerts in the future. This functionality is particularly useful when the system raises a false-positive or an alert under a known scenario too often. Alert notifications and the time allotted to investigate/fix it are also largely dependent on the severity of an alert.
A priority-based system to categorize alerts can be followed, where warnings raised on key metrics should be prioritized higher than the warnings raised on other metrics. At Microsoft ExP, we categorize alerts in the following way:
• P0: A P0 alert indicates that something has gone catastrophically wrong (most severe) and needs immediate attention.
• P1: A P1 alert indicates that something quite serious is happening (severe), and experimenters should investigate it as soon as possible.
• P2: A P2 alert indicates that something potentially wrong is happening (less severe), and experimenters should investigate it.
It is the experiment owners’ responsibility to determine an appropriate course of action on alerts based on their priority. When it comes to notifying experimenters, it can depend on the severity of an alert. For less severe alerts like P2 alerts, an email notification can be sufficient. In scenarios of more severe alerts like P0 and P1, it might also be beneficial to configure automated responses along with email notifications. This would act as a safety-net in situations when the experimenters missed taking timely action.
One way of such an automated response is the auto-shutdown of experiments. At Microsoft ExP, we enable auto-shutdown based on alerts for some of our experiments. This is done in scenarios where actions on such alerts are time-sensitive, and the downside of not shutting the experiment can potentially have unwanted ramifications in the form of a sub-optimal user experience.
Summary
At Microsoft ExP we strive to create a platform that enables different product teams at Microsoft to run and analyze trustworthy A/B tests. A key component of the experimentation platform is a robust alerting mechanism, which keeps experimenters informed of bugs, anomalies, and surprising results early on. In this blog post, we summarized how alerting is incorporated in our experimentation platform, including the most important alerts, how they are defined and analyzed, and the high-level alerting workflow.
-Ankita Agrawal and Jiannan Lu, Microsoft Experimentation Platform
Glossary
- Population Mean \(\begin{equation}\left(\mu_{T}, \mu_{C}\right)\end{equation}\): This is the average value for a metric.
- Sample Mean \(\begin{equation}\left(\boldsymbol{m}_{T}, \boldsymbol{m}_{C}\right)\end{equation}\): This is the average value of the metric in interest as obtained from the user telemetry after running the experiment for some days.
- Absolute Sample Delta \(\begin{equation}\left(m_{T} – m_{C}\right)\end{equation}\): The difference in sample means between treatment and control group.
- Sample standard deviation \(\begin{equation}\left(s_{T}^{2}, s_{C}^{2}\right)\end{equation}\): This is the variance in metric for treatment and control group.
- Sample Count \(\begin{equation}\left(N_{T}, N_{C}\right)\end{equation}\): This is the total user count obtained from user telemetry.
- Sample standard deviation (s) of absolute sample delta \(\begin{equation}\left(m_{T} – m_{C}\right)\end{equation}\): \(\begin{equation}s=\sqrt{\frac{s_{T}^{2}}{N_{T}}+\frac{s_{C}^{2}}{N_{C}}}\end{equation}\)
- Equivalence bounds \(\begin{equation}\left(b_{A L}, b_{A U}, b_{R L}, b_{R U}\right)\end{equation}\): This is the bound set by the experimenter. This can be understood as the “acceptable” range. Anything outside this range should fire an alert and that’s the hypothesis we have set. The absolute and relative bounds are denoted by:
- Absolute Lower Bound: \(\begin{equation}b_{A L}\end{equation}\)
- Absolute Upper Bound: \(\begin{equation}b_{A U}\end{equation}\)
- Relative Lower Bound: \(\begin{equation}b_{R L}\end{equation}\)
- Relative Upper Bound: \(\begin{equation}b_{R U}\end{equation}\)
- Test-Statistic \(\begin{equation}\left(t_{A L}, t_{A U}, t_{R L}, t_{R U}\right)\end{equation}\) : A test statistic is a random variable that is calculated from sample data. It measures the degree of agreement between a sample of data and the null hypothesis. Test statistic is denoted by:
- Absolute Lower \(\begin{equation}\boldsymbol{t}_{A L}=\frac{\left(m_{T}-m_{C}\right)-b_{A L}}{s}\end{equation}\)
- Absolute Upper \(\begin{equation}\boldsymbol{t}_{A U}=\frac{\left(m_{T}-m_{C}\right)-b_{A U}}{s}\end{equation}\)
- Relative Lower \(\begin{equation}\boldsymbol{t}_{R L}=\frac{\left(m_{T}-m_{C}\right)-b_{R L} \mu C}{s}\end{equation}\)
- Relative Upper \(\begin{equation}\boldsymbol{t}_{R U}=\frac{\left(m_{T}-m_{C}\right)-b_{R U} \mu C}{s}\end{equation}\)
- P-value \(\begin{equation}\left(p_{A L}, p_{A U}, p_{R L}, p_{R U}\right)\end{equation}\): It is the probability of obtaining results as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is true.
- Absolute: \(\begin{equation}\boldsymbol{p}_{A L}=P\left(X \lt t_{A L}\right), \boldsymbol{p}_{A U}=P\left(X \gt t_{A U}\right)
\end{equation}\) - Relative: \(\begin{equation}\boldsymbol{p}_{R L}=P\left(X \lt t_{R L}\right), \boldsymbol{p}_{R U}=P\left(X \gt t_{R U}\right)
\end{equation}\)
- Absolute: \(\begin{equation}\boldsymbol{p}_{A L}=P\left(X \lt t_{A L}\right), \boldsymbol{p}_{A U}=P\left(X \gt t_{A U}\right)
References
[1] A. Fabijan, J. Gupchup, S. Gupta, J. Omhover, W. Qin, L. Vermeer and P. Dmitriev, “Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.
[2] H. Hamel, “A/B Testing fast & secure, or how to improve ads iteratively, quickly, safely,” https://medium.com/criteo-engineering/a-b-testing-fast-secure-or-how-to-improve-ads-iteratively-quickly-safely-ab614e0d83fc, 2019.
[3] R. Esfandani, “Monitoring and alerting for A/B testing: Detecting problems in real time,” https://medium.com/walmartglobaltech/monitoring-and-alerting-for-a-b-testing-detecting-problems-in-real-time-4fe4f9b459b6, 2018.
[4] A. Fabijan, T. Blanarik, M. Caughron, K. Chen, R. Zhang, A. Gustafson, V. K. Budumuri and S. Hunt, “Diagnosing Sample Ratio Mismatch in A/B Testing,” http://approjects.co.za/?big=en-us/research/group/experimentation-platform-exp/articles/diagnosing-sample-ratio-mismatch-in-a-b-testing/, 2020.
[5] R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu and N. Pohlmann, “Online Controlled Experiments at Large Scale,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013.
[6] D. Schuirmann, “A comparison of the Two One-Sided Tests Procedure and the Power Approach for assessing the equivalence of average bioavailability,” Journal of Pharmacokinetics and Biopharmaceutics, pp. 657-680, 1987.
[7] A. Farcomeni, “A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion,” Statistical Methods in Medical Research, vol. 17, pp. 347-388, 2008.
[8] P. O’Brien and T. Fleming, “A Multiple Testing Procedure for Clinical Trials,” Biometrics, vol. 35, no. 3, pp. 549-556, 1979.
[9] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society, Series B, vol. 57, pp. 289-300, 1995.
[10] Y. Benjamini and D. Yekutieli, “The control of the false discovery rate in multiple testing under dependency,” Annals of Statistics, vol. 29, pp. 1165-1188, 2001.