{"id":832888,"date":"2022-04-06T09:30:31","date_gmt":"2022-04-06T16:30:31","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=832888"},"modified":"2022-04-06T09:30:31","modified_gmt":"2022-04-06T16:30:31","slug":"stedii-properties-of-a-good-metric","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/stedii-properties-of-a-good-metric\/","title":{"rendered":"STEDII Properties of a Good Metric"},"content":{"rendered":"
When a product adopts an experimentation-driven culture, software development tends to shift from being a top-down decision to more of a democratized approach. Instead of arguing about what should be built, product leaders define goals for metrics to improve the product, and they empower their teams to invest in changes that will ultimately achieve those goals. This allows the organization\u2019s culture to innovate faster by testing multiple ideas for improvement, fail fast, and iterate.<\/p>\n
One of the best ways to experiment with a software product is to run A\/B tests. For successful A\/B tests, it is very important to have the right metrics. But what makes a good metric? Is the company\u2019s stock price a good metric for a product team? Probably not. It is not sensitive to small changes in the product, and we cannot observe the counterfactual \u2013 that is, the stock price in the universe where the treatment is not present. Perhaps the company could conduct an extensive user survey for each change, and then measure the degree of satisfaction that their users have for the change. However, such a survey for each product change would annoy users, it would be very costly to scale, and it would not be reflective of the overall user population because many users won\u2019t respond to the survey. These examples demonstrate just how challenging it is to define a good A\/B metric.<\/p>\n
So how do we define a good metric? After running hundreds of thousands of A\/B tests at Microsoft, we have identified six key properties of a good A\/B metric:<\/p>\n
In this blog post, we will examine each of these properties more closely to understand what makes a good metric, and we will provide checks to test a metric against each property. We would like to emphasize, however, that these are general<\/em> properties of a good experimentation metric. They are necessary properties for most experiment metrics, but they may not be sufficient for use in every single case. In later blog posts, for instance, we will discuss Overall Evaluation Criteria (OEC) metrics that should have additional properties, such as being a proxy for the overall product, user, and business health.<\/p>\n STEDII Checklist for Creating Good Metrics<\/p><\/div>\n In a previous blog post<\/a>, we discussed the details of measuring and checking the sensitivity of the metric. We will briefly summarize those details here. A sensitive metric has a high chance of detecting an effect when there is one. Conversely, when a sensitive metric that is well powered has no stat-sig movement, we have high confidence that there is no treatment effect.<\/b> Let \\( H_1\\)\u00a0be the alternative hypothesis \u2013 i.e., there is real treatment effect of a certain size. Then,<\/p>\n \\(Prob\\)[Detecting the treatment effect on the metric] = \\(Prob(H_1)Prob(\\text{p-value}<0.05|H_1)\\).<\/p>\n To measure the sensitivity of a metric, we can use a labeled corpus, consisting of tests in which there is high confidence that a treatment effect exists. That confidence is built by examining many metric movements to check if they align with the hypothesis of the test, deep dives, and offline analyses that add more evidence to the correctness of the hypotheses of the A\/B tests. A sensitive metric will have a high proportion of stat-sig changes when there is an empirically known impact from the corpus. In cases where there is no labeled corpus, a more sensitive metric would have a higher proportion of overall tests where it is stat-sig. This will include no more than 5% false positive stat-sig tests, in the event that all tests in the corpus have no impact on the metric.<\/p>\n There are two components of metric sensitivity [1]:<\/p>\n If the movement probability of a metric is low, the metric will rarely have a statistically significant movement, even though it may have very high statistical power<\/strong>. In cases of well-optimized products, high-level metrics (such as \u201cdays active\u201d and \u201csessions per user\u201d) can be very difficult to improve in a short period of time, but they can regress very easily. So those metrics are poor indicators of success, but they are good guardrails. Proxy metrics with higher positive metric movement probability are better success metrics that we can aim to improve in an experiment. In other cases, we can improve metric design to measure more sensitive transformations of a metric, such as the \u201ccapped average of log of time spent in Teams,\u201d instead of a simple average.<\/p>\n Statistical power is the probability of detecting a stat-sig change if the alternative hypothesis is true<\/strong>. We refer to the smallest magnitude of metric change detectable with high probability (usually 80%) as the Minimum Detectable Effect (MDE). The smaller the MDE, the higher the statistical power. If we assume that Treatment and Control have the same size and the same population variance, the relative change (expressed as a fraction of the control mean) that we can detect (using the t-test) with 80% power is approximately \\(\\frac{4 \\times CV}{\\text{sample size}}\\)\u00a0 [4]. Here, \\(CV\\) is the coefficient of variation, defined as the ratio of the standard deviation (\\(\\sigma\\)) of a metric to its mean (\\(\\mu\\)), \\(CV = \\frac{\\sigma}{\\mu}\\). When designing a metric, we want to aim for a low \\(CV\\) in order to get more statistical power.<\/p>\n Checklist for Creating Sensitive Metrics<\/p><\/div>\n Metrics are not only used to make informed product decisions, but they also help determine the incentives and future investments of feature teams.<\/strong> An untrustworthy metric can send a product down a wrong path and away from its aspired goal. While the accurate estimation of the variance of a metric should be handled by the experimentation platform, the metric authors should focus on data quality, alignment with the goal and user experience, and generalization.<\/p>\n Our previous blog post on data quality<\/a> provides an in-depth guide on data quality. For completeness we will summarize some of its key points here. When creating a metric, we should check for the following aspects related to data quality: missing data, invalid values, low join rates with other data sources, duplicate data, and delayed data<\/strong>. As mentioned in another past blog post<\/a> on validating the trustworthiness of a metric, we must check to ensure that the p-value distribution of a metric under multiple simulated AA tests is uniform. Moreover, we recommend regularly monitoring these aspects of key metrics through dashboards, anomaly detection methods, and AA tests in order to detect and resolve any regressions.<\/p>\n When a metric is aligned with a key product goal, it aggregates data from all observation units into a single number that is most pertinent for that goal. For example, a common goal for end-to-end performance of a product such as Page Load Time (PLT) is that it should be satisfactory for the large majority of the page load instances. The distribution of PLT is usually very skewed with a long tail. Average PLT is a poor metric to track the goal, but a metric like the 95th<\/sup> percentile or the 99th<\/sup> percentile of the PLT is more suitable. If that is hard to compute, then another option would be a metric which estimates the proportion of page loads where PLT exceeds a threshold. Further PLT is only useful if it is measuring the latency of users\u2019 experiences when loading a page. For instance, some webpages load most of the content after the standard page load event (opens in new tab)<\/span><\/a>. In such cases, a good metric would measure the latency to the point where the page becomes functional for the end user.<\/p>\n Often, the goal will be clear (e.g., \u201cincrease user satisfaction\u201d), but it will be hard to determine a metric that actually aligns with that goal.<\/strong> In those cases it is important to test those metrics on a corpus of past A\/B tests that we can confidently label as good or bad in terms of the desired goal [2]. In an upcoming blog post, we will discuss how to develop a trustworthy OEC that reflects user satisfaction.<\/p>\n A trustworthy metric provides an unbiased estimate for the entire population for which a product decision is being made. The most important factor to look for in this case is selection bias in data generation, collection, transformation, and metric definition and interpretation.<\/strong><\/p>\n An example of a bias in data generation is \u201capp ratings\u201d in the App Store. If an app only sends users with a positive outlook to the App Store, then the data generated from user reviews will be biased in a positive direction. A data collection bias can occur if we are unable to collect data from users who abandoned the product too quickly, or if the data collection is affected by the feature being tested. Data transformation can introduce a bias if we are incorrectly identifying spurious users (such as bots or test machines) and we are either removing legitimate users or including spurious ones [5]. Metric definition and interpretation can also introduce bias if the metric is analyzing only a subset of the intended-to-treat users or if it puts more weight on certain users or actions unintentionally.<\/p>\n We recommend being very vigilant of selection bias from end-to-end, validating whether a metric value and its movement are consistent with other tracked measurements and our expectations of its movement in an A\/B test. For metrics that are particularly important, it may be a good idea to run A\/B tests where we are certain about the outcome, and then verify whether or not the metric movement aligns with it.<\/p>\n Checklist for Creating Trustworthy Metrics<\/p><\/div>\n As the <\/strong>Experimentation Flywheel<\/strong><\/a> turns, more and more A\/B tests are run regularly, and we will need to compute A\/B metrics for a large number of A\/B tests. Therefore, the time, complexity, and cost to compute the metrics should be manageable<\/strong>. We will examine these factors related to efficiency in more detail below.<\/p>\n Checklist for Creating Efficient Metrics<\/p><\/div>\n Debuggability is essential for experimenters to understand why a metric is moving in an A\/B test.<\/strong> If the metric regresses, the metric debugging must help the experimenter narrow down the cause of the regression so that a bug can be fixed or the design of a feature can be changed. Equally important, if the metric is improving, the metric debugging should help the experimenter understand why it improved. This will prevent them from falling prey to confirmation bias, and it will also help guide future investments. Let\u2018s discuss two common methods for making a metric debuggable: debug metrics and tools.<\/p>\n Debug metrics capture additional changes or treatment effects that help us better understand the movement of a more complex metric. They typically zoom in on a specific property of the complex metric in order to shed more light on the treatment impact. The reduced scope of the debug metrics usually makes them more sensitive [4].<\/p>\n There are three common ways to construct debug metrics:<\/p>\n Key guardrail metrics, such as \u201cperformance\u201d and \u201creliability,\u201d benefit from diagnostic and raw data debugging tools that identify the cause of a regression. The diagnostic tools can help reproduce and diagnose an issue in a developer\u2019s machine that is running the treatment. For instance, by inspecting the data that a device is sending, Fiddler can help pinpoint regressions in performance or telemetry loss. We recommend that teams should develop tools that can mine the A\/B test data to find instances of regressions in raw data, such as stack traces for crashes that are caused more often by treatment than by control.<\/p>\n Checklist for Creating Debuggable Metrics<\/p><\/div>\n In order to enable the best informed product decisions, a good metric must be easy to understand and easy to act upon by all<\/em> team members \u2013 not just experts.<\/strong> Interpretability reinforces the trustworthiness and debuggability of the metric, and it also provides the right information for the proper usage of a metric in a product decision. In general, there are two key aspects of interpretability that we should be mindful of \u2013 clarity about the goal and direction of the metric, and caveats about its usage.<\/p>\n For a metric to be interpretable and actionable, it is essential for all team members to understand what the goal of the metric is, and why it is important.<\/strong> This context is provided by the name and description of the metric, as well as by the presentation of its results in relationship to other metrics. We recommend establishing a review process to ensure that at least one person who is not<\/em> involved with creating a metric can understand its goal and importance.<\/p>\n It should also be easy for all team members to understand when a metric movement is good or bad. We should try to design the metric in such a way that we can assign a \u201cgood-for-the-user\u201d or \u201cbad-for-the-user\u201d label to its movement in a specific A\/B test. At ExP, we use color coding to indicate the difference between a good or a bad movement. This is usually the first level of information that A\/B test owners consume in order to make a decision about the test.<\/p>\n If a movement in a metric can be interpreted as either good or bad, depending on the treatment tested, then it introduces an extra level of subjectivity in the process. It is best to try to avoid that subjectivity through a better metric design. An example would be the tracking of the distribution of page views in a product across users, and understanding how a treatment impacts heavy users (those with a large number of page views) and light users (those with a small number of page views). This metric distribution could potentially be represented in a histogram that tracks the proportion of users falling in buckets labeled as 1, [2,4), [4,8), and so on. However, this would be challenging to properly interpret. A loss of users in the [2,4) bucket could be good if those users moved to the [4,8) bucket, but it would be bad if they moved to the 1 bucket. It would be better to represent the cumulative distribution with buckets labeled as 1, [2,\u221e), [4,\u221e), [8,\u221e), and so on. In this representation, loss in each bucket would have an unambiguous interpretation; a decrease in bucket 1 would always be good, while a decrease in any other bucket would always be bad. This property is also referred to as directionality.<\/p>\n Almost every metric has a blind spot, because it is aggregating a large number of measurements from all observation units into a single number. It is important to communicate the limitations of the metric, and most importantly, the cases when a metric should not<\/em> be used<\/strong>. Not all metrics will be able to test every kind of change. It is essential to know exactly what kind of changes will be tested in an A\/B test, so that the metric can be properly designed to measure their effect. A good example would be a revenue estimate metric that is computed using the historical averages of revenue made per action type. This metric works well for testing changes where the revenue made per action type has not changed; but otherwise, it will give a wrong estimate. In websites, the time to load a page is usually measured as the time difference between the first request sent by the client to load the page and the actual page load event. This estimates the amount of time that a user must wait before they see the page content. But if a treatment is still loading content after the page load event, or if a treatment is issuing the request to load a page in advance, then this metric breaks and is no longer valid.<\/p>\n Checklist for Creating Interpretable and Actionable Metrics<\/p><\/div>\n Although there may be a small set of OEC metrics that indicate the success of an experiment, experimenters should rely on a holistic set of metrics to make sure that a product decision is inclusive and fair [5]. Each metric in a metric set provides one data point that is used in conjunction with other data points in order to make a product decision. For inclusive and fair decision making, it is important to make sure that there is no unintended bias in our metrics. This is ensured by looking at three major factors: missing values, weights, and heterogeneity.<\/strong><\/p>\n We have already discussed selection bias issues in the Trustworthiness section, under the subheading labeled \u201cGeneralization.\u201d Selection bias does not exclude observation units randomly between treatment and control. Rather, it leads to the exclusion of observation units that share a common set of characteristics. For example, devices with low network bandwidth may not be able to load a page fast enough to send data before a user gets frustrated and abandons the product, or users who are neither very happy nor very unhappy with a product tend to respond less to surveys about user satisfaction. Metrics that overlook missing value issues in certain segments of observation units are blind to the regression in product experience for that segment. <\/strong><\/p>\n Whenever possible, we should try to design metrics in a way that either avoids missing values or else can impute missing values. For instance, a \u201cclicks per user\u201d metric can impute 0 values for users who were assigned to treatment or control, but who did not show up in the product logs. But in metrics relating to performance and ratio metrics we cannot impute 0s for missing values. We ought to have data-quality metrics that alert us in case the proportion of missing values changes due to the treatment. For survey-based metrics with large numbers of missing values that cannot be imputed easily, we should have alternative proxy metrics that are based on data that can be observed from most observation units.<\/p>\n A metric can typically aggregate data from observation units in three different ways:<\/p>\n Even if one of these metrics may be the main metric that aligns with the goal of the A\/B test, it is best to have multiple metrics that place different weights on observation units and activities, so that experimenters can get more insights on the distribution of a metric movement along key factors. <\/strong>This will ensure that we are more confident in making a good product decision. For a product with a mix of heavy and light users, a \u201cclicks per user\u201d metric generally increases with increase in overall engagement with the product; but it could also increase due to an increase in engagement from heavy users, even when there is a drop in engagement from light users. Similarly, a \u201cclicks per impression\u201d metric generally increases when there is an overall increase in engagement with impressions; but it might also increase when there is more engagement with a popular page of the product, even when there is a decline in engagement with less popular pages. Lastly a \u201cproportion of users with a click\u201d metric increases when non-engaged users become more engaged; but when already-engaged users are more engaged with the product, it may not show an increase [1, 4].<\/p>\n In cases where we want to ensure an improvement in performance for the bottom 5% or 10% of the observation units, we should compute 95th<\/sup> and 90th<\/sup> percentile metrics or threshold metrics (as we discussed in the Trustworthiness section, under the subheading labeled \u201cAlignment With the Goal and User Experience\u201d).<\/p>\n A\/B metrics estimate the average treatment effect over all observation units, and they can sometimes be blind to different impact of a treatment on a specific segment of users.<\/strong> Therefore, it is important to have good segments that allow for the viewing of a metric for a subpopulation. A good segment should have the following properties:<\/p>\n Common segments include market, country, pre-A\/B test activity level, device and platform, day of the week, and product-specific user personas. For more details, read our earlier blog post on Patterns of Trustworthy Experimentation: During Experiment Stage<\/a>.<\/p>\n
Sensitivity<\/h2>\n
\n
Movement Probability<\/h3>\n
Statistical Power<\/h3>\n

Trustworthiness<\/h2>\n
Data Quality<\/h3>\n
Alignment with the Goal and User Experience<\/h3>\n
Generalization<\/h3>\n

Efficiency<\/h2>\n
\n

Debuggability<\/h2>\n
Debug Metrics<\/h3>\n
\n
Tools<\/h3>\n

Interpretability and Actionability<\/h2>\n
Clarity About the Goal and Direction of the Metric<\/h3>\n
Caveats About the Usage of a Metric<\/h3>\n

Inclusivity and Fairness<\/h2>\n
Missing Values<\/h3>\n
Weights<\/h3>\n
\n
Heterogeneity<\/h3>\n
\n