{"id":965463,"date":"2023-09-05T13:36:07","date_gmt":"2023-09-05T20:36:07","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=965463"},"modified":"2023-09-05T13:36:39","modified_gmt":"2023-09-05T20:36:39","slug":"experimentation-and-the-north-star-metric","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/experimentation-and-the-north-star-metric\/","title":{"rendered":"Experimentation and the North Star Metric"},"content":{"rendered":"\n
Ram Hariharan and Will Dubyak<\/em><\/p>\n\n\n\n We must measure user impact to continue enhancing Copilot User Experience. <\/p>\n\n\n\n This post addresses application of A\/B testing and the North Star metric to this question. It uses an actual example to demonstrate test set up, interpretation of results, and sound decision making. It shows the power of these tools and highlights the hazards of too-rapid interpretation. <\/p>\n\n\n\n There are many technical improvements deriving from thoughtful application of A\/B Experimentation, but this paper should be viewed from the perspective of enhancing customer experience; our focus is always doing what we must to make customers more successful. Given the increasing embrace of Copilot across the range of our products, we see a tremendous opportunity to use experimentation to make Copilot more impactful on the end-to-end experience. <\/p>\n\n\n\n This post is not a recipe; there are volumes written about testing and metrics. Nor is it a comprehensive overview of the example use case. It is meant to illustrate how A\/B testing and metrics in real life can be applied and show how misinterpretation or misuse can lead to weaker decision making. <\/p>\n\n\n\n Two key ideas: <\/p>\n\n\n\n Microsoft Power Automate is a low-code tool to create a flow to streamline automating repetitive processes. Power Automate Copilot makes creating flows easy, saving user time and effort. Users simply describe the automated workflow they want in everyday language and Copilot transforms the words into a flow, jumpstarting the authoring process. For example, a text input \u201cSend me a notification when I get a high importance email from my manager\u201d generates this flow: <\/p>\n\n\n\n This is a terrific opportunity to leverage natural language input; Power Automate was among Microsoft\u2019s first AI Copilot use cases released publicly. Data suggests workflows built with AI run far more often than those built manually. Users are more likely to use Copilot if we make it easier, helping them automate more of their scenarios. This suggests a natural experiment.<\/p>\n\n\n\n Our research question has two parts:<\/p>\n\n\n\n The goal of this post is four-fold:<\/p>\n\n\n\n Some Building Blocks<\/u><\/em><\/p>\n\n\n\n This section provides basic definitions. When a process has many moving parts, there are many levers. But caution is required.<\/p>\n\n\n\n The Metric Framework<\/u><\/em><\/p>\n\n\n\n By \u201cMetric Framework\u201d we acknowledge many measurable outcomes in a complex flow. North Star is our target metric. It is related to outcomes driving overall process efficacy; it is the one that makes us best. But it is only one of three types:<\/p>\n\n\n\n Each metric is unique and valuable. \u201cLeading\u201d is most easily influenced: it\u2019s our main lever. We study all of these but focus on North Star and (to some extent) Lagging Metrics. They are slower to react, but more revealing of true impact and business value.<\/p>\n\n\n\n Experimentation + North Star<\/u><\/em><\/p>\n\n\n\n Our obligation is to proceed carefully.<\/p>\n\n\n\n Experimentation enables focus on what matters, while using scientific discipline to control what we can at intermediate steps. It helps explain the impact of early process modifications on the North Star Metric which responds to change slowly. We move it by systematically altering individual parts of a process, one at a time. Our experimentation program is to select the processes that ultimately most impact North Star.<\/p>\n\n\n\n Experimentation in action<\/u><\/em><\/p>\n\n\n\n We rely on Experimentation (not intuition\/human judgement) to modify a process. We will simplify entry to Copilot to see if more users choose to try AI-assisted flow. This is a leading metric; we will have near real-time feedback when a user chooses to try this out.<\/p>\n\n\n\n But that isn\u2019t enough; it doesn\u2019t matter if more people try Copilot if they don\u2019t use the output. We also explore this impact using User Save Rate as a proxy. (This tradeoff is routine in experimentation. While run rate is measurable, it is complex, and it does not move quickly in real time; we use save rate as a proxy because, presumably, flows are saved with the intent of running it later.)<\/p>\n\n\n\n We use sound statistical practices in selecting a sample for these experiments; it goes well beyond convenience samples or simply observing for a while. In an observational (non-experimental) scenario we don\u2019t have the insight we need to establish a causal connection between action and changes in our North Star (or any other) metric. We can\u2019t know if systemic or environmental conditions also play a part in an outcome. Here is why.<\/p>\n\n\n\n Suppose we have a variable Y whose performance we\u2019d like to optimize, but which is also a function of many inputs. In our example Y is Save Rate, our North Star proxy. We hypothesize that a variable X impacts Y. If this is true, we theorize that we can control Y by manipulating X.<\/p>\n\n\n\n Suppose we test this hypothesis by manipulating X and observing how Y changes during some period. We are happy if Y improves, but what have we proved?<\/p>\n\n\n\n The answer is, unfortunately, very little. The strongest statement we can make is that there is an association between X and our North Star Y. Our test offers no assurance that environmental changes are not driving the observed change in Y. To make a stronger statement we must use a disciplined statistical experiment.<\/p>\n\n\n\n Our goal is to identify a modification (and associated Leading Metric) with a causal impact on our business: to say that, within a confidence level, we believe a change in X causes<\/u><\/em> a change in Y. We seek to demonstrate reasoned inference about ultimate changes in the NS metric in response to manipulation of the hypothesized causal variable. Until we establish this connection, we are relying on hope, not science, to improve our feature.<\/p>\n\n\n\n A\/B Experimentation is the most accepted way to establish causality between variables in a system. It lets researchers hold the environment constant while manipulating the X variable. In \u201cA\/B testing\u201d subjects in the sample are randomly separated into groups that do\/do not receive the \u201ctreatment\u201d. Random assignment to these categories ensures that groups are equivalent when the study begins, so we can infer that treatment drives observed changes in the outcome.<\/p>\n\n\n\n We will test two hypotheses:<\/p>\n\n\n\n The Figure below will help. There are 5 steps in the process. We will modify the way users opt in to Copilot by placing entry points on the home page. Our treatment group gets a new way of entering the funnel from the homepage; control group sees the original entry mechanism (Traditional entry points have been from \u201cCreate\u201d and \u201cMy Flows\u201d pages.). The Treatment group automatically sees a Copilot enabled designer; the Control Group must deliberately transit to Copilot. Note that there is NO change in the process of building an AI assisted flow; that process is accessible regardless of how a user enters. The test is for one specific question only: what happens if we make getting started more discoverable? Since evidence suggests AI-assisted flows are run more often, this might be a way to generate more usage.<\/p>\n\n\n\n The figure below represents the experimental results geometrically. The sample size was identical for treatment and control groups; breadth of the horizontal bar representing each step is proportional to a positive response.<\/p>\n\n\n\n Copilot assistance in flow construction has been in use in Power automate since last fall, but there is evidence to suggest that some users are unaware of this functionality. This experiment tests the hypothesis that implementation of this Copilot AI call-for-action banner will help more users discover Copilot, enter the flow funnel, ultimately resulting in saving and running a flow.<\/p>\n\n\n\n While the actual data has been redacted and this funnel is a little dated, the results are striking.<\/p>\n\n\n\n What do we learn<\/u><\/em><\/p>\n\n\n\n This use-case adds clarity to the care with which experimental results must be interpreted.<\/p>\n\n\n\n This emphasizes the key idea of using A\/B testing in conjunction with the North Star.<\/p>\n\n\n\n The dramatic improvement in the rate at which the new users enter the experience, on its own, suggests that we should make this entry standard for all users. <\/p>\n\n\n\n But the decline in the treatment group save rate in the experiment suggests otherwise. Fortunately, lessons learned in other experiments offer potential explanations.<\/p>\n\n\n\n Our working hypothesis is that new users who react to the entry banner are less likely to understand the designer process; users who understand AI supported mechanisms for flow creation are more likely to save and run the flow. This is supported by data: despite much higher rates of entry from the new Homepage endpoint, 57% of created flows came from original methods. The Homepage entry accounted for 43%; on the whole, users coming from the home page saved their flows at a 25% lower rate.<\/p>\n\n\n\n Which suggests the next set of experiments for A\/B testing and product improvements!<\/p>\n\n\n\n Take aways<\/u><\/em><\/p>\n\n\n\n First, A\/B testing is best regarded as an incremental process. We can rapidly gather insight, but we must be cautious about reacting to a specific outcome until we have studied the entire process.<\/p>\n\n\n\n Second, interplay between Leading metrics and the North Star is critical to success. Improvement at an intermediate step in a workflow (such as significant increase in entry) is of no use until it leads to a corresponding improvement in primary success metrics (such as save rate).<\/p>\n\n\n\n Finally, in experimentation we are constrained by what we can measure. Accordingly, we use Save Rate as a proxy for Run rate. And we temper our response to some experimental results if the outcome is inconsistent with other indicators at our disposal (i.e., the fall in save rate does not match other evidence that says AI generated flows run at a much higher rate than original flows.) We use each result as an opportunity to learn, and to plan our next experiment to continually improve the value customers derive from increasingly delightful experiences.<\/p>\n","protected":false},"excerpt":{"rendered":" Ram Hariharan and Will Dubyak We must measure user impact to continue enhancing Copilot User Experience. This post addresses application of A\/B testing and the North Star metric to this question. It uses an actual example to demonstrate test set up, interpretation of results, and sound decision making. It shows the power of these tools […]<\/p>\n","protected":false},"author":42414,"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":804652,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-965463","msr-blog-post","type-msr-blog-post","status-publish","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":804652,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/965463","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42414"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/965463\/revisions"}],"predecessor-version":[{"id":965487,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/965463\/revisions\/965487"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=965463"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=965463"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=965463"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=965463"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}For thousands of years mariners have understood the value of the North Star as a beacon to help navigate a journey.\u00a0 It is a trusted source of truth; reference to it is the basis of life and death navigational decisions.\u00a0 If they adhere to its message, they find their way home.\u00a0 They ignore it at their peril.<\/em>\u00a0<\/h3>\n\n\n\n
\n
\n
The use case<\/em> <\/h3>\n\n\n\n
<\/figure>\n\n\n\n
\n
\n
\n
\n
\n
North Star Metric; an example<\/h3>\n\n\n\n
\n
<\/figure>\n\n\n\n
\n
\n