Business Applications Applied AI Articles

Experimentation and the North Star Metric

Will Dubyak — Tue, 05 Sep 2023 20:36:07 +0000

Ram Hariharan and Will Dubyak

For thousands of years mariners have understood the value of the North Star as a beacon to help navigate a journey. It is a trusted source of truth; reference to it is the basis of life and death navigational decisions. If they adhere to its message, they find their way home. They ignore it at their peril.

We must measure user impact to continue enhancing Copilot User Experience.

This post addresses application of A/B testing and the North Star metric to this question. It uses an actual example to demonstrate test set up, interpretation of results, and sound decision making. It shows the power of these tools and highlights the hazards of too-rapid interpretation.

There are many technical improvements deriving from thoughtful application of A/B Experimentation, but this paper should be viewed from the perspective of enhancing customer experience; our focus is always doing what we must to make customers more successful. Given the increasing embrace of Copilot across the range of our products, we see a tremendous opportunity to use experimentation to make Copilot more impactful on the end-to-end experience.

This post is not a recipe; there are volumes written about testing and metrics. Nor is it a comprehensive overview of the example use case. It is meant to illustrate how A/B testing and metrics in real life can be applied and show how misinterpretation or misuse can lead to weaker decision making.

Two key ideas:

You must have a well understood and impactful North Star metric to successfully develop your model.

You must remain flexible; your actions will not always have the expected effect.

The use case

Microsoft Power Automate is a low-code tool to create a flow to streamline automating repetitive processes. Power Automate Copilot makes creating flows easy, saving user time and effort. Users simply describe the automated workflow they want in everyday language and Copilot transforms the words into a flow, jumpstarting the authoring process. For example, a text input “Send me a notification when I get a high importance email from my manager” generates this flow:

This is a terrific opportunity to leverage natural language input; Power Automate was among Microsoft’s first AI Copilot use cases released publicly. Data suggests workflows built with AI run far more often than those built manually. Users are more likely to use Copilot if we make it easier, helping them automate more of their scenarios. This suggests a natural experiment.

Our research question has two parts:

If we make Copilot entry easier, will it increase the use of AI Assistance?
If more users start using Copilot, will it reflect in greater adoption leading to measurable customer impact?

The goal of this post is four-fold:

Describe design and execution of an A/B test.
Understand interplay between different metrics in A/B testing.
Selection of proxy metrics
Illustrate guidelines to make good decisions with test results.

Some Building Blocks

This section provides basic definitions. When a process has many moving parts, there are many levers. But caution is required.

A/B testing: a statistical experiment in which a sample of users is drawn randomly and separated into two groups: Control and Treatment. At the end we compare the actions of the two groups. This is governed by a basic assumption (below).
Treatment Group: We want to make a change of some kind. The users receiving the change are in the Treatment Group. The change is called the Treatment.
Control Group: A group selected to be left untouched and similar to the Treatment Group. If assumptions hold, we have a start at understanding treatment impact.
Metrics: Measurements of outcomes. In a process there can be many, each with a different message. It is critical to choose one that speaks to the question being asked; often the most relevant message is learned from a point far from the actual experiment. We use three types in our work (below).

The Metric Framework

By “Metric Framework” we acknowledge many measurable outcomes in a complex flow. North Star is our target metric. It is related to outcomes driving overall process efficacy; it is the one that makes us best. But it is only one of three types:

Leading Metric: for local impacts (such as modification of a single step in AI Copilot). It is generally quite sensitive to feature changes; it can move quickly when change is induced, because the metric is generally quite close to the action. In our running example, it is the decision to use AI for flow design. Measurement is fast.
North Star: More revealing. It does not move as fast as Leading Metric but speaks clearly to process end state. In our running example, we typically think of Flow Value per User as our North Star (we use save rate as a proxy); our hypothesis is that decisions to run a workflow indicate maker satisfaction with the results.
Lagging Metric: The Lagging Metric is the slowest response to changes; it is often removed in time or sequence from the modified process. In our example, a lagging metric would be closer to the business such as increased usage or higher revenue.

Each metric is unique and valuable. “Leading” is most easily influenced: it’s our main lever. We study all of these but focus on North Star and (to some extent) Lagging Metrics. They are slower to react, but more revealing of true impact and business value.

Experimentation + North Star

Our obligation is to proceed carefully.

Experimentation enables focus on what matters, while using scientific discipline to control what we can at intermediate steps. It helps explain the impact of early process modifications on the North Star Metric which responds to change slowly. We move it by systematically altering individual parts of a process, one at a time. Our experimentation program is to select the processes that ultimately most impact North Star.

Experimentation in action

We rely on Experimentation (not intuition/human judgement) to modify a process. We will simplify entry to Copilot to see if more users choose to try AI-assisted flow. This is a leading metric; we will have near real-time feedback when a user chooses to try this out.

But that isn’t enough; it doesn’t matter if more people try Copilot if they don’t use the output. We also explore this impact using User Save Rate as a proxy. (This tradeoff is routine in experimentation. While run rate is measurable, it is complex, and it does not move quickly in real time; we use save rate as a proxy because, presumably, flows are saved with the intent of running it later.)

We use sound statistical practices in selecting a sample for these experiments; it goes well beyond convenience samples or simply observing for a while. In an observational (non-experimental) scenario we don’t have the insight we need to establish a causal connection between action and changes in our North Star (or any other) metric. We can’t know if systemic or environmental conditions also play a part in an outcome. Here is why.

Suppose we have a variable Y whose performance we’d like to optimize, but which is also a function of many inputs. In our example Y is Save Rate, our North Star proxy. We hypothesize that a variable X impacts Y. If this is true, we theorize that we can control Y by manipulating X.

Suppose we test this hypothesis by manipulating X and observing how Y changes during some period. We are happy if Y improves, but what have we proved?

The answer is, unfortunately, very little. The strongest statement we can make is that there is an association between X and our North Star Y. Our test offers no assurance that environmental changes are not driving the observed change in Y. To make a stronger statement we must use a disciplined statistical experiment.

Our goal is to identify a modification (and associated Leading Metric) with a causal impact on our business: to say that, within a confidence level, we believe a change in X causes a change in Y. We seek to demonstrate reasoned inference about ultimate changes in the NS metric in response to manipulation of the hypothesized causal variable. Until we establish this connection, we are relying on hope, not science, to improve our feature.

A/B Experimentation is the most accepted way to establish causality between variables in a system. It lets researchers hold the environment constant while manipulating the X variable. In “A/B testing” subjects in the sample are randomly separated into groups that do/do not receive the “treatment”. Random assignment to these categories ensures that groups are equivalent when the study begins, so we can infer that treatment drives observed changes in the outcome.

The key consideration in experiment design is making certain that systemic/environmental effects not under control of the researcher are averaged out across the sample; if experimental units are randomly assigned to treatment/control groups, with a large enough sample environmental influences will average out. This gives reasonable assurance that variation between treatment and control groups is due to the treatment variable.
Experimentation allows us to separate an action from the environment.

North Star Metric; an example

We will test two hypotheses:

The new home page entry point Copilot gets more users to try the Copilot powered experience.
The new Copilot powered method increases the flow save rate.

The Figure below will help. There are 5 steps in the process. We will modify the way users opt in to Copilot by placing entry points on the home page. Our treatment group gets a new way of entering the funnel from the homepage; control group sees the original entry mechanism (Traditional entry points have been from “Create” and “My Flows” pages.). The Treatment group automatically sees a Copilot enabled designer; the Control Group must deliberately transit to Copilot. Note that there is NO change in the process of building an AI assisted flow; that process is accessible regardless of how a user enters. The test is for one specific question only: what happens if we make getting started more discoverable? Since evidence suggests AI-assisted flows are run more often, this might be a way to generate more usage.

The figure below represents the experimental results geometrically. The sample size was identical for treatment and control groups; breadth of the horizontal bar representing each step is proportional to a positive response.

Copilot assistance in flow construction has been in use in Power automate since last fall, but there is evidence to suggest that some users are unaware of this functionality. This experiment tests the hypothesis that implementation of this Copilot AI call-for-action banner will help more users discover Copilot, enter the flow funnel, ultimately resulting in saving and running a flow.

While the actual data has been redacted and this funnel is a little dated, the results are striking.

The enhanced entry option draws users at double the original rate.
This pattern endures through the Query, Response, and Accept stages; the treatment group remains in the funnel at a significantly higher rate.
The pattern alters at the Designer; now the control group retains at a higher rate from the last step than treatment. The save rate is also higher in control than treatment.

What do we learn

This use-case adds clarity to the care with which experimental results must be interpreted.

Our hypothesis: get more people into the experience by simplifying entry. This experiment suggested this is true: entry rate was > double the original method. This trend endures for the next several levels of the funnel.
Things change at the designer. In this experiment, all advantages of the new entry method diminish: users, for whatever reason, do not make it through the designer, even though the designer is also enhanced.
We seek to maximize North Star, which is “Flow Value Per User”. We don’t directly move the NS (a lagging metric). However, a key leading metric driving NS is “user flow save rate” which measures the rate at which users save flows. By moving the user flow save rate, the NS metric is moved.
Save rate is actually lower in the new experience.

This emphasizes the key idea of using A/B testing in conjunction with the North Star.

The dramatic improvement in the rate at which the new users enter the experience, on its own, suggests that we should make this entry standard for all users.

But the decline in the treatment group save rate in the experiment suggests otherwise. Fortunately, lessons learned in other experiments offer potential explanations.

Our working hypothesis is that new users who react to the entry banner are less likely to understand the designer process; users who understand AI supported mechanisms for flow creation are more likely to save and run the flow. This is supported by data: despite much higher rates of entry from the new Homepage endpoint, 57% of created flows came from original methods. The Homepage entry accounted for 43%; on the whole, users coming from the home page saved their flows at a 25% lower rate.

Which suggests the next set of experiments for A/B testing and product improvements!

Take aways

First, A/B testing is best regarded as an incremental process. We can rapidly gather insight, but we must be cautious about reacting to a specific outcome until we have studied the entire process.

Second, interplay between Leading metrics and the North Star is critical to success. Improvement at an intermediate step in a workflow (such as significant increase in entry) is of no use until it leads to a corresponding improvement in primary success metrics (such as save rate).

Finally, in experimentation we are constrained by what we can measure. Accordingly, we use Save Rate as a proxy for Run rate. And we temper our response to some experimental results if the outcome is inconsistent with other indicators at our disposal (i.e., the fall in save rate does not match other evidence that says AI generated flows run at a much higher rate than original flows.) We use each result as an opportunity to learn, and to plan our next experiment to continually improve the value customers derive from increasingly delightful experiences.

The post Experimentation and the North Star Metric appeared first on Microsoft Research.

Prompt Engineering: Improving our Ability to Communicate with an LLM

Will Dubyak — Tue, 30 May 2023 18:29:32 +0000

By Zewei Xu, Senior Applied Scientist and Will Dubyak, Principal Program Manager

In March we announced Dynamics 365 Copilot (opens in new tab) and the Copilot in Power Platform (opens in new tab) which has generated curiosity about how we’ve been able to bring more context and specificity to generative AI models.

This post explains how we use retrieval augmented generation (RAG) to ground responses and use other prompt engineering to properly set context in the input to large language models (LLMs), making the use of natural language generation (NLG) technology easier and faster for users. It is a look at two components of our efforts to deliver NLG: prompt engineering, and knowledge grounding.

In early use a key tension for engineering is becoming increasingly apparent.

Pretrained NLG models are powerful, but in the absence of contextual information responses are necessarily antiseptic and generic. Provision of access to customer data is an option, but the need for data security and privacy precludes many sharing options at scale.

Our challenge is to balance these competing forces: enable access to the power of these models for contextually relevant and personalized text generation, while at the same time providing every privacy and security protection our users expect.

Our approach uses two methods. The first involves additions to the user prompt to pass relevant information to the underlying NLG model. The second involves intervention in the data layer so that contextual information is available in a searchable format while remaining secure.

Note that through using Azure OpenAI to call their Generative Pre-trained Transformer (GPT), all standard Azure protections (Trust your cloud | Microsoft Azure) (opens in new tab) are assumed, and thus excluded from explicit discussion.

Prompt Engineering

The key idea behind prompt engineering is to provide enough information in the instructions to the AI model so that the user gets exactly the hoped for result.

The prompt is the primary mechanism for access to NLG capabilities. It is an enormously effective tool, but despite its flexibility there are expectations for how information is passed if user intent is to be actively converted to the expected output. It’s obvious that prompts must be accurate and precise: otherwise, the model is left guessing. But there are other dimensions to prompt engineering that enable the secure access we will require to generate useful insight.

We have 5 components in our prompt. Each of the five is a necessary part of the pipeline. Order matters, and ours are in the accepted method of ascending order of importance to accommodate recency bias.

A sample prompt is attached as an appendix.

System Instructions: We assume a certain set of basic rules to govern generation of content in response to the prompt. This information is of several types. There is general info such as avoidance of subjective opinion or repetition, argument or excessive procedural insight with the user, and ending a chat thread that becomes controversial. There are also general rules to avoid being vague, wandering off topic, or providing images in the response.
There is guidance about Grounding and RAI considerations which will be considered below.
Historical Data: such as information from past conversations. Multiturn will alter this practice.
Agent Context input (optional; in this context, “Agent” refers to a customer service agent. Note that this is an example; a very similar process occurs in functional areas outside customer service.)
Assistant Message. This is a reference to a part of the conversation history to serve as an example of the type of response desired. It can be written by a developer.

Case	GPT generated agent answer before prompt engineering	GPT generated agent answer after prompt engineering
Agent: Hello, how may I help you? Customer: Hi, I am having an issue. Customer: How to force delete VM in Azure? Agent:	Sure, I can help you with that. Please follow the below steps: 1. Open the Azure portal. 2. Select the virtual machine you want to delete. 3. Click on the “Delete” button. 4. Confirm the deletion by clicking on “Yes”. 5. Wait for the deletion process to complete. 6. Verify that the virtual machine has been deleted.	You can force delete a VM in Azure through the Azure Portal, CLI, PowerShell, and REST API. To force delete a VM in the Azure Portal, navigate to your virtual machine, select Delete on the Overview page, select the checkbox for Apply force delete in the Delete virtual machine pane, and select Ok. To delete a VM using the Azure CLI, use the –force-deletion parameter for az vm delete. To force delete a VM using PowerShell, use the -ForceDeletion parameter for Remove-AzVm. To force delete a VM using the Azure REST API, use the forceDeletion parameter for Virtual Machines – Delete.

Grounding

NLG models generate contextually relevant responses by accessing customer’s specific data. Doing this comes down to answering one question: “How do we extract the right information from a customer’s specific data?” We call this technique “grounding”, and it refers to the ability of a model to generate correct answers to questions from customer data.

Grounding depends on the idea of Retrieval Augmented Generation or RAG. RAG is a form of knowledge grounding; it is important because models must extract insight from different customers, and different kinds of data, almost always in great quantities. Formulating good answers requires a mechanism to make sense of the data. In Customer Service, for example, our solution is to decompose data into “chunks”, which are arranged by ranker models by relevance to different customer issues. When a customer sends a query, this query would be used to call for grounding. The ranker model is then used to extract the most relevant chunks from user data sources (e.g., we might pick a few of the most relevant). Combined with historical data and agent input (optional), extracted KB chunks are integrated into prompt LLM template to generate the answer. Similar practices are followed in other areas.

Knowledge Base chunking is a process of creating smaller units of document corpus based on paragraph separations. For our pipeline, we chunked KB articles based on the unit of paragraphs with a maximum and minimum limitation. We use Byte-pair encoding (https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt (opens in new tab)) , which is the default tokenizer for GPT series. After chunking, document embeddings are generated. During runtime, the customer query is used to generate a query embedding, and cosine similarity score is calculated between query embedding and document chunk embedding. Then, the top matched document chunks are used as inputs to the prompt template for GPT to generate answer. We use all-mpnet-base-v2 (opens in new tab) as our embedding model because of its superior performance compared to other sentence embedding models (opens in new tab). Customer knowledge bases can be frequently changed, updated, or expanded with new information or KBs. Updating and storing static embedding files dynamically is also not ideal; these operations can be expensive and hard to maintain at scale. In our pipeline, we adopted Azure Cognitive Search (opens in new tab) (ACS), which provides us the scalable option for this indexing service. We have created an indexer based on ACS. In ACS, there are two levels of search. L1 of keyword-based searching (BM25) and L2 of semantic search (Semantic search – Azure Cognitive Search Microsoft Learn (opens in new tab)) (opens in new tab). We utilize L1 search for document chunks retrieval and are working on enabling semantic search for future pipeline updates. Another thread we are exploring is vector search, which is currently under private preview (Build next-generation, AI-powered applications on Microsoft Azure | Azure Blog | Microsoft Azure) (opens in new tab). It provides both the efficiency and scalability for document retrieval compared to other vector store solutions since it is constructed based on an optimized vector database.

After document chunks retrieval, we integrate top ranked chunks into our GPT prompt template for answer generation.

As shown in Figure 1, we first preprocess customer data with different formats into structured texts and save them as Dataverse entities. These entities are sent to the document featurization pipeline for chunking, text feature extraction, and embedding creation (vector search feature). Then the extracted features are uploaded to ACS storage to create an indexer.

Figure 1: Indexing in Dataverse Copilot Pipeline

During real time service (Figure 2), the customer query is generated based on historical conversation using Cognitive Service’s summarization function (here: Summarize text with the extractive summarization API – Azure Cognitive Services | Microsoft Learn (opens in new tab)). It is then sent to ACS indexer to retrieve the most relevant KB chunks. After integrating the top relevant KB chunks, historic conversation data, and agent context input (optional) with our prompt template, we use GPT to generate answers for agents during conversations.

Figure 2: Real-Time answer generation

RAI: Prompt screening and Provenance check:

Our team is firmly committed to delivering services that comply fully with Microsoft Responsible AI (RAI) principles. These are described (here: http://approjects.co.za/?big=en-us/ai/responsible-ai). They manifest in two distinct ways.

The first is at the point of prompt. For extra protection (i.e., beyond Cog Services Content Moderation), each prompt is screened for anything that might return an NLG response considered prohibited content. Original GPT answers might contain offensive, inappropriate, and sometimes hallucinated contents that would reduce the quality of generated answers. We developed an RAI component that has been adopted across multiple BAP copilot features. Our approach is that a question that does not get asked (i.e., is not submitted to the model) cannot produce offensive or harmful answers. Therefore, each prompt is screened before execution. A prompt that fails screening is not executed, and the user receives a polite notification that they have asked a question that cannot be answered within the limits of our search policies.

In the provenance check component, we do entity checks including numeric, URL, date, and phone number checks. Moreover, we score for answer relevance to filter out irrelevant answers with a threshold.

On–going work:

In the current pipeline, we are requiring users to provide file formats in metadata as we need this info to properly parse the doc. This limits the number of file formats supportable by the pipeline. We plan a method for automatic file format detection and potentially expand file formats we currently support.

We are also working on improving the provenance check component by experimenting with Name Entity Recognition models (NER). We will enhance the RAI component by using out-of-box or fine-tuning customized NER models for extracting more entities like name, address, product related info for provenance checks.

A final thought

There is clearly an element of artistry to construction of a good prompt. The good news is that the required skills are not overwhelmingly difficult to acquire. We advise users to follow a mental framework of Ideate/Experiment/Fine Tune; prompts generation can be learned by doing. Approaching it with much the same mindset as we do multiturn prompts is a path to success with an NLG model.

Appendix: A Sample Prompt

System message:
You are a customer service agent who helps users answer questions based on documents from

## On Safety:
– e.g. be polite
– e.g. output in JSON format
– e.g. do not respond to if request contains harmful content…

## Important
– e.g. do not greet the customer
–

AI Assistant message:

## Conversation

User message:

AI Assistant message:

The post Prompt Engineering: Improving our Ability to Communicate with an LLM appeared first on Microsoft Research.

Power Automate with copilot; the back story

Tom Marsh — Mon, 31 Oct 2022 22:42:00 +0000

Authors: Will Dubyak, Chhaya Methani

With Satya’s copilot announcements (Microsoft Ignite Opening (opens in new tab) )at Ignite in the rear-view mirror, it’s a good time to talk more about the kind of work and creative thinking that made it possible. If you aren’t already familiar with the new ways to innovate with AI, such as the AI-based copilot to build your flow in seconds, check out the Microsoft Power Automate blog post (opens in new tab). The idea that a plain language prompt can be used to generate a sophisticated automated workflow is powerful, and a glimpse into what the future holds with innovative large language models. But the path to this point was anything but easy and automatic.

As anyone with a background in AI/ML knows, the long pole in the execution tent for a good idea is training data. To train a model to generate a flow from a prompt assumes that we have lots of flows with associated prompts to show the model.

We didn’t. So we needed to be creative.

Our solution took shape in 2 main dimensions. First, we devised a way to generate synthetic data for model training. We had many production flow skeletons that had been scrubbed of Personal Identifiable Information (PII), and we found ways to generate descriptions (or labels) for them to simulate the prompts a user might have generated. We also used a method to generate Natural Language (NL) utterances-flow pairs that we knew to be empirically relevant based on historical patterns in our existing Microsoft Power Automate flow data.

A Power Automate flow is made up of a trigger that “activates” the flow and steps that perform actions upon that trigger. For example:

“When I get an email from my manager, send me a Teams notification”;
“Send me a message on Teams when a task is completed in Planner;”
“If I receive an email that contains the subject ‘invoice’ create an item on SharePoint”.

We trained the first version of the model by using training data generated through manually generated prompts for the flows. We are using OpenAI Codex, which is the engine behind the GitHub Copilot tool which generates executable code from a natural language prompt. Because large language models lend themselves to new domains, we started achieving excellent results almost immediately.

The model works by pairing a workflow with a natural language description to use as training data. The model – which we refer to internally as NL2Flow – learns the correspondence between the language and the flow and is later able to generate a new flow in response to a natural language prompt. (Interestingly, we have learned that it is working in far more than English; there was intense interest among Japanese users immediately after Ignite; even though it’s not specifically trained in Japanese, it works surprisingly often!) There are many working production flows available, but very few of them have a description we can use in model training and testing.

Generating synthetic data

We augmented the data we had by generating synthetic (meaning “we created ourselves”) natural language query-flow pairs.

Note that this is a reverse of the NL2Flow models. As a practical matter, this included fine tuning of a Codex model to generate new descriptions of an existing production flow, as well as inducing variation in the flow language by paraphrasing. The objective is not just a greater volume of training flows and descriptions, but also a broader selection of triggers and actions with which to generate flows. The team took two approaches:

Reverse the original NL2Flow process and generate NL utterance for existing flows
Use a context grammar to generate synthetic label/flow pairs

Flow2NL

The first effort was to use NLG (Natural Language Generation) to generate NL descriptions from anonymized production flows.

The figure below indicates the process. We input flow code to a fine-tuned Codex model and generated multiple natural language descriptions of flow activity. For economy of effort, these descriptions were submitted to judges for human review; they selected the ones they thought most accurate. On the first pass, 92% of data samples (flows) processed with this approach had agreement of 2 or more judges on at least one NL utterance that the model output.

As an example, consider this flow:

Flow Code:

triggeroutputs = await shared_office365.OnNewEmailV3(); // Trigger Function

outputs_forward_email = shared_office365.ForwardEmail_V2(‘message_id’: triggeroutputs?[‘body’]?[‘MessageId’]) // Forward email function

The Flow2NL model generates the following paraphrased utterances, all of which result in the generation of the above flow.

Forward emails to a specific address
Forward email to another address
Forward emails from a specific address to a different address

Training the model with samples generated this way increases the robustness of the model to language variations. The flow chart below shows the flow of training data generated from Flow2NL pipeline, which is then used to train the NL2Flow model.

Context Grammar

As shown in Table 1, the extra data from Flow2NL helped in our efforts to produce good flow descriptions, but not as much as we needed. To achieve more diversity in flow descriptions we used a process called “Context Grammar” to vary flow descriptions. We iterated over all possible functions (with their corresponding prompts) needed to “construct” a flow. We created a tool called DataGen, that generates these combinations given a config file that contains the following:

The grammar defines groups of co-occurring functions and their order in the flow. The grammar includes code patterns as well as the corresponding NL prompts, needed to replicate a real flow.
The list of all possible functions allowed in this group (both triggers and actions) and
The NL prompts or “patterns” that correspond to these functions.

For example, consider the following config file describing the grammar structure to save attachments from an email. Please note that we only show iterations over one pattern (@SaveTo@) to keep it simple. The tool can expand multiple patterns recursively.

Code Pattern:

triggeroutputs = await shared_office365.OnNewEmailV3(); // Trigger Function

// For loop on email attachments

for (items_foreach in triggeroutputs?[‘body’]?[‘attachments’])

{

//Grammar pattern for set of possible functions allowed under this group

@SaveTo@

}

Corresponding NL Prompts describing the above code (Note: there are many ways to describe a given code):

Save email attachments to @0@

Store every email attachment I receive to @0@

Pull attachments from outlook to @0@

In the above NL-flow pair, the parameters enclosed in @ will be sampled from the list mentioned in Steps 2 & 3. The same config describes the function values that @SaveTo@ can take. The corresponding NL part will be used to replace all occurrences of @0@.

Sampling from known patterns allows us to generate data inexpensively while still preserving relevant triggers and actions from our users. We added additional samples for under-represented connectors.

Contextual Grammar enriched the training set for the NL2Flow model. See the results section for a detailed description of the impact of including both Flow2NL and Context Grammar.

Model Training & Results

Using the two approaches, we generated about 800 new training samples with Flow2NL, and about 3,000 new samples using the Context Grammar approach. We ensured the distribution of generated flows across topics was about the same as in the production samples.

We created a test set for tracking improvements across models trained on different iterations of the data. We computed a custom similarity metric for determining flow similarity between the predicted and the ground truth code. We do a fuzzy match to compute similarity by counting the number of correctly predicted API calls (includes triggers as well as actions) divided by the total number of predicted functions. For e.g. if the ground truth for a certain flow has 5 function calls; the model predicted 6 functions and 4 of those are correct, the similarity measure would be 4/6 = 0.66.

Source	Relative improvement in Similarity Measure
Baseline Model + Flow2NL	3.2%
Baseline + Context Grammar	9.5%
Base + Flow2NL + Context Grammar	15.9%

Table 1: Relative impact of different types of synthetic data on model performance

As we can see above, both Flow2NL and Context Grammar give the best improvements over baseline when both are added to the train set. This shows how powerful it is to add synthetic samples to improve the model strategically to improve as and where needed.

We invite you to create a Power Automate flow today by following this link Power Automate (opens in new tab), and clicking on “Create +.” Select the option “You describe it, AI builds it”. Please leave feedback or ask a question on our forum Microsoft Power Automate Community – Power Platform Community! We are continuously improving the model and would love to hear from you!

The post Power Automate with copilot; the back story appeared first on Microsoft Research.

Customer Lifetime Value Predictive Model now uses Customer Profile Attributes

James Owenby — Thu, 04 Aug 2022 15:57:12 +0000

Authors: Kidus Asfaw, Sally Kellaway, and Radha Sayyaparaju

When crafting your marketing and sales strategies, personalizing your marketing campaigns, journeys and content to the special traits of your customer segments is an important way to drive engagement and purchases. One method for segmenting your audiences is to determine the potential value of those customers to isolate your high and low predicted value customers and personalize campaigns to reward high value customers and drive up the value of lower value customers. Artificial intelligence (AI) can be used to predict the value of customers to create segments like these.

As with all AI, the more data an AI model has, the more accurate its predictions will be. The type of data that’s provided to an AI model can influence the accuracy of those predictions – e.g., transaction data can help predict transaction value in the future, which can be made more accurate by adding customer data. The more customer data you provide to a model, the more accurate and targeted the segments of “high” and “low” value customers will be. As a result, the personalization of marketing campaigns driven with those segments can be more targeted and ‘speak’ more directly to the individual traits of your customers – driving engagement and sales!

Dynamics 365 Customer Insights is introducing a new feature for the Customer Lifetime Value (CLV) out-of-box predictive model that allows you to optionally select customer profile attributes to include in the prediction. You can select from 18 commonly used attributes (in any combination) to include as an input to the model. These attributes will drive more personalized, relevant, and actionable model results for your business use cases. This blog post will share what customer profile attributes are, how they’re used in the CLV model, and what’s new in the overall experience of using the model.

What is the CLV model?

The customer lifetime value model predicts the potential value (revenue) that individual active customers will bring to your business in a future period of time that you define. This model can help you achieve various goals:

Identify high-value customers
- Identify high-value customers for individual targeting
- Create strategic customer segments based on customers’ potential value to run personalized campaigns with targeted sales, marketing, and support efforts
- Recognize and reward high-value customers through loyalty or rewards programs
Identify characteristics of high-value customers
- Guide marketing and sales strategies by optimizing factors that increase customer value (e.g., incentivize more frequent visits, or larger transaction totals with upsell campaigns)
Optimize sales or marketing strategy and allocate budget more accurately for customer outreach
Identify medium and lower value customers to define strategies to improve their value

What are customer profile attributes?

The fields listed in a customer profile (or any CDM -defined entity) are called attributes (read more about the CDM schema here: Common Data Model – Common Data Model | Microsoft Docs (opens in new tab)). Customer Insights utilizes both standard entity definitions and its own defined entities (opens in new tab), including the CustomerProfile entity (opens in new tab). This entity includes dozens of attributes that define the qualities of that customer. An example customer attribute is a customer’s birth date, first name, last name or gamertag.

What is the model doing with these attributes?

These customer attributes can be used as extra sources of information for personalizing customer lifetime value predictions. Personalization of these predictions helps to “customize” to your use case without requiring the development of fully custom AI models.

The profile attributes that you select during configuration are added to the model and considered alongside factors added (and featurized) from the Required Data that you added. The model develops its predictions of lifetime value and will show all factors that influenced those predictions on the results page.

As illustrated in the figure below, the CLV model takes transaction activities data and featurizes it into a transactions table with one row per customer. Each row will have features such as average transaction value for the customer. Similarly, CLV uses other activities (in the figure below, retail product web reviews are used as an example) to augment the transactions table with additional features. We call the resulting table a featurized activities table. The new profile attributes feature allows the user to add customer attributes as an additional feature to the featurized activities table.

At the heart of training a CLV model is a random forest regression that uses the features generated above as predictors and the historical customer lifetime values (over a prediction window equaling the future prediction window requested by the user). Hyperparameters are selected using cross-validation. Random forest implementations scale well with data size and can handle the presence of outliers and/or missing data. Adding profile attributes to this regression model allows the user to incorporate more of what they know about their customers into the model.

How has the experience of configuring the model changed?

The high-level experience of configuring the CLV model does not change. For more information about configuring the CLV model see the product documentation online (opens in new tab). Adding Customer Profile attributes is an optional step that you can take after adding the required data (Customer transaction history). Additional (Optional) Data includes adding Activity and now Customer data.

Figure 2 -Additional data screen in the CLV configuration experience where you can map the Customer Profile Attributes

Customer profile attributes will be used in the model when you set up a mapping between the unified customer profile table and one or more of the 18 possible attributes that the model allows. To do so, the user must click on the second “Add data” button in the optional data entry step of model configuration. In the resulting right-hand pane, the user can map the fields in the unified customer profile to the available profile fields (e.g., Birth Date).

Figure 3 – Right side panel where you can map your Customer profile attributes

All customer profile attributes are considered optional. You can skip this feature entirely or choose to map as few as one attribute or all 18. The CLV model will consider all attributes for which a mapping is provided and decide whether they are influential in identifying your predicted high and low value customers.

What changes will I see in the CLV model predictions?

You’ll see the results of using customer profile attributes in the results screen (more information about the results experience is available in the product documentation (opens in new tab). The model will decide which of the customer profile attributes are influential in your CLV predictions, and will show them alongside other influential factors in the results experience.
MoreD

Figure 4 – Results screen for a CLV model that included Customer profile attributes

This is the easiest way to know if the mapped attributes were used in the model. The influential factors table shows the importance (for predictive ability) assigned by the model to each of the aggregated factors. Factors are aggregated because there are hundreds of similar factors, and it is simpler to understand and use these insights to inform other activities. For example, the model aggregates the importance of location-related fields under one explainability variable called ‘Location’.

Could adding Customer attributes make my model biased?

Like all AI features that are developed at Microsoft, we apply rigorous Responsible AI (RAI) principles to ensure that those AI meet our standards for developing ethical solutions that you can trust. The Responsible AI standard that we operationalize is available online, and includes further information about the principles (Fairness, Inclusiveness, Reliability & Safety, Transparency, Privacy & Security and Accountability), as well as information regarding how Microsoft reviews and tests AI to ensure they are meeting our standards. These principles are considered during the design of the AI and feature in the product, which is then reviewed by a committee to ensure that the standard is met, and the AI is considered fit to ship.

All the out-of-box predictive models in Customer Insights are subject to this process. Because you bring your own data to these models, it’s important to understand these principles as bias in your data may influence the predictions to be biased as well.

For example, Contoso coffee is predicting where to host an event for the opening of a flagship, top of the line espresso machine. They want to target high value customers to invite to this event. If they include location as a Customer Profile Attribute in this prediction, they might find that their highest value customers are all grouped into affluent suburbs. If they include Occupation, they might find that highest value customers are grouped into particular occupations.

There are a few layers of bias in this scenario which are augmented by the model if Contoso doesn’t add in as many customer attributes as possible:

Assuming that location or occupation are the only attributes that indicate a high value customer, and therefore only mapping those attributes
When Contoso reviews the results – selecting and acting on only a subset of profile attributes can be a type of bias, too!

Avoiding Bias in your source data can be done by using tools like Fairlearn (opens in new tab) to profile the data you are ingesting into Customer Insights. When you are certain that your data isn’t biased, you can avoid bias in the way you configure the model, too. Avoiding bias when training your model is as easy as including as many customer attributes as you can map, to allow the model to consider them as influential factors. You can also avoid bias when you review your results by taking the model’s influential factors list into consideration and avoiding ‘cherry picking’ profile attributes because they serve a hypothesis you might have.

For Contoso Coffee. They do want to make sure guests they invite can easily travel to the event, so they will be able to use Location fairly, but they might want to make sure that they’re considering lots of other attributes to help make a fair decision and capture the best audience to invite to the event.

Why might my mapped attributes not be used?

If the attributes you mapped do not appear in the influential factors table, there are a few potential causes:

An incorrect manual mapping of a customer profile attribute can cause issues. In this case, you will need to go back to the configuration and re-map the attribute then re-run the model.
Either the model didn’t find them as being influential towards the predictions (compared to the transaction and activity data) and omitted them,
Or, there were data quality issues that meant the attributes were unusable:

We perform various data quality checks on the mapped attributes before using them in the model. For example, if most entries for an attribute field are missing, we consider that attribute unreliable for prediction and do not use it in the model. If there is a data quality issue, the attribute will not appear in the influential factors page and there will be an entry in the Input Data Usability Report (instructions on finding the Report are in the Customer Insights online documentation (opens in new tab)).

Figure 5 – Input data usability report where you can find errors about parameter recommendations, data quality and model execution issues impacting your predictions.

When is this feature available?

This feature is now available in the Customer Lifetime Value model, which is Generally Available within Customer Insights. The product documentation (opens in new tab) is also updated with information about this feature, to help get you started!

Ways to engage

If you’re a new user, or want to test with demo data: Start a trial of Customer Insights (opens in new tab) to see the CLV Customer Profile Attributes in action, or the other out-of-box predictive, or,
If you would like more information, reach out via the Customer and Partner Voice channels or in the Dynamics 365 Application Ideas page (opens in new tab)
The Customer Insights forums (opens in new tab) are also another great way to ask questions and connect with other Customer Insights partners and users

The post Customer Lifetime Value Predictive Model now uses Customer Profile Attributes appeared first on Microsoft Research.

Supply Chain News: Impact and Categories

James Owenby — Fri, 22 Jul 2022 15:55:37 +0000

Author: Allie Giddings and Chhaya Methani

In our last blog post, we explained how news is surfaced in Supply Chain Insights and how it can be useful for having better risk visibility. Since then, we’ve made two major updates to Supply Chain Insights News. First, we added a tag describing whether the news article has an immediate, future, or positive impact on the supply chain. Second, we added tags for the category of news. These updates can help supply chain managers quickly see the most impactful and relevant news for them, as well as filter to relevant categories of news.

What are immediate, future, and positive impact articles?

Immediate impact articles are those with an expected negative effect on a supply chain in the near term. Here are two examples of articles with immediate impact:

More than 900 layoffs planned at plant in
suspends operations, adjusts production line to minimize impact

Future impact articles are those with an expected negative effect on a supply chain in the future. Here’s an example of a future impact:

lockdown will not have a big impact on production

Positive impact articles are those with an expected positive effect on a supply chain. Here are two examples of articles with positive impact:

acquires Business
in talks with and to setup local semiconductor plants

What data does the model use to learn?

The model is trained on news articles from recent months with a label for if the article is immediate, future, positive, or no impact. The key challenge here is to account for data imbalance among the three classes. Immediate and future impact categories tend to make up roughly 3% of all news related to a company. While this justifies the need for an AI model to filter and surface these not-so-frequent articles, it is also challenging to get enough articles belonging to this category to train the model effectively. If you pick all articles randomly, you could incur a high cost to get enough samples to train the model with.

To help with the data imbalance, we used a combination of various techniques to generate a representative sample set to be labelled by crowd sourcing. We bootstrapped the process with a few labelled articles and then added some automation to find articles similar to the ones we labelled by evaluating their contextual similarity in an embedding space where all articles are mapped.

How does the model learn?

For the impact model, we trained a multi-class classifier which selects one of the 4 categories 1) Immediate impact, 2) Future Impact, 3) No impact and 4) Positive Impact. We used a combination of statistical NLP features along with contextual representation of text from articles as features. These features are used by a deep neural network to train an impact classifier having 4 softmax classifiers, one for each class. By using this one vs all strategy, we get the best performance for each class. We selected a threshold for each category with the help of an ROC curve.

What features does the model use?

As mentioned above, we used a combination of statistical, linguistic and contextual features. For the statistical features, we used the term frequency, text length, etc. We used a sentiment classifier to indicate the sentiment of the article which helps in identifying the tone of the article. Additionally, we use contextual representations to capture the meaning of the article overall by using deep models that are trained on long text like news articles to capture their relevance.

What are the categories?

The categories are related to supply chain topics and are shown in the list below.

Bankruptcy, acquisition, and collaboration: Contains information about mergers and acquisitions, bankruptcy, or new or reduced collaborations with other companies or suppliers.
Company: Contains information relevant to the company, such as change in leadership or important personnel, new investment areas, or awards.
Company financial: Contains information about the growth and financial outlook of an individual company.
Disruption and weather: Contains information about events causing direct supply disruption, such as factory fires, explosions, leaks, Suez Canal blockage, or natural disasters such as forest fires or hurricanes.
Health: Contains information about human and animal epidemics and pandemics, such as COVID-19, Ebola, or H1N1.
Industry financial: Contains information that focuses on the growth or financial outlook of an entire industry.
Industry supplier: Contains information about other suppliers in the same industry, such as top supplier lists or general supplier risk articles.
Infrastructure: Contains information about general infrastructure improvements that could benefit a specific supplier.
Politics and government: Contains information such as government investigations, government collaborations, discounts/deals, lobbying, litigation, or regulations.
Product: Contains information about new or old products of the company, such as new technologies used in existing products or removal of product lines.
Quality: Contains information about supplier quality or quality control issues.
Sustainability: Contains information such as new or existing sustainability efforts or environmental impacts.
Workforce: Contains information affecting employees, such as strikes or workplace conditions.

What data does the model use to learn?

The category model is trained on recent news articles collected using the Bing News API. Each of these articles can have multiple category labels associated with them. E.g., an article could be about a workforce strike due to local political matters and will thus belong to two categories. This adds some complexity to the process of labeling since missing dominant labels for an article will confuse the classifier. It is important to assign the dominant categories as labels for an article. However, in our experience, judges tend to miss some categories when asked to choose all labels from the list of all categories. This impacted the classifier performance adversely.

To make sure we get a relatively complete set of labels, we modified the labeling task. We asked the judges to select the dominant categories in the article from a subset of categories assigned by a rules-based classifier. A rules-based classifier is prone to make mistakes which can generate good examples for the classifier to learn from when the judges assign negative labels to them. One issue with using the rules-based classifier is with recall. The classifier cannot learn from the articles that were never present in the dataset. Hence, it is important to add some randomly selected categories to the list of category options presented to the judges. Following this approach, we got a set of labels that was used to train the classifier.

The classifier had a varied performance on different categories. To improve those categories, we decided to add more samples. However, it is challenging to extract the most useful samples for selective categories due to the problem of data imbalance. Each category has a relatively small number of overall articles belonging to it. To collect more data, we followed an approach similar to the one mentioned above for the impact classifier. We selected samples by evaluating similarity to a small representative set to shortlist the set of possible candidates. This helped improve the model performance effectively.

As a result of a combination of techniques, we were able to collect a sizable set for all the categories and added more samples to the categories where the model needed more examples to resolve ambiguities and have good performance.

How does the model learn?

The model learns by fine-tuning a RoBerta model. We add a softmax layer for each of the classes and assign all categories that are above a threshold determined empirically for each category. In this manner, we can assign multiple labels to an article that may belong to many categories. To evaluate performance, we considered each assigned category individually and computed the F-1 score for those categories.

What features does the model use?

The mapping to the semantic space where the articles are represented as data points in the high-dimensional contextual space acts as the features used by the softmax classification layer to learn the boundary between classes.

Supply Chain News Model in Supply Chain Insights

Check out our documentation for more information on this news feature and Supply Chain Insights. Which categories are most relevant to you? Share your thoughts on our forum: http://aka.ms/sci-forum. Please send any feedback on news article relevance to scinewsfeedback@microsoft.com.

The post Supply Chain News: Impact and Categories appeared first on Microsoft Research.

Accelerating and Scaling AI in Microsoft Dynamics 365 with MLOps

James Owenby — Fri, 03 Jun 2022 23:21:34 +0000

Authors:
Matthew Burruss and Shafiul “Jacky” Islam

In an ideal world, machine learning would be a straight path from defining the business use-cases to operationalizing the model, but in reality, the model lifecycle is a continuous loop, as objectives are redefined, models are updated, and the world changes (i.e., concept/data drift).

The process of automating, monitoring, developing, testing, and providing a continuous delivery of models into production is known as machine learning operations or MLOps.

Figure 1. The goal of MLOps is to orchestrate and automate the model lifecycle as shown above. At a high-level, the models iterate from experimentation (in Blue) to development (in Orange) to operationalizing in production (in Green).

AI models are used in Microsoft Dynamics 365 to provide business insights and drive business outcomes for our customers, including product recommendations, churn risk, and sentiment analysis and to power applications like Microsoft Dynamics 365 Customer Insights ,Microsoft Dynamics 365 Supply Chain Insights, etc.

This document will act as a reference guide to describe common challenges facing MLOps systems and describe the approaches our team at Microsoft has used to address these challenges.

Challenge 1. Scaling to Support Many Models, Infrastructures, and Apps

This section will describe the challenge of standardizing the model lifecycle to support many model types and runtimes. For example, data scientists may use several analytic solutions like Azure HDInsight (opens in new tab), Azure Synapse (opens in new tab), and Azure Machine Learning (opens in new tab) during their experimentation phase while eventually finding the best solution for deployment based on their specific functional requirements (e.g., batch and real-time inference) and nonfunctional requirements (e.g., model selection, scalability, GPU acceleration, and latency requirements).

However, each ecosystem may have different ways to register datasets, register models, provision resources, etc., causing teams to divert money and time into resource management.

Standardization of the platform ecosystems provides a consistent user experience for the development and release of a model even when the underlying technologies or model type may change. This accelerates the model lifecycle by reducing confounding variables like compute configuration while also enabling cross-team and cross-compute development.

We address this challenge by relying heavily on compute abstraction, model abstraction, and declarative testing of the AI model. We will see additional advantages of these design decisions in Challenge 3. Traceability and Monitoring Model Performance.

Compute abstraction is achieved by providing infra-as-code through Azure Pipelines (opens in new tab), allowing customers to declaratively select and configure their compute target(s) for various tests. We deliberately hide the details of the compute deployment and treat the compute as a service to allow data scientists to focus on model development, while also enhancing the reproducibility of their experiments and decreasing costs across the organization.

Model abstraction is achieved by ensuring that models adhere to a common interface. While this abstraction can be high-level, such as a train and score interface, it enables a consistent development-to-release process and allows applications to be built around the models in a backwards and forwards-compatible way.

Finally, Declarative tests verify the behavior and performance of the model. We leverage build verification tests and performance verification tests (e.g., stress and load tests) which are orchestrated by Azure Pipelines (opens in new tab). These tests leverage the compute and model abstraction to surface consistent signals to developers and to act as gates to judge if a model is deployment-ready.

Figure 2. A high-level view of the components of our MLOps platform as well as the applications it supports.

Overall, as shown in Figure 2, incorporating MLOps helps scale delivery of models to multiple Dynamics 365 applications while accelerating AI model development lifecycle, allowing fast onboarding and quick model iteration. The onboarded models are a small subset of the models in Dynamics/Power applications overall and represent the set that have been onboarded to date to the MLOps platform discussed in this post.

Challenge 2. Reproducibility and Versioning

This section will describe the challenge of reproducibility and versioning. Randomness is inherent in many machine learning algorithms. One example is the random initialization of neuron weights in neural networks. However, reproducibility in the experimental design is important. For example, dataset transformations & feature engineering techniques like embeddings should be repeatable and reusable throughout the model lifecycle. Furthermore, it is important to capture the model runtime (model dependencies, python version, etc.) between iterations for change control.

To make the model lifecycle reproducible, we rely on Azure Repos (opens in new tab) to provide version control tools that store a permanent snapshot of the model, and its dataset transformation steps in a Git repository. We leverage Azure Blob Storage (opens in new tab) to house derived artifacts like model weights or explainability outcomes which are checkpointed whenever our testing platform runs (See Challenge 3. Traceability and Monitoring Model Improvements).

Once experiments are reproducible, different iterations of your model can be compared for improvements or regressions. This is where the challenge of versioning comes into play for the models, configurations, and datasets. To version our model, we rely on Azure Artifacts (opens in new tab) and Docker images pushed to Azure Container Registry (opens in new tab) to provide consistent snapshots of our model, configurations, and its runtime environment. We also leverage an internal Data API to supply versioned artifacts, e.g., for pretrained models and we use hashing for dataset versioning when augmenting data. We also allow users to organize data to their liking. We have found that tagging datasets with summary statistics and descriptions also helps in experimentation efforts and dataset readability.

Challenge 3. Traceability and Monitoring Model Improvements

This section will describe the challenge of testing the model and evaluating its performance against a baseline or existing model in production. The MLOps platform should weigh the tradeoffs of time, money, and resources when determining the number and types of tests used to evaluate the model code (e.g., unit, integration, load, and stress tests). It is also important to ensure high quality and secure code through automatic linting and security scanning. For performance monitoring, it is important to practice reproducibility and ensure that metrics can be easily traced and compared against previous models. Traceability also ensures performance numbers, code changes, approvals, and new features can be tracked across the model’s lifecycle. In our platform, we track common metrics like memory consumption, execution times, etc. while allowing models to customize their own metrics like F1 score, accuracy, etc.

For testing of the AI model, we have developed a common set of standard DevOps tools to orchestrate model development, including a verification engine to verify model behavior and performance. We start with high-speed, low-cost unit tests followed by slower endpoint integration tests. These endpoint tests, for example, verify that the train endpoint works as intended on a small dataset. Finally, we run more heavyweight end-to-end integration tests on DEV instances of our production platform to gather performance numbers and verify end-to-end runtime behaviors.

These integration tests use declarative test cases to verify the behavior of the model, allowing model owners to define tests that check metrics, SHAP explainability results, model error codes, etc. The declarative tests also enable model developers to create consistent test cases for different model types like batch vs. real-time as well as different compute types like HDInsight vs. Azure Machine Learning.

For each test, metrics and outputs are tracked, providing a historical snapshot of the model. We leverage mlflow (opens in new tab) to log metrics, parameters, and tags. We persist these artifacts in CosmosDB (opens in new tab). Finally, we consume and visualize this data in PowerBI (opens in new tab), which becomes especially useful for quick model evaluation and comparison.

Challenge 4. Model Packaging and Deployment

Once the model is tested and evaluated, the final challenge is packaging and deploying the model. First, after passing all the testing gates, a new version of the model is pushed to our PyPI and NuGet feed in Azure Artifacts (opens in new tab). How the model is then deployed largely depends on the compute target requirements. For example, batch jobs running on HDInsight through Spark consume a Conda environment whereas our real-time/batch models running on Azure Machine Learning pull the model’s Docker image from an Azure Container Registry (opens in new tab).

For operationalizing the deployment, it is best practice to automate as many parts of the process as possible while having a human in the loop to sign-off on the deployment. Azure DevOps allows the ability to define gated release (opens in new tab)s such that a notification is sent to a group of approvers to manually review/approve the release. Once all the tests and approvals have passed, the model can then be deployed to the customer facing service. One thing that is important to determine is when and how frequently to deploy a model. Common approaches include scheduled triggers, which may be required if a data drift is frequent; but for most of our cases we perform manual deployment at ad-hoc intervals whenever a new feature or improvement is available.

Finally, it may be helpful to ease the model into production and evaluate its performance against an existing model to see if the AI feature improves the end-user experience. Common approaches to this include A/B testing and shadow mode deployments. While such discussions are out-of-scope of this article, we encourage those interested in learning more about techniques to continuously evaluate a model in production.

Conclusions

Thank you for reading; we hope you have learned more about MLOps and can utilize these learnings to improve the continuous release of your models. If you want to talk more about this or join our team (we’re always looking for great people!), contact us at: BizAppsAI@microsoft.com. Happy training!

The post Accelerating and Scaling AI in Microsoft Dynamics 365 with MLOps appeared first on Microsoft Research.

Getting Deterministic Results from Spark’s randomSplit Function: A Deep Dive

James Owenby — Fri, 22 Apr 2022 18:49:40 +0000

Authors:

Tommy Guy and Kidus Asfaw

We noticed an odd case of nondeterminism in Spark’s randomSplit function, which is often used to generate test/train data splits for Machine Learning training scripts. There are other posts, notably this one (opens in new tab) that diagnose the problem, but there are a few details to spell out. We also want to suggest an alternative to randomSplit that will guarantee determinism.

The Problem

If you want to split a data set 80/20 in Spark, you call df.randomSplit([0.80, 0.20], seed) where seed is some integer used to reseed the random number generator. Reseeding a generator is a common way to force determinism. But in this case, it doesn’t work! In some cases (we’ll identify exactly which cases below), randomSplit will:

Leave some rows out of either split
Duplicate other rows into both splits
On two separate runs on the same data with the same seed, assign data differently.

This feels like a bit of a bait and switch. I feel like any function that accepts a seed is advertising that it should be deterministic: otherwise why bother with the seed at all?

Luckily, there is a way to force randomSplit to be deterministic, and it’s listed in several (opens in new tab) places (opens in new tab) online (opens in new tab). The trick is to cache the dataframe before invoking randomSplit. This seems straightforward, but it relies on a solid understanding of Spark internals to gain an intuition on when you should be careful. Ultimately, Spark tries hard to force determinism (and more recent Spark versions are even better at this) but they can’t provide 100% assurance that randomSplit will work deterministically. Below, I’m going to suggest a different way to randomly partition that will be deterministic no matter what.

Pseudorandomization: A Reminder

Just as a quick reminder, the way computers produce “random” numbers is actually pseudorandom: they start with some number then iterate in a complicated but deterministic way to produce a stream of numbers that are uncorrelated with each other. In the example below, we assign random numbers to some names, and we show that we can do this repeatably

If we shuffle the names, we get different results even if we keep the seed.

Note that the numbers are the same, but they apply to different names.

So, the way to make a deterministic algorithm with a random number generator is to:

Set the seed the same way.
Invoke the random number generator the exact same number or times and use the sequence in the exact same way.

Another Reminder: Spark DataFrame definition vs execution

Spark makes a distinction between defining what to do and executing the defined compute. Some expressions on DataFrames are transformations that convert one DataFrame to a new DataFrame while others are actions that execute a sequence of transformations. There are many sources talking about this distinction online, but the original paper (opens in new tab) on Spark is still a really great intro. (Aside: the paper talks about Resilient Distributed Datasets, which are a foundational element that DataFrames use).

If you’ve worked in Spark long at all, you’ve seen this phenomenon. I can execute the following commands in a REPL and they succeed almost immediately no matter how big the data really is:

^{df = spark.read.parquet(“/some/parquet/file/pattern*.parquet”)}

_{df = df.filter(df[‘amount’] > 4000).filter(df[‘month’] != ‘jan’).show()}

_{df2 = spark.read.parquet(“/someother/parquet/file/pattern*.parquet”)}

_{df3 = df.join(df2)}

That’s because all I’ve done so far is define a set of computations. You can see the plan by trying

_{df3.explain()}

But when we execute something like df3.count(), we issue an action. The set of transformations that create df3 execute on Spark workers, and it can take much longer to execute the statement because it blocks on the actual Spark action finishing.

In a normal python script, if you trace the program on a white board, you can basically track the system state line by line. But in a pyspark script, it’s much harder to trace when the “real” work (the actions) take place, or even when and how often they take place.

randomSplit([0.8, 0.2], seed) creates two DataFrames, and each results in an action

Ok, so now it’s time to look at the randomSplit function. The actual code (opens in new tab) is below:

This is what it does:

Sort the data within each partition. This ensures that within a Spark partition, the random number generator in Sample will execute the same number of times and will use the random numbers in the same exact way.
Normalize the weights.
Issue a series of calls to Sample with different sliding windows and with the same seed. Those calls are totally independent, and each call returns a Dataframe.
Return a list of DataFrames: one per sample partition.

Sample is a transformation: it adds to the DAG of transformations but doesn’t result in an action. In our example of an 80/20 split, the first call to Sample will use a random generator to assign a value between 0 and 1 to every row, and it will keep rows where the random value is <0.8. The second call will assign new random values to every row and keep rows where the random value is >0.8. This works if and only if the random reassignment is exactly the same in both calls to Sample.

Each of the 2 DataFrames (one with 80% of data and one with 20%) corresponds to a set of transformations. They share the set of steps up to the sample transformation, but those shared steps will execute independently for each random split. This could extend all the way back to data reading, so data would literally be read from disk independently for the 80% sample and the 20% sample. Any other work that happens in the DataFrame before Sample will also run twice.

This all works just fine assuming every step in the upstream DAG deterministically maps data to partitions! If everything is deterministic upstream, then all data maps to the same partition every time the script runs, and that data is sorted the same way in randomSplit every time, and the random numbers generated use the same seed and used on the same data row every time. But if something upstream changes the mapping of data to partitions then some rows will end up on different partitions in the execution for the 80% sample than they end up in the 20% sample. To summarize:

If a non-deterministic process maps data to partitions, then the non-deterministic process could run independently per partition.
If the independent, non-deterministic transformation changes something that Spark uses to partition data, then some rows may map to partitions differently in each DAG execution.
That data is assigned different random numbers in the 80% sample and 20% sample because the random numbers in Sample are used differently in the two samples. In fact, likely nearly all data gets different random numbers because any change to partitioning impacts data that is sorted.

What could cause the DataFrame input to randomSplit to be non-deterministic? Here are a few samples:

Changing data. If your data changes between reads, the two frames could start with different data. This could happen if you are, say, reading from a stream with frequent appends. The extra rows from the second action would end up somewhere.
Some UDFs (User Defined Functions) can be nondeterministic. A classic example would be a function that generates a UUID for every row, especially if you later use that field as a primary key.

There used to be a much more nefarious problem in Shuffle (opens in new tab) when used in df.partition(int). Spark did a round robin partitioning, which meant rows were distributed across partitions in a way that depended on the order of data in the original partition. By now, you should see a problem with that approach! In fact, someone filed a bug (opens in new tab) pointing out the same sort of nondeterministic behavior we saw in randomSplit, and it was fixed. The source (opens in new tab) for round robin shuffling now explicitly sorts to ensure rows are handled in a deterministic order.

A Few Workarounds

There are really two options, and they are documented elsewhere (opens in new tab) in more detail. They boil down to:

Force the upstream DAG to only run once. This is what cache does: it persists the DataFrame to memory or disk. Subsequent reads hit the cache, so someNonDeterministicDataFrame.cache().randomSplit forces the DAG creating someNonDeterministicDataFrame to run once, saves results in cache, then forces all samples in randomSplit to read from the cache. The cache is deterministic by definition: it’s a fixed data set.
Do something that deterministically forces data to partitions. Do this after the nondeterministic transformation, and be careful not to partition on something that is nondeterministic (like a guid you build in a UDF)!

Both workaround options require that you think globally to act locally. That breaks the encapsulation that is at the core of software engineering! You are left to either understand every step upstream in the DAG (likely by using explain function) and hoping that doesn’t change or adding potentially expensive extra computation to guard against changes. Both of these options effectively require global knowledge and global change knowledge! For example, my team at Microsoft intentionally separates the problem of reading data from disk and producing DataFrames from the actual Machine Learning training and inference steps. We don’t want you to think globally!

An Alternative Fix: Deterministic by Design Shuffle

randomSplit relies on DataFrame structure to produce deterministic results: consistent data-to-partition mapping and consistent ordering within partition (enforced in the method). Another approach is to deterministically use the data values to map to partitions. This is an approach that is commonly used in AB test initialization (I described it here (opens in new tab)) that has a few interesting properties:

The same input always maps to the same sample.
You can use a subset of columns to consistently hash all data that matches on the subset to same sample. For instance, you could map all data from a userId to the same random split.
The algorithm is stateless: this is important for scale in AB testing but for our purposes it makes implementation easier.

The basic idea for a row is:

Concatenate any columns you want to sample on into a new column.
Add a salt to the new column (we’ll use seed), which allows us to produce different partitions at different times.
Hash the column.
Compute the modulus of the hash using some large modulus number (say 1000) [0]
Pick a set of modulus outputs for each split. For an 80/20 split, moduli 0-799 is the 80% split and 800-999 is the 20% split.

In pyspark:

The post Getting Deterministic Results from Spark’s randomSplit Function: A Deep Dive appeared first on Microsoft Research.

Explainability

James Owenby — Tue, 01 Mar 2022 23:23:15 +0000

Authors:
Alejandro Gutierrez Munoz (opens in new tab), Tommy Guy, (opens in new tab) Sally Kellaway

Trust and understanding of AI Models’ predictions through Customer Insights

AI models are becoming a normal part of many business operations, led by advancement in AI technologies and the democratization of AI. While AI is increasingly important in decision making, it can be challenging to understand what influences the outcomes of AI models. Critical details like the information used as input, the influence of missing data, and use of unintended or sensitive input variables can all have an impact on a model’s output. To use AI responsibly and to trust it enough to make decisions, we must have tools and processes in place to understand how the model is reaching its conclusions.

Microsoft Dynamics 365 Customer Insights goes beyond just a predicted outcome and provides additional information that helps better understand the model and its predictions. Using the latest AI technologies, Customer Insights surfaces the main factors that drive our predictions. In this blog post, we will talk about how Customer Insights’ out-of-the-box AI models enable enterprises to better understand and trust AI models, as well as what actions can be taken based on the additional model interpretability.

Figure 1: Explainability information on the results page of the Customer Lifetime Value Out of box model, designed to help you interpret model results.

What is model interpretability and why is it important?

AI models are sometimes described as black boxes that consume information and output a prediction – where the inner workings are unknown. This raises serious questions about our reliance on AI technology. Can the model’s prediction be trusted? Does the prediction make sense? AI model interpretability has emerged over the last few years as an area of research with the goal of providing insights into how AI models reach decisions.

AI models leverage information from the enterprise (data about customers, transactions, historic data, etc.) as inputs. We call these inputs features. Features are used by the model to determine the output. A way to achieve model interpretability is by using explainable AI, or model explainability, which are a set of techniques that describe which features influence a prediction. We’ll talk about two approaches: local explainability that describes how the model arrived at a single prediction (say a single customer’s churn score) and global explainability that describes which features are most useful to make all predictions. Before we describe how a model produces explainability output and how you should interpret it, we need to describe how we construct features from input data.

AI Feature Design with Interpretability in mind

AI models are trained using features, which are transformations of raw input data to make it easier for the model to use. These transformations are a standard part of the model development process.

For instance, input data may be a list of transactions with dollar amounts, but a feature might be the number of transactions in the last thirty days and the average transaction value. (Many features summarize more than one input row.) Before features are created, raw input data needs to be prepared and “cleaned”. In a future post, we’ll deep dive on data preparation and the role that model explainability plays in it.

To provide a more concrete example of what a feature is and how they might be important to the model’s prediction, take these two features that might help predict customer churn value: frequency of transactions and number of product types bought. In a coffee shop, frequency of transactions is likely a great predictor of continued patronage: the regulars who walk by every morning will likely continue to do so. But those regulars may always get the same thing: I always get a 12 oz black Americano and never get a mochaccino or a sandwich. That means that number of product types I buy isn’t a good predictor of my churn: I buy the same product, but I get it every morning.

Conversely, the bank down the road may observe that I rarely visit the branch to transact. However, I’ve got a mortgage, two bank accounts and a credit card with that bank. The bank’s churn predictions might rely on the number of products/services bought rather than frequency of buying a new product. Both models start with the same set of facts (frequency of transactions and number of product types) and predict the same thing (churn) but have learned to use different features to make accurate predictions. Model authors created a pair of features that might be useful, but the model ultimately decides how or whether to use those features based on the context.

Feature design also requires understandable names for the features. If a user doesn’t know what a feature means, then it’s hard to understand what it means if the model thinks it’s important! During feature construction, AI engineers work with Product Managers and Content Writers to create human-readable names for every feature. For example, a feature representing the average number of transactions for a customer in the last quarter could look something like ‘avg_trans_last_3_months’ in the data science experimentation environment. If we were to present features like this to business users, it could be difficult for them to understand exactly what that means.

Explainability via Game Theory

A main goal in model explainability is to understand the impact of including a feature in a model. For instance, one could train a model with all the features except one, then train a model with all features. The difference in accuracy of model predictions is a measure of the importance of the feature that was left out. If the model with the feature is much more accurate than the model without the feature, then the feature was very important.

Figure 2: The basic idea to compute explainability is to understand each feature’s contribution to the model’s performance by comparing performance of the whole model to performance without the feature. In reality, we use Shapley values to identify each feature’s contribution, including interactions, in one training cycle.

There are nuances related to feature interaction (e.g., including city name and zip code may be redundant: removing one won’t impact model performance but removing both would) but the basic idea remains the same: how much does including a feature contribute to model performance?

With hundreds of features, it’s too expensive to train a model leaving each feature out one by one. Instead, we use a concept called Shapley values (opens in new tab) to identify feature contributions from a single training cycle. Shapley values are a technique from game theory, where the goal is to understand the gains and costs of several actors working in a coalition. In Machine Learning, the “actors” are features, and the Shapley Value algorithm can estimate each feature’s contribution even when they interact with other features.

If you are looking for (much!) more detail about Shapley analysis, a good place to start is this GitHub repository: GitHub – slundberg/shap: A game theoretic approach to explain the output of any machine learning model. (opens in new tab)

Figure 3: Shap Contributions to model’s prediction

Other types of models, like deep learning neural networks, require novel methods to discover the features contributions. Customer Insights’ sentiment model uses a deep learning transformer model that uses thousands of features. To explain the impact of each feature we leverage a technique known as integrated gradients. (opens in new tab) Most deep learning models are implemented using neural networks, which learn by fine-tuning weights of the connections between the neurons in the network. Integrated gradients evaluate these connections to explain how different inputs influence the results. This lets us measure which words in a sentence have the highest contribution to the final sentiment score.

Figure 4: Model level explainability information generated for the Sentiment Analysis model.

Figure 5: Record level explainability information generated by the Sentiment analysis model.

How to leverage the interpretability of a model

AI models will output predictions for each record. A record is an instance or sample of the set we want to predict a score for. For example, for a churn model in Customer Insights, each customer is a record to score. Explainability is first computed at the record level (local explainability), meaning we compute the impact of each feature on predicting the output for a single record. If we are interested in a particular set of records (e.g., I have a specific set of customer accounts I manage), or just a few examples to validate our intuitions as to what features might be important to the model, looking at local explainability makes sense. When we are interested in the main features across all scored records, we need to see the aggregated impact for each record, or its global explainability.

Figure 6: Global explainability example from the Churn model.

Features can impact the score in a positive way or negative way. For instance, a high value on number of support interactions might make a customer more likely to churn by 13%, while more transactions per week might make the customer less likely to churn by 5%. In these cases, a high numerical value for each feature (support calls or transactions per week) has opposing effects on the churn outcome. Feature impact therefore needs to consider both magnitude (size of impact) and directionality (positive or negative).

Figure 7: Local explainability example for the Business-to-Business Churn model.

Acting on explainability information

Now that we have made the case for adding explainability as an important output of our AI models, the question is what do I do with this information? For model creators, adding explainability as part of feature design and model debugging is a very powerful tool, as it can highlight data issues from ingestion, clean-up process, transformations, etc. It also helps validate the behavior of the model early on: does the way the model make predictions pass a “sniff test” where obviously important features are important in the model? For consumers of AI models, it helps validate their assumptions about what should be important to the model. It can also help inform you of particular trends and patterns to pay attention to in your customer base to inform next steps.

Explainability is an integral part of providing more transparency to AI models, how they work, and why they make a particular prediction. Transparency is one of the core principles of Responsible AI, which we dive into more detail in a future blog post.

The post Explainability appeared first on Microsoft Research.

Supply Chain News for your Digital Twin

James Owenby — Fri, 28 Jan 2022 01:36:53 +0000

Authors:
Tommy Guy, Allie Giddings, Chhaya Methani

Microsoft Dynamics 365 Supply Chain Insights is a new product that helps predict risks and manage your supply chain through three main functions.

It increases visibility of risks affecting your supply chain.
It provides analytics on your digital twin that can help you address risks.
It enables collaboration with your suppliers to act on the visibility and analytics.

The AI in Supply Chain Insights is infused with the product’s main functions to accelerate risk identification and mitigation. In this blog post, we’ll focus on how surfacing relevant news articles about suppliers and warehouses helps increase the visibility of risks and empower you to collaborate with your partners to reduce the risk. We will cover how our model works and why you can trust the results, so you can have more confidence in your news feed in Supply Chain Insights.

What are relevant news articles for Supply Chain?

Smart news will filter news articles to those most relevant to suppliers in your supply chain and your own company. The model finds news related to supply chain topics such as:

Product News
Quality News
Workforce News
Infrastructure News
Politics and Government News
Location Health
Sustainability News
Disruption and Disaster News

By filtering to these topics, you can have a news feed that only shows important information affecting your supply chain. Here’s an example of how these articles are surfaced in Supply Chain Insights.

Figure 1: Example of how articles are surfaced in Supply Chain Insights.

What does the model do?

The model uses data you have uploaded to Supply Chain Insights in the Vendors table, and your company name in the company profile section to query Bing News for articles. These articles from Bing News are passed to the model, where it scores each article on how relevant it is from 0 to 1 with higher scores being more relevant. Only articles with a score higher than 0.5 are shown in the news feed, and the articles are ordered by the relevance score from highest to lowest.

What data does the model use to learn?

The model is trained on news articles from recent months with a label for whether each article is relevant or not. The news articles are generated by querying the news API with a broad list of companies to get news that cover a broad range of topics (that were listed above). The labels for if each article is relevant or not are created using simple rules since there are too many articles for humans to label. For example, we could have a simple rule to label the article as relevant if it contains the words “supply chain.”

How does the model learn?

We use a Random Forest model, which is an ensemble of Decision Trees. Ensemble methods combine multiple individual models to make a prediction, which often makes them more robust to noise in the training data. Decision Tree models learn by grouping data into different sets based on feature values and predicting a confidence score based on the average of the data in that group. For example, you could group data into two groups based on if the source of an article is a spam or junk domain. Here’s an example diagram of a simple Decision Tree model:

Figure 2: Decision Tree model

Note: This is not the actual model, but an illustrative example of what one tree could look like out of the many trees that together complete the Random Forest.

What features does the model use?

The model learns to deal with the ambiguity inherent in the news articles by using the following feature areas:

Company name matches in article
Avoiding click bait titles
Penalizing cases when the article is not relevant to the title.
Filter out spammy and junk domains
Ensure high quality sources
Similarity to supply chain domains
Sentiment expressed in the article, which can be positive, negative, or neutral.

We use powerful, state-of-the-art language models developed by Microsoft Research. These pre-trained language models are used to create scores that tell how similar the title and content of an article are to different topics such as a natural disaster or port capacity issues.

How is the model evaluated?

The model is evaluated on test datasets with metrics. First, the model makes a prediction for every news article in the test dataset. These predictions are compared to the expected label and each news article will be a false positive, true positive, false negative, and true negative.

A false positive is a news article that was predicted as relevant but should not have been predicted as relevant.
A true positive is a news article that was predicted as relevant and was supposed to be relevant.
A false negative is a news article that was predicted as not relevant but should have been predicted as relevant.
A true negative is a news article that was predicted as not relevant and was supposed to not be relevant.

Now that we know for each article if it is a false positive, true positive, false negative, or true negative we can calculate the precision and recall metrics to evaluate how good the model is. Precision is a measure of how accurate the model is, while recall is a measure of the coverage of the model.

Precision = # of true positives/ (# of true positives + # of false positives)

Recall = # of true positives/ (# of true positives + # of false negatives)

How can I trust the results?

We validate the model results on test datasets. These test datasets are not shown to the model during training, so we can be confident the model didn’t learn patterns specific to the test datasets. This ensures the model isn’t overfitting to a dataset and can generalize to more types of news articles. For more information about overfitting, check out our previous post, Intro to Machine Learning in Customer Insights. Test datasets are created by the development team and are a sample set of suppliers.

Supply Chain News Model in Supply Chain Insights

In Supply Chain Insights, customers can see news that impacts their partners and collaborate with them to reduce risk surfaced from the news. These news articles are selected from a machine learning model so only the most important information is surfaced, leaving the noise out. We presented information to help you understand how relevance is determined to be able to act on risks to your supple chain more effectively. Relevance is determined by a Random Forest model which learns to predict a relevance score based on features and historical data. The model performance is validated by test sets and reviewed by the team to ensure high quality.

Check out our documentation for more information on this news feature and Supply Chain Insights. Please send any feedback on news article relevance to scinewsfeedback@microsoft.com.

The post Supply Chain News for your Digital Twin appeared first on Microsoft Research.

Intro to Machine Learning in Customer Insights

James Owenby — Sat, 18 Dec 2021 00:07:14 +0000

Authors:
Tommy Guy (opens in new tab), Sally Kellaway, Zachary Cook, Julie Koesmarno

Microsoft Dynamics 365 Customer Insights (opens in new tab) accelerates time to value with Machine Learning-based predictions covering Product recommendations, Churn risk, Sentiment analysis and Customer lifetime value scenarios. These features were developed using vast data sets and advanced analytics to provide a comprehensive and timely understanding of customers. The complexity of these artificial intelligence features also introduces new questions about the scope and meaning of the results they produce. This article introduces some key concepts in Customer Insights’ Machine Learning capabilities. Our goal is to demystify Machine Learning and help Customer Insights users gain more confidence in the predictions they can generate with these features.

Machine Learning is a Generalization of Patterns in Data
For sellers and marketers, growing a business requires a wealth of domain knowledge. This expertise is necessary to identify campaigns and activities that will help drive revenue. One activity might be to recommend products that will have a high purchase rate to recommend to customers in marketing campaigns. For instance, Cameron is a marketing specialist for Contoso Coffee. Cameron’s experience may indicate that people who buy a new grinder are likely to also buy whole bean coffee. Cameron has also found that these customers are less likely to purchase ground coffee in the future.

Simple rules like this are called heuristics. Heuristics don’t scale well because each rule created requires the domain expert (in this case, Cameron) to encode their personal experience and direct observations from the data into the rule. Human-generated heuristics may encode biases (e.g., Cameron might have a subconscious preference for whole bean coffee and may look for confirming evidence) and may also suffer from the human expert’s limited breadth of experience (e.g., Cameron might have only worked for Contoso for 3 months).

Nevertheless, formulating good human-generated heuristics and good Machine Learning generated predictions share the same methods for reducing bias:

Cameron should check their intuition by assessing whether the patterns exist across longer time periods.
When Cameron recommends whole bean coffee, they understand why that recommendation was made by being able to provide an explanation to justify the recommendation.
Cameron should continue to reassess and re-validate their intuition to make sure the heuristic still holds. For instance, a shift to WFH could change coffee consumption habits in ways that are hard to predict.

What are the components of a Machine Learning Prediction?
Like all Machine Learning products, the product recommendation model (opens in new tab) offered in Customer Insights improves on human-generated heuristic rules. It has a few key components:

A business objective, which is the thing we are trying to optimize. If we are trying to predict future purchases by customers, the objective is to identify products or services that have the highest chance of being purchased.
An algorithm, which is the ‘program’ that will consume input data and make a prediction. Machine Learning algorithms contain variables that can be optimized during training to produce the most accurate predictions. The values of variables that optimize the predictions are called a model. For example, every Customer Insights product recommendation model starts with the same algorithm, but the model is trained to give optimal recommendations based on the product set and the purchase patterns of the dataset that the user inputs. We train machine learning models by “replaying history” to see if the model accurately predicts what happened in the data and update it to make better predictions. A recommendation algorithm may train a model by taking every customer’s purchase history and using all previously purchased items to predict the last item that each customer purchased.
Machine Learning algorithms have features that describe aspects of the data. As an example, the product recommendation algorithm may create features by categorizing items in our catalogue (ground vs whole bean coffee, or beans vs grinders vs espresso machines). Other features may be manufacturers, colors, etc. Features are important for Machine Learning algorithms because they represent facts about input data that can be used to generate the predictions. Some features are constructed by the algorithm (as opposed to just extracted directly from the data). For instance, the Customer Lifetime Value prediction model computes the time between purchases from the total list of transactions because the frequency of purchase may be an important predictor of lifetime value in some scenarios.

Figure 1- High level workflow from business objective to model predictions

How do Machine Learning Predictions differ from Human Generated Rules?
Different models have different components and details that can be tuned to help generate a wide array of business predictions. We’ll dive into these details in future blog posts and interviews. In this introduction, we’ll discuss two of the key differences between Machine Learning model and human generated rules.

Machine Learning can consume vastly more data than a human can read and understand. This makes it impossible for a human to fully simulate every possible decision the model will make (e.g., it is impossible to list all possible categories of items that customers may have previously purchased for all of time). In fact, good models can generalize well to inputs that never occurred in the training data. In Contoso Coffee’s transaction history, no human has ever bought a specific coffee grinder and French press together, but our product recommendation algorithm should be able to generalize that the user may be interested in an electric kettle. However, if we cannot “check the model’s homework”, this can make it more difficult to know if we trust a Machine Learning model. Below, we discuss how Customer Insights validates our prediction models.
Machine Learning algorithms are only successful if the dataset contains features that describe the incoming data. In the example above, Cameron used their past knowledge to recommend whole bean coffee instead of ground coffee. Cameron implicitly used features of those products: they understood that coffee grinders are only useful when the coffee isn’t already ground. Unless we include useful features like product category in our training data, the Machine Learning algorithm can’t use that ‘fact’ to learn the best model. Unless “ground vs whole” is a feature on bean products, a model couldn’t learn to suggest one over the other. Our predictions will only be as good as the data used to train the model.

Is My Model Trustworthy?
Customer Insights applies several techniques to help you trust a model in the process of making business decisions. This section describes some of the quality and performance testing that Machine Learning engineers conduct when they build a new model. Stay tuned for more details about them in subsequent posts.

Is my model “overfitting” to the training data?
Overfitting occurs when a Machine Learning model works well on its training data but produces poor results when it is applied to other data. Consider the 8 user purchase logs below. Three customers buy a kettle and of those, two buy a French Press. It’s reasonable to suggest a French Press to anyone who buys a kettle. But One customer (Purchase 5) bought a coffee machine and a kettle. If the model recommends a kettle for Purchase 1, it’s likely overfitting based on one observation.

Figure 2- 8 purchase logs from Contoso Coffee transactions.

Customer Insights uses an automated procedure called cross validation (opens in new tab) to detect overfitting during training. During model training, the algorithm divides the training dataset into 10 subsets. Then, it trains 10 models that each use 9/10 of the data. Each model is used to predict data in the remaining (1/10) data, also called the holdout set. If the predictions on the holdout set are accurate then we know we can trust the final model to work on new data in the future. If the algorithm overfits, then Machine Learning engineers will modify the algorithm or its features to generalize more effectively.

How do you know what a model uses to make decisions ?
Cameron can clearly explain why they recommend whole bean coffee to grinder buyers. It can be more challenging to understand what feature a Machine Learning model uses to make predictions.

Machine Learning engineers talk about feature importance when explaining how a model works. A feature’s importance is the amount of improvement in the model’s accuracy that we gain from including that one feature. So, a feature has no importance if removing it from the data set has no impact on the model’s decisions. A feature has high importance if removing it from the data makes the model significantly worse.

Figure 3- Feature importance table that is available with Business-to-Business Churn prediction. High impact features increase the risk of a business account churning, with the % impact listed as the importance of that one feature.

The Customer Insights Business-to-Business Churn model generates detailed information about the importance of features used to generate its predictions. In this example, features like Customer Service support activities were important in indicating high churn, as well as customer traits like what city the customer lives in. Stay tuned for a deeper dive in to feature importance in future blog posts.

Machine Learning in Customer Insights
In Customer Insights, you can use Machine Learning to identify advanced patterns across comprehensive data sets. It’s important to understand how those patterns emerge: algorithms examine historical data to identify patterns that can be used on new data in the future. Customer Insights predictions features are built with industry best practices to ensure that models are accurate and trustworthy. You can help improve the accuracy of the predictions you use them for by providing training data with features that help the model identify trends and patterns for your business.

In future posts, we will share deep dives about how we create accurate and trustworthy Machine Learning models in Customer Insights and about how you can help improve the predictions you generate with those models.

The post Intro to Machine Learning in Customer Insights appeared first on Microsoft Research.

Business Applications Applied AI Articles

Experimentation and the North Star Metric

The use case

North Star Metric; an example

Prompt Engineering: Improving our Ability to Communicate with an LLM

Power Automate with copilot; the back story

Customer Lifetime Value Predictive Model now uses Customer Profile Attributes

What is the CLV model?

What are customer profile attributes?

What is the model doing with these attributes?

How has the experience of configuring the model changed?

What changes will I see in the CLV model predictions?

Could adding Customer attributes make my model biased?

Why might my mapped attributes not be used?

When is this feature available?

Supply Chain News: Impact and Categories

Accelerating and Scaling AI in Microsoft Dynamics 365 with MLOps

Challenge 1. Scaling to Support Many Models, Infrastructures, and Apps

Challenge 2. Reproducibility and Versioning

Challenge 3. Traceability and Monitoring Model Improvements

Challenge 4. Model Packaging and Deployment

Conclusions

Getting Deterministic Results from Spark’s randomSplit Function: A Deep Dive

Authors:

The Problem

Pseudorandomization: A Reminder

Another Reminder: Spark DataFrame definition vs execution

randomSplit([0.8, 0.2], seed) creates two DataFrames, and each results in an action

A Few Workarounds

An Alternative Fix: Deterministic by Design Shuffle

Explainability

Authors: Alejandro Gutierrez Munoz (opens in new tab), Tommy Guy, (opens in new tab) Sally Kellaway

Trust and understanding of AI Models’ predictions through Customer Insights

What is model interpretability and why is it important?

AI Feature Design with Interpretability in mind

Explainability via Game Theory

How to leverage the interpretability of a model

Acting on explainability information

Supply Chain News for your Digital Twin

Intro to Machine Learning in Customer Insights

Authors:
Alejandro Gutierrez Munoz (opens in new tab), Tommy Guy, (opens in new tab) Sally Kellaway