{"id":971043,"date":"2023-09-27T20:39:09","date_gmt":"2023-09-28T03:39:09","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=971043"},"modified":"2025-03-06T21:32:16","modified_gmt":"2025-03-07T05:32:16","slug":"how-to-evaluate-llms-a-complete-metric-framework","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/how-to-evaluate-llms-a-complete-metric-framework\/","title":{"rendered":"How to Evaluate LLMs: A Complete Metric Framework"},"content":{"rendered":"\n<p>Over the past year, excitement around Large Language Models (LLMs) skyrocketed. With ChatGPT and BingChat, we saw LLMs approach human-level performance in everything from performance on standardized exams to generative art. However, many of these LLM-based features are new and have a lot of unknowns, hence require careful release to preserve privacy and social responsibility. While offline evaluation is suitable for early development of features, it cannot assess how model changes benefit or degrade the user experience in production. In fact, multiple explorations of GPT-4 capabilities suggest that \u201c<em>the machine learning community needs to move beyond classical benchmarking via structured datasets and tasks, and that the evaluation of the capabilities and cognitive abilities of those new models have become much closer in essence to the task of evaluating those of a human rather than those of a narrow AI model<\/em>\u201d [1].<strong> Measuring LLM performance on user traffic in real product scenarios is essential to evaluate these human-like abilities and guarantee a safe and valuable experience to the end user. <\/strong>This is not only applicable for deploying a feature; In fact, continuous evaluation of features as they are being developed provides early insight into any regressions or negative user experience while also informing design decisions.<\/p>\n\n\n\n<p>At Microsoft, the Experimentation Platform has worked closely with multiple teams to launch and evaluate LLM products over the past several months. We learned and developed best practices on how to design AB tests and metrics to evaluate such features accurately and holistically. In this article, we are sharing the standard set of metrics that are leveraged by the teams, focusing on estimating costs, assessing customer risk and quantifying the added user value. These metrics can be directly computed for any feature that uses <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/overview\">OpenAI model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>s and logs their <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/reference\">API response<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. &nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"gpu-utilization\">GPU Utilization<\/h2>\n\n\n\n<p>To estimate the usage cost of an LLM, we measure the GPU Utilization of the LLM. The main unit we use for measurement is <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/introduction\/tokens\"><strong>token.<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> Tokens are pieces of words used for natural language processing. For Open AI models, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/introduction\/tokens\">1 token is approximately 4 characters or 0.75 words in English text<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Prompts passed to LLM are tokenized (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/api-reference\/completions\">prompt tokens<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) and the LLM generates words that also get tokenized (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/api-reference\/completions\">completion tokens<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>). LLMs output one token per iteration or forward pass, so the number of forward passes of an LLM required for a response is equal to the number of completion tokens<a>.<\/a><\/p>\n\n\n\n<p>We use the following primary utilization metrics \u2013 please check the appendix for a full list of metrics. <em><\/em><\/p>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li><strong>Number of <\/strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/help.openai.com\/en\/articles\/6891829-error-code-429-rate-limit-reached-for-requests\"><strong>429 responses<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><strong> received.<\/strong> A 429 error response is sent when the model and\/or service is currently <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/how-to\/quota?tabs=rest#understanding-rate-limits\">overloaded<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. We recommend measuring the 95<sup>th<\/sup> or 90<sup>th<\/sup> percentile of the number of 429 responses to measure the peak performance. &nbsp;<\/li>\n\n\n\n<li><strong>Total number of tokens, <\/strong>computed as the sum of prompt tokens and completion tokens. This is the main utilization metric we recommend for tracking for GPU Utilization. OpenAI charges based on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/pricing\">the total number of tokens used by the prompt and response.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"responsible-ai\">Responsible AI<\/h2>\n\n\n\n<p>As LLMs get used at large scale, it is critical to measure and detect any <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/machine-learning\/concept-responsible-ai?view=azureml-api-2\">Responsible AI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> issues that arise. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/machine-learning\/concept-responsible-ai?view=azureml-api-2\">Azure OpenAI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (AOAI) provides solutions to evaluate your LLM-based features and apps on multiple dimensions of quality, safety, and performance<strong>. <\/strong>Teams leverage those evaluation methods before, during and after deployment to minimize negative user experience and manage customer risk.<\/p>\n\n\n\n<p>Moreover, the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/concepts\/content-filter\">Azure Open AI Content filtering system<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> captures and blocks some prompts and responses that have RAI issues. It also produces <ins><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/concepts\/content-filter#annotations-preview\">annotations<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/ins> and properties in the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/openai\/concepts\/content-filter#scenario-details\">Azure Open AI API <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;that we use to compute the following metrics.<\/p>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li><strong>% Prompts with HTTP 400 error<\/strong>. This is the percentage of prompts that are classified at a filtered category and severity level.<\/li>\n\n\n\n<li><strong>% Responses with &#8220;finish_reason&#8221;: &#8220;content_filter&#8221;<\/strong>. This is the percentage of responses that didn\u2019t return content due to content filtering.<\/li>\n<\/ol>\n\n\n\n<p>The annotations could be further used to provide stats for each filtering category (e.g. to what extent certain filtrations have happened).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"performance-metrics\">Performance Metrics<\/h2>\n\n\n\n<p>As with any feature, measuring performance and latency is essential to ensure that the user is getting the intended value in a timely and frictionless manner. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/guides\/production-best-practices\/improving-latencies\">LLM interactions <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>have multiple layers hence tracking and measuring latency at each layer is critical. If there are any orchestrator or added components between the LLM and the final rendering of the content, we also measure the latency for each of the components in the full workflow as well.<\/p>\n\n\n\n<p>We use the following metrics to measure performance:<\/p>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li><strong>Time to first token render<\/strong> from submission of the user prompt, measured at multiple percentiles.<\/li>\n\n\n\n<li><strong>Requests Per Second (RPS)<\/strong> for the LLM.<\/li>\n\n\n\n<li><strong>Tokens rendered per second<\/strong> when <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/guides\/production-best-practices\/streaming\">streaming<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> the LLM response.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"utility-metrics\">Utility Metrics<\/h2>\n\n\n\n<p>LLM features have the potential to significantly improve the user experience, however, they are expensive and can impact the performance of the product. Hence, it is critical to measure the user value they add to justify any added costs. While a product-level utility metric [2] functions as an <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-during-experiment-stage\/\">Overall Evaluation Criteria<\/a> (OEC) to evaluate any feature (LLM-based or otherwise), we also measure usage and engagement with the LLM features directly to isolate its impact on user utility.<\/p>\n\n\n\n<p>Below we share the categories of metrics we measure. For a full list of the metrics, check the appendix.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"user-engagement-and-satisfaction\">User Engagement and Satisfaction<\/h3>\n\n\n\n<p>In this category, we measure how often the user engages with the LLM features, the quality of those interactions and how likely they are to use it in the future.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"618\" height=\"582\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/PromptResponseFunnel-LLMMetrics-6514f23eeda34.png\" alt=\"Prompt and Response funnel. We compute metrics at each stage to understand how the user interacts with the model. Some stages (e.g. editing the response) are not applicable to all scenarios (e.g. chat).\" class=\"wp-image-971091\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/PromptResponseFunnel-LLMMetrics-6514f23eeda34.png 618w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/PromptResponseFunnel-LLMMetrics-6514f23eeda34-300x283.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/09\/PromptResponseFunnel-LLMMetrics-6514f23eeda34-191x180.png 191w\" sizes=\"auto, (max-width: 618px) 100vw, 618px\" \/><figcaption class=\"wp-element-caption\">Prompt and response funnel. We compute metrics at each stage to understand how the user interacts with the model. <br>Some stages (e.g., editing the response) are not applicable to all scenarios (e.g., chat).<\/figcaption><\/figure>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li><strong>Prompt and Response Funnel: <\/strong>As the user interacts with the LLM, prompts are sent in, and responses are sent back. We measure the usefulness of these responses and whether the user is in fact using them in their current task. The funnel tracks the interaction from the time the LLM is triggered until the user accepts or rejects the response.<\/li>\n\n\n\n<li><strong>Prompt and Response Quality: <\/strong>Not all engagement with features provide value. To assess whether the user had a successful interaction with the LLM with minimal effort, we measure additional aspects that reflect quality of engagement: length of&nbsp;the prompt and response indicate whether they were meaningful, average <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Edit_distance#:~:text=In%20computational%20linguistics%20and%20computer,one%20string%20into%20the%20other.\">edit distance<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> between prompts indicate the user reformulating the same intent and Number of responses with Thumbs Up\/Thumbs Down provide explicit feedback from the user on the quality of the response. Check out the appendix for detailed description of these metrics.<\/li>\n\n\n\n<li><strong>Retention:<\/strong> These metrics measure how sticky the feature is and whether the user gets retained into the LLM feature. It is an important measure to detect any novelty effect where the usage drops after the initial engagement. Any retention metric that works for your product can be modified to focus on the LLM feature. Check the appendix for the ones we use.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"increase-in-productivity-for-collaboration-scenarios\">Increase in productivity for collaboration scenarios<\/h3>\n\n\n\n<p>For scenarios where content can be created with AI and then consumed by users, we also recommend measuring any increase or improvement in productivity, both on the creation and consumption side. Such metrics measure the value-add beyond an individual user when the AI-generated content is used in a collaboration setting.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"data-requirements\">Data Requirements<\/h2>\n\n\n\n<p>To compute the metrics, the product needs to collect the properties needed from the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/api-reference\">OpenAI API<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> response. Moreover, we recommend collecting the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/api-reference\/chat\/create#user\">end user Id<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> from the product&#8217;s telemetry to pass to the API.<\/p>\n\n\n\n<p>For an LLM feature that can modify a user\u2019s text directly, we add telemetry to differentiate user edits from machine or LLM edits. Otherwise, it will be hard to measure reduction in user-added characters or text when the LLM auto-completes the content.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"running-a-b-tests\">\u00ad\u00ad\u00ad\u00ad\u00adRunning A\/B tests<\/h2>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-during-experiment-stage\/\">A\/B testing<\/a> is the golden standard to causally measure the impact of any change to the product. As mentioned in the intro, this is even more critical for LLM features, both at launch time as well as subsequent improvements. The metrics we share above are then used to evaluate the changes and tradeoff costs and user value.<\/p>\n\n\n\n<p>As you embark on the journey of launching an LLM-powered feature and innovating further, we recommend running the following types of experiments at launch and post launch of the feature.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"launch-an-llm-feature\">Launch an LLM Feature<\/h3>\n\n\n\n<p>Ensure that the feature at launch is performant, reliable and increasing productivity and making the right cost v. benefit tradeoffs.<\/p>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li><strong>Dark mode experiment<\/strong>: When launching an LLM Feature, we want to ensure that the feature at launch is <strong>performant<\/strong> and <strong>reliable<\/strong>. Before exposing the feature to end customers, we recommend running a dark mode experiment where the components for the feature are loaded without showing anything to the end customer.<\/li>\n\n\n\n<li><strong>0-1 Experiment: <\/strong>0-1 experiments are special as the treatment has the LLM-powered feature and the control variant does not. We recommend rolling out the feature in a <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/safe-velocity-a-practical-guide-to-software-deployment-at-scale-using-controlled-rollout\/\">controlled rollout<\/a> to ensure that you have enough GPU capacity and the product <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-during-experiment-stage\/\">OEC and guardrail metrics<\/a> are not affected, while you see an increase in productivity metrics.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"post-launch\">Post Launch<\/h3>\n\n\n\n<p>Continue to innovate and optimize the feature to quickly address new customer needs through prompt optimization, using newer models, and UX improvements.<\/p>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li><strong>Shadow Experiment<\/strong>: Before exposing a change in the LLM feature that changes the response shown to the user, we run shadow experiments to measure the impact in a low-risk and safe manner. Shadow experiments allow you to compute the treatment and control response for the same user, but only show them the control response. For example, when a user issues a query or prompt, the user\u2019s input is fed into both the control workflow <em>and<\/em> the treatment workflow at the same time. All users get the response from the control workflow but now that we have both treatment and control responses on live traffic for the same user, hence metrics can be evaluated for both variants. Metrics are more sensitive than regular A\/B tests as the treatment and control samples have exactly the same set of users leading to variance reduction. We can also get further sensitivity gains for by using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Student%27s_t-test#Unpaired_and_paired_two-sample_t-tests\">paired samples t-tests<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in the statistical analysis. Metrics that could be measured in shadow experiments include GPU utilization, performance and latency, RAI metrics and prompt metrics that do not rely on user engagement. However, metrics that need user response cannot be evaluated in shadow experiments as no user experiences the treatment response.<\/li>\n\n\n\n<li><strong>1-N Experiment: <\/strong>These are the regular A\/B tests we run to evaluate any change introduced to the product, including LLM features. Refer to our earlier blog posts on <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-pre-experiment-stage\/\">pre-experiment<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-during-experiment-stage\">during-experiment<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-post-experiment-stage\">post-experiment<\/a> patterns of trustworthy experimentation for best practices in this space.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"summary\">Summary<\/h2>\n\n\n\n<p>LLMs can be a great tool to build features that add user value and increase their satisfaction with the product. However, properly testing and evaluating them is critical to safe release and added value. In this blog post, we shared a complete metrics framework to evaluate all aspects of LLM-based features, from costs, to performance, to RAI aspects as well as user utility. These metrics are applicable to any LLM but also can be built directly from telemetry collected from AOAI models. We also described the various experimentation designs used at Microsoft to evaluate the features at release time and continuously through any change.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"acknowledgements\">Acknowledgements<\/h2>\n\n\n\n<p>Many thanks to our colleagues in Azure Open AI, particularly Sanjay Ramanujan, for all their input on the API responses as well as for ExP\u2019s experimenting partners for testing and using the metrics.<\/p>\n\n\n\n<p><em>Widad Machmouchi<\/em>, <em>Somit Gupta<\/em> \u2013 Experimentation Platform<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"references\">References<\/h2>\n\n\n\n<p>[1] S. Bubeck et al., &#8220;Sparks of Artificial General Intelligence: Early experiments with GPT-4&#8221;, https:\/\/doi.org\/10.48550\/arXiv.2303.12712.<\/p>\n\n\n\n<p>[2] &nbsp;W. Machmouchi, A. H. Awadallah, I. Zitouni, and G. Buscher, \u201cBeyond success rate: Utility as a search quality metric for online experiments,\u201d in International Conference on Information and Knowledge Management, Proceedings, 2017, vol. Part F1318, doi: 10.1145\/3132847.3132850.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"appendix\">Appendix<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"gpu-utilization-metrics\">GPU Utilization Metrics<\/h3>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li><strong>Number of <\/strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/help.openai.com\/en\/articles\/6891829-error-code-429-rate-limit-reached-for-requests\"><strong>429 responses<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><strong> received.<\/strong> A 429 error response is sent when the model and\/or service is currently overloaded. We recommend measuring the 95<sup>th<\/sup> or 90<sup>th<\/sup> percentile of the number of 429 responses to measure the peak performance. &nbsp;<\/li>\n\n\n\n<li><strong>Total number of tokens, <\/strong>computed as the sum of prompt tokens and completion tokens. This the main utilization metric we recommend for tracking for GPU Utilization. OpenAI charges based on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/pricing\">the total number of tokens used by the prompt and response.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><strong>Number of <\/strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/api-reference\/completions\"><strong>prompt tokens<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><strong>. <\/strong>The number of tokens resulting from tokenizing the prompt passed to the LLM. While OpenAI also charges for these tokens, they are <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openai.com\/pricing\">much cheaper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> than completion tokens and can be optimized by the product team.<\/li>\n\n\n\n<li><strong>Number of completion tokens.<\/strong> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/api-reference\/completions\">Completion tokens<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> are the largest cost incurred when using OpenAI models. These can be controlled by changing the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/platform.openai.com\/docs\/api-reference\/chat\/create#max_tokens\">Max_Tokens parameter<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in the request. <strong><\/strong><\/li>\n\n\n\n<li><strong>Wasted Utilization per LLM. <\/strong>Some responses from the LLM will not provide any value to the user. This is due to issues such as truncation (see below), errors, \u201cnot able to understand\u201d responses or other unactionable responses that can be defined based on the user scenario. We recommend tracking the number of completion tokens associated with these non-actionable or unused responses to keep them to a minimum.<\/li>\n\n\n\n<li><strong>Number of LLM calls with truncated responses. <\/strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/help.openai.com\/en\/articles\/7437022-text-completion#h_5f2716f4dc\">If the API response has a \u201cfinish_reason\u201d: \u201clength<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u201d, it implies that the call reached the max_tokens limit set in the API request, so the response is likely truncated\/incomplete.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"utility-metrics\">Utility Metrics<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"user-engagement-and-satisfaction\">User Engagement and Satisfaction<\/h4>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li><strong>Prompt and Response Funnel<\/strong>\n<ol class=\"wp-block-list\">\n<li>Number of opportunities to suggest content: This captures all instances where the LLM was called, irrespective of whether the response was shown to the user. This is important in case there is an added layer or orchestrator between the LLM and the feature that determines whether the response is in fact shown to the user.<\/li>\n\n\n\n<li>Number and Rate of prompts made to LLM<\/li>\n\n\n\n<li>Number and Rate of response from LLM<\/li>\n\n\n\n<li>Number and Rate of responses seen by user: As mentioned earlier, it\u2019s possible not all responses are shown to the user due to content moderation, relevance or performance.<\/li>\n\n\n\n<li>Number and Rate of accepts by user: How to identify accepts depends on the user scenario. In a text prediction or summarization scenario, the user accepts the responses by including it in the document or text they are writing. In a conversational context, an accept is when a user thumbs up a response, gets positive utility from a link provided or reengages with the bot for more information.<\/li>\n\n\n\n<li>Number and Rate of responses kept (retained) by user at end of time X: This metric is particularly relevant in the context of text prediction where the user keeps the content and uses it in the doc or text they are creating.<\/li>\n<\/ol>\n<\/li>\n\n\n\n<li><strong>Prompt and Response Quality<\/strong>\n<ul class=\"wp-block-list\">\n<li>Average length of the prompts and responses<\/li>\n\n\n\n<li>Average time between prompts and between responses<\/li>\n\n\n\n<li>Time spent on writing prompts and on generating responses<\/li>\n\n\n\n<li>Average <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"font-size: 18px\" href=\"https:\/\/en.wikipedia.org\/wiki\/Edit_distance#:~:text=In%20computational%20linguistics%20and%20computer,one%20string%20into%20the%20other.\">edit distance<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> between prompts: Edit distance has long been used in information retrieval as&nbsp;a measure of reformulating queries and hence restating user intent. The more often a user reformulates a query or prompt, the more likely it is that the original prompt or query did not provide the information they are looking for. Note that since prompts can be changed or expanded by the product beyond what the user inputs, it\u2019s important to also separate user and product components of the prompt. Moreover edit distance metrics require some data cooking for efficient computation.<\/li>\n\n\n\n<li>Average <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"font-size: 18px\" href=\"https:\/\/en.wikipedia.org\/wiki\/Edit_distance#:~:text=In%20computational%20linguistics%20and%20computer,one%20string%20into%20the%20other.\">edit distance<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> between LLM response and retained content: this is applicable in text prediction or summarization scenarios where the user can accept a response and edit it to fit their needs. For other scenarios and content types, you will need to tailor the definition of edit distance.<\/li>\n\n\n\n<li>Number of responses with Thumbs Up\/Thumb Down feedback from the user: such metrics are explicit feedback from the user on how well the LLM response answered their prompt. However, these metrics, like other user sentiment metrics, suffer from low sample size and selection bias, as users who provide such feedback are not representative of the whole population.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Retention:<\/strong> The following metrics can be averaged across users, sessions, days or any other unit as needed by the product.\n<ul class=\"wp-block-list\">\n<li>LLM <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"font-size: 18px\" href=\"https:\/\/platform.openai.com\/docs\/api-reference\/chat\">conversation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> length and duration<\/li>\n\n\n\n<li>Average number of LLM conversations<\/li>\n\n\n\n<li>Average Number of days an LLM feature was actively used.<\/li>\n\n\n\n<li>Daily Active LLM users<\/li>\n\n\n\n<li>Retention rate of new-to-LLM users<\/li>\n\n\n\n<li>New users who use an LLM feature in their first session<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"increase-in-productivity-for-collaboration-scenarios\">Increase in productivity for collaboration scenarios<\/h4>\n\n\n\n<p>For scenarios where content can be created with AI and then consumed by users, we also recommend measuring any increase or improvement in productivity, both on the creation and consumption side.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"creator-productivity-better-content-created-in-less-time\">Creator Productivity (better content created in less time)<\/h5>\n\n\n\n<p>As content creation becomes easier with LLMs, more creators will edit more documents faster and the quality of the content should improve.<\/p>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li>Reach of the content:\n<ul class=\"wp-block-list\">\n<li>#users, #sessions creating content per document<\/li>\n\n\n\n<li>#documents edited with the LLM<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Quality of the content &#8211; the length and richness of the prompts and responses created automatically and overall:\n<ul class=\"wp-block-list\">\n<li>Total characters retained per user<\/li>\n\n\n\n<li>Number and length of interactions with the LLM<\/li>\n\n\n\n<li>Number of total and user edits <\/li>\n\n\n\n<li>Number of artifacts used like images, charts<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Effort:\n<ul class=\"wp-block-list\">\n<li>Average time spent by user in editing mode.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<h5 class=\"wp-block-heading\" id=\"consumer-productivity-better-consumption-of-content-in-less-time\">Consumer Productivity (better consumption of content in less time):<\/h5>\n\n\n\n<ol style=\"list-style-type:1\" class=\"wp-block-list\">\n<li>Reach of the content\n<ul class=\"wp-block-list\">\n<li># users, #sessions consuming content per document<\/li>\n\n\n\n<li># documents read that were edited with the LLM<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Quality of the content\n<ul class=\"wp-block-list\">\n<li># consumption actions (e.g. sharing, commenting, reviewing) per AI-edited document<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Effort:\n<ul class=\"wp-block-list\">\n<li>Average time spent in consumption mode per document per user<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Over the past year, excitement around Large Language Models (LLMs) skyrocketed. With ChatGPT and BingChat, we saw LLMs approach human-level performance in everything from performance on standardized exams to generative art. However, many of these LLM-based features are new and have a lot of unknowns, hence require careful release to preserve privacy and social responsibility. [&hellip;]<\/p>\n","protected":false},"author":39255,"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":651963,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-971043","msr-blog-post","type-msr-blog-post","status-publish","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":651963,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/971043","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39255"}],"version-history":[{"count":11,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/971043\/revisions"}],"predecessor-version":[{"id":1133855,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/971043\/revisions\/1133855"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=971043"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=971043"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=971043"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=971043"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}