{"id":971043,"date":"2023-09-27T20:39:09","date_gmt":"2023-09-28T03:39:09","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=971043"},"modified":"2025-03-06T21:32:16","modified_gmt":"2025-03-07T05:32:16","slug":"how-to-evaluate-llms-a-complete-metric-framework","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/how-to-evaluate-llms-a-complete-metric-framework\/","title":{"rendered":"How to Evaluate LLMs: A Complete Metric Framework"},"content":{"rendered":"\n
Over the past year, excitement around Large Language Models (LLMs) skyrocketed. With ChatGPT and BingChat, we saw LLMs approach human-level performance in everything from performance on standardized exams to generative art. However, many of these LLM-based features are new and have a lot of unknowns, hence require careful release to preserve privacy and social responsibility. While offline evaluation is suitable for early development of features, it cannot assess how model changes benefit or degrade the user experience in production. In fact, multiple explorations of GPT-4 capabilities suggest that \u201cthe machine learning community needs to move beyond classical benchmarking via structured datasets and tasks, and that the evaluation of the capabilities and cognitive abilities of those new models have become much closer in essence to the task of evaluating those of a human rather than those of a narrow AI model<\/em>\u201d [1]. Measuring LLM performance on user traffic in real product scenarios is essential to evaluate these human-like abilities and guarantee a safe and valuable experience to the end user. <\/strong>This is not only applicable for deploying a feature; In fact, continuous evaluation of features as they are being developed provides early insight into any regressions or negative user experience while also informing design decisions.<\/p>\n\n\n\n At Microsoft, the Experimentation Platform has worked closely with multiple teams to launch and evaluate LLM products over the past several months. We learned and developed best practices on how to design AB tests and metrics to evaluate such features accurately and holistically. In this article, we are sharing the standard set of metrics that are leveraged by the teams, focusing on estimating costs, assessing customer risk and quantifying the added user value. These metrics can be directly computed for any feature that uses OpenAI model (opens in new tab)<\/span><\/a>s and logs their API response (opens in new tab)<\/span><\/a>. <\/p>\n\n\n\n To estimate the usage cost of an LLM, we measure the GPU Utilization of the LLM. The main unit we use for measurement is token.<\/strong> (opens in new tab)<\/span><\/a> Tokens are pieces of words used for natural language processing. For Open AI models, 1 token is approximately 4 characters or 0.75 words in English text (opens in new tab)<\/span><\/a>. Prompts passed to LLM are tokenized (prompt tokens (opens in new tab)<\/span><\/a>) and the LLM generates words that also get tokenized (completion tokens (opens in new tab)<\/span><\/a>). LLMs output one token per iteration or forward pass, so the number of forward passes of an LLM required for a response is equal to the number of completion tokens.<\/a><\/p>\n\n\n\n We use the following primary utilization metrics \u2013 please check the appendix for a full list of metrics. <\/em><\/p>\n\n\n\n As LLMs get used at large scale, it is critical to measure and detect any Responsible AI (opens in new tab)<\/span><\/a> issues that arise. Azure OpenAI (opens in new tab)<\/span><\/a> (AOAI) provides solutions to evaluate your LLM-based features and apps on multiple dimensions of quality, safety, and performance. <\/strong>Teams leverage those evaluation methods before, during and after deployment to minimize negative user experience and manage customer risk.<\/p>\n\n\n\n Moreover, the Azure Open AI Content filtering system (opens in new tab)<\/span><\/a> captures and blocks some prompts and responses that have RAI issues. It also produces annotations (opens in new tab)<\/span><\/a><\/ins> and properties in the Azure Open AI API (opens in new tab)<\/span><\/a> that we use to compute the following metrics.<\/p>\n\n\n\n The annotations could be further used to provide stats for each filtering category (e.g. to what extent certain filtrations have happened).<\/p>\n\n\n\n As with any feature, measuring performance and latency is essential to ensure that the user is getting the intended value in a timely and frictionless manner. LLM interactions (opens in new tab)<\/span><\/a>have multiple layers hence tracking and measuring latency at each layer is critical. If there are any orchestrator or added components between the LLM and the final rendering of the content, we also measure the latency for each of the components in the full workflow as well.<\/p>\n\n\n\n We use the following metrics to measure performance:<\/p>\n\n\n\n LLM features have the potential to significantly improve the user experience, however, they are expensive and can impact the performance of the product. Hence, it is critical to measure the user value they add to justify any added costs. While a product-level utility metric [2] functions as an Overall Evaluation Criteria<\/a> (OEC) to evaluate any feature (LLM-based or otherwise), we also measure usage and engagement with the LLM features directly to isolate its impact on user utility.<\/p>\n\n\n\n Below we share the categories of metrics we measure. For a full list of the metrics, check the appendix.<\/p>\n\n\n\n In this category, we measure how often the user engages with the LLM features, the quality of those interactions and how likely they are to use it in the future.<\/p>\n\n\n\n For scenarios where content can be created with AI and then consumed by users, we also recommend measuring any increase or improvement in productivity, both on the creation and consumption side. Such metrics measure the value-add beyond an individual user when the AI-generated content is used in a collaboration setting.<\/p>\n\n\n\n To compute the metrics, the product needs to collect the properties needed from the OpenAI API (opens in new tab)<\/span><\/a> response. Moreover, we recommend collecting the end user Id (opens in new tab)<\/span><\/a> from the product’s telemetry to pass to the API.<\/p>\n\n\n\n For an LLM feature that can modify a user\u2019s text directly, we add telemetry to differentiate user edits from machine or LLM edits. Otherwise, it will be hard to measure reduction in user-added characters or text when the LLM auto-completes the content.<\/p>\n\n\n\n A\/B testing<\/a> is the golden standard to causally measure the impact of any change to the product. As mentioned in the intro, this is even more critical for LLM features, both at launch time as well as subsequent improvements. The metrics we share above are then used to evaluate the changes and tradeoff costs and user value.<\/p>\n\n\n\n As you embark on the journey of launching an LLM-powered feature and innovating further, we recommend running the following types of experiments at launch and post launch of the feature.<\/p>\n\n\n\n Ensure that the feature at launch is performant, reliable and increasing productivity and making the right cost v. benefit tradeoffs.<\/p>\n\n\n\n Continue to innovate and optimize the feature to quickly address new customer needs through prompt optimization, using newer models, and UX improvements.<\/p>\n\n\n\n LLMs can be a great tool to build features that add user value and increase their satisfaction with the product. However, properly testing and evaluating them is critical to safe release and added value. In this blog post, we shared a complete metrics framework to evaluate all aspects of LLM-based features, from costs, to performance, to RAI aspects as well as user utility. These metrics are applicable to any LLM but also can be built directly from telemetry collected from AOAI models. We also described the various experimentation designs used at Microsoft to evaluate the features at release time and continuously through any change.<\/p>\n\n\n\n Many thanks to our colleagues in Azure Open AI, particularly Sanjay Ramanujan, for all their input on the API responses as well as for ExP\u2019s experimenting partners for testing and using the metrics.<\/p>\n\n\n\n Widad Machmouchi<\/em>, Somit Gupta<\/em> \u2013 Experimentation Platform<\/p>\n\n\n\n [1] S. Bubeck et al., “Sparks of Artificial General Intelligence: Early experiments with GPT-4”, https:\/\/doi.org\/10.48550\/arXiv.2303.12712.<\/p>\n\n\n\n [2] W. Machmouchi, A. H. Awadallah, I. Zitouni, and G. Buscher, \u201cBeyond success rate: Utility as a search quality metric for online experiments,\u201d in International Conference on Information and Knowledge Management, Proceedings, 2017, vol. Part F1318, doi: 10.1145\/3132847.3132850.<\/p>\n\n\n\nGPU Utilization<\/h2>\n\n\n\n
\n
Responsible AI<\/h2>\n\n\n\n
\n
Performance Metrics<\/h2>\n\n\n\n
\n
Utility Metrics<\/h2>\n\n\n\n
User Engagement and Satisfaction<\/h3>\n\n\n\n

Some stages (e.g., editing the response) are not applicable to all scenarios (e.g., chat).<\/figcaption><\/figure>\n\n\n\n\n
Increase in productivity for collaboration scenarios<\/h3>\n\n\n\n
Data Requirements<\/h2>\n\n\n\n
\u00ad\u00ad\u00ad\u00ad\u00adRunning A\/B tests<\/h2>\n\n\n\n
Launch an LLM Feature<\/h3>\n\n\n\n
\n
Post Launch<\/h3>\n\n\n\n
\n
Summary<\/h2>\n\n\n\n
Acknowledgements<\/h2>\n\n\n\n
References<\/h2>\n\n\n\n
Appendix<\/h2>\n\n\n\n
GPU Utilization Metrics<\/h3>\n\n\n\n
\n
Utility Metrics<\/h3>\n\n\n\n
User Engagement and Satisfaction<\/h4>\n\n\n\n
\n
\n
\n