{"id":1138048,"date":"2025-05-12T09:00:00","date_gmt":"2025-05-12T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1138048"},"modified":"2025-05-20T13:43:52","modified_gmt":"2025-05-20T20:43:52","slug":"predicting-and-explaining-ai-model-performance-a-new-approach-to-evaluation","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/predicting-and-explaining-ai-model-performance-a-new-approach-to-evaluation\/","title":{"rendered":"Predicting and explaining AI model performance: A new approach to evaluation"},"content":{"rendered":"\n
\"The<\/figure>\n\n\n\n

With support from the Accelerating Foundation Models Research<\/a> (AFMR) grant program, a team of researchers from Microsoft and collaborating institutions has developed an approach to evaluate AI models that predicts how they will perform on unfamiliar tasks and explain why, something current benchmarks struggle to do.<\/p>\n\n\n\n

In the paper, \u201cGeneral Scales Unlock AI Evaluation with Explanatory and Predictive Power<\/a>,\u201d they introduce a methodology that goes beyond measuring overall accuracy. It assesses the knowledge and cognitive abilities a task requires and evaluates them against the model’s capabilities.<\/p>\n\n\n\n

ADeLe: An ability-based approach to task evaluation<\/h2>\n\n\n\n

The framework uses ADeLe (annotated-demand-levels), a technique that assesses how demanding a task is for an AI model by applying measurement scales for 18 types of cognitive and knowledge-based abilities. This difficulty rating is based on a detailed rubric<\/a>, originally developed for human tasks and shown to work reliably when applied by AI models<\/a>.<\/p>\n\n\n\n

By comparing what a task requires with what a model can do, ADeLe generates an ability profile<\/em> that not only predicts performance but also explains why a model is likely to succeed or fail\u2014linking outcomes to specific strengths or limitations.<\/p>\n\n\n\n

The 18 scales reflect core cognitive abilities (e.g., attention, reasoning), knowledge areas (e.g., natural or social sciences), and other task-related factors (e.g., prevalence of the task on the internet). Each task is rated from 0 to 5 based on how much it draws on a given ability. For example, a simple math question might score 1 on formal knowledge, while one requiring advanced expertise could score 5. Figure 1 illustrates how the full process works\u2014from rating task requirements to generating ability profiles.<\/p>\n\n\n\n

\"Figure
Figure 1. Top: For each AI model, (1) run the new system on the ADeLe benchmark, and (2) extract its ability profile. Bottom: For each new task or benchmark, (A) apply 18 rubrics and (B) get demand histograms and profiles that explain what abilities the tasks require. Optionally, predict performance on the new tasks for any system based on the demand and ability profiles, or past performance data, of the systems.<\/figcaption><\/figure>\n\n\n\n

To develop this system, the team analyzed 16,000 examples spanning 63 tasks drawn from 20 AI benchmarks, creating a unified measurement approach that works across a wide range of tasks. The paper<\/a> details how ratings across 18 general scales explain model success or failure and predict performance on new tasks in both familiar and unfamiliar settings.<\/p>\n\n\n\n

Evaluation results <\/h2>\n\n\n\n

Using ADeLe, the team evaluated 20 popular AI benchmarks and uncovered three key findings: 1) Current AI benchmarks have measurement limitations; 2) AI models show distinct patterns of strengths and weaknesses across different capabilities; and 3) ADeLe provides accurate predictions of whether AI systems will succeed or fail on a new task. <\/p>\n\n\n\n

1. Revealing hidden flaws in AI testing methods<\/strong> <\/p>\n\n\n\n

Many popular AI tests either don’t measure what they claim or only cover a limited range of difficulty levels. For example, the Civil Service Examination benchmark is meant to test logical reasoning, but it also requires other abilities, like specialized knowledge and metacognition. Similarly, TimeQA, designed to test temporal reasoning, only includes medium-difficulty questions\u2014missing both simple and complex challenges. <\/p>\n\n\n\n

2. Creating detailed AI ability profiles<\/strong> <\/p>\n\n\n\n

Using the 0\u20135 rating for each ability, the team created comprehensive ability profiles of 15 LLMs. For each of the 18 abilities measured, they plotted \u201csubject characteristic curves\u201d to show how a model\u2019s success rate changes with task difficulty.  <\/p>\n\n\n\n

They then calculated a score for each ability\u2014the difficulty level at which a model has a 50% chance of success\u2014and used these results to generate radial plots showing each model\u2019s strengths and weaknesses across the different scales and levels, illustrated in Figure 2.<\/p>\n\n\n\n

\"Figure
Figure 2. Ability profiles for the 15 LLMs evaluated. <\/figcaption><\/figure>\n\n\n\n

This analysis revealed the following: <\/p>\n\n\n\n