{"id":1142208,"date":"2025-06-23T09:35:06","date_gmt":"2025-06-23T16:35:06","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1142208"},"modified":"2025-07-08T15:58:04","modified_gmt":"2025-07-08T22:58:04","slug":"learning-from-other-domains-to-advance-ai-evaluation-and-testing","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/learning-from-other-domains-to-advance-ai-evaluation-and-testing\/","title":{"rendered":"Learning from other domains to advance AI evaluation and testing"},"content":{"rendered":"\n
\"Illustrated<\/figure>\n\n\n\n

As generative AI becomes more capable and widely deployed, familiar questions from the governance of other transformative technologies have resurfaced. Which opportunities, capabilities, risks, and impacts should be evaluated? Who should conduct evaluations, and at what stages of the technology lifecycle? What tests or measurements should be used? And how can we know if the results are reliable?  <\/p>\n\n\n\n

Recent research and reports from Microsoft<\/a>, the UK AI Security Institute (opens in new tab)<\/span><\/a>, The New York Times<\/em> (opens in new tab)<\/span><\/a>, and MIT Technology Review<\/em> (opens in new tab)<\/span><\/a> <\/em>have highlighted gaps in how we evaluate AI models and systems. These gaps also form foundational context for recent international expert consensus reports: the inaugural International AI Safety Report<\/em> (opens in new tab)<\/span><\/a> (2025) and the Singapore Consensus<\/em> (opens in new tab)<\/span><\/a> (2025). Closing these gaps at a pace that matches AI innovation will lead to more reliable evaluations that can help guide deployment decisions, inform policy, and deepen trust. <\/p>\n\n\n\n

Today, we\u2019re launching a limited-series podcast, AI Testing and Evaluation: Learnings from Science and Industry<\/a><\/em>, to share insights from domains that have grappled with testing and measurement questions. Across four episodes, host Kathleen Sullivan<\/a> speaks with academic experts in genome editing<\/a>, cybersecurity<\/a>, pharmaceuticals<\/a>, and medical devices<\/a> to find out which technical and regulatory steps have helped to close evaluation gaps and earn public trust.<\/p>\n\n\n\n

We\u2019re also sharing written case studies from experts, along with top-level lessons we\u2019re applying to AI. At the close of the podcast series, we\u2019ll offer Microsoft\u2019s deeper reflections on next steps toward more reliable and trustworthy approaches to AI evaluation. <\/p>\n\n\n\n

Lessons from eight case studies <\/h2>\n\n\n\n

Our research on risk evaluation, testing, and assurance models in other domains began in December 2024, when Microsoft\u2019s Office of Responsible AI<\/a> gathered independent experts from the fields of civil aviation, cybersecurity, financial services, genome editing, medical devices, nanoscience, nuclear energy, and pharmaceuticals. In bringing this group together, we drew on our own learnings and feedback received on our e-book, Global Governance: Goals and Lessons for AI<\/em> (opens in new tab)<\/span><\/a>, <\/em>in which we studied the higher-level goals and institutional approaches that had been leveraged for cross-border governance in the past. <\/p>\n\n\n\n

While approaches to risk evaluation and testing vary significantly across the case studies, there was one consistent, top-level takeaway: evaluation frameworks always reflect trade-offs among different policy objectives, such as safety, efficiency, and innovation.  <\/p>\n\n\n\n

Experts across all eight fields noted that policymakers have had to weigh trade-offs in designing evaluation frameworks. These frameworks must account for both the limits of current science and the need for agility in the face of uncertainty. They likewise agreed that early design choices, often reflecting the \u201cDNA\u201d of the historical moment in which they\u2019re made, as cybersecurity expert Stewart Baker described it, are important as they are difficult to scale down or undo later. <\/p>\n\n\n\n

Strict, pre-deployment testing regimes\u2014such as those used in civil aviation, medical devices, nuclear energy, and pharmaceuticals\u2014offer strong safety assurances but can be resource-intensive and slow to adapt. These regimes often emerged in response to well-documented failures and are backed by decades of regulatory infrastructure and detailed technical standards.  <\/p>\n\n\n\n

In contrast, fields marked by dynamic and complex interdependencies between the tested system and its external environment\u2014such as cybersecurity and bank stress testing\u2014rely on more adaptive governance frameworks, where testing may be used to generate actionable insights about risk rather than primarily serve as a trigger for regulatory enforcement.  <\/p>\n\n\n\n

Moreover, in pharmaceuticals, where interdependencies are at play and there is emphasis on pre-deployment testing, experts highlighted a potential trade-off with post-market monitoring of downstream risks and efficacy evaluation. <\/p>\n\n\n\n

These variations in approaches across domains\u2014stemming from differences in risk profiles, types of technologies, maturity of the evaluation science, placement of expertise in the assessor ecosystem, and context in which technologies are deployed, among other factors\u2014also inform takeaways for AI.<\/p>\n\n\n\n

Applying risk evaluation and governance lessons to AI <\/h2>\n\n\n\n

While no analogy perfectly fits the AI context, the genome editing and nanoscience cases offer interesting insights for general-purpose technologies like AI, where risks vary widely depending on how the technology is applied.  <\/p>\n\n\n\n

Experts highlighted the benefits of governance frameworks that are more flexible and tailored to specific use cases and application contexts. In these fields, it is challenging to define risk thresholds and design evaluation frameworks in the abstract. Risks become more visible and assessable once the technology is applied to a particular use case and context-specific variables are known.  <\/p>\n\n\n\n

These and other insights also helped us distill qualities essential to ensuring that testing is a reliable governance tool across domains, including: <\/p>\n\n\n\n

    \n
  1. Rigor <\/strong>in defining what is being examined and why it matters. This requires detailed specification of what is being measured<\/a> and understanding how the deployment context may affect outcomes.<\/li>\n\n\n\n
  2. Standardization <\/strong>of how tests should be conducted to achieve valid, reliable results. This requires establishing technical standards that provide methodological guidance and ensure quality and consistency. <\/li>\n\n\n\n
  3. Interpretability <\/strong>of test results and how they inform risk decisions. This requires establishing expectations for evidence and improving literacy in how to understand, contextualize, and use test results\u2014while remaining aware of their limitations. <\/li>\n<\/ol>\n\n\n\n

    Toward stronger foundations for AI testing <\/h2>\n\n\n\n

    Establishing robust foundations for AI evaluation and testing requires effort to improve rigor, standardization, and interpretability\u2014and to ensure that methods keep pace with rapid technological progress and evolving scientific understanding.  <\/p>\n\n\n\n

    Taking lessons from other general-purpose technologies, this foundational work must also be pursued for both AI models and systems. While testing models will continue to be important, reliable evaluation tools that provide assurance for system performance will enable broad adoption of AI, including in high-risk scenarios. A strong feedback loop on evaluations of AI models and systems could not only accelerate progress on methodological challenges but also bring focus to which opportunities, capabilities, risks, and impacts are most appropriate and efficient to evaluate at what points along the AI development and deployment lifecycle.<\/p>\n\n\n\n

    Acknowledgements <\/h2>\n\n\n\n

    We would like to thank the following external experts who have contributed to our research program on lessons for AI testing and evaluation: Mateo Aboy, Paul Alp, Ger\u00f3nimo Poletto Antonacci, Stewart Baker, Daniel Benamouzig, Pablo Cantero, Daniel Carpenter, Alta Charo, Jennifer Dionne, Andy Greenfield, Kathryn Judge, Ciaran Martin, and Timo Minssen.  <\/p>\n\n\n\n

    Case studies <\/h2>\n\n\n\n

    Civil aviation:<\/strong> Testing in Aircraft Design and Manufacturing<\/a><\/em>, by Paul Alp <\/p>\n\n\n\n

    Cybersecurity:<\/strong> Cybersecurity Standards and Testing\u2014Lessons for AI Safety and Security<\/a><\/em>, by Stewart Baker <\/p>\n\n\n\n

    Financial services (bank stress testing):<\/strong> The Evolving Use of Bank Stress Tests<\/a><\/em>, by Kathryn Judge <\/p>\n\n\n\n

    Genome editing:<\/strong> Governance of Genome Editing in Human Therapeutics and Agricultural Applications<\/a><\/em>, by Alta Charo and Andy Greenfield <\/p>\n\n\n\n

    Medical devices:<\/strong> Medical Device Testing: Regulatory Requirements, Evolution and Lessons for AI Governance<\/a><\/em>, <\/em>by Mateo Aboy and Timo Minssen <\/p>\n\n\n\n

    Nanoscience:<\/strong> The regulatory landscape of nanoscience and nanotechnology, and applications to future AI regulation<\/a><\/em>, by Jennifer Dionne <\/p>\n\n\n\n

    Nuclear energy:<\/strong> Testing in the Nuclear Industry<\/a><\/em>, by Pablo Cantero and Ger\u00f3nimo Poletto Antonacci <\/p>\n\n\n\n

    Pharmaceuticals:<\/strong> The History and Evolution of Testing in Pharmaceutical Regulation<\/a><\/em>, by Daniel Benamouzig and Daniel Carpenter<\/p>\n","protected":false},"excerpt":{"rendered":"

    As generative AI becomes more capable and widely deployed, familiar questions from the governance of other transformative technologies have resurfaced. Which opportunities, capabilities, risks, and impacts should be evaluated? Who should conduct evaluations, and at what stages of the technology lifecycle? What tests or measurements should be used? And how can we know if the […]<\/p>\n","protected":false},"author":43868,"featured_media":1143007,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Amanda Craig Deckard","user_id":"43899"},{"type":"user_nicename","value":"Chad Atalla","user_id":"40249"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1142208","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Amanda Craig Deckard","user_id":43899,"display_name":"Amanda Craig Deckard","author_link":"Amanda Craig Deckard<\/a>","is_active":false,"last_first":"Craig Deckard, Amanda","people_section":0,"alias":"amcraig"},{"type":"user_nicename","value":"Chad Atalla","user_id":40249,"display_name":"Chad Atalla","author_link":"Chad Atalla<\/a>","is_active":false,"last_first":"Atalla, Chad","people_section":0,"alias":"chatalla"}],"msr_type":"Post","featured_image_thumbnail":"\"Illustrated","byline":"Amanda Craig Deckard<\/a> and Chad Atalla<\/a>","formattedDate":"June 23, 2025","formattedExcerpt":"As generative AI becomes more capable and widely deployed, familiar questions from the governance of other transformative technologies have resurfaced. Which opportunities, capabilities, risks, and impacts should be evaluated? Who should conduct evaluations, and at what stages of the technology lifecycle? What tests or measurements…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1142208","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43868"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1142208"}],"version-history":[{"count":24,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1142208\/revisions"}],"predecessor-version":[{"id":1143008,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1142208\/revisions\/1143008"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1143007"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1142208"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1142208"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1142208"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1142208"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1142208"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1142208"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1142208"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1142208"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1142208"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1142208"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1142208"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}