{"id":1141125,"date":"2025-06-23T09:40:10","date_gmt":"2025-06-23T16:40:10","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-story&p=1141125"},"modified":"2025-12-15T14:58:19","modified_gmt":"2025-12-15T22:58:19","slug":"ai-testing-and-evaluation-learnings-from-science-and-industry","status":"publish","type":"msr-story","link":"https:\/\/www.microsoft.com\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/","title":{"rendered":"AI Testing and Evaluation: Learnings from Science and Industry"},"content":{"rendered":"\n
<\/div><\/span>
\n
\n
<\/div>\n\n\n\n

AI Testing and Evaluation: Learnings from Science and Industry<\/h1>\n\n\n\n
<\/div>\n<\/div>\n<\/div><\/div>\n\n\n\n\n\n
\n
\n
<\/div>\n\n\n\n
\n
<\/div>\n\n\n\n

Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.<\/h2>\n\n\n\n
<\/div>\n\n\n\n
<\/div>\n\n\n\n

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains\u2014from genome editing to cybersecurity\u2014to investigate the role of testing and evaluation as a governance tool. AI Testing and Evaluation: Learnings from Science and Industry, <\/em>hosted by Microsoft Research\u2019s Kathleen Sullivan<\/a>, explores what the technology industry and policymakers can learn from these fields and how that might help shape the course of AI development.<\/p>\n\n\n\n


\n\n\n\n

Episodes<\/h2>\n\n\n\n

Introducing \u2018AI Testing and Evaluation: Learnings from Science and Industry\u2019<\/h3>\n\n\n\n

Amanda Craig Deckard | June 23, 2025<\/p>\n\n\n\n

In the introductory episode of this new series, host Kathleen Sullivan and Senior Director Amanda Craig Deckard explore Microsoft\u2019s efforts to draw on the experience of other domains to help advance the role of AI testing and evaluation as a governance tool.<\/p>\n\n\n\n

\"Illustrated<\/a><\/figure>\n\n\n\n
\n
\n
\n
\n
\n\t
\n\t\t
\n\t\t\t\t\t\tPodcast<\/span>\n\t\t\tAI Testing and Evaluation: Learnings from Science and Industry<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n
\n
\n\t
\n\t\t
\n\t\t\t\t\t\tBLog<\/span>\n\t\t\tLearning from other domains to advance AI evaluation and testing<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

Guest<\/h4>\n\n\n\n
\"illustration<\/figure>\n\n\n\n

Amanda Craig Deckard<\/a><\/strong>
Amanda Craig Deckard is senior director of public policy in Microsoft\u2019s Office of Responsible AI, where she leads efforts to strengthen AI governance as a foundation for trust and innovation.<\/p>\n\n\n\n

<\/div>\n\n\n\n
\n<\/div>\n<\/div>\n\n\n\n
\n

Episode 1 | AI Testing and Evaluation: Learnings from genome editing <\/h3>\n\n\n\n

Alta Charo, Daniel Kluttz | June 30, 2025<\/p>\n\n\n\n

Bioethics and law expert Alta Charo explores the value of regulating technologies at the application level and the role of coordinated oversight in genome editing, while Microsoft GM Daniel Kluttz reflects on Charo\u2019s points, drawing parallels to AI governance.<\/p>\n\n\n\n

\n
\n
\"Outline<\/figure>\n<\/div>\n\n\n\n
\n
\n\t
\n\t\t
\n\t\t\t\t\t\tPodcast<\/span>\n\t\t\tAI Testing and Evaluation: Learnings from genome editing<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

Guests<\/h4>\n\n\n\n
\"Alta<\/figure>\n\n\n\n

Alta Charo (opens in new tab)<\/span><\/a><\/strong>
Alta Charo, the Warren P. Knowles Professor Emerita of Law and Bioethics, is a biotechnology policy and ethics consultant who has been at the forefront of the field for decades.<\/p>\n\n\n\n

<\/div>\n\n\n\n
\"Daniel<\/figure>\n\n\n\n

Daniel Kluttz<\/strong> (opens in new tab)<\/span><\/a>
Daniel Kluttz is a partner general manager in Microsoft’s Office of Responsible AI, where he leads the group\u2019s Sensitive Uses and Emerging Technologies program.  <\/p>\n\n\n\n

<\/div>\n\n\n\n
\n<\/div>\n\n\n\n
\n

Episode 2 | AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices<\/h3>\n\n\n\n

Daniel Carpenter, Timo Minssen, Chad Atalla | July 7, 2025<\/p>\n\n\n\n

Professors Daniel Carpenter and Timo Minssen explore evolving pharma and medical device regulation, including the role of clinical trials, while Microsoft applied scientist Chad Atalla shares where AI governance stakeholders might find inspiration in the fields.<\/p>\n\n\n\n

\n
\n
\"Illustrated<\/a><\/figure>\n<\/div>\n\n\n\n
\n
\n\t
\n\t\t
\n\t\t\t\t\t\tPodcast<\/span>\n\t\t\tAI Testing and Evaluation: Learnings from pharmaceuticals and medical devices<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

Guests<\/h4>\n\n\n\n
\"Daniel<\/figure>\n\n\n\n

Daniel Carpenter<\/strong> (opens in new tab)<\/span><\/a>
Daniel Carpenter is the Allie S. Freed Professor of Government and chair of the department of government at Harvard. His research spans social and political science, including pharmaceutical regulation.<\/p>\n\n\n\n

<\/div>\n\n\n\n
\"Timo<\/figure>\n\n\n\n

Timo Minssen<\/strong> (opens in new tab)<\/span><\/a>
Timo Minssen is a law professor at the University of Copenhagen, where he leads the Center for Advanced Studies in Bioscience Innovation Law. He specializes in legal aspects of biomedical innovation.<\/p>\n\n\n\n

<\/div>\n\n\n\n
\"Chad<\/figure>\n\n\n\n

Chad Atalla<\/strong><\/a>
Chad Atalla is a senior applied scientist in Microsoft Research New York City’s Sociotechnical Alignment Center, where they contribute to responsible AI research and practical responsible AI solutions.<\/p>\n\n\n\n

<\/div>\n\n\n\n
\n<\/div>\n\n\n\n
\n

Episode 3 | AI Testing and Evaluation: Learnings from cybersecurity <\/h3>\n\n\n\n

Ciaran Martin, Tori Westerhoff | July 14, 2025<\/p>\n\n\n\n

Drawing on his previous work as\u202fthe UK\u2019s cybersecurity chief, Professor Ciaran Martin explores differentiated standards and public-private partnerships in cybersecurity, and Microsoft\u2019s Tori Westerhoff examines the insights through an AI red-teaming lens.<\/p>\n\n\n\n

\n
\n
\"Illustrated<\/a><\/figure>\n<\/div>\n\n\n\n
\n
\n\t
\n\t\t
\n\t\t\t\t\t\tPodcast<\/span>\n\t\t\tAI Testing and Evaluation: Learnings from cybersecurity<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

Guests<\/h4>\n\n\n\n
\"Illustration<\/a><\/figure>\n\n\n\n

Ciaran Martin (opens in new tab)<\/span><\/a><\/strong>
Ciaran Martin is a professor of practice in the management of public organizations at the University of Oxford. Previously, he was the founding chief executive of the UK\u2019s National Cyber Security Centre.<\/p>\n\n\n\n

<\/div>\n\n\n\n
\"Illustration<\/a><\/figure>\n\n\n\n

Tori Westerhoff (opens in new tab)<\/span><\/a><\/strong>
A principal director on the Microsoft AI Red Team, Tori Westerhoff leads AI security and safety red team operations and dangerous capability testing, directly informing company leadership.<\/p>\n\n\n\n

<\/div>\n\n\n\n
\n<\/div>\n\n\n\n
\n

Episode 4 | AI Testing and Evaluation: Reflections<\/h3>\n\n\n\n

Amanda Craig Deckard | July 21, 2025<\/p>\n\n\n\n

In the series finale, Amanda Craig Deckard returns to examine what Microsoft has learned about testing as a governance tool. She also explores the roles of rigor, standardization, and interpretability in testing and what\u2019s next for Microsoft\u2019s AI governance work.<\/p>\n\n\n\n

\n
\n
\"Illustrated<\/a><\/figure>\n<\/div>\n\n\n\n
\n
\n\t
\n\t\t
\n\t\t\t\t\t\tPodcast<\/span>\n\t\t\tAI Testing and Evaluation: Reflections<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n
\n\t
\n\t\t
\n\t\t\t\t\t\tPUBLICATION<\/span>\n\t\t\tLearning from other domains to advance AI evaluation and testing<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

Guests<\/h4>\n\n\n\n
\"Illustrated<\/figure>\n\n\n\n

Amanda Craig Deckard<\/a><\/strong>
Amanda Craig Deckard is senior director of public policy in Microsoft\u2019s Office of Responsible AI, where she leads efforts to strengthen AI governance as a foundation for trust and innovation.<\/p>\n\n\n\n

<\/div>\n\n\n\n
<\/div>\n\n\n\n
\n<\/div>\n<\/div>\n\n\n\n
<\/div>\n<\/div>\n\n\n\n
\n
\n\t\n\t
\n\t\t
\n\t\t\t

Series contributors<\/strong>: Neeltje Berger, Tetiana Bukhinska, David Celis Garcia, Matt Corwine, Amanda Craig Deckard, Kristina Dodge, Chris Duryee, Milan Gandhi, Ann Griffin, Alyssa Hughes, Gretchen Huizinga, Matthew McGinley, Amanda Melfi, Joe Plummer, Brenda Potts, Kathleen Sullivan, Amber Tingle, Kathleen Toohill, Craig Tuschoff, Sarah Wang, Brian Wesolowski, and Katie Zoller.<\/em><\/p>\n\n\n\n

Series launched on June 23, 2025<\/em><\/p>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n

<\/div>\n\n\n\n
\n
\n

Other resources<\/h3>\n<\/div>\n\n\n\n
\n
\n
\n

Responsible AI at Microsoft<\/a><\/p>\n\n\n\n

Global Governance:
Goals and Lessons for AI (opens in new tab)<\/span><\/a><\/p>\n<\/div>\n\n\n\n

\n

Microsoft Research Podcast<\/a><\/p>\n\n\n\n

Learning from other domains to advance AI evaluation and testing<\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

<\/div>\n<\/article>\n\n\n\n\n\n

Frequently asked questions<\/h2>\n\n\n\n\n\n

AI governance refers to the frameworks, policies, best practices, and tools that guide the responsible development, deployment, and use of AI. At Microsoft, this includes working to ensure the alignment of AI systems with our Responsible AI Standard<\/a>, which we continue to build on as new AI capabilities, risks, and regulatory requirements emerge. Read more about Microsoft\u2019s internal approach to AI governance in our 2025 Responsible AI Transparency Report (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n\n\n

AI evaluations are structured ways to test how AI models and systems perform and where they could go wrong. Because this is a rapidly evolving field, there\u2019s no single agreed way to categorize these tests. Different methods are used depending on what is being tested and when it is tested.<\/p>\n\n\n\n

The International AI Safety Report 2025 (opens in new tab)<\/span><\/a>\u2014the world\u2019s first comprehensive synthesis of research on the capabilities and risks of advanced AI systems\u2014defines AI evaluations as \u201csystematic assessments of an AI system\u2019s performance, capabilities, vulnerabilities or potential impacts. Evaluations can include benchmarking, red-teaming and audits and can be conducted both before and after model deployment.\u201d<\/p>\n\n\n\n\n\n

Many of the aims of evaluating generative AI models and systems resemble those of evaluating traditional software, such as assessing performance and reliability. However, there is growing recognition that evaluating generative AI is more challenging than evaluating traditional machine learning systems. This is because generative AI systems accept a wide range of inputs, produce diverse outputs, support numerous use cases, and can have impacts on people and society that range from mundane to consequential. We explore these challenges in Part 1 of our 2025 white paper, Learning from Other Domains to Advance AI Evaluation and Testing<\/a><\/em>.<\/p>\n\n\n\n\n\n

Microsoft recognizes that governance is not a blank slate. Many other domains have long histories of managing complex, impactful technologies in high-stakes settings. By engaging experts from these domains, Microsoft aims to learn from the strengths and shortfalls of established governance and public policy strategies, adapting insights to the unique challenges of AI. <\/p>\n\n\n\n

This cross-domain learning has helped further shape Microsoft\u2019s approach to AI governance and contributions to public policy discussions.<\/p>\n\n\n\n\n\n

<\/div>\n\n\n","protected":false},"excerpt":{"rendered":"

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains\u2014from genome editing to cybersecurity\u2014to investigate the role of testing and evaluation as a governance tool. AI […]<\/p>\n","protected":false},"featured_media":1141306,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-1141125","msr-story","type-msr-story","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us"],"related-researchers":[{"type":"user_nicename","display_name":"Kathleen Sullivan","user_id":40949,"people_section":"Section name 0","alias":"kasull"},{"type":"user_nicename","display_name":"Amanda Craig Deckard","user_id":43899,"people_section":"Section name 0","alias":"amcraig"}],"related-publications":[1147724],"related-downloads":[],"related-videos":[],"related-projects":[],"related-groups":[],"related-events":[],"related-posts":[1140810,1142130,1142208,1143099,1144381],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-story\/1141125","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-story"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-story"}],"version-history":[{"count":57,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-story\/1141125\/revisions"}],"predecessor-version":[{"id":1158595,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-story\/1141125\/revisions\/1158595"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1141306"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1141125"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1141125"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1141125"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1141125"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}