Microsoft Research Forum Briefing Book cover image

Research Forum Brief | January 2024

Evaluation and Understanding of Foundation Models

Share this page

photo of Besmira Nushi smiling for the camera

“We see model evaluation and understanding as a guide to AI innovation. Our work measures, informs, and accelerates model improvement and, at the same time, is a contribution that is useful to the scientific community for understanding and studying new forms and levels of intelligence.”

Besmira Nushi, Principal Researcher

Transcript

Besmira Nushi, Principal Researcher, Microsoft Research AI Frontiers 

Besmira Nushi summarizes timely challenges and ongoing work on evaluating and in-depth understanding of large foundation models as well as agent platforms built upon such models. 

Microsoft Research Forum, January 30, 2024

BESMIRA NUSHI: Hi, everyone. My name is Besmira Nushi, and together with my colleagues at Microsoft Research, I work on evaluating and understanding foundation models. In our team, we see model evaluation and understanding as a guide to AI innovation. Our work measures, informs, and accelerates model improvement and, at the same time, is a contribution that is useful to the scientific community for understanding and studying new forms and levels of intelligence.

But evaluation is hard, and new generative tasks are posing new challenges in evaluation and understanding. For example, it has become really difficult to scale up evaluation for long, open-ended, and generative outputs. At the same time, for emergent abilities, very often some benchmarks do not exist and often we have to create them from scratch. And even when they exist, they may be saturated or leaked into training datasets. In other cases, factors like prompt variability and model updates may be just as important as the quality of the model that is being tested in the first place. When it comes to end-to-end and interactive scenarios, other aspects of model behavior may get in the way and may interfere with task completion and user satisfaction. And finally, there exists a gap between evaluation and model improvement. 

In our work, we really see this as just the first step towards understanding new failure modes and new architectures through data and model understanding. So in Microsoft Research, when we address these challenges, we look at four important pillars. First, we build novel benchmarks and evaluation workflows. Second, we perform and put a focus on interactive and multi-agent systems evaluation. And in everything we do, in every report that we write, we put responsible AI at the center of testing and evaluation to understand the impact of our technology on society. Finally, to bridge the gap between evaluation and improvement, we pursue efforts in data and model understanding.  

But let’s look at some examples. Recently, in the benchmark space, we released KITAB. KITAB is a novel benchmark and dataset for testing constraint satisfaction capabilities for information retrieval queries that have certain user specifications in terms of constraints. And when we tested recent state-of-the-art models with this benchmark, we noticed that only in 50 percent of the cases these models are able to satisfy user constraints.

And similarly, in the multimodal space, Microsoft Research just released HoloAssist (opens in new tab). HoloAssist is a testbed with extensive amounts of data that come from recording and understanding how people perform tasks in the real and physical world. And this provides us with an invaluable amount of resources in terms of evaluation for understanding and measuring how the new models are going to assist people in things like task completion and mistake correction. In the responsible AI area, ToxiGen (opens in new tab) is a new dataset that is designed to mention and to understand toxicity generation from language models. And it is able to measure harms that may be generated from such models across 13 different demographic groups. 

Similarly, in the multimodal space, we ran extensive evaluations to measure representational fairness and biases. For example, we tested several image generation models to see how they represent certain occupations, certain personality traits, and geographical locations. And we found that sometimes such models may present a major setback when it comes to representing different occupations if compared to real-world representation. For instance, in some cases, we see as low as 0 percent representation for certain demographic groups.  

Now when it comes to data on model understanding, often what we do is that we look back at architectural and model behavior patterns to see how they are tied to important and common errors in the space. For example, for the case of constraint satisfaction for user queries, we looked at factual errors, information fabrication and mapped them to important attention patterns. And we see that whenever factual errors occur, there are very weak attention patterns within the model that map to these errors. And this is an important finding that is going to inform our next steps in model improvement. 

So as we push the new frontiers in AI innovation, we are also just as excited about understanding and measuring that progress scientifically. And we hope that many of you are going to join us in that challenge.

Thank you.