Abstracts Archives - Microsoft Research http://approjects.co.za/?big=en-us/research/podcast-series/abstracts/ Thu, 02 Apr 2026 16:10:11 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 Abstracts: Zero-shot models in single-cell biology with Alex Lu http://approjects.co.za/?big=en-us/research/podcast/abstracts-zero-shot-models-in-single-cell-biology-with-alex-lu/ Thu, 22 May 2025 15:58:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1139205 The emergence of foundation models has sparked interest in applications to single-cell biology, but when tested in zero-shot settings, they underperform compared to simpler methods. Alex Lu shares insights on why more research on AI models is needed in biological applications.

The post Abstracts: Zero-shot models in single-cell biology with Alex Lu appeared first on Microsoft Research.

]]>
Illustrated headshot of Alex Lu.

Members of the research community at Microsoft work continuously to advance their respective fields. The Abstracts podcast brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

The success of foundation models like ChatGPT has sparked growing interest in scientific communities seeking to use AI for things like discovery in single-cell biology. In this episode, senior researcher Alex Lu joins host Gretchen Huizinga to talk about his work on a paper called Assessing the limits of zero-shot foundation models in single-cell biology, where researchers tested zero-shot performance of proposed single-cell foundation models. Results showed limited efficacy compared to older, simpler methods, and suggested the need for more rigorous evaluation and research. 

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. 

[MUSIC FADES]

On today’s episode, I’m talking to Alex Lu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single-cell Biology. Alex Lu, wonderful to have you on the podcast. Welcome to Abstracts! 

ALEX LU: Yeah, I’m really excited to be joining you today. 

HUIZINGA: So let’s start with a little background of your work. In just a few sentences, tell us about your study and more importantly, why it matters. 

LU: Absolutely. And before I dive in, I want to give a shout out to the MSR research intern who actually did this work. This was led by Kasia Kedzierska, who interned with us two summers ago in 2023, and she’s the lead author on the study. But basically, in this research, we study single-cell foundation models, which have really recently rocked the world of biology, because they basically claim to be able to use AI to unlock understanding about single-cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells, to discovering new drugs for cancer, will conduct experiments where they measure how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view into the cell’s internal state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So deriving meaning from this really long chain of numbers is really difficult. And single-cell foundation models claim to be capable of unraveling deeper insights than ever before. So that’s the claim that these works have made. And in our recent paper, we showed that these models may actually not live up to these claims. Basically, we showed that single-cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it has implications on the toolkits that biologists use to understand their experiments. Our work suggests that single-cell foundation models may not be appropriate for practical use just yet, at least in the discovery applications that we cover. 

HUIZINGA: Well, let’s go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they’re being used in novel applications, which is what you’re interested in, like single-cell biology. So I’m curious, just sort of as a foundation, what other research has already been done in this area, and how does this study illuminate or build on it? 

LU: Absolutely. Okay, so we were the first to notice and document this issue in single-cell foundation models, specifically. And this is because that we have proposed evaluation methods that, while are common in other areas of AI, have yet to be commonly used to evaluate single-cell foundation models. We performed something called zero-shot evaluation on these models. Prior to our work, most works evaluated single-cell foundation models with fine tuning. And the way to understand this is because single-cell foundation models are trained in a way that tries to expose these models to millions of single-cells. But because you’re exposing them to a large amount of data, you can’t really rely upon this data being annotated or like labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training phase. We call this the fine-tuning phase, where you have a smaller number of single cells, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people, they typically evaluate the performance of single-cell models after they fine-tune these models. However, what we noticed is that this evaluating these fine-tuned models has several problems. First, it might not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we’re not just trying to interact with an agent that has access to knowledge through its pre-training, we’re trying to extend these models to discover new biology beyond the sphere of influence then. And so in many cases, the point of using these models, the point of analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the biologists worked with that they weren’t aware of before. So in these kinds of cases, it is really tough to fine-tune a model. There’s a bit of a chicken and egg problem going on. If you don’t know, for example, there’s a new kind of cell in the data, you can’t really instruct the model to help us identify these kinds of new cells. So in other words, fine-tuning these models for those tasks essentially becomes impossible then. So the second issue is that evaluations on fine-tuned models can sometimes mislead us in our ability to understand how these models are working. So for example, the claim behind single-cell foundation model papers is that these models learn a foundation of biological knowledge by being exposed to millions of single cells in its first training phase, right? But it’s possible when you fine-tune a model, it may just be that any performance increases that you see using the model is simply because that you’re using a massive model that is really sophisticated, really large. And even if there’s any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what’s really different about this paper is that we propose zero-shot evaluation for these models. What that means is that we do not fine-tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be a downstream task instead is that we extract the model’s internal embedding of single-cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it’s essentially how the model perceives single-cell data and how it’s organizing in its own internal state. So basically, this is the better way for us to test the claim that single-cell foundation models are learning foundational biological insights. Because if they actually are learning these insights, they should be present in the models embedding space even before we fine-tune the model. 

HUIZINGA: Well, let’s talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single-cell biology. How did you go about evaluating these models? 

LU: Yes, so let’s dive deeper into how zero-shot evaluations are conducted, okay? So the premise here is that we’re relying upon the fact that if these models are fully learning foundational biological insights, if we take the model’s internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single-cell foundation models and importantly, we compared these models against older and reliable tools that biologists have used for exploratory analyses. So these include simpler machine learning methods like scVI, statistical algorithms like Harmony, and even basic data pre-processing steps, just like filtering your data down to a more robust subset of genes, then. So basically, we tested embeddings from our two single-cell foundation models against this baseline in a variety of settings. And we tested the hypothesis that biologically similar cells should be similar across these distinct methods across these datasets. 

HUIZINGA: Well, and as you as you did the testing, you obviously were aiming towards research findings, which is my favorite part of a research paper, so tell us what you did find and what you feel the most important takeaways of this paper are. 

LU: Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to older methods then. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that,yeah, it’s a very surprising result, given how hyped these models are and how people were already adopting them. But our results basically caution that these shouldn’t really be adopted for these use purposes. 

HUIZINGA: Yeah, so this is serious real-world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, et cetera? So given that, who would you say benefits most from what you’ve discovered in this paper and why? 

LU: Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I’ve discussed, these experiments are used for cases that have practical impact, drug discovery applications, investigations into basic biology, then. But let’s also talk about the impact for methodologists, people who are trying to improve these single-cell foundation models, right? I think at the base, they’re really excited proposals. Because if you look at what some of the prior and less sophisticated methods couldn’t do, they tended to be more bespoke. So the excitement of single-cell foundation models is that you have this general-purpose model that can be used for everything and while they’re not living up to that purpose just now, just currently, I think that it’s important that we continue to bank onto that vision, right? So if you look at our contributions in that area, where single-cell foundation models are a really new proposal, so it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step towards more rigorous evaluation of these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they’re going in the right direction. And in fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. And so this essentially helps future computer scientists that are working on single-cell foundation models know how to train better models. 

HUIZINGA: That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology, and what foundation might this paper lay for future research agendas in the field? 

LU: Yeah, absolutely. So now that we’ve shown single-cell foundation models don’t necessarily perform well, I think the natural question on everyone’s mind is how do we actually train single-cell foundation models that live up to that vision, that can perform in helping us discover new biology then? So I think in the short term, yeah, we’re actively investigating many hypotheses in this area. So for example, my colleagues, Lorin Crawford and Ava Amini, who were co-authors in the paper, recently put out a pre-print understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data sets that people used to train single-cell foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance then. But you can also look forward to many other explorations in this area as we continue to develop this research at the end of the day. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use, right? I mean, this is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language maybe in the consumer domain and then extrapolating these methods to biology and expecting that they will work in the same way then, right? So for example, one reason why zero-shot evaluation was not routine practice for single-cell foundation models prior to our work, I mean, we were the first to fully establish that as a practice for the field, was because I think people who have been working in AI for biology have been looking to these more mainstream AI domains to shape their work then. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field then. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it’s more of like a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it’s a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but to use these models, zero-shot then. So in other words, I think that we need to be thinking carefully about what it is that makes training a model for biology different from training a model, for example, for consumer purposes. 

[MUSIC]

HUIZINGA: Alex Lu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read it on the Genome Biology website. See you next time on Abstracts! 

[MUSIC FADES] 

The post Abstracts: Zero-shot models in single-cell biology with Alex Lu appeared first on Microsoft Research.

]]>
Abstracts: Aurora with Megan Stanley and Wessel Bruinsma http://approjects.co.za/?big=en-us/research/podcast/abstracts-aurora-with-megan-stanley-and-wessel-bruinsma/ Wed, 21 May 2025 15:22:51 +0000 http://approjects.co.za/?big=en-us/research/?p=1138492 A new Nature paper explores Aurora, an AI model that redefines weather prediction with application to other environmental domains such as tropical cyclones. Hear from senior researchers Megan Stanley and Wessel Bruinsma as they share their groundbreaking work.

The post Abstracts: Aurora with Megan Stanley and Wessel Bruinsma appeared first on Microsoft Research.

]]>
Abstracts podcast | Aurora with Megan Stanley and Wessel Bruinsma

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode of Abstracts, Microsoft senior researchers Megan Stanley and Wessel Bruinsma join host Amber Tingle to discuss their groundbreaking work on environmental forecasting. Their new Nature publication, “A Foundation Model for the Earth System,” features Aurora, an AI model that redefines weather prediction and extends its capabilities to other environmental domains such as tropical cyclones and ocean wave forecasting.


Learn more

A foundation model for the Earth system (opens in new tab)
Nature | May 2025

Introducing Aurora: The first large-scale foundation model of the atmosphere
Microsoft Research Blog | June 2024

Project Aurora: The first large-scale foundation model of the atmosphere
Video | September 2024

A Foundation Model for the Earth System (opens in new tab)
arXiv | November 2024

Aurora (opens in new tab)
Azure AI Foundry Labs

Transcript

[MUSIC]   

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES] 

Our guests today are Megan Stanley and Wessel Bruinsma. They are both senior researchers within the Microsoft Research AI for Science initiative. They are also two of the coauthors on a new Nature publication called “A Foundation Model for the Earth System.”

This is such exciting work about environmental forecasting, so we’re happy to have the two of you join us today.  

Megan and Wessel, welcome. 

MEGAN STANLEY: Thank you. Thanks. Great to be here. 

WESSEL BRUINSMA: Thanks. 

TINGLE: Let’s jump right in. Wessel, share a bit about the problem your research addresses and why this work is so important. 

BRUINSMA: I think we’re all very much aware of the revolution that’s happening in the space of large language models, which have just become so strong. What’s perhaps lesser well-known is that machine learning models have also started to revolutionize this field of weather prediction. Whereas traditional weather prediction models, based on physical laws, used to be the state of the art, these traditional models are now challenged and often even outperformed by AI models.  

This advancement is super impressive and really a big deal. Mostly because AI weather forecasting models are computationally much more efficient and can even be more accurate. What’s unfortunate though, about this big step forward, is that these developments are mostly limited to the setting of weather forecasting.  

Weather forecasting is very important, obviously, but there are many other important environmental forecasting problems out there, such as air pollution forecasting or ocean wave forecasting. We have developed a model, named Aurora, which really kicks the AI revolution in weather forecasting into the next gear by extending these advancements to other environmental forecasting fields, too. With Aurora, we’re now able to produce state-of-the-art air pollution forecasts using an AI approach. And that wasn’t possible before! 

TINGLE: Megan, how does this approach differ from or build on work that’s already been done in the atmospheric sciences? 

STANLEY: Current approaches have really focused training very specifically on weather forecasting models. And in contrast, with Aurora, what we’ve attempted to do is train a so-called foundation model for the Earth system. In the first step, we train Aurora on a vast body of Earth system data. This is our pretraining step.  

And when I say a vast body of data, I really do mean a lot. And the purpose of this pretraining is to let Aurora, kind of, learn some general-purpose representation of the dynamics that govern the Earth system. But then once we’ve pretrained Aurora, and this really is the crux of this, the reason why we’re doing this project, is after the model has been pretrained, it can leverage this learned general-purpose representation and efficiently adapt to new tasks, new domains, new variables. And this is called fine-tuning. 

The idea is that the model really uses the learned representation to perform this adaptation very efficiently, which basically means Aurora is a powerful, flexible model that can relatively cheaply be adapted to any environmental forecasting task.   

TINGLE: Wessel, can you tell us about your methodology? How did you all conduct this research? 

BRUINSMA: While approaches so far have trained models on primarily one particular data set, this one dataset is very large, which makes it possible to train very good models. But it does remain only one dataset, and that’s not very diverse. In the domain of environmental forecasting, we have really tried to push the limits of scaling to large data by training Aurora on not just this one large dataset, but on as many very large datasets as we could find. 

These datasets are a combination of estimates of the historical state of the world, forecasts by other models, climate simulations, and more. We’ve been able to show that training on not just more data but more diverse data helps the model achieve even better performance. Showing this is difficult because there is just so much data.  

In addition to scaling to more and more diverse data, we also increased the size of the model as much as we could. Here we found that bigger models, despite being slower to run, make more efficient use of computational resources. It’s cheaper to train a good big model than a good small model. The mantra of this project was to really keep it simple and to scale to simultaneously very large and, more importantly, diverse data and large model size. 

TINGLE: So, Megan, what were your major findings? And we know they’re major because they’re in Nature. [LAUGHS] 

STANLEY: Yeah, [LAUGHS] I guess they really are. So the main outcome of this project is we were actually able to train a single foundation model that achieves state-of-the-art performance in four different domains. Air pollution forecasting. For example, predicting particulate matter near the surface or ozone in the atmosphere. Ocean wave forecasting, which is critical for planning shipping routes.  

Tropical cyclone track forecasting, so that means being able to predict where a hurricane or a typhoon is expected to go, which is obviously incredibly important, and very high-resolution weather forecasting.  

And I’ve, kind of, named these forecasting domains as if they’re just items in a list, but in every single one, Aurora really pushed the limits of what is possible with AI models. And we’re really proud of that.  

But perhaps, kind of, you know, to my mind, the key takeaway here is that the foundation model approach actually works. So what we have shown is it’s possible to actually train some kind of general model, a foundation model, and then adapt it to a wide variety of environmental tasks. Now we definitely do not claim that Aurora is some kind of ultimate environmental forecasting model. We are sure that the model and the pretraining procedure can actually be improved. But, nevertheless, we’ve shown that this approach works for environmental forecasting. It really holds massive promise, and that’s incredibly cool. 

TINGLE: Wessel, what do you think will be the real-world impact of this work? 

BRUINSMA: Well, for applications that we mentioned, which are air pollution forecasting, ocean wave forecasting, tropical cyclone track forecasting, and very high-resolution weather forecasting, Aurora could today be deployed in real-time systems to produce near real-time forecasts. And, you know, in fact, it already is. You can view real-time weather forecasts by the high-resolution version of the model on the website of ECMWF (European Centre for Medium-Range Weather Forecasts). 

But what’s remarkable is that every of these applications took a small team of engineers about four to eight weeks to fully execute. You should compare this to a typical development timeline for more traditional models, which can be on the order of multiple years. Using the pretraining fine-tuning approach that we used for Aurora, we might see significantly accelerated development cycles for environmental forecasting problems. And that’s exciting. 

TINGLE: Megan, if our listeners only walk away from this conversation with one key talking point, what would you like that to be? What should we remember about this paper? 

STANLEY: The biggest takeaway is that the pretraining fine-tuning paradigm, it really works for environmental forecasting, right? So you can train a foundational model, it learns some kind of general-purpose representation of the Earth system dynamics, and this representation boosts performance in a wide variety of forecasting tasks. But we really want to emphasize that Aurora only scratches the surface of what’s actually possible. 

So there are many more applications to explore than the four we’ve mentioned. And undoubtedly, the model and pretraining procedure can actually be improved. So we’re really excited to see what the next few years will bring. 

TINGLE: Wessel, tell us more about those opportunities and unanswered questions. What’s next on the research agenda in environmental prediction? 

BRUINSMA: Well, Aurora has two main limitations. The first is that the model produces only deterministic predictions, by which I mean a single predicted value. For variables like temperature, this is mostly fine. But other variables like precipitation, they are inherently some kind of stochastic. For these variables, we really want to assign probabilities to different levels of precipitation rather than predicting only a single value. 

An extension of Aurora to allow this sort of prediction would be a great next step.  

The second limitation is that Aurora depends on a procedure called assimilation. Assimilation attempts to create a starting point for the model from real-world observations, such as from weather stations and satellites. The model then takes the starting point and uses it to make predictions. Unfortunately, assimilation is super expensive, so it would be great if we could somehow circumvent the need for it. 

Finally, what we find really important is to make our advancements available to the community.

[MUSIC] 

TINGLE: Great. Megan and Wessel, thanks for joining us today on the Microsoft Research Podcast. 

BRUINSMA: Thanks for having us. 

STANLEY: Yeah, thank you. It’s been great. 

TINGLE: You can check out the Aurora model on Azure AI Foundry. You can read the entire paper, “A Foundation Model for the Earth System,” at aka.ms/abstracts. And you’ll certainly find it on the Nature website, too.  

Thank you so much for tuning in to Abstracts today. Until next time.  

[MUSIC FADES] 

The post Abstracts: Aurora with Megan Stanley and Wessel Bruinsma appeared first on Microsoft Research.

]]>
Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv http://approjects.co.za/?big=en-us/research/podcast/abstracts-heat-transfer-and-deep-learning-with-hongxia-hao-and-bing-lv/ Thu, 08 May 2025 16:00:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1138389 Silicon has long borne the burden of heat transfer in electronics, but in a post-Moore’s Law world, researchers like Hongxia Hao and Bing Lv are using AI to discover and design next-generation materials that exceed the limits of silicon’s thermal conductivity.

The post Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv appeared first on Microsoft Research.

]]>
Illustrated headshots of Hongxia Hao (left) and Bing Lv (right).

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts bring its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, senior researcher Hongxia Hao, and physics professor Bing Lv (opens in new tab), join host Gretchen Huizinga to talk about how they are using deep learning techniques to probe the upper limits of heat transfer in inorganic crystals, discover novel materials with exceptional thermal conductivity, and rewrite the rulebook for designing high-efficiency electronics and sustainable energy.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers.

[MUSIC FADES] 

Today I’m talking to two researchers, Hongxia Hao, a senior researcher at Microsoft Research AI for Science, and Bing Lv, an associate professor in physics at the University of Texas at Dallas. Hongxia and Bing are co-authors of a paper called Probing the Limit of Heat Transfer in Inorganic Crystals with Deep Learning. I’m excited to learn more about this! Hongxia and Bing, it’s great to have you both on Abstracts!

HONGXIA HAO: Nice to be here.

BING LV: Nice to be here, too.

HUIZINGA: So Hongxia, let’s start with you and a brief overview of this paper. In just a few sentences. Tell us about the problem your research addresses and more importantly, why we should care about it.

HAO: Let me start with a very simple yet profound question. What’s the fastest the heat can travel through a solid material? This is not just an academic curiosity, but it’s a question that touched the bottom of how we build technologies around us. So from the moment when you tap your smartphone, and the moment where the laptop is turned on and functioning, heat is always flowing. So we’re trying to answer the question of a century-old mystery of the upper limit of heat transfer in solids. So we care about this not just because it’s a fundamental problem in physics and material science, but because solving it could really rewrite the rulebook for designing high-efficiency electronics and sustainable energy, etc. And nowadays, with very cutting-edge nanometer chips or very fancy technologies, we are packing more computing power into smaller space, but the faster and denser we build, the harder it becomes to remove the heat. So in many ways, thermal bottlenecks, not just transistor density, are now the ceiling of the Moore’s Law. And also the stakes are very enormous. We really wish to bring more thermal solutions by finding more high thermal conductor choices from the perspective of materials discovery with the help of AI.

LV: So I think one of the biggest things as Hongxia said, right? Thermal solutions will become, eventually become, a bottleneck for all type of heterogeneous integration of the materials. So from this perspective, so how people actually have been finding out previously, all the thermal was the last solution to solve. But now people actually more and more realize all these things have to be upfront. This co-design, all these things become very important. So I think what we are doing right now, integrated with AI, helping to identify the large space of the materials, identify fundamentally what will be the limit of this material, will become very important for the society.

HUIZINGA: Hmm. Yeah. Hongxia, did you have anything to add to that?

HAO: Yes, so previously many people are working on exploring these material science questions through experimental tradition and the past few decades people see a new trend using computational materials discovery. Like for example, we do the fundamental solving of the Schrödinger equation using Density Functional Theory [DFT]. Actually, this brings us a lot of opportunities. The question here is, as the theory is getting more and more developed, it’s too expensive for us to make it very large scale and to study tons of materials. Think about this. The bottleneck here, now, is not just about having a very good theory, it’s about the scale. So, this is where AI, specifically now we are using deep learning, comes into play.

HUIZINGA: Well, Hongxia, let’s stay with you for a minute and talk about methodology. How did you do this research and what was the methodology you employed?

HAO: So here we, for this question, we built a pipeline that spans the AI, the quantum mechanics, and computational brute-force with a blend of efficiency and accuracy. It begins with generating an enormous chemical and structure design space because this is inspired by Slack’s principle. We focus first on simple crystals, and there are the systems most likely to have low and harmonious state, fewer phononic scattering events, and therefore potentially have high thermal conductivities. But we didn’t stop here. We also included a huge pool of more complex and higher energy structures to ensure diversity and avoid bias. And for each candidate, we first run like a structure relaxation using MatterSim, which is a deep learning foundational model for material science for us to characterize the properties of materials. And we use that screen for dynamic stability. And now it’s about 200K structures past this filter. And then came another real challenge: calculating the thermal conductivity. We try to solve this problem using the Boltzmann transport equation and the three-phonon scattering process. The twist here is all of this was not done by traditional DFT solvers, but with our deep learning model, the MatterSim. It’s trained to predict energy, force, and stress. And we can get second- and third-order interatomic force constants directly from here, which can guarantee the accuracy of the solution. And finally, to validate the model’s predictions, we performed full DFT-based calculations on the top candidates that we found, some of which even include higher-order scattering mechanism, electron phonon coupling effect, etc. And this rigorous validation gave us confidence in the speed and accuracy trade-offs and revealed a spectrum of materials that had either previously been overlooked or were never before conceived.

HUIZINGA: So Bing, let’s talk about your research findings. How did things work out for you on this project and what did you find?

LV: I think one of the biggest things for this paper is it creates a very large material base. Basically, you can say it’s a smart database which eventually will be made accessible to the public. I think that’s a big achievement because people who actually if they have to look into it, they actually can go search Microsoft database, finding out, oh, this material does have this type of thermal properties. This is actually, this database can send about 230,000 materials. And one of the things we confirm is the highest thermal conductivity material based on all the wisdom of Slack criteria, predicted diamond would have the highest thermal conductivity. We more or less really very solidly prove diamond, at this stage, will remain with the highest thermal conductivity. We have a lot of new materials, exotic materials, which some of them, Hongxia can elaborate a little bit more. So, which having all this very exotic combination of properties, thermal with other properties, which could actually provide a new insight for new physics development, new material development, and a new device perspective. All of this combined will have actually a very profound impact to society.

HUIZINGA: Yeah, Hongxia, go a little deeper on that because that was an interesting part of the paper when you talked about diamond still being the sort of “gold standard,” to mix metaphors! But you’ve also found some other materials that are remarkable compared to silicon.

HAO: Yeah, yeah. Among this search space, even though we didn’t find like something that’s higher than diamonds, but we do discover more than like twenty new materials with thermal conductivity exceeding that of silicon. And silicon is something like a benchmark for criteria that we think we want to compare with because it’s a backbone of modern electronics. More interestingly, I think, is the manganese vanadium. It shows some very interesting and surprising phenomena. Like it’s a metallic compound, but with very high lattice thermal connectivity. And this is the first time discovered by, like, through our search pattern, and it’s something that cannot be easily discovered without the hope with AI. And right now, think Bing can explain more on this, and show some interesting results.

HUIZINGA: Yeah, go ahead Bing.

LV: So this is actually very surprising to me as an experimentalist because of when Hongxia presented their theory work to me, this material, manganese vanadium it’s discovered back in 1938, almost 100 years ago, but there’s no more than twenty papers talking about this! A lot of them was on theory, okay, not even on experimental part. We actually did quite a bit of work on this. We actually are in the process; will characterize this and then moving forward even for the thermal conductivity measurements. So that will be hopefully, will be adding to the value of these things, showing you, Hey, AI does help to predict the materials could really generate the new materials with very good high thermal conductivity.

HUIZINGA: Yeah, so Bing, stay with you for a minute. I want you to talk about some kind of real-world applications of this. I know you alluded to a couple of things, but how is this work significant in that respect, and who might be most excited about it, aside from the two of you? [LAUGHS]

LV: So I think as I mentioned before, the first thing is this database. I believe that’s the first ever large material database regarding to the thermal conductivity. And it has, as I said, 230,000 materials with AI-predicted thermal connectivity. This will provide not only science but engineering with a vastly expanding catalog of candidate materials for the future roadmap of integration, material integration, and all these bottlenecks we are talking about, the thermal solution for the semiconductors or for even beyond the semiconductor integration, people actually can have a database to looking for. So these things, it will become very important, and I believe over a long time it will generate a very long impact for the research community, for the society development.

HUIZINGA: Yeah. Hongxia, did you have anything to add to that one too?

HAO: Yeah, so this study reshapes how we think about limits. I like the sentence that the only way to discover the limits of possible is to go beyond them into the impossible. In this case, we tried, but we didn’t break the diamond limit. But we proved it even more rigorously than ever before. In doing so, we also uncovered some uncharted peaks in the thermal conductivity landscape. This would not happen without new AI capabilities for material science. I think in the long run, I believe researchers could benefit from using this AI design and shift their way on how to do materials research with AI.

HUIZINGA: Yeah, it’ll be interesting to see if anyone ever does break the diamond limit with the new tools that are available, but…

HAO: Yeah!

HUIZINGA: So this is the part of the abstracts podcast where I like to ask for sort of a golden nugget, a one sentence takeaway that listeners might get from this paper. If you had one Hongxia, what would it be? And then I’ll ask Bing to maybe give his.

HAO: Yes. AI is no longer just a tool. It’s becoming a critical partner for us in scientific discovery. So our work proved that the large-scale data-driven science can now approach long-standing and fundamental questions with very fresh eyes. When trained well, and guided with physical intuition, models like MatterSim can really realize a full in-silico characterization for materials and don’t just simulate some known materials, but really trying to imagine what nature hasn’t yet revealed. Our work points to a path forward, not just incrementally better materials, but entirely new class of high-performance compounds where we could never have guessed without AI.

HUIZINGA: Yeah. Bing, what’s your one takeaway?

LV: I think I want to add a few things on top of Hongxia’s comments because I think Hongxia has very good critical words I would like to emphasize. When we train the AI well, if we guide the AI well, it could be very useful to become our partner. So I think all in all, our human being’s intellectual merit here is still going to play a significantly important role, okay? We are generating this AI, we should really train the AI, we should be using our human being intellectual merit to guide them to be useful for our human being society advancement. Now with all these AI tools, I think it’s a very golden time right now. Experimentalists could work very closely with like Hongxia, who’s a good theorist who has very good intellectual merits, and then we actually now incorporate with AI, then combine all pieces together, hopefully we’re really able to accelerating material discovery in a much faster pace than ever which the whole society will eventually get a benefit from it.

HUIZINGA: Yeah. Well, as we close, Bing, I want you to go a little further and talk about what’s next then, research wise. What are the open questions or outstanding challenges that remain in this field and what’s on your research agenda to address them?

LV: So first of all, I think this paper is addressing primarily on these crystalline ordered inorganic bulk materials. And also with the condition we are targeting at ambient pressure, room temperature, because that’s normally how the instrument is working, right? But what if under extreme conditions? We want to go to space, right? There we’ll have extreme conditions, some very… sometimes very cold, sometimes very hot. We have some places with extremely probably quite high pressure. Or we have some conditions that are highly radioactive. So under that condition, there’s going to be a new database could be emerged. Can we do something beyond that? Another good important thing is we are targeting this paper on high thermal conductivity. What about extremely low thermal conductivity? Those will actually bring a very good challenge for theorists and also the machine learning approach. I think that’s something Hongxia probably is very excited to work on in that direction. I know since she’s ambitious, she wants to do something more than beyond what we actually achieved so far.

HUIZINGA: Yeah, so Hongxia, how would you encapsulate what your dream research is next?

HAO: Yeah, so I think besides all of these exciting research directions, on my end, another direction is perhaps kind of exciting is we want to move from search to design. So right now we are kind of good at asking like what exists by just doing a forward prediction and brute force. But with generative AI, we can start asking what should exist? In the future, we can have an incorporation between forward prediction and backwards generative design to really tackle questions. If you have materials like you want to have desired like properties, how would you design the problems?

[MUSIC]

HUIZINGA: Well, it sounds like there’s a full plate of research agenda goodness going forward in this field, both with human brains and AI. So, Hongxia Hao and Bing Lv, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/Abstracts, or you can read a pre-print of it on arXiv. See you next time on Abstracts!

[MUSIC FADES] 

The post Abstracts: Heat Transfer and Deep Learning with Hongxia Hao and Bing Lv appeared first on Microsoft Research.

]]>
Abstracts: Societal AI with Xing Xie http://approjects.co.za/?big=en-us/research/podcast/abstracts-societal-ai-with-xing-xie/ Mon, 05 May 2025 16:01:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1138012 New AI models aren’t just changing the world of research; they’re also poised to impact society. Xing Xie talks about Societal AI, a white paper that explores the changing landscape with an eye to future research and improved communication across disciplines.

The post Abstracts: Societal AI with Xing Xie appeared first on Microsoft Research.

]]>
Xing Xie illustrated headshot

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts bring its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Partner Research Manager Xing Xie joins host Gretchen Huizinga to talk about his work on a white paper called Societal AI: Research Challenges and Opportunities. Part of a larger effort to understand the cultural impact of AI systems, this white paper is a result of a series of global conversations and collaborations on how AI systems interact with and influence human societies. 


Learn more:

Societal AI: Building human-centered AI systems
Microsoft Research Blog, May 2024

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot – or a podcast abstract – of their new and noteworthy papers. 

[MUSIC FADES]

I’m here today with Xing Xie, a partner research manager at Microsoft Research and co-author of a white paper called Societal AI: Research Challenges and Opportunities. This white paper is a result of a series of global conversations and collaborations on how AI systems interact with and impact human societies. Xing Xie, great to have you back on the podcast. Welcome to Abstracts! 

XING XIE: Thank you for having me. 

HUIZINGA: So let’s start with a brief overview of the background for this white paper on Societal AI. In just a few sentences, tell us how the idea came about and what key principles drove the work. 

XIE: The idea for this white paper emerged in response to the shift we are witnessing in the AI landscape. Particularly since the release of ChatGPT in late 2022, these models didn’t just change the pace of AI research, they began reshaping our society, education, economy, and yeah, even the way we understand ourselves. At Microsoft Research Asia, we felt a strong urgency to better understand these changes. Over the past 30 months, we have been actively exploring this frontier in partnership with experts from psychology, sociology, law, and philosophy. This white paper serves three main purposes. First, to document what we have learned. Second, to guide future research directions. And last, to open up an effective communication channel with collaborators across different disciplines. 

HUIZINGA: Research on responsible AI is a relatively new discipline and it’s profoundly multidisciplinary. So tell us about the work that you drew on as you convened this series of workshops and summer schools, research collaborations and interdisciplinary dialogues. What kinds of people did you bring to the table and for what reason? 

XIE: Yeah. Responsible AI actually has been evolving within Microsoft for like about a decade. But with the rise of large language models, the scope and urgency of these challenges have grown exponentially. That’s why we have leaned heavily on interdisciplinary collaboration. For instance, in the Value Compass Project, we worked with philosophers to frame human values in a scientifically actionable way, something essential for aligning AI behavior. In our AI evaluation efforts, we drew from psychometrics to create more principled ways of assessing these systems. And with the sociologists, we have examined how AI affects education and social systems. This joint effort has been central to the work we share in this white paper. 

HUIZINGA: So white papers differ from typical research papers in that they don’t rely on a particular research methodology per se, but you did set, as a backdrop for your work, ten questions for consideration. So how did you decide on these questions and how or by what means did you attempt to answer them? 

XIE: Rather than follow a traditional research methodology, we built this white paper around ten fundamental, foundational research questions. These came from extensive dialogue, not only with social scientists, but also computer scientists working at the technical front of AI. These questions span both directions. First, how AI impacts society, and second, how social science can help solve technical challenges like alignment and safety. They reflect a dynamic agenda that we hope to evolve continuously through real-world engagement and deeper collaboration. 

HUIZINGA: Can you elaborate on… a little bit more on the questions that you chose to investigate as a group or groups in this? 

XIE: Sure, I think I can use the Value Compass Project as one example. In that project, our main goal is to try to study how we can better align the value of AI models with our human values. Here, one fundamental question is how we define our own human values. There actually is a lot of debate and discussions on this. Fortunately, we see in philosophy and sociology actually they have studied this for years, like, for like hundreds of years. They have defined some, like, such as basic human value framework, they have defined like modern foundation theory. We can borrow those expertise. Actually, we have worked with sociology and philosophers, try to borrow these expertise and define a framework that could be usable for AI. Actually, we have worked on, like, developing some initial frameworks and evaluation methods for this. 

HUIZINGA: So one thing that you just said was to frame philosophical issues in a scientifically actionable way. How hard was that? 

XIE: Yeah, it is actually not easy. I think that first of all, social scientists and AI researchers, we… usually we speak different languages. 

HUIZINGA: Right! 

XIE: Our research is at a very different pace. So at the very beginning, I think we should find out what’s the best way to talk to each other. So we have workshops, have joint research projects, we have them visit us, and also, we have supervised some joint interns. So that’s all the ways we try to find some common ground to work together. More specifically for this value framework, we have tried to understand what’s the latest program from their source and also try how to adapt them to an AI context. So that’s, I mean, it’s not easy, but it’s like enjoyable and exciting journey! 

HUIZINGA: Yeah, yeah, yeah. And I want to push in on one other question that I thought was really interesting, which you asked, which was how can we ensure AI systems are safe, reliable, controllable, especially as they become more autonomous? I think this is a big question for a lot of people. What kind of framework did you use to look at that? 

XIE: Yeah, there are many different aspects. I think alignment definitely is an aspect. That means how we can make sure we can have a way to truly and deeply embed our values into the AI model. Even after we define our value, we still need a way to make sure that it’s actually embedded in. And also evaluation I think is another topic. Even we have this AI…. looks safe and looks behavior good, but how we can evaluate that, how we can make sure it is actually doing the right thing. So we also have some collaboration with psychometrics people to define a more scientific evaluation framework for this purpose as well. 

HUIZINGA: Yeah, I remember talking to you about your psychometrics in the previous podcast… 

XIE: Yeah! 

HUIZINGA: …you were on and that was fascinating to me. And I hope… at some point I would love to have a bigger conversation on where you are now with that because I know it’s an evolving field. 

XIE: It’s evolving! 

HUIZINGA: Yeah, amazing! Well, let’s get back to this paper. White papers aren’t designed to produce traditional research findings, as it were, but there are still many important outcomes. So what would you say the most important takeaways or contributions of this paper are? 

XIE: Yeah, the key takeaway, I believe, is AI is no longer just a technical tool. It’s becoming a social actor. 

HUIZINGA: Mmm. 

XIE: So it must be studied as a dynamic evolving system that intersects with human values, cognition, culture, and governance. So we argue that interdisciplinary collaboration is no longer optional. It’s essential. Social sciences offer tools to understand the complexity, bias, and trust, concepts that are critical for AI’s safe and equitable deployment. So the synergy between technical and social perspectives is what will help us move from reactive fixes to proactive design. 

HUIZINGA: Let’s talk a little bit about the impact that a paper like this can have. And it’s more of a thought leadership piece, but who would you say will benefit most from the work that you’ve done in this white paper and why? 

XIE: We hope this work speaks to both AI and social science communities. For AI researchers, this white paper provides frameworks and real-world examples, like value evaluation systems and cross-cultural model training that can inspire new directions. And for social scientists, it opens doors to new tools and collaborative methods for studying human behavior, cognition, and institutions. And beyond academia, we believe policymakers and industry leaders can also benefit as the paper outlines practical governance questions and highlights emerging risks that demand timely attention. 

HUIZINGA: Finally, Xing, what would you say the outstanding challenges are for Societal AI, as you framed it, and how does this paper lay a foundation for future research agendas? Specifically, what kinds of research agendas might you see coming out of this foundational paper? 

XIE: We believe this white paper is not a conclusion, it’s a starting point. While the ten research questions are a strong foundation, they also expose deeper challenges. For example, how do we build a truly interdisciplinary field? How can we reconcile the different timelines, methods, and cultures of AI and social science? And how do we nurture talents who can work fluently across those both domains? We hope this white paper encourages others to take on these questions with us. Whether you are researcher, student, policymaker, or technologist, there is a role for you in shaping AI that not only works but works for society. So yeah, I look forward to the conversation with everyone. 

[MUSIC]

HUIZINGA: Well, Xing Xie, it’s always fun to talk to you. Thanks for joining us today and to our listeners, thanks for tuning in. If you want to read this white paper, and I highly recommend that you do, you can find a link at aka.ms/Abstracts, or you can find a link in our show notes that will take you to the Microsoft Research website. See you next time on Abstracts!

[MUSIC FADES]

 

The post Abstracts: Societal AI with Xing Xie appeared first on Microsoft Research.

]]>
Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang http://approjects.co.za/?big=en-us/research/podcast/abstracts-neurips-2024-with-jindong-wang-and-steven-euijong-whang/ Fri, 13 Dec 2024 14:30:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1107435 Researcher Jindong Wang and Associate Professor Steven Euijong Whang explore the NeurIPS 2024 work ERBench. ERBench leverages relational databases to create LLM benchmarks that can verify model rationale via keywords in addition to checking answer correctness.

The post Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang appeared first on Microsoft Research.

]]>
Illustrated image of Jindong Wang and Steven Euijong Whang

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Jindong Wang, a senior researcher at Microsoft Research, and Steven Euijong Whang (opens in new tab), a tenured associate professor at Korea Advanced Institute of Science and Technology (KAIST), join host Gretchen Huizinga to discuss the paper “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” a spotlight session at this year’s Conference on Neural Information Processing Systems (NeurIPS). ERBench leverages the integrity constraints of relational databases to create LLM benchmarks that can verify model rationale via keywords as well as check for answer correctness. This work is supported by the Microsoft Research initiative Accelerating Foundation Models Research, or AFMR.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Jindong Wang, a senior researcher at Microsoft Research, and Steven Whang, a tenured associate professor at the Korea Advanced Institute of Science and Technology. Jindong and Steven are coauthors of a paper called “ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models,” and this paper is a spotlight at this year’s conference on Neural Information Processing Systems, or NeurIPS, in Vancouver, BC, this week. Jindong and Steven, thanks for joining us on Abstracts!

JINDONG WANG: Thank you. Nice to be here.

STEVEN EUIJONG WHANG: It’s great to be here.

HUIZINGA: So, Jindong, I’ll start with you. In just a few sentences, tell us what problem your research addresses and why people should care about it.

JINDONG WANG: OK, everybody knows that with the widespread usage of large language models, hallucination has become a crucial factor of concern. Hallucination occurs when models generate false or nonexistent information. In particular, factual hallucination greatly undermines the reliability of the large language models. To correctly evaluate the hallucination, evaluating the model’s rationale is also important. Up to date, when the paper, you know, was submitted, there were no works dealing with automatic rationale evaluation systematically because, you know, most of them focused on manual evaluation or just using GPT-judge. ERBench is the first one to generate a large language model evaluation benchmark utilizing relational databases. Relational databases are based on the relational data model assuming a fixed schema. The fixed schema enables relational databases to have data integrity that are based on database design theories, so that integrity constraints in relational databases allows better evaluation of the large language models. Functional dependencies allow automatic rationale evaluation using the functional dependency inferred keywords, and foreign key constraints also allow for easy generation of the multi-hop questions, which are usually very complicated to generate with other techniques. So that’s basically what we want to do. So in one sentence, we try to build an automatic evaluation benchmark for evaluation of the hallucination.

HUIZINGA: Steven, give us a quick overview of your research methodology and findings. How did you conduct your research, and what were your major takeaways?

STEVEN EUIJONG WHANG: Sure. So this was a collaboration between our group at KAIST, and Dr. Xing Xie’s group at MSRA (Microsoft Research Asia). KAIST is Korea Advanced Institute of Science and Technology. So we had the privilege to closely work with our LLM expert, Dr. Jindong Wang, here. We also acknowledge the Microsoft Accelerating Foundation Models Research, or AFMR, program for using Azure quota for our experiments. So we had some biweekly meetings for maybe over a year, and at some point, we figured that relational databases could be really important for LLM evaluation. I personally have a background in databases, which I studied at Stanford University as a PhD student. So relational databases have integrity constraints that can be used to better construct complex, in-depth questions and verify answers. So the first ingredient is functional dependencies. So these are constraints where, given a few attributes, you can determine another attribute. So I’ll just give an example because I think that helps the understanding. So suppose that you have, like, a movie table, and in a movie, you have the title of the movie, the year of production, and the director of the movie, and the length of the movie, and so on and so forth. So if you know the title and year of the movie, that pretty much identifies the movie, and you can actually determine the director of the movie, as well. So, for example, if you know that there’s a movie called Star Wars, which is a very popular movie produced in 1977, that determines the director. We know it’s George Lucas, right. So, basically, it’s like a function. It receives the Star Wars 1977 and determines, gives the output, George Lucas. So that’s the first ingredient. Now, the reason this is important is that we can use these functional dependencies to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values. For example, we may ask the LLM, is there a director of a movie called Star Wars produced in 1977? And the LLM can say yes. And it is the right answer, but we’d like to know if the LLM is knowing what it’s saying, right. And so we look at the rationale. That’s why looking at the rationale is important. We just can’t say it’s doing the correct thing. So if the LLM mentions George Lucas, bingo, that’s a great answer. However, if the LLM mentions some other director, like Steven Spielberg, that’s not a correct rationale. So that’s exactly what we’re trying to evaluate. Functional dependency is key to being able to do that kind of verification.

The second ingredient is foreign key constraints. So foreign key constraint is where one of the attributes in one table can intuitively link to another attribute of another table. So in our movie table, we had the director attribute. Now we may also have a separate table called the director table, and maybe we might have some more information about the director in that table, like the director name, the director’s age, all sorts of information about the director. So foreign key constraint basically requires that if there is some director mentioned in the movie table, it has to be one of the directors in the director table. So this basically links a table to another table. It’s very useful. So using this, what we can do is we can join the two tables, right. So now we can join the movie and director table and generate a bigger table. The reason this is useful is that we can also chain together functional dependencies that I just mentioned into longer functional dependencies. So what this enables is us to construct more complex questions, arbitrarily, that are multi-hop. So using these integrity constraints, we can basically convert any relational database into an LLM benchmark, and this supports continuous evaluation as the database changes. We can also support multimodal questions and also support various prompt engineering techniques.

HUIZINGA: Well, I would ask you to, kind of, drill in on what you found in how ERBench compares to other benchmark tests.

STEVEN EUIJONG WHANG: So we evaluated our benchmark on five domains and performed comprehensive analyses in terms of answer and rationale accuracies and hallucination rates using single, multi-hop, and multimodal questions and also performed prompt engineering and fine-tuning. And what we found is that some LLMs, like GPT-4, are relatively aggressive and good at answering lots of questions. Other LLMs, like Gemini, tend to be a bit more conservative and do not answer as many questions but instead hallucinate less as a result. So the key conclusion is that no LLM, like, totally subsumes the other in all aspects, which is the reason why we use multiple measures. And the key message we want to make is that overall, ERBench is effective in evaluating any LLM’s thought process by pinpointing critical keywords within the rationale.

HUIZINGA: Well, Jindong, back to you. Research settings are one thing, but tell us how your work is significant in real-world settings, and who does this impact most and how?

JINDONG WANG: Relational databases, you know, they are everywhere across various domains. Anyone can easily get access from Google or from Kaggle or even create them targeting the domain or subject that one wants to test the model on. So taking into account that ERBench is the first work to utilize the relational database for generating large language model hallucination benchmarks … so this work will lead a new research direction of integrating database design theories and techniques, a long-studied field—you know, database is very traditional, old, and classic, but, you know, they’re still operating right now—into the large language model field, a recently emerging area.

HUIZINGA: Right. Well, Steven, as we close, I assume there are still a few unanswered questions or unsolved problems in the field. What do you propose to do about those, and what’s next on your research agenda?

STEVEN EUIJONG WHANG: Sure, so the big picture is that we basically proposed the first work to properly evaluate the rationale of LLMs, right. This is very important because LLMs are being used in our everyday lives, and everyone has the question, is the LLM suitable for my task? Can I benefit from the LLM? So it’s very important to verify if the LLM knows what it’s saying. So I just mentioned that we use functional dependencies to pinpoint critical keywords in the rationale. And we believe that’s just the first step. It’s very effective, by the way. So you may have the question, is it enough to just look at, like, the George Lucas within the long rationale? And it turns out 95% of the cases, it is actually effective, so we did human studies and also used GPT-judge to verify that. But these are factual questions and there could be various other questions that require long answers, right. Long rationales. And so the important question is, can we also verify all the rest of the rationales, the complicated rationales, as well? And so in order to properly do that, we need a lot of technology. So first we need to understand the rationales using NLP techniques, and we need to know if it’s properly answering the question, and so on and so forth. And so we believe that there’s a lot of opportunity to expand from that. So we basically, you know, proposed an initial work towards this direction, but we believe that there are many more interesting challenges that remain.

HUIZINGA: Well, Jindong Wang and Steven Whang, thanks for joining us today, and to our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find a link at aka.ms/abstracts.

[MUSIC]

You can also find it on arXiv and on the NeurIPS website. And if you’re at the NeurIPS conference this week, go to the poster session and talk to the authors! See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang appeared first on Microsoft Research.

]]>
Abstracts: NeurIPS 2024 with Weizhu Chen http://approjects.co.za/?big=en-us/research/podcast/abstracts-neurips-2024-with-weizhu-chen/ Sat, 07 Dec 2024 00:48:04 +0000 http://approjects.co.za/?big=en-us/research/?p=1107414 Next-token prediction trains a language model on all tokens in a sequence. VP Weizhu Chen discusses his team’s 2024 NeurIPS paper on how distinguishing between useful and “noisy” tokens in pretraining can improve token efficiency and model performance.

The post Abstracts: NeurIPS 2024 with Weizhu Chen appeared first on Microsoft Research.

]]>
Illustrated image of Weizhu Chen.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Weizhu Chen, vice president of Microsoft GenAI, joins host Amber Tingle to discuss the paper “Not All Tokens Are What You Need for Pretraining,” an oral presentation and a recipient of the Best Paper Runner-Up Award at this year’s Conference on Neural Information Processing Systems (NeurIPS). Based on an examination of model training at the token level, Chen and his coauthors present an alternate approach to model pretraining: instead of training language models to predict all tokens, they make a distinction between useful and “noisy” tokens. Doing so, the work shows, improves token efficiency and model performance.

Transcript

[MUSIC]

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES] 

Our guest today is Weizhu Chen. He is vice president of Microsoft GenAI and coauthor of a paper called “Not All Tokens Are What You Need for Pretraining.” This paper is an oral presentation during the 38th annual Conference on Neural Information Processing Systems, also known as NeurIPS, which is happening this week in Vancouver. Weizhu, thank you for joining us today on Abstracts

WEIZHU CHEN: Thank you for having me, Amber. 

TINGLE: So let’s start with a brief overview of your paper. In a couple sentences, tell us about the problem your research addresses and, more importantly, why the research community and beyond should know about this work. 

CHEN: So my team basically in Microsoft GenAI, we are working on model training. So one of the things actually we do in the pretraining, we realize the importance of the data. And we found that actually when we do this kind of data for each of the tokens, some token is more important than the other. That’s one. The other one actually is some token actually is very, very hard to be predicted during the pretraining. So, for example, just like if someone see the text of “Weizhu,” and what’s the next token? It can be “Chen”; it can be any of the last name. So it’s very hard to be predicted. And if we try to enforce a language model to focus on this, kind of, the hard-to-predict token, just like actually it’s going to confuse the language model. There are so many different kinds of the example like this. Just like, for example, the serial number in your UPS. So the focus of this paper is try to identify which token actually is more important for the language model to learn. And actually the other token maybe is just the noise. And how can we try to discriminate the token—which is good token, which is noise token. Basically, you try to understand this kind of dynamic of the tokens. 

TINGLE: How did you conduct this research? 

CHEN: Actually we do a lot of work in the model training, including the pretraining and the post-training. So for the pretraining side, actually the most important thing to us is the data. We also try to understand, how can we leverage the existing data, and how can we create much more data, as well? And data basically is one of the most important thing to build a better foundation model. So we try to understand how much more we can get from the data. And the important thing for the data is about data filtering. So you think about actually in the previous literature, we do the data filtering, for example, just like we build a classifier to classify, OK, this page is more important than the other. And this page actually is a noise because there’s so much noise data in the web. So we just keep the best data to get into the pretraining corpus. And further away, we think about, OK, yeah, so this is … maybe it’s not fine grain enough, so can we try to understand even for the same page we want to keep? So some token is more important than the other. Maybe some token just some noise token. Actually you put this data into the pretraining, it’s going to hurt the model quality. So there is the motivation actually we try to think about.

TINGLE: And what were your major findings? 

CHEN: Our major finding is about basically, definitely this works so well. And it’s so important that actually we are able to get the best token from the corpus and then make it available and try to ask the model during the pretraining to ignore the token we don’t want to get into the model itself. So that is one. The second thing definitely data is the other very important thing. If you’re able to figure out the better way to build a better data is most likely you’re able to build a much better foundation model. The third thing actually is also connected to a lot of other existing work, just like data synthesis, just like distillation, just like data filtering, and so a lot of things are really connected together. And actually, this work, basically, you can associate with also a lot of other work we are working on, just like distillation. You can think about, for example, for this work, we also try to build a model, a reference model—we call as the reference model—to try to identify actually this data, this token, is more important than the other and try to understand the discrepancy between the reference model and the running model, their prediction on each tokens. So you can think about also it’s some kind of the try to distill from the reference model to the existing model, as well. 

TINGLE: Let’s talk a little bit about real-world impact. Who benefits most from this work? And how significant is this within your discipline and even downstream for people using applications? 

CHEN: This actually is very, very fundamental work because just like I share a little bit before, actually we build the data and this data is—build the data much better—is able to build a much better foundation model. If we’re able to build a better model actually is able to benefit so many different kinds of application. This also is going to help us to build a much better small language model. And we can also serve this model even in the edge side, in the client side, in the coding scenario. So we are going to see actually huge impact from this kind of the foundation model if you are able to benefit from building much better training data. 

TINGLE: Are there any unanswered questions or unsolved problems in this area? What’s next on your research agenda? 

CHEN: Yeah, I think that is a very good questions. And definitely there’s a lot of things about how to build a better data [that] is unsolved yet in the literature. And especially because when you do the pretraining, the most important part is the data, but the data is very limited. And how can we make better use from the existing limited data is a big challenge. Because we can increase the model by 10x, but it’s super hard to increase the data by 10x, especially when we want to deal with the high quality of data. The other way, even given the data, how can you identify, especially for this work, the importance of each token to build a much better model? I think all these things are very connected together. To me, actually, data is the oxygen. So there are still so many things we are able to do in the data, including building for even the small language model or the large model. 

TINGLE: Data is oxygen—I love that! So other than that being a key takeaway, is there any other one thing that you’d like our listeners to walk away from this conversation knowing? 

CHEN: I would love to say actually focus more on this kind of data and focus more about how can I get more from the data actually; it is the very important thing. And the other thing actually, we are working on something that’s very exciting. You can feel free to come to join us if you are very interested in this area. 

[MUSIC] 

TINGLE: Well, Weizhu Chen, thank you for joining us today. We really appreciate it. 

CHEN: Thank you. Thank you for having me. 

TINGLE: And thanks to our listeners for tuning in. If you’d like to read the full paper, you may find a link at aka.ms/abstracts. You can also find the paper on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research, and we hope you’ll join us next time on Abstracts

[MUSIC FADES] 

The post Abstracts: NeurIPS 2024 with Weizhu Chen appeared first on Microsoft Research.

]]>
Abstracts: NeurIPS 2024 with Pranjal Chitale http://approjects.co.za/?big=en-us/research/podcast/abstracts-neurips-2024-with-pranjal-chitale/ Fri, 06 Dec 2024 14:00:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1107426 Pranjal Chitale discusses the 2024 NeurIPS work CVQA. Spanning 31 languages and the cultures of 30 countries, this VQA benchmark was created with native speakers and cultural experts to evaluate model performance across diverse linguistic and cultural contexts.

The post Abstracts: NeurIPS 2024 with Pranjal Chitale appeared first on Microsoft Research.

]]>
diagram

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Research Fellow Pranjal Chitale joins host Gretchen Huizinga to discuss the paper “CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). CVQA, which comprises questions and images representative of 31 languages and the cultures of 30 countries, was created in collaboration with native speakers and cultural experts to evaluate how well models perform across diverse linguistic and cultural contexts, an important step toward improving model inclusivity.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract— of their new and noteworthy papers.

[MUSIC FADES]

Today I’m talking to Pranjal Chitale, a research fellow at Microsoft Research India. Pranjal is coauthor of a paper called “CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark,” and this paper is an oral presentation at this week’s 38th annual Conference on Neural Information Processing Systems, or NeurIPS, in Vancouver, BC. Pranjal, thanks for joining us today on Abstracts!

PRANJAL CHITALE: Hi, Gretchen. Thanks for having me.

HUIZINGA: So, Pranjal, give us an overview of this paper. In a couple sentences, what problem are you trying to solve, and why should people care about it?

CHITALE: So we are witnessing some exciting times as LLMs are rapidly evolving as tools for countless use cases. While most of these LLMs were initially leveraged for natural language processing tasks, they are now expanded across languages and modalities. However, a major gap lies in the availability of multimodal data for non-English languages. Therefore, most multimodal models might not have coverage for non-English languages altogether or might just heavily rely on translations of the associated text in English-centric datasets so as to support multiple languages. The drawback of this approach is that it often misses the cultural nuances of local languages. And another reason why this is not optimal is the images are mostly Western-centric [and] therefore would not be well reflective of the local culture of a lot of regions. So this kind of bias can skew these models towards a Western perspective, raising concerns about inclusivity and safety of the content which they generate when serving a global population, which involves multicultural and multilingual users. Therefore, for a truly inclusive AI ecosystem, models must demonstrate cultural understanding to ensure that the generated content is safe, respectful for diverse communities. Evaluating cultural awareness, though, is extremely challenging because how to define culture itself is an unsolved problem. However, in this work, we are trying to take a step towards having a proxy which could measure cultural understanding.

HUIZINGA: Well, talk about how you did this. What methodology did you use for this paper, and what were your major findings?

CHITALE: Now that we have defined our broader problem, it is important to decide the scope of our solution because, as we discussed, culture is an umbrella term. So we need to define a smaller scope for this problem. We chose visual question answering, which is a multimodal task, and it is one of the most critical multimodal tasks for the scope of this work. So recognizing the limitations of existing VQA benchmarks, which often rely on translations and lack cultural representation, we developed CVQA, which is Culturally-diverse multilingual VQA benchmark. CVQA spans 30 countries, 31 languages, and has over 10,000 culturally nuanced questions, which were crafted by native speakers and cultural experts. So our focus was on creating questions which required what we term as cultural common sense to answer. For instance, with just the image, it is not possible to answer the question. You need some cultural awareness about the local culture to be able to answer the question. So these questions draw inspiration from knowledge of local culture. So one important aspect of this dataset is that we include both local language as well as English variants of the same question to allow robust testing of models across linguistic concepts. I would say the crux of this effort is that while most of the prior efforts may be small in terms of language—it could be language-group specific or country specific for most—but we wanted this to be a much larger global-scale collaborative effort. So this covers 31 languages across 30 countries. So to build CVQA, we worked with qualified volunteers from diverse age group and genders, ensuring that the questions authentically represented their cultures. So images which were collected, those were ensured to be copyright free, grounded in culture, and safe for work with strict guidelines to ensure that we avoid images which reflect some stereotypes or privacy violations. And we also had 10 categories, which involved topics ranging from daily life, sports, cuisine to history of the region, so a holistic view of the culture of the region. So each question was crafted as a multiple-choice task with challenging answer options which required both the image as well as cultural knowledge to solve. We also employed a maker-checker approach to ensure quality and consistency.

HUIZINGA: So you’ve created the benchmark. You’ve tested it. What were your major findings?

CHITALE: Now that we have created a benchmark, the next step is to evaluate how these multimodal models are performing on this benchmark. So we benchmark several state-of-the-art multimodal models, which include both open-source offerings like CLIP, BLIP, LLaVA-1.5, and proprietary offerings like GPT-4o or Gemini 1.5 Flash. So what we observed is there is a huge gap when it comes … in performance when we compare these proprietary offerings versus the open-source models. So GPT-4o was the highest-performing model with 75.4% accuracy on English prompts and 74.3% accuracy on local prompts. However, the story is completely different when we go to open-source models. These open-source models significantly lag behind the proprietary models. And one key finding over these open-source models is that these models perform even worse when prompted in the native language when we compare it to prompting in English. This potentially highlights that these models lack multilingual understanding capabilities, which may be because multilingual training data is pretty scarce.

HUIZINGA: Yeah.

CHITALE: So LLaVA-1.5 turned out to be the best open-source model. So one thing to notice, LLaVA-1.5 performs well across a large set of English VQA benchmarks, but when it comes to cultural understanding, it is a pretty weak model. Further, we also did some ablations to understand if adding location-specific information to the textual prompts has some impact or not, but we identified that it does not result in any significant performance improvements. Further, we also conducted a category-wise analysis. So, as we had mentioned, there are 10 categories to which these images belong. So what we observed is that certain categories, like people and everyday life, consistently saw higher accuracy across a large set of models. This may be likely due to abundance of human activity data in training datasets. However, when it comes to niche categories like cooking and food, pop culture, which are much more challenging, especially in local languages, these models struggle. Therefore, these are the kind of highly diverse cultural contexts which need improvement.

HUIZINGA: How’s this work going to make an impact outside the lab and in the real world?

CHITALE: CVQA is significant because it addresses a fundamental gap in how we evaluate vision-language and multimodal models today. While proprietary models are making impressive strides, open-source models, which are more accessible and easier to deploy, significantly lag behind in terms of cultural awareness and safety. So CVQA fills this gap and provides a much-needed benchmark to help us identify these gaps in the first place. So as to fix them, we first need to identify the gaps, and whether we are progressing or not can be captured by this benchmark. So for the real world, this benchmark does have some far-reaching implications. Models which understand culture are not just technically better, but they would create interactions which are far more engaging, natural, and safe for users from diverse backgrounds. So this benchmark offers entirely new axis for improvement, cultural awareness, and linguistic diversity. Therefore, by improving a model’s ability to handle culturally nuanced questions, CVQA ensures researchers and developers think beyond accuracy and also focus on cultural awareness and inclusivity before shipping these models into production.

HUIZINGA: Pranjal, what are the unanswered questions or unsolved problems in this field, and what do you plan to do about it?

CHITALE: So while CVQA makes some strides in addressing cultural and linguistic diversity, there is still much more to explore in this space. So this dataset only covers 31 languages and cultures, but this is just, like, a subset of the incredible diversity that exists globally. Many languages and cultures remain underrepresented, especially some of them are endangered or have limited digital resources. So expanding CVQA to include more of these languages would be a natural next step. Secondly, CVQA just focuses on single-turn question-answer pairs. But in reality, human interaction is often multi-turn and conversational in nature. So a multi-turn version of CVQA could better simulate real-world use cases and challenge models to maintain cultural and contextual awareness over extended dialogues. Another interesting area is personalization. So it would be very interesting if we could teach models to adapt to a user’s cultural background, preferences, or even regional nuances in real time. This remains a significant challenge, although this benchmark could help us move a step towards our broader goal.

[MUSIC]

HUIZINGA: Well, Pranjal Chitale, this is super important research, and thank you for joining us today. To our listeners, thanks for tuning in. If you’re interested in learning more about this paper, you can find it at aka.ms/abstracts. You can also find it on arXiv and on the NeurIPS website. And if you’re at NeurIPS, you can also go hear about it. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Pranjal Chitale appeared first on Microsoft Research.

]]>
Abstracts: NeurIPS 2024 with Dylan Foster http://approjects.co.za/?big=en-us/research/podcast/abstracts-neurips-2024-with-dylan-foster/ Fri, 06 Dec 2024 14:00:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1107372 Can existing algorithms designed for simple reinforcement learning problems be used to solve more complex RL problems? Researcher Dylan Foster discusses the modular approach he and his coauthors explored in their 2024 NeurIPS paper on RL under latent dynamics.

The post Abstracts: NeurIPS 2024 with Dylan Foster appeared first on Microsoft Research.

]]>
Illustrated image of Dylan Foster for the Abstracts series on the Microsoft Research Podcast.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Principal Researcher Dylan Foster joins host Amber Tingle to discuss the paper “Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity,” an oral presentation at this year’s Conference on Neural Information Processing Systems (NeurIPS). In the paper, Foster and his coauthors explore whether well-studied RL algorithms for simple problems can be leveraged to solve RL problems with high-dimensional observations and latent dynamics, part of larger efforts to identify algorithm design principles that can enable agents to learn quickly via trial and error in unfamiliar environments.

Transcript

[MUSIC]

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Our guest today is Dylan Foster. He is a principal researcher at Microsoft Research and coauthor of a paper called “Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity.” The work is among the oral presentations at this year’s Conference on Neural Information Processing Systems, or NeurIPS, in Vancouver. Dylan, welcome and thank you for joining us on the podcast!

DYLAN FOSTER: Thanks for having me.

TINGLE: Let’s start with a brief overview of this paper. Tell us about the problem this work addresses and why the research community should know about it.

FOSTER: So this is a, kind of, a theoretical work on reinforcement learning, or RL. When I say reinforcement learning, broadly speaking, this is talking about the question of how can we design AI agents that are capable of, like, interacting with unknown environments and learning how to solve problems through trial and error. So this is part of some broader agenda we’ve been doing on, kind of, theoretical foundations of RL. And the key questions we’re looking at here are what are called, like, exploration and sample efficiency. So this just means we’re trying to understand, like, what are the algorithm design principles that can allow you to explore an unknown environment and learn as quickly as possible? What we’re doing in this paper is we’re, kind of, looking at, how can you most efficiently solve reinforcement learning problems where you’re faced with very high-dimensional observations, but the underlying dynamics of the system you’re interacting with are simple? So this is a setting that occurs in a lot of natural reinforcement learning and control problems, especially in the context of, like, say, embodied decision-making. So if you think about, say, games like Pong, you know, the state of the game, like, the state of, like, Pong, is extremely simple. It’s just, you know, what is the position and velocity of the ball, and, like, where are the paddles? But what we’d like to be able to do is learn to, you know, like, control or, like, solve games like this from raw pixels or, like, images kind of in the same way that a human would, like, just solve them from vision. So if you look at these types of problems, you know, we call these, like, RL with rich observations or RL with latent dynamics. You know, these are interesting because they, kind of, require you to explore the system, but they also require, you know, representation learning. Like, you want to be able to use neural nets to learn a mapping from, say, the images you see to the latent state of the system. This is a pretty interesting and nontrivial algorithmic problem. And, kind of, what we do in this work is we take a first step towards something like a unified understanding for how to solve these sorts of, like, rich-observation, or latent dynamics, RL problems.

TINGLE: So how did you go about developing this theoretical framework?

FOSTER: Yeah, so if you look at these sort of RL problems with latent dynamics, this is something that’s actually received a lot of investigation in theory. And a lot of this goes back to, kind of, early work from our lab from, like, 2016, 2017 or so. There’s some really interesting results here, but progress was largely on a, like, case-by-case basis, meaning, you know, there are many different ways that you can try to model the latent dynamics of your problem, and, you know, each of these somehow leads to a different algorithm, right. So, like, you know, you think very hard about this modeling assumption. You think about, what would an optimal algorithm look like? And you end up, you know, writing an entire paper about it. And there’s nothing wrong with that per se, but if you want to be able to iterate quickly and, kind of, try different modeling assumptions and see what works in practice, you know, this is not really tenable. It’s just too slow. And so the starting point for this work was to, kind of, try to take a different and more modular approach. So the idea is, you know, there are many, many different types of, sort of, systems or modeling assumptions for the dynamics that have been already studied extensively and have entire papers about them for the simpler setting in which you can directly see the state of the system. And so what we wanted to ask here is, is it possible to use these existing results in more of, like, a modular fashion? Like, if someone has already written a paper on how to optimally solve a particular type of MDP, or Markov decision process, can we just take their algorithm as is and perhaps plug it into some kind of meta-algorithm that can directly, kind of, combine this with representation learning and use it to solve the corresponding rich-observation, or latent dynamics, RL problem?

TINGLE: What were your major findings? What did you learn during this process?

FOSTER: We started by asking the question sort of exactly the way that I just posed it, right. Like, can we take existing algorithms and use them to solve rich-observation RL problems in a modular fashion? And this turned out to be really tricky. Like, there’s a lot of natural algorithms you might try that seem promising at first but don’t exactly work out. And what this, kind of, led us to and, sort of, the first main result in this paper is actually a negative result. So what we actually showed is most, sort of, well-studied types of systems or, like, MDPs that have been studied in, like, the prior literature on RL, even if they’re tractable when you’re able to directly see the state of the system, they can become statistically intractable once you add, sort of, high-dimensional observations to the picture. And statistically tractable here means the amount of interaction that you need, like the amount of, sort of, attempts to explore the system that you need, in order to learn a good decision-making policy becomes, like, very, very large, like much, much larger than the corresponding, sort of, complexity if you were able to directly see the states of the system. You know, you could look at this and say, I guess we’re out of luck. You know, maybe there’s just no hope of solving these sorts of problems. But that’s perhaps a little too pessimistic. You know, really the way you should interpret this result is just that you need more assumptions. And that’s precisely what the, sort of, second result we have in this paper is. So our second result shows that you can, sort of, bypass this impossibility result and, you know, achieve truly modular algorithms under a couple different types of additional assumptions.

TINGLE: Dylan, I’d like to know—and I’m sure our audience would, too—what this work means when it comes to real-world application. What impact will this have on the research community?

FOSTER: Yeah, so maybe I’ll answer that, um, with two different points. The first one is a broader point, which is, why is it important to understand this problem of exploration and sample efficiency in reinforcement learning? If you look at the, sort of, setting we study in this paper—you know, this, like, RL or decision-making with high-dimensional observations—on the empirical side, people have made a huge amount of progress on this problem through deep reinforcement learning. This was what kind of led to these amazing breakthroughs in solving games like Atari in the last decade. But if you look at these results, the gains are somehow more coming from the, like, inductive bias or the, like, generalization abilities of deep learning and not necessarily from the specific algorithms. So, like, current algorithms do not actually explore very deliberately, and so their sample efficiency is very high. Like, it’s hard to draw a one-to-one comparison, but you can argue they need, like, far more experience than a human would to solve these sorts of problems. So it’s not clear that we’re really anywhere near the ceiling of what can be achieved in terms of, like, how efficiently can you have, you know, an agent learn to solve new problems from trial and error. And I think better algorithms here could potentially be, like, transformative in a lot of different domains. To get into this specific work, I think there’s a couple of important takeaways for researchers. One is that by giving this impossibility result that shows that RL with latent dynamics is impossible without further assumptions, we’re kind of narrowing down the search space where other researchers can look for efficient algorithms. The second takeaway is, you know, we are showing that this problem becomes tractable when you make additional assumptions. But I view these more as, like, a proof of concept. Like, we’re kind of, showing for the first time that it is possible to do something nontrivial, but I think a lot more work and research will be required in order to like, you know, build on this and take this to something that can lead to, like, practical algorithms.

TINGLE: Well, Dylan Foster, thank you for joining us today to discuss your paper on reinforcement learning under latent dynamics. We certainly appreciate it.

FOSTER: Thanks a lot. Thanks for having me.

[MUSIC]

TINGLE: And to our listeners, thank you all for tuning in. If you’d like to read Dylan’s paper, you may find a link at aka.ms/abstracts. You can also find the paper on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research, and we hope you’ll join us next time on Abstracts!

[MUSIC FADES]

The post Abstracts: NeurIPS 2024 with Dylan Foster appeared first on Microsoft Research.

]]>
Abstracts: November 14, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-november-14-2024/ Thu, 14 Nov 2024 15:00:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1101918 The efficient simulation of molecules has the potential to change how the world understands biological systems and designs new drugs and biomaterials. Tong Wang discusses AI2BMD, an AI-based system designed to simulate large biomolecules with speed and accuracy.

The post Abstracts: November 14, 2024 appeared first on Microsoft Research.

]]>
Outlined illustrations of Tong Wang and Bonnie Kruft for the Microsoft Research Podcast, Abstracts series.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Microsoft Senior Researcher Tong Wang joins guest host Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science, to discuss “Ab initio characterization of protein molecular dynamics with AI2BMD.” In the paper, which was published by the scientific journal Nature, Wang and his coauthors detail a system that leverages AI to advance the state of the art in simulating the behavior of large biomolecules. AI2BMD, which is generalizable across a wide range of proteins, has the potential to advance solutions to scientific problems and enhance biomedical research in drug discovery, protein design, and enzyme engineering.

Transcript

[MUSIC]

BONNIE KRUFT: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES] 

I’m Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science and your host for today. Joining me is Tong Wang, a senior researcher at Microsoft. Tong is the lead author of a paper called “Ab initio characterization of protein molecular dynamics with AI2BMD,” which has just been published by the top scientific journal Nature. Tong, thanks so much for joining us today on Abstracts!

TONG WANG: Thank you, Bonnie.

KRUFT: Microsoft Research is one of the earliest institutions to apply AI in biomolecular simulation research. Why did the AI for Science team choose this direction, and—with this work specifically, AI2BMD—what problem are you and your coauthors addressing, and why should people know about it?

WANG: So as Richard Feynman famously said, “Everything that living things do can be understood in terms of the jigglings and the wigglings of atoms.” To study the mechanisms behind the biological processes and to develop biomaterials and drugs requires a computational approach that can accurately characterize the dynamic motions of biomolecules. When we review the computational research for biomolecular structure, we can get two key messages. First, in recent years, predicting the crystal, or static, protein structures with methods powered by AI has achieved great success and just won the Nobel Prize in Chemistry in the last month. However, characterizing the dynamic structures of proteins is more meaningful for biology, drug, and medicine fields but is much more challenging. Second, molecular dynamics simulation, or MD, is one of the most widely used approaches to study protein dynamics, which can be roughly divided into classical molecular dynamics simulation and quantum molecular dynamics simulation. Both approaches have been developed for more than a half century and won Nobel Prize. Classical MD is fast but less accurate, while quantum MD is very accurate but computationally prohibitive for the protein study. However, we need both the accuracy and the efficiency to detect the biomechanisms. Thus, applying AI in biomolecular simulation can become the third way to achieve both ab initio—or first principles—accuracy and high efficiency. In the winter of 2020, we have foreseen the trend that AI can make a difference in biomolecular simulations. Thus, we chose this direction.

KRUFT: It took four years from the idea to the launch of AI2BMD, and there were many important milestones along the way. First, talk about how your work builds on and/or differs from what’s been done previously in this field, and then give our audience a sense of the key moments and challenges along the AI2BMD research journey.

WANG: First, I’d like to say applying AI in biomolecular simulation is a novel research field. For AI-powered MD simulation for large biomolecules, there is no existing dataset, no well-designed machine learning model for the interactions between the atoms and the molecules, no clear technical roadmap, no mature AI-based simulation system. So we face various new challenges every day. Second, there are some other works exploring this area at the same time. I think a significant difference between AI2BMD and other works is that other works require to generate new data and train the deep learning models for any new proteins. So it takes a protein-specific solution. As a contrast, AI2BMD proposes a generalizable solution for a wide range of proteins. To achieve it, as you mentioned, there are some key milestones during the four-year journey. The first one is we proposed the generalizable protein fragmentation approach that divides proteins into the commonly used 20 kinds of dipeptides. Thus, we don’t need to generate data for various proteins. Instead, we only need to sample the conformational space of such dipeptides. So we built the protein unit dataset that contains about 20 million samples with ab initio accuracy. Then we proposed ViSNet, the graph neural network for molecular geometry modeling as the machine learning potential for AI2BMD. Furthermore, we designed AI2BMD simulation system by efficiently leveraging CPUs and GPUs at the same time, achieving hundreds of times simulation speed acceleration than one year before and accelerating the AI-driven simulation with only ten to a hundred millisecond per simulation step. Finally, we examined AI2BMD on energy, force, free energy, J coupling, and many kinds of property calculations for tens of proteins and also applied AI2BMD in the drug development competition. All things are done by the great team with science and engineering expertise and the great leadership and support from AI for Science lab.

KRUFT: Tell us about how you conducted this research. What was your methodology?

WANG: As exploring an interdisciplinary research topic, our team consists of experts and students with biology, chemistry, physics, math, computer science, and engineering backgrounds. The teamwork with different expertise is key to AI2BMD research. Furthermore, we collaborated and consulted with many senior experts in the molecular dynamics simulation field, and they provided very insightful and constructive suggestions to our research. Another aspect of the methodology I’d like to emphasize is learning from negative results. Negative results happened most of the time during the study. What we do is to constantly analyze the negative results and adjust our algorithm and model accordingly. There’s no perfect solution for a research topic, and we are always on the way.

KRUFT: AI2BMD got some upgrades this year, and as we mentioned at the top of the episode, the work around the latest system was published in the scientific journal Nature. So tell us, Tong—what is new about the latest AI2BMD system? 

WANG: Good question. We posted a preliminary version of AI2BMD manuscript on bioRxiv last summer. I’d like to share three important upgrades through the past one and a half year. The first is hundreds of times of simulation speed acceleration for AI2BMD, which becomes one of the fastest AI-driven MD simulation system and leads to perform much longer simulations than before. The second aspect is AI2BMD was applied for many protein property calculations, such as enthalpy, heat capacity, folding free energy, pKa, and so on. Furthermore, we have been closely collaborating with the Global Health Drug Discovery Institute, GHDDI, a nonprofit research institute founded and supported by the Gates Foundation, to leverage AI2BMD and other AI capabilities to accelerate the drug discovery processes.

KRUFT: What significance does AI2BMD hold for research in both biology and AI? And also, what impact does it have outside of the lab, in terms of societal and individual benefits?

WANG: Good question. For biology, AI2BMD provides a much more accurate approach than those used in the past several decades to simulate the protein dynamic motions and study the bioactivity. For AI, AI2BMD proves AI can make a big difference to the dynamic protein structure study beyond AI for the protein static structure prediction. Raised by AI2BMD and other works, I can foresee there is a coming age of AI-driven biomolecular simulation, providing binding free-energy calculation with quantum simulation accuracy for the complex of drug and the target protein for drug discovery, detecting more flexible biomolecular conformational changes that molecular mechanics cannot do, and opening more opportunities for enzyme engineering and vaccine and antibody design.

KRUFT: AI is having a profound influence on the speed and breadth of scientific discovery, and we’re excited to see more and more talented people joining us in this space. What do you want our audience to take away from this work, particularly those already working in the AI for Science space or looking to enter it?

WANG: Good question. I’d like to share three points from my research experience. First is aim high. Exploring a disruptive research topic is better than doing 10 incremental works. In the years of research, our organization always encourages us to do the big things. Second is persistence. I remembered a computer scientist previously said about 90% of the time during research is failure and frustration. The rate is even higher when exploring a new research direction. In AI2BMD study, when we suffered from research bottlenecks that cannot be tackled for several months, when we received critical comments from reviewers, when some team members wanted to give up and leave, I always encourage everyone to persist, and we will make it. More importantly, the foundation of persistence is to ensure your research direction is meaningful and constantly adjust your methodology from failures and critical feedback. The third one is real-world applications. Our aim is to leverage AI for advancing science. Proposing scientific problems is a first step, then developing AI tools and evaluating on benchmarks and, more importantly, examining its usefulness in the real-world applications and further developing your AI algorithms. In this way, you can close the loop of AI for Science research.

KRUFT: And, finally, Tong, what unanswered questions or unsolved problems remain in this area, and what’s next on the agenda for the AI2BMD team?

WANG: Well, I think AI2BMD is a starting point for the coming age of AI-driven MD for biomolecules. There are lots of new scientific questions and challenges coming out in this new field. For example, how to expand the simulated molecules from proteins to other kinds of biomolecules; how to describe the biochemical reactions during the simulations; how to further improve the simulation efficiency and robustness; and how to apply it for more real-world scenarios. We warmly welcome any people from both academic and industrial fields to work together with us to make the joint efforts to push the frontier of this new field moving forward.

[MUSIC]

KRUFT: Well, Tong, thank you for joining us today, and to our listeners, thanks for tuning in. If you want to read the full paper on AI2BMD, you can find a link at aka.ms/abstracts, or you can read it on the Nature website. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: November 14, 2024 appeared first on Microsoft Research.

]]>
Abstracts: November 5, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-november-5-2024/ Tue, 05 Nov 2024 19:30:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1099821 Researchers Chris Hawblitzel and Jay Lorch share how progress in programming languages and verification approaches are bringing bug-free software within reach. Their work on the Rust verification tool Verus won the Distinguished Artifact Award at SOSP ’24.

The post Abstracts: November 5, 2024 appeared first on Microsoft Research.

]]>
Outlined illustrations of Chris Hawblitzel and Jay Lorch for the Microsoft Research Podcast, Abstracts series.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Microsoft senior principal researchers Chris Hawblitzel and Jay Lorch join host Amber Tingle to discuss “Verus: A Practical Foundation for Systems Verification,” which received the Distinguished Artifact Award at this year’s Symposium on Operating Systems Principles, or SOSP. In their research, Hawblitzel, Lorch, and their coauthors leverage advances in programming languages and formal verification with two aims. The first aim is to help make software verification more accessible for systems developers so they can demonstrate their code will behave as intended. The second aim is to provide the research community with sound groundwork to tackle the application of formal verification to large, complex systems. 

Transcript 

[MUSIC] 

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers. 

[MUSIC FADES] 

Our guests today are Chris Hawblitzel and Jay Lorch. They are both senior principal researchers at Microsoft and two of the coauthors on a paper called “Verus: A Practical Foundation for Systems Verification.” This work received the Distinguished Artifact Award at the 30th Symposium on Operating Systems Principles, also known as SOSP, which is happening right now in Austin, Texas. Chris and Jay, thank you for joining us today for Abstracts and congratulations!

JAY LORCH: Thank you for having us. 

CHRIS HAWBLITZEL: Glad to be here. 

TINGLE: Chris, let’s start with an overview. What problem does this research address, and why is Verus something that the broader research community should know about? 

HAWBLITZEL: So what we’re trying to address is a very simple problem where we’re trying to help developers write software that doesn’t have bugs in it. And we’re trying to provide a tool with Verus that will help developers show that their code actually behaves the way it’s supposed to; it obeys some sort of specification for what the program is supposed to do. 

TINGLE: How does this publication build on or differ from other research in this field, including your previous Verus-related work? 

HAWBLITZEL: So formal verification is a process where you write down what it is that you want your program to do in mathematical terms. So if you’re writing an algorithm to sort a list, for example, you might say that the output of this algorithm should be a new list that is a rearrangement of the elements of the old list, but now this rearrangement should be in sorted order. So you can write that down using standard mathematics. And now given that mathematical specification, the challenge is to prove that your piece of software written in a particular language, like Java or C# or Rust, actually generates an output that meets that mathematical specification. So this idea of using verification to prove that your software obeys some sort of specification, this has been around for a long time, so, you know, even Alan Turing talked about ways of doing this many, many decades ago. The challenge has always been that it’s really hard to develop these proofs for any large piece of software. It simply takes a long time for a human being to write down a proof of correctness of their software. And so what we’re trying to do is to build on earlier work in verification and recent developments in programming languages to try to make this as easy as possible and to try to make it as accessible to ordinary software developers as possible. So we’ve been using existing tools. There are automated theorem provers—one of them from Microsoft Research called Z3—where you give it a mathematical formula and ask it to prove that the formula is valid. We’re building on that. And we’re also taking a lot of inspiration from tools developed at Microsoft Research and elsewhere, like Dafny and F* and so on, that we’ve used in the past for our previous verification projects. And we’re trying to take ideas from those and make them accessible to developers who are using common programming languages. In this case, the Rust programming language is what we’re focusing on. 

TINGLE: Jay, could you describe your methodology for us and maybe share a bit about how you and your coauthors tested the robustness of Verus.

LORCH: So the question we really want to answer is, is Verus suitable for systems programming? So that means a variety of things. Is it amenable to a variety of kinds of software that you want to build as part of a system? Is it usable by developers? Can they produce compact proofs? And can they get timely feedback about those proofs? Can the verifier tell you quickly that your proof is correct or, if it’s wrong, that it’s wrong and guide you to fix it? So the main two methodological techniques we used were millibenchmarks and full systems. So the millibenchmarks are small pieces of programs that have been verified by other tools in the past, and we built them in Verus and compared to what other tools would do to find whether we could improve usability. And we found generally that we could verify the same things but with more compact proofs and proofs that would give much snappier feedback. The difference between one second and 10 seconds might not seem a lot, but when you’re writing code and working with the verifier, it’s much nicer to get immediate feedback about what is wrong with your proof so you can say, oh, what about this? And it can say, oh, well, I still see a problem there. And you could say, OK, let me fix that. As opposed to waiting 10, 20 seconds between each such query to the verifier. So the millibenchmarks helped us evaluate that. And the macrobenchmarks, the building entire systems, we built a couple of distributed systems that had been verified before—a key value store and a node replication system—to show that you could do them more effectively and with less verification time. We also built some new systems, a verified OS page table, a memory allocator, and a persistent memory append-only log. 

TINGLE: Chris, the paper mentions that successfully verifying system software has required—you actually use the word heroic to describe the developer effort. Thinking of those heroes in the developer community and perhaps others, what real-world impact do you expect Verus to have? What kind of gains are we talking about here? 

HAWBLITZEL: Yeah, so I think, you know, traditionally verification or this formal software verification that we’re doing has been considered a little bit of a pie-in-the-sky research agenda. Something that people have applied to small research problems but has not necessarily had a real-world impact before. And so I think it’s just, you know, recently, in the last 10 or 15 years, that we started to see a change in this and started to see verified software actually deployed in practice. So on one of our previous projects, we worked on verifying the cryptographic primitives that people use when, say, they browse the web or something and their data is encrypted. So in these cryptographic primitives, there’s a very clear specification for exactly what bytes you’re supposed to produce when you encrypt some data. And the challenge is just writing software that actually performs those operations and does so efficiently. So in one of our previous projects that we worked on called HACL* and EverCrypt, we verified some of the most commonly used and efficient cryptographic primitives for things like encryption and hashing and so on. And these are things that are actually used on a day-to-day basis. So we, kind of, took from that experience that the tools that we’re building are getting ready for prime time here. We can actually verify software that is security critical, reliability critical, and is in use. So some of the things that Jay just mentioned, like verifying, you know, persistent memory storage systems and so on, those are the things that we’re looking at next for software that would really benefit from reliability and where we can formally prove that your data that’s written to disk is read correctly back from disk and not lost during a crash, for example. So that’s the kind of software that we’re looking to verify to try to have a real-world impact. 

LORCH: The way I see the real-world impact, is it going to enable Microsoft to deal with a couple of challenges that are severe and increasing in scale? So the first challenge is attackers, and the second challenge is the vast scale at which we operate. There’s a lot of hackers out there with a lot of resources that are trying to get through our defenses, and every bug that we have offers them purchase, and techniques like this, that can get rid of bugs, allow us to deal with that increasing attacker capability. The other challenge we have is scale. We have billions of customers. We have vast amounts of data and compute power. And when you have a bug that you’ve thoroughly tested but then you run it on millions of computers over decades, those rare bugs eventually crop up. So they become a problem, and traditional testing has a lot of difficulty finding those. And this technology, which enables us to reason about the infinite possibilities in a finite amount of time and observe all possible ways that the system can go wrong and make sure that it can deal with them, that enables us to deal with the vast scale that Microsoft operates on today.

HAWBLITZEL: Yeah, and I think this is an important point that differentiates us from testing. Traditionally, you find a bug when you see that bug happen in running software. With formal verification, we’re catching the bugs before you run the software at all. We’re trying to prove that on all possible inputs, on all possible executions of the software, these bugs will not happen, and it’s much cheaper to fix bugs before you’ve deployed the software that has bugs, before attackers have tried to exploit those bugs. 

TINGLE: So, Jay, ideally, what would you like our listeners and your fellow SOSP conference attendees to tell their colleagues about Verus? What’s the key takeaway here? 

LORCH: I think the key takeaway is that it is possible now to build software without bugs, to build systems code that is going to obey its specification on all possible inputs always. We have that technology. And this is possible now because a lot of technology has advanced to the point where we can use it. So for one thing, there’s advances in programming languages. People are moving from C to Rust. They’ve discovered that you can get the high performance that you want for systems code without having to sacrifice the ability to reason about ownership and lifetimes, concurrency. The other thing that we build on is advances in computer-aided theorem proving. So we can really make compact and quick-to-verify mathematical descriptions of all possible behaviors of a program and get fast answers that allow us to rapidly turn around proof challenges from developers. 

TINGLE: Well, finally, Chris, what are some of the open questions or future opportunities for formal software verification research, and what might you and your collaborators tackle next? I heard a few of the things earlier. 

HAWBLITZEL: Yes, I think despite, you know, the effort that we and many other researchers have put into trying to make these tools more accessible, trying to make them easier to use, there still is a lot of work to prove a piece of software correct, even with advanced state-of-the-art tools. And so we’re still going to keep trying to push to make that easier. Trying to figure out how to automate the process better. There’s a lot of interest right now in artificial intelligence for trying to help with this, especially if you think about artificial intelligence actually writing software. You ask it to write a piece of software to do a particular task, and it generates some C code or some Rust code or some Java code, and then you hope that that’s correct because it could have generated any sort of code that performs the right thing or does total nonsense. So it would be really great going forward if when we ask AI to develop software, we also expect it to create a proof that the software is correct and does what the user asked for. We’ve started working on some projects, and we found that the AI is not quite there yet for realistic code. It can do small examples this way. But I think this is still a very large challenge going forward that could have a large payoff in the future if we can get AI to develop software and prove that the software is correct. 

LORCH: Yeah, I see there’s a lot of synergy between—potential synergy—between AI and verification. Artificial intelligence can solve one of the key challenges of verification, namely making it easy for developers to write that code. And verification can solve one of the key challenges of AI, which is hallucinations, synthesizing code that is not correct, and Verus can verify that that code actually is correct. 

TINGLE: Well, Chris Hawblitzel and Jay Lorch, thank you so much for joining us today on the Microsoft Research Podcast to discuss your work on Verus. 

[MUSIC] 

HAWBLITZEL: Thanks for having us. 

LORCH: Thank you. 

TINGLE: And to our listeners, we appreciate you, too. If you’d like to learn more about Verus, you’ll find a link to the paper at aka.ms/abstracts or you can read it on the SOSP website. Thanks for tuning in. I’m Amber Tingle, and we hope you’ll join us again for Abstracts.

[MUSIC FADES] 

The post Abstracts: November 5, 2024 appeared first on Microsoft Research.

]]>