Research Forum Brief | January 2024 Articles http://approjects.co.za/?big=en-us/research/ Mon, 10 Jun 2024 17:08:58 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 Research Forum Closing Remarks and Announcements http://approjects.co.za/?big=en-us/research/articles/research-forum-closing-remarks-and-announcements/ Tue, 30 Jan 2024 13:24:21 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1000812 Ashley Llorens, VP and Distinguished Scientist at Microsoft Research presents closing remarks and announcements at Microsoft Research Forum, January 30, 2024.

The post Research Forum Closing Remarks and Announcements appeared first on Microsoft Research.

]]>
Presented by Ashley Llorens at Microsoft Research Forum, January 2024

Ashley Llorens headshot

“As the path from research to reality continues to speed up, we remain committed to openly sharing the latest, and coming together to make sense of where we are and where we’re headed.”

Ashley Llorens, VP and Distinguished Scientist, Microsoft

Transcript

Ashley Llorens, VP and Distinguished Scientist, Microsoft

Microsoft Research Forum, January 30, 2024

ASHLEY LLORENS: As AI accelerates, it’s more important than ever that we engage across disciplines, organizations, and geographies. Last fall, we issued a call for proposals for MSR’s first-ever AI & Society Fellows program. Through this investment, we aim to create deep interdisciplinary collaborations to help maximize the value of AI for people and society. Today we’re thrilled to announce our first cohort of fellows. Here’s more about the program and one of the research challenges we’ll be pursuing.

[Beginning of presentation on AI & Society Fellows program] 

HANNA WALLACH, Microsoft Research New York: AI is everywhere. Companies all over the tech industry are pivoting to AI-first strategies, and it kind of feels like literally everyone is talking about large language models at the moment. MSR is launching a new AI & Society Fellows program that’s intended to bring together people from within and outside of Microsoft.  

[Slide on research challenges the program focuses on, including intersection of creatives, AI, and society; regulatory innovation for AI drug development; and responsible AI in practice] 

DANIELA MASSICETI, Microsoft Research Cambridge UK, 2024 AI & Society principal investigator working on reducing the digital divide of generative AI in the Global South: Our team at Microsoft Research has been deeply studying how well multimodal models will work for blind and low-vision communities when we start to integrate these models into visual-assistive technologies. So what I’m really excited about with this fellowship with Ishtiaque is that it will allow us to extend our understanding to how well these multimodal models will work not only in a new part of the world—in this case, in Bangladesh, which is broadly considered a country in the Global South—but also to a new community—in this case, to artists and designers who are using these generative image models in their day-to-day work. 

SYED ISHTIAQUE AHMED, University of Toronto, 2024 AI & Society fellow: I’m excited because this project will help me to bring the benefits of artificial intelligence to my own people back in Bangladesh, some of the communities who have been historically marginalized. I’m also excited because this project will allow me to work with some of the finest researchers in the world in Microsoft Research.  

MASSICETI: Ishtiaque is already deeply embedded with this community of Bangladeshi artists and designers, having already led some really culturally sensitive ethnographic work with this community, and so I think this expertise and background that he brings will really help us drive this fellowship work forward and deepen our understanding of how these multimodal models will need to work in order to be truly useful to diverse communities in the Global South. 

ISHTIAQUE AHMED: The modern text-to-image tools like DALL-E or Midjourney, they work on a huge database of images which are mostly sourced from the Western world, so the output image looks like a Western image. So when a person from the Global South tries to use these models for producing an image in their context, these models do not actually produce good results. The objective of this project is to find out where exactly the system is broken and how we can come up with a better technology with the local people to fix the system so that these people can get the benefits of this artificial intelligence.  

MASSICETI: Microsoft Research is specifically well placed to work on this quest because it has the multidisciplinarity that is required to answer these complex questions that span sociotechnical and socioeconomic issues. In fact, researchers at the Microsoft Africa Research Institute and the Microsoft Research India lab are already leading the charge in this space with their current work looking at models like GPT-4 and how they work with low-resourced African and Asian languages. And finally, working with Microsoft, there is the real potential for your research to really shape and influence products and services that are then used by millions of people around the world. 

Learn about all of our 2024 AI & Society fellows and research challenges > (opens in new tab)

[End of presentation on AI & Society Fellows program] 

LLORENS: Foundation models are driving a fundamental shift in how research is done—in AI, across the sciences, and in just about every domain of application. Through our Accelerate Foundation Models Research program, we issue grants that make leading models, hosted through Microsoft Azure, accessible to academic research teams.

To date, our grants are supporting nearly 200 projects across 80 research institutions around the world. These projects include work in AI model innovation and evaluation, responsible AI, health, AI for scientific discovery, and more. You can learn more about the projects and researchers doing this important work at the link below.

https://aka.ms/afmr (opens in new tab)

Thanks for joining our first Microsoft Research Forum. As the path from research to reality continues to speed up, we remain committed to openly sharing the latest and coming together to make sense of where we are and where we’re headed. To learn more about the people, projects, and publications we’ve shared today, just ask our new research copilot.

See you next time.

The post Research Forum Closing Remarks and Announcements appeared first on Microsoft Research.

]]>
Kahani: Visual Storytelling through Culturally Nuanced Images http://approjects.co.za/?big=en-us/research/articles/kahani-visual-storytelling-through-culturally-nuanced-images/ Tue, 30 Jan 2024 13:22:13 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=999750 Sameer Segal, Principal Research Software Development Engineer at Microsoft Research India, discusses Kahani, a research prototype that allows the user to create visually stunning and culturally nuanced images just by describing them in their local languages.

The post Kahani: Visual Storytelling through Culturally Nuanced Images appeared first on Microsoft Research.

]]>
Presented by Sameer Segal at Microsoft Research Forum, January 2024

Sameer Segal headshot

“[Project Kahani is] trying to bring not only visually stunning images but also bring in cultural nuances to it. Past work has shown that diffusion models tend to stereotype and fail to understand local words, but they don’t provide ways to overcome these shortcomings without modifying the model or using fine-tuning.”

Sameer Segal, Principal Research Software Development Engineer

Transcript – Lightning Talk 5: Kahani: Visual Storytelling through Culturally Nuanced Images

Sameer Segal, Principal Research Software Development Engineer, Microsoft Research India 

Sameer Segal discusses Kahani, a research prototype that allows the user to create visually stunning and culturally nuanced images just by describing them in their local languages. 

Microsoft Research Forum, January 30, 2024 

SAMEER SEGAL: Hi, everyone. My name is Sameer Segal. I’m a principal research engineer at the Microsoft Research India lab. I’ve always been passionate about technology and societal impact. I was an entrepreneur for 10 years before I joined MSR (Microsoft Research), and it’s been absolutely wonderful the last couple of years that I’ve been here because I’m pursuing my passion at a whole new level of scale.

I’m also a father to a 6-year-old daughter, and like most parents with kids this age, you spend a lot of time making up stories—sometimes to teach important lessons like how to be kind and sometimes as well as just for fun. In India, we have a great repertoire of folktales, but unfortunately, they’re not visually appealing to the kids of today. With all these recent advancements in generative AI like large language models and diffusion models, wouldn’t it be great if we could create visual stories? That’s what our Project Kahani is trying to do. It’s trying to bring not only visually stunning images but also bring in cultural nuances to it.  

Past work has shown that diffusion models tend to stereotype and fail to understand local words, but they don’t provide ways to overcome these shortcomings without modifying the model or using fine-tuning in significant ways. The other big problem is that to get that perfect image, you need to do a lot of prompting, and sometimes if you use tools like Adobe Photoshop or use fine-tuning, this makes it out of the league of laypeople. And that’s really sad because these models were meant to be a force of democratization.  

Our project started off at an internal hackathon a few months ago and has now evolved into a research project. Let me show you what we have built.  

I’m going to paste a prompt inspired by a story that my daughter and I recently read. It’s about Geetha, a girl who lives near a forest near BR Hills. And it’s about her unexpected friendship with a butterfly and a firefly. And we want to emphasize about how important it is to be kind to your friends. So the system takes this instruction, and it tries to pick up the cultural nuances from this and generate a story. And then from there, it creates characters and scenes. And here is, you know, an example of what is done, right, so about Geetha, who meets a butterfly that’s stuck in a cobweb. If I’d like to add more to the story, I can make changes and just add new instructions. But if I’d like to add specific instructions, let’s say, on this particular slide … you know, in villages in India, we have something called as a Nazar Battu, which wards of evil. So what I can do is I can pull up the scene and just make a little hole here to place the object that I want, and I am going to give the system a reference image. I am going to tell it that this is a Nazar Battu. And let’s see what it does with this. [PAUSE] There you have it. It was pretty easy to get a word that the model doesn’t really understand right where we wanted in the context of our story.  

Let me show you how this was happening. From my prompt, we were able to extract these cultural elements. The large language models are especially good where they were able to understand BR Hills means that this is a place in India, southern India. And from these cultural nuances, we were able to create a specific prompt that was able to generate this character. And from this character, we were able to compose and create various scenes, right. Now it’s not perfect, but it’s a big step up from where we were with just the models. This work required us to do a series of benchmarking exercises where we tried out different prompts with names, visual descriptions, and definitions, and we would generate the image and compare that to a reference image that we got from a search engine. And GPT-4 [with] Vision was used as a judge to decide whether the image actually matched the reference image or not. 

We believe our work has tremendous potential. It can make local culture a lot more accessible, especially for image generation. And this can have application not just in storytelling and education but across domains.

Thank you.

The post Kahani: Visual Storytelling through Culturally Nuanced Images appeared first on Microsoft Research.

]]>
Generative AI Meets Structural Biology: Equilibrium Distribution Prediction http://approjects.co.za/?big=en-us/research/articles/generative-ai-meets-structural-biology-equilibrium-distribution-prediction/ Tue, 30 Jan 2024 13:21:21 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=999744 Shuxin Zheng presents how his team uses generative AI to solve a long-standing challenge in structural biology and molecular science—predicting equilibrium distribution for molecular systems at the Microsoft Research Forum.

The post Generative AI Meets Structural Biology: Equilibrium Distribution Prediction appeared first on Microsoft Research.

]]>
Presented by Shuxin Zheng at Microsoft Research Forum, January 2024

photo of Shuxin Zheng

“Understanding equilibrium distributions in molecular science is challenging but exciting. … By learning about the different states and the behavior of molecules, scientists can make breakthroughs in developing new drugs, creating advanced materials, and understanding biological processes.”

Shuxin Zheng, Principal Researcher

Transcript

Shuxin Zheng, Principal Researcher, Microsoft Research AI4Science 

Shuxin Zheng presents how his team uses generative AI to solve a long-standing challenge in structural biology and molecular science—predicting equilibrium distribution for molecular systems. 

Microsoft Research Forum, January 30, 2024

SHUXIN ZHENG: Hi, everyone. I’m Shuxin from Microsoft Research AI4Science. Thank you for joining this exciting discussion of our latest research, called Distributional Graphormer, which uses generative AI to solve a long-standing challenge in structural biology: the prediction of equilibrium distribution.

We begin by acknowledging the groundbreaking work in protein structure prediction. However, proteins are dynamic, constantly changing their conformation. This is where our research takes a pioneering step, focusing on the equilibrium distributions of these structures versus a static image. 

Understanding equilibrium distributions in molecular science is challenging but exciting because it opens up new possibilities in diverse fields. By learning about the different states and the behavior of molecules, scientists can make breakthroughs in developing new drugs, creating advanced materials, and understanding biological processes.  

Our new approach, the Distributional Graphormer, brings generative AI technologies into thermodynamics, offering efficiency and accuracy to obtain the equilibrium distribution for any molecular system, far beyond traditional methods like molecular dynamics simulation. It begins with any descriptor of a molecular system. For example, the sequence of amino acids revolutionized the prediction of molecular systems’ equilibrium distribution. 

Let’s dive into practical implications. Consider the case of B-Raf kinase, a protein linked to cancer. Traditional methods fail to capture its active and inactive states comprehensively. DiG, on the other hand, accurately samples these states, demonstrating its power in understanding the important dynamics. 

Let’s see a real-world application. The ability of DiG to predict a range of conformations of the main proteins of SARS-CoV-2 virus provides insight that could revolutionize how we understand the viral mutations and the development of drugs. DiG can also reveal the interaction between protein and ligands and predict the binding of free energy to aid in modern drug discovery. The transition pathway of conformation can be easily obtained with DiG by a fast interpolation in latent space.  

Beyond protein systems, DiG can also predict equilibrium distribution for other molecular systems. For example, this figure shows DiG predicts the density of catalyst-adsorbate systems compared with the results of DFT calculations.

In closing, DiG is a paradigm shift in molecular science—from the structure prediction and the molecular simulation to equilibrium distribution prediction with generative AI. Its potential applications are vast, touching upon areas from bioinformatics to material discovery. I invite you to explore our new findings on the arXiv paper (opens in new tab) and engage with our interactive demo (opens in new tab) to witness the future of molecular science.

Thank you for your time.

The post Generative AI Meets Structural Biology: Equilibrium Distribution Prediction appeared first on Microsoft Research.

]]>
Augmenting Human Cognition and Decision Making with AI http://approjects.co.za/?big=en-us/research/articles/augmenting-human-cognition-and-decision-making-with-ai/ Tue, 30 Jan 2024 13:20:09 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=999738 Jake Hofman discusses recent research in building and evaluating AI tools for helping people make better decisions and improve their own capabilities at the Microsoft Research Forum.

The post Augmenting Human Cognition and Decision Making with AI appeared first on Microsoft Research.

]]>
Presented by Jake Hofman at Microsoft Research Forum, January 2024

Jake Hofman

“How can we use AI to help people make better decisions, reason about information, be more productive, and ultimately even improve themselves?”

Jake Hofman, Senior Principal Researcher

Transcript

Jake Hofman, Senior Principal Researcher, Microsoft Research NYC 

Jake Hofman discusses recent research in building and evaluating AI tools for helping people make better decisions and improve their own capabilities. 

Microsoft Research Forum, January 30, 2024

JAKE HOFMAN: Hi, my name is Jake, and I’m excited to share some recent work that we’ve been up to at Microsoft Research New York City called augmenting human cognition and decision making with AI. And what I mean by that is very simple: how can we use AI to help people make better decisions, reason about information, be more productive, and ultimately even improve themselves?

And we’ve come up with a little sports analogy for thinking about the spectrum of ways that people might interact with AI tools. On the left, we have, sort of, the least desirable outcomes with an analogy to steroids, something that gives you a superhuman ability in the moment but can leave you worse off afterwards than you were, leading to long-term deskilling. An example being something like forgetting how to spell if you over-rely on spell check.  

In the middle, we have tools that act like a good pair of running sneakers. They give you a temporary boost in the moment, but there are no long-term consequences when you take them away. Here, you might think of something like saving time with cumbersome syntax using autocomplete. And on the right, we have perhaps the Holy Grail, a coach that not only helps you in the moment but helps you improve yourself in a long-lasting and sustainable way. And it may seem that these are discrete options, but in fact, we can make choices in how we design and use AI tools that can substantially impact how they affect people. And so I want to go through very quickly just two examples of studies we’ve done to think about the design and use of AI tools and how we can optimize them: first an LLM-based search study and a second one on an LLM-based tutor.  

So in the first study, we looked at a, sort of, sneaker scenario: how does LLM-based search affect decision making? We did this by asking people to research and choose between pairs of cars given a certain criteria, and we randomized whether they had access to traditional search or LLM-based search. So some people saw the usual set of blue links, which was provided on the backend by Bing Search API, while other people saw natural language responses generated by GPT-3.5. 

And here’s what we learned from this experiment. For routine tasks where the LLM provided accurate information, people were about twice as fast using the LLM-based search as they were using traditional search with comparable levels of accuracy. But when the LLM made a mistake—as it did here, indicated by the X over an incorrect number in the response—people basically didn’t notice, and they often made incorrect decisions themselves as a result. Thankfully, though, we found a simple fix. We added confidence-based highlighting similar to what you would see in a spelling or grammar check, and that greatly reduced overreliance on this incorrect information and improved people’s performance in the task, leaving all other measures unaffected. So this is one of those key design choices that can make a real difference. And experimentation was key for prototyping and validating it.

In our second study, we looked at more of a coach scenario for how LLM-based tutoring affects learning. So what we did is we randomized people into seeing different types of assistance at different times when they were practicing standardized math problems like the one you see here. Then we looked at their performance on a separate test where, very importantly, no one had any assistance so we could assess how much they themselves had learned. 

So in one condition—the answer-only condition—people tried a problem, and then they were just told whether they were right or wrong, and if they were wrong, they were shown the correct answer. In contrast, in a stock-LLM condition, people were given the explanation that vanilla GPT-4 out of the box provided. In this case, GPT-4 gives a correct but rather esoteric formula for the person to try to learn and memorize to solve the problem. And in a third and final condition, we had a customized LLM that was given a pre-prompt to emulate a human tutor, and it suggested more cognitively friendly strategies, which in this case involves choosing a value for an unspecified number in the problem to make it easier to solve. 

And the findings here are pretty straightforward. From this experiment, we saw that LLM explanations really boosted learning relative to seeing only answers, as shown by these two points on the right. But there were substantial benefits to having people use the tutor after having tried on their own first, as opposed to consulting the tutor before attempting the problem. We also saw some indication that there’s directional evidence for the customized pre-prompt, providing a small boost over the stock explanations. 

And so to wrap up, I hope these two studies have provided useful examples of just how much the choices we make in designing and deploying AI tools can matter and how important rigorous measurement and experimentation are to making sure that we maximize the benefits and minimize the risks of the tools that we build. So with that, I will wrap up. I have some links here to the papers I’ve discussed and looking forward to any comments and questions that folks might have.

Thank you.

The post Augmenting Human Cognition and Decision Making with AI appeared first on Microsoft Research.

]]>
Evaluation and Understanding of Foundation Models http://approjects.co.za/?big=en-us/research/articles/evaluation-and-understanding-of-foundation-models/ Tue, 30 Jan 2024 13:18:54 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=999732 Besmira Nushi summarizes timely challenges and ongoing work on evaluating and in-depth understanding of large foundation models as well as agent platforms built upon such models at the Microsoft Research Forum.

The post Evaluation and Understanding of Foundation Models appeared first on Microsoft Research.

]]>
Presented by Besmira Nushi at Microsoft Research Forum, January 2024

photo of Besmira Nushi smiling for the camera

“We see model evaluation and understanding as a guide to AI innovation. Our work measures, informs, and accelerates model improvement and, at the same time, is a contribution that is useful to the scientific community for understanding and studying new forms and levels of intelligence.”

Besmira Nushi, Principal Researcher

Transcript

Besmira Nushi, Principal Researcher, Microsoft Research AI Frontiers 

Besmira Nushi summarizes timely challenges and ongoing work on evaluating and in-depth understanding of large foundation models as well as agent platforms built upon such models. 

Microsoft Research Forum, January 30, 2024

BESMIRA NUSHI: Hi, everyone. My name is Besmira Nushi, and together with my colleagues at Microsoft Research, I work on evaluating and understanding foundation models. In our team, we see model evaluation and understanding as a guide to AI innovation. Our work measures, informs, and accelerates model improvement and, at the same time, is a contribution that is useful to the scientific community for understanding and studying new forms and levels of intelligence.

But evaluation is hard, and new generative tasks are posing new challenges in evaluation and understanding. For example, it has become really difficult to scale up evaluation for long, open-ended, and generative outputs. At the same time, for emergent abilities, very often some benchmarks do not exist and often we have to create them from scratch. And even when they exist, they may be saturated or leaked into training datasets. In other cases, factors like prompt variability and model updates may be just as important as the quality of the model that is being tested in the first place. When it comes to end-to-end and interactive scenarios, other aspects of model behavior may get in the way and may interfere with task completion and user satisfaction. And finally, there exists a gap between evaluation and model improvement. 

In our work, we really see this as just the first step towards understanding new failure modes and new architectures through data and model understanding. So in Microsoft Research, when we address these challenges, we look at four important pillars. First, we build novel benchmarks and evaluation workflows. Second, we perform and put a focus on interactive and multi-agent systems evaluation. And in everything we do, in every report that we write, we put responsible AI at the center of testing and evaluation to understand the impact of our technology on society. Finally, to bridge the gap between evaluation and improvement, we pursue efforts in data and model understanding.  

But let’s look at some examples. Recently, in the benchmark space, we released KITAB. KITAB is a novel benchmark and dataset for testing constraint satisfaction capabilities for information retrieval queries that have certain user specifications in terms of constraints. And when we tested recent state-of-the-art models with this benchmark, we noticed that only in 50 percent of the cases these models are able to satisfy user constraints.

And similarly, in the multimodal space, Microsoft Research just released HoloAssist (opens in new tab). HoloAssist is a testbed with extensive amounts of data that come from recording and understanding how people perform tasks in the real and physical world. And this provides us with an invaluable amount of resources in terms of evaluation for understanding and measuring how the new models are going to assist people in things like task completion and mistake correction. In the responsible AI area, ToxiGen (opens in new tab) is a new dataset that is designed to mention and to understand toxicity generation from language models. And it is able to measure harms that may be generated from such models across 13 different demographic groups. 

Similarly, in the multimodal space, we ran extensive evaluations to measure representational fairness and biases. For example, we tested several image generation models to see how they represent certain occupations, certain personality traits, and geographical locations. And we found that sometimes such models may present a major setback when it comes to representing different occupations if compared to real-world representation. For instance, in some cases, we see as low as 0 percent representation for certain demographic groups.  

Now when it comes to data on model understanding, often what we do is that we look back at architectural and model behavior patterns to see how they are tied to important and common errors in the space. For example, for the case of constraint satisfaction for user queries, we looked at factual errors, information fabrication and mapped them to important attention patterns. And we see that whenever factual errors occur, there are very weak attention patterns within the model that map to these errors. And this is an important finding that is going to inform our next steps in model improvement. 

So as we push the new frontiers in AI innovation, we are also just as excited about understanding and measuring that progress scientifically. And we hope that many of you are going to join us in that challenge.

Thank you.

The post Evaluation and Understanding of Foundation Models appeared first on Microsoft Research.

]]>
Improving Reasoning in Language Models with LASER: Layer-Selective Rank Reduction http://approjects.co.za/?big=en-us/research/articles/improving-reasoning-in-language-models-with-laser-layer-selective-rank-reduction/ Tue, 30 Jan 2024 13:14:41 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=999708 Dipendra Misra, Senior Researcher at Microsoft Research New York City and AI Frontiers lightning talk presentation at the Microsoft Research Forum.

The post Improving Reasoning in Language Models with LASER: Layer-Selective Rank Reduction appeared first on Microsoft Research.

]]>
Presented by Dipendra Misra at Microsoft Research Forum, January 2024

Dipendra Misra

“An LLM is trained on lots of data, often collected from the internet, and uses a model architecture, typically a transformer, to train the model, and they work remarkably well across a range of different tasks. And so one way perhaps we can build towards understanding [an] LLM is by performing interventions in the model and then seeing how that intervention reflects in [its performance].”

Dipendra Misra, Senior Researcher

Transcript

Dipendra Misra, Senior Researcher, Microsoft Research NYC and AI Frontiers

Dipendra Misra will present a surprising discovery that by merely replacing selected weight matrices in an LLM with their suitable low-rank approximation, you can significantly improve the performance of the LLM, at times by 20 to 30 percentage points.

Microsoft Research Forum, January 30, 2024 

DIPENDRA MISRA: Welcome, everyone. I’m Dipendra Misra, a researcher at Microsoft Research New York City and AI Frontiers, and I’m excited to be talking about our new method called LASER, which is Layer-Selective Rank Reduction, an approach for improving pretrained large language models. So large language models, or LLMs, have revolutionized machine learning, and yet there is so little we know about how they work. 

So in a summary, an LLM is trained on lots of data, often collected from the internet, and uses a model architecture, typically a transformer, to train the model, and they work remarkably well across a range of different tasks. And so one way perhaps we can build towards the understanding of LLM is by performing intervention in the model and then seeing how that intervention reflects in the performance of the LLM. For example, we may find that performing a certain type of intervention may affect one type of task but not the other. And by this way, we may understand how the information about solving different tasks is stored inside the LLM. So with this motivation in mind, we introduce LASER, which is a type of intervention where we select one of the weight matrices of the LLM and replace it by its low-rank approximation. 

So in the bottom over here, we see our transformer architecture. If you’re not familiar with the details of it, that’s fine. What we need to know here is that the transformer architecture consists of repeated transformer blocks arranged in different layers, and each block has multiple weight matrices, which are shown here in square. So, for example, here, to perform LASER, we select this weight matrix, which is highlighted in red, and it’s coming from layer No. 22, and we call it the \(W\) matrix here.

And to perform this low-rank approximation, we first use what’s called a singular value decomposition, which decomposes this matrices into three matrices called the \(U\), \(Σ\), and \(V\). The \(Σ\) here contains the singular value of the matrices, and it’s arranged diagonally in decreasing order. So to perform its lower-rank approximation, we throw away all the information in \(U\), \(Σ\), and \(V\), which is \(not\) in blue color, and then we multiply the remaining matrix, and we get its low-rank approximation, which is shown in \(W_{lr}\). And this is a very computationally efficient process and can be done easily with existing libraries.

So in summary, to perform a single LASER intervention, one has to make three choices. So first is which layer to select. Second is which type of weight matrix to edit. And third is how much approximation should be done. In our paper, we also study how these different LASER interventions can be composed across layers and applied simultaneously. So before discussing how to evaluate LASER, I want to mention that LASER also has the advantage of reducing the memory footprint of the model. And this is important because we are living in this age where the memory taken by LLMs is growing at an astonishing pace, and by reducing the memory footprint, we can allow more people to be able to use these LLMs and store them on device. 

So for our first evaluation, we evaluate LASER on an existing GPT-J LLM and evaluate on the CounterFact question-answering dataset. The motivation for this is that the GPT-J LLM has its training data available publicly, which allows us to do interesting analysis with it, and the CounterFact question-answering dataset has paraphrases, which allows us to measure robustness to paraphrases. 

Now as I mentioned earlier, we are doing intervention using LASER on the LLM, so one would expect that the model loss should go up as we are doing more approximation, meaning that the model is going to perform bad, right, because we are throwing [out] information from an LLM, which is trained on large amounts of data. But to our surprise, what we find [is] that if the right type of LASER intervention is performed, then the model loss doesn’t go up but actually goes down, meaning that we actually improve the pretrained LLM even more. 

So in this figure here, we show what happens when the LASER is applied to the MLP matrices, and we see that if we apply LASER at the earlier layer, then the loss is going up. Here, the orange color or the yellow color shows that we’re doing less approximation, and black or in blue means we are doing more approximation. So in the lower layer, we can see that the yellow has a lower loss, but the black has a higher loss. But if you apply LASER in the later layers, we see that the loss is actually decreasing as we do more approximation. And this is truly surprising.  

So does this hold more generally? So we find that, yes, this does hold across several tasks and in three different LLMs, namely RoBERTa, GPT-J, and Llama 2. And at times, we see surprising gains like 20 to 30 percentage points. For example, on this task of gender prediction using biographies, we see that the performance of GPT-J goes from 70.9 percent to 97.5 percent accuracy. And in our paper, we have more type of analysis. I’ll just briefly describe two of them quickly.

So one of them shows that if you apply LASER, then the most gains that we get are from improvements in data points which are rarer in the training data. And we also find that the components that the LASER removes from a weight matrices typically offer semantically correct but incorrect responses. And so we can view LASER as a denoising process which is removing this erroneous information.  

So in conclusion, we present LASER, which is a new way of doing intervention in large language models, and we show a surprising result that performing LASER can both increase the accuracy of these large language models while also removing the memory footprint. And more details can be found in our paper, which is available on arXiv and will appear as a conference paper at the upcoming ICLR conference.

Thank you.

The post Improving Reasoning in Language Models with LASER: Layer-Selective Rank Reduction appeared first on Microsoft Research.

]]>
Keynote: Research in the Era of AI http://approjects.co.za/?big=en-us/research/articles/keynote-research-in-the-era-of-ai/ Tue, 30 Jan 2024 13:05:42 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=999720 Peter Lee, Corporate Vice President, Microsoft Research and Incubations, discusses how recent developments in AI have transformed the way Microsoft approaches research.

The post Keynote: Research in the Era of AI appeared first on Microsoft Research.

]]>
Presented by Peter Lee at Microsoft Research Forum, January 2024

Peter Lee

“As we work together as a research community in computer science, we are in this incredibly exciting stage, a stage of being disrupted personally as researchers … and a whole new vista of possibilities in front of us. And we are just incredibly excited within Microsoft Research to be living through this.”

Peter Lee, CVP, Microsoft Research & Incubations

Transcript

Peter Lee, Corporate Vice President, Microsoft Research and Incubations 

Peter Lee discusses how recent developments in AI have transformed the way Microsoft approaches research.  

Microsoft Research Forum, January 30, 2024

PETER LEE: Hi. I’m really pleased and excited to be here for this first Microsoft Research Forum, a series that we have here out of Microsoft Research to carry out some important conversations with the research and scientific community. 

This past year has been quite a memorable one. Just some incredible advances, particularly in AI, and I’ll spend a little bit of time talking about AI here to get us started. But before doing that, I thought I would try to at least share how I see what is happening in the broader context of scientific disruption. And to do that, I want to go all the way back to the 1700s and the emerging science of biology, the science of living things. Actually, in the 1700s, it was well understood by the end of that century that all living things were made up of cells—everything from trees and plants to bugs, animals, and human beings. But a fundamental scientific mystery that lingered for decades was, where do cells come from?

And a prevailing theory of that was this concept of cell crystallization. It has been understood in other areas that sometimes hard materials would crystallize into existence from fluid materials. And so the thought was that out of living fluids, under just the right conditions, cells would crystallize into existence. And a lot of biological research of the time was centered around that theory. And in fact, quite a few important and useful things came out of that line of research, research that even has impact medically today. Now of course, there was an alternative theory, which I think is credited to Robert Remak, that in fact cells get created through a process of cell division. And we know that this is true today. But it was really considered an alternative theory until Rudolf Virchow was actually able to witness the mitosis of cells, the division of cells, and in fact coined the aphorism that all cells come from other living cells. And this had a very significant impact on Virchow’s research and his research into what is now known as pathology.

Overnight, whole research legacies were rendered largely invalid because the whole concept of cell crystallization was then known to be invalid. But even the very foundational infrastructure of research at the time changed. In fact, after Virchow, to call yourself a researcher in biology, you had to have access to a new piece of research infrastructure called the microscope, and you had to be good at using it. And so while the researchers themselves of the time were not invalidated, they were disrupted in a really fundamental way. And of course, the discovery of mitosis really set biology research on the path ultimately to the discovery of DNA and the remarkable kinds of medical and biological advances we see in the field. 

Now I tell that story because when I think about that story—and I learned it first from the great biology researcher and medical scientist Sid Mukherjee at Columbia—I think about what we as computer scientists are going through today. We’ve now witnessed the incredible potential power of machine learning systems at scale and of specific architectures like neural transformers. And there are many possibilities, there are many challenges, and there are many mysteries. And furthermore, the infrastructure of what we do as computer science researchers, particularly in areas related to artificial intelligence, has changed in the same way that biology researchers need access to new infrastructure like microscopes. At least that was the case in the mid-1800s, when Virchow made his discovery.

Today, for a large segment of the kinds of research that we do, we now realize we need new types of infrastructure, infrastructure such as large datasets, access to large scale GPU compute, and even other training pipelines and foundations. And what we’re seeing is that this is affecting virtually everything that we do today. And so as we work together as a research community in computer science, we are in this incredibly exciting stage, a stage of being disrupted personally as researchers—many of us as researchers finding large parts of what we had been working on being changed, disrupted, or even invalidated—and a whole new vista of possibilities in front of us. And we are just incredibly excited within Microsoft Research to be living through this. There are difficult moments, to be sure, but also a sense of joy, a joy that comes from the realization that we are now living through something that is very special and very rare. 

So now what has this meant? And to do a little bit of a discussion about that, if you will permit me, I’d like to focus a little bit on the research, particularly in AI within Microsoft Research in this past year. We had the opportunity to do something unusual. And while I use the word opportunity, it was also a challenge. In our ongoing collaboration with OpenAI, when the new model that we now call GPT-4 was being made available for investigation and study within Microsoft Research—and this was toward the end of 2022—we, for various reasons, were required to do something that is exceptionally unusual for Microsoft Research, and that is to work in secret for a period of several weeks.  

And this is exceptionally unusual for Microsoft Research because almost all of the research we do at Microsoft is done in collaboration with external researchers, particularly at great universities all around the world. And so really for the first time, we were doing some core research in secret, and that remained secret until the release publicly of GPT-4 in March of 2023. That March of 2023 marked the time when we were allowed finally to speak publicly about GPT-4 in the wake of OpenAI’s public announcement of that model and allowed the publication of our initial findings on our own internal study and investigation of this model. And that has led to a paper that was tantalizingly titled “Sparks of Artificial General Intelligence,” or what is now oftentimes referred to as the Sparks paper.

That Sparks paper was really a turning point for many of us in the research community. It tried to show a series of example interactions with this new large language model that defied complete explanation in terms of the emergence of apparent cognitive capabilities. It was also a somewhat edgy or even controversial paper because of our then lack of ability to fully explain the core mechanisms about where these apparent capabilities were coming from. At the same time, it was a real chance to finally reach out and establish collaborations with many of you here today, applications and collaborations to understand, to what extent are these models able to do planning? What is the nature of causal reasoning and causal inference? Counterfactual reasoning? What is the interchange between fundamental reasoning abilities of these models versus world knowledge? To what extent are [they] commonsense reasoning? Decision making in controversial or morally charged circumstances? And fundamentally, what could this mean more broadly for us as people, for the communities we live in, and for societies?

The Sparks paper was just the beginning, and with many of you, we’ve had a series of important research advances that have been deepening our understanding in these and many other areas. And we’ve been trying to put these things into action as we have also worked with our own Microsoft product developers as well as researchers and product developers in other organizations, in other companies. We’ve really had to come to grips with the impact of AI technology on the world and on people. In our internal collaborations with our Bing product team, we devoted considerable effort in trying to understand the guardrails around responsible AI. And in fact, today at Microsoft, our responsible AI organization within Microsoft has hundreds of people really forming around understanding the impact not just of the potential harms and risks that AI technologies can bring when deployed widely at scale but the broader opportunities both for benefit as well as risks on society as a whole. 

So I’d like to say just a few more words about this whole concept of responsible AI. And in fact, I prefer to think of this as the area of AI and its impact on society or AI and society, for short. For us, in this new AI era, we started in a difficult space, because we devoted some of our best expertise across Microsoft but specifically Microsoft Research to building the guardrails and understanding the risks of our first integrations of GPT-4 into our products like Bing. And that devotion, in secret, ended up being noticeable to the research community after the release of ChatGPT, when we were in a somewhat difficult position in Microsoft Research of having to remain silent while the research community was starting to really delve into the research questions around AI safety of ChatGPT, in that case, and then later in GPT-4. What has happened since then, I think, is now a renaissance in our understanding, jointly with all of you, about the nature of AI risks and the whole debate around the future of AI and society—not only the future of work, which we care about a great deal in our own research here at Microsoft, but the impact on communities, on relationships, and societies in core areas like medicine and health care, in education, in finance, in law, and you name it. I’m extremely excited about what this has meant for our own research within Microsoft in an area that we call sociotechnical systems and specifically in something we call FATE: Fairness, Accountability, Transparency, and Ethics. 

There has never been a more exciting time and never been larger and more challenging research questions as well as bigger and more relevant opportunities for the impact of this research. And we can’t be more excited to be working with all of you on many of these things. Within Microsoft, this has also had a transformative effect. We have evolved from having a single organization for responsible AI to now deeply integrating responsible AI and, more broadly, AI and societal impact thinking into literally every engineering group across the company as well as areas in finance, security, and our legal departments. And so as we think about the future of AI and society, we just really look forward and we will be depending on our collaborations with all of you. 

Now it doesn’t just stop there, though. There are tremendous other areas. The actual costs of training, the necessity or not of scale, and the emergence of alternative models has become extremely important. And so another thread of research that has been exceptionally important in AI for us over the past year has been in the emerging area of open source and small language model development. And we’ve been very proud to have shared with the research community in open-source form a series of models called Phi. The Phi models are exceptionally interesting, because they’ve taken a new approach to the construction of training data to synthesize data to really focus on specific reasoning and problem-solving strategies as opposed to world knowledge. And this has led to a series of models starting with Phi-1 (opens in new tab), 1.5 (opens in new tab), and now Phi-2 (opens in new tab) that have been devoted to open source in the hopes of encouraging and supporting the open-research community in developing greater understanding of the transparency and the limits of these types of models, to understand better the alignment issues, and to have further explication in the pretraining phase of areas related to AI safety and risk.

We’ve also been looking at platform and model orchestration. What will this world look like in the future, where there may be many models together? And so we’ve been extremely proud of our work on AutoGen (opens in new tab). AutoGen has provided, again in the open-source community, a way to very easily and rapidly get multiple AI systems collaborating together as independent agents to solve problems more easily—to, for example, have one model interact with a human being to solve problems, another model look over their shoulders to ensure that nothing goes wrong, and maybe even a third agent, which is a human being in the loop, doing various kinds of checks and balances.  

We’ve been studying tremendously about how we can extend our ability to train these models for specific domains. Our work on Orca and Orca 2 (opens in new tab) has really helped shed more light on the nature of training data. And our work on the LLaVA (opens in new tab) model specialized to medical image generation in LLaVA-Med (opens in new tab) has shown real promise for a future of specialized models devoted to aspects of medicine.

Now while I’m talking about model specialization, this interplay between specialization versus generalization has been another major theme for Microsoft Research AI over the past year. The basic question is, do we need intense specialization? And nowhere has that question been more pertinent than the area of health care and medicine. Do we need to take AI models to med school in order for them to be useful in the medical domain? The question is still a mystery to us, and in fact, we released a series of prompting scripts that would automate the creation of chain-of-thought augmentation of prompts called promptbase and its application to medicine called Medprompt that shows that GPT-4 when suitably prompted still outperforms any existing specialist model. And so to date, we still have a mystery as to the role of specialization. And furthermore, we have some hints that specialization may lead to the loss of some cognitive function, and a really fun paper to read out of Microsoft Research and our Azure division is entitled “[Who’s] Harry Potter?” where we show that even a small amount of specialized training of large language models can get a large language model to forget everything it ever knew about Harry Potter. A humorous title, but it makes an important point in deepening understanding of the role of specialization.  

All of these put together, of course, is just the tip of a very, very large iceberg, one that is growing at tremendous speed. And in fact, today we are seeing so much of AI research happening in the world, in social media, at the speed of conversation, to the point where even our top-tier AI researchers feel, at times, that their heads are spinning. But working together, providing openness, providing greater access, we definitely—looking back over this past year—can see that we’ve made tremendous, tremendous progress. And it’s not just the new discoveries that we jointly have made together but also in deepening understanding about how to care and mitigate against the downstream harms and risks of AI as we see them emerge as well as the broader societal impacts about what this will mean for the future of work, for the future of life, and the future of our relationships with technology.

Now as we think more about AI, it has just infected—and I’m using that word for a reason—almost all of the research that we do across Microsoft Research, whether it’s in security and privacy, whether it’s in sociotechnical systems, whether it’s in program analysis and verification. You name it, generative AI has been having an impact. I can’t tell you how surprised and tickled I was the first time I saw our program analysis research group using GPT-4 to help synthesize loop invariants for the purposes of furthering problem verification analysis. Just really cool. A little bit amusing; maybe even a little bit scary. But just amazing no matter how you look at it. 

Now when we take all that, I use the word infected because, of course, one area that has had a special place within Microsoft Research over the last few years has been in areas related to health care and medicine. And in fact, we saw early on the potential impact that GPT-4 would have in health care and medicine, and in fact, I myself coauthored—with Carey Goldberg, who is a science writer, and Zak Kohane, from Harvard Medical School—a book on the potential impact of GPT-4 on health care and medicine (opens in new tab). And we had already in place a Microsoft Research lab called Health Futures that had been working on large language models such as BioGPT (opens in new tab) and PubMedBERT (opens in new tab) for the purposes of creating knowledge bases and supporting decisions in clinical settings, and much of that in collaboration with our partners at Nuance, the makers of really the most widely used medical transcription and note-taking technologies today. 

GPT-4 and the emergence of large language models have really changed everything and in fact have broadened beyond the narrow confines of health care and medicine to other parts of science, other natural sciences—the discovery of new materials, the discovery of catalysts to help climate science, the potential to take these sorts of models and make them multi-scale so that we can now start modeling weather patterns and predict weather events ahead of time.

And in recognition of all of that, we’ve created a new Microsoft Research lab called AI4Science, and we’re very proud that already we’re seeing some tremendous results from this. Working with collaborators at the Pacific Northwest National Laboratories, we very rapidly were able to synthesize the discovery of new electrolyte substances, combining sodium and lithium in ways that could be the foundation of a new generation of solid-state batteries that make a dramatically lowered use of what is oftentimes referred to as “white gold,” or lithium. And we’ve furthermore been able to work with the Global Health Drug Discovery [Institute], or GHDDI, in very rapidly discovering new inhibitors that may form the foundation of new drug treatments for diseases such as tuberculosis and coronaviruses. And so the future is just incredibly bright, and as we extend beyond language and medical images and other types of two-dimensional images to 3D-structure learning and the ability to make predictions about the structure of physical systems like molecules, we see just an incredibly bright future ahead of all of us. 

As I think about our future as fellow researchers, as scientists, I just see a tremendous reason to be optimistic. You know, we’re in an era where what we do has never mattered more than it matters right now. The things that we’re working on have tremendous technological power to really empower and reach every person on the planet and make their lives better in so many different ways. And it’s something that we can just do together and go directly from the laboratory into the real world.  

And so I really am hoping and I’m looking forward to us continuing in this spirit of collaboration, in the spirit of openness, to help ensure that that future is as vibrant and bright as we know it can be while at the same time being clear-eyed about the potential risks, risks that we don’t even understand or realize yet today. But together we can do what scientists have always done in the past, which is to ensure that we get as many of the benefits out of emerging new technologies while mitigating the downstream harms and risks. If we do that together, if we do it with the right spirit and attitude, I think the future is incredibly bright. I know I’m really looking forward to doing it with you.

Thank you again for joining us, and I hope you enjoy this first Microsoft Research Forum.

The post Keynote: Research in the Era of AI appeared first on Microsoft Research.

]]>
Panel Discussion: AI Frontiers http://approjects.co.za/?big=en-us/research/articles/panel-discussion-ai-frontiers/ Tue, 30 Jan 2024 00:33:42 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=999726 Hosted by Ashley Llorens, VP and Distinguished Scientist, Microsoft AI researchers, Sébastien Bubeck, Ahmed Awadallah, and Ece Kamar discuss frontiers in small language models and where AI research and capabilities are headed next.

The post Panel Discussion: AI Frontiers appeared first on Microsoft Research.

]]>
Hosted by Ashley Llorens, with Ece Kamar, Sébastien Bubeck, and Ahmed Awadallah at Microsoft Research Forum, January 2024

photo of Ece Kamar

“The sparks that we are seeing [are] really about having building blocks that give us the initial technologies … to get to those AI systems that have a memory, that have a history, that have a deep understanding of human concepts, and that can carry out tasks that are a lot broader, a lot more complex than what we can do today.”

Ece Kamar, Managing Director, AI Frontiers

Transcript

Ashley Llorens, VP and Distinguished Scientist, Microsoft
Ece Kamar, Managing Director, Microsoft Research AI Frontiers 
Sébastien Bubeck, VP, Microsoft GenAI 
Ahmed Awadallah, Senior Principal Research Manager, Microsoft Research AI Frontiers 

Microsoft AI researchers discuss frontiers in small language models and where AI research and capabilities are headed next. 

Microsoft Research Forum, January 30, 2024

I’m Ashley Llorens, with Microsoft Research. My team works across research and product to incubate emerging technologies and runs programs that connect our research at Microsoft to the broader research community. I sat down with research leaders Ece Kamar, Ahmed Awadallah, and Sébastien Bubeck to explore some of the most exciting new frontiers in AI. We discussed their aspirations for AI, the research directions they’re betting on to get us there, and how their team is working differently to meet this moment.

ASHLEY LLORENS: So let’s dive in. We’re experiencing an inflection point in human technology where machines, broadly speaking, are starting to exhibit the sparks of general intelligence, and it’s hard to avoid the enthusiasm. Even if you wanted to. And I think it’s fair to say that there’s no shortage of that enthusiasm here among us. But as researchers, we’re also skeptics. You know, we go right in and try to understand the limitations of the technology as well as the capabilities, because it’s really those limitations that expose and define the frontiers that we want to push forward on. And so what I want to start here with is to sketch those frontiers here with you a little bit. I’d like to hear about an aspiration you have for AI and why the technology cannot do that today. Then we’ll come back around to the research directions that you’re betting on to close those gaps. And, so, I don’t know. Ahmed, what do you think? What aspiration do you have for AI, and why can’t the tech do it today?

AHMED AWADALLAH: I have a lot of aspirations. I think … you just mentioned we saw the sparks of AGI, so naturally, we’re looking forward to actually seeing AGI. But beyond that, more realistically, I think two of the things I’m really looking forward to is having AI that can actually perceive and operate in the real world. We have made significant advances with language models. We are seeing a lot of advantages with multimodality. It looks like an AI that can perceive and operate in the real world is not that far off from where we are. But there are a lot of challenges, as well. And I’m really excited to see how we can get to that. 

LLORENS: What does that look like for you, when AI operates in the real world? What is it doing? 

AWADALLAH: It looks … To me, it means that we go, first, we go beyond language, and we are getting a lot into multimodal models right now that can perceive images and languages. However, a big part of what we do is that we take actions in the world in different ways. We have a lot of behavior that we exhibit as we do tasks, and it’s not clear that we can do that right now with AI. So imagine that we have an AI system that we can ask to do things on our behalf, both in the digital and in the physical world. Imagine that we have guarantees that they will accomplish these tasks in a way that aligns with our original intent. 

LLORENS: Yeah, it’s compelling. Ece, what do you think?

ECE KAMAR: My dream for the AI systems is that they become our helpers, companions, longer-term collaborators than just, like, prompting something and it gives me an answer. And we are, actually, still quite far from having AI systems that can really help us through our life for the different purposes that we have and also really understand our goals, intentions, and also preferences. So I think we have, right now, the sparks that we are seeing are really about having building blocks that give us the initial technologies to build on, to get to those AI systems that have a memory, that have a history, that have a deep understanding of human concepts, and that can carry out tasks that are a lot broader, a lot more complex than what we can do today. And our task right now is using these blocks to really imagine what those future systems are going to look like and discover those new innovations that will push the capabilities forward so that we can really build systems that create a difference in our lives, not only the systems that we want to play with or, you know, do small tasks for us—that are already changing how I work, by the way. These things are not minor, but they can really be a part of my daily life and help me with everything I do. 

LLORENS: Seb, what do you think? 

SÉBASTIEN BUBECK: Yeah, my aspiration for AI, actually, has nothing to do with the technology itself. I hope that AI will illuminate how the human mind works. That’s really my real aspiration. You know, I think what’s going on in our minds and the way we reason is extremely mysterious. And anything that is mysterious, it looks kind of magical. We have no idea what are the basic elements for it. And with AI, we’re seeing that, at the very least, it’s mimicking the type of reasoning that’s going on in human beings. So I’m hoping that we’re going to be able to really uncover those building blocks of reasoning. That’s my dream for the next decade, I guess. 

LLORENS: How good of an analogy do you think, I’ll say, transformers or, you know, today’s machine learning models are for how we think and reason? 

BUBECK: It’s a terrible analogy. [LAUGHS] So it really … the transformer is absolutely not, in my mind, trying to mimic what the human brain is doing. It’s more like the emergent properties are similar. So, you know, it’s … the substrate is going to be obviously different. I mean, one is a machine and one is wetware, and the concrete algorithm that is running will be different. But it’s plausible that the emergent property will be similar. That’s what I’m hoping. 

LLORENS: No, yeah. Super interesting. And now I want to understand a little bit about the research directions that you are most excited about to get there. I don’t think you’re going to tell me about your neuroscience research. [LAUGHS] 

BUBECK: [LAUGHS] I wish. I wish. 

LLORENS: That’s an interesting place to start … 

KAMAR: Not yet. Maybe in the next episode. [LAUGHS] 

BUBECK: Exactly. 

LLORENS: But what are you betting on right now to get us closer to that? 

BUBECK: Yeah. No, it’s actually connected, the two things. So what we are experimenting with right now is the following. So to us, I think to all of us here, GPT-4 showed the sparks of AGI, early signs of humanlike reasoning. And to us, we see this as a, kind of, proof of concept. OK, it means you can get this type of intelligence—quote, unquote—if you scale up a ton, if you have a very, very large neural network trained on a lot of data with a lot of compute for a very long time. OK, great. But exactly which one of those elements was needed? Is it the big data that’s necessary? Is it the large neural network? Is it a lot of compute? And what is a lot, by the way? What is large? You know, is 1 billion large? Is 10 billion large? You know, questions like this. So to me, this comes from a scientific inquiry perspective. But at the end of the day, it has enormous economical impact, because when you answer these questions, you go make everything smaller. And this is what we’ve been doing with the Phi series of models, trying to build those small language models. Again, we come at it from the scientific perspective, but it has very, very concrete impact for the future of Microsoft.

LLORENS: So I think Phi is on a lot of minds right now. Let’s actually stick with Phi for a minute. What is the secret? [LAUGHS] What—let’s stick with that—what is the secret? What’s enabling you to get to the reasoning capabilities that you’re demonstrating with models of that size? 

BUBECK: Yes, yes, yeah. There is … 

LLORENS: What size is Phi, by the way? 

BUBECK: Yeah, so the latest, Phi-2 (opens in new tab), is 2.7 billion parameters. Phi-1.5 (opens in new tab) was 1.3 billion. So we have doubled the size. So the secret is actually very simple. The secret is in the title of the first paper that we wrote in the Phi series, which is “Textbooks Are All You Need.” So “Textbooks Are All You Need,” this is, of course, a play on the most famous paper of all time in machine learning, “Attention Is All You Need,” that introduced the attention mechanism for the transformer architecture. So in “Textbooks Are All You Need,” what we say is if you play with the data and you come up with data which is of “textbook quality”—so the meaning of this is a little bit fuzzy, and this is where part of the secret lies—but if you come up with this textbook-quality data, we’re able to get a thousand x gains if you look at the total compute that you need to spend to reach a certain level in terms of benchmark, intelligence, etc. So now what is this textbook quality, this mysterious textbook quality? Well, the way I want to put it is as follows. What matters in text when you give text to this transformer to try to teach them a concept is how much reasoning is going on in the text? How, what kind of concept can you extract if you are to predict the next word in that text? So what we want is text which is reasoning dense, and, you know, like, novels, they are not really reasoning dense. Sometimes you need to reason a little bit to understand, OK, how all the characters are related, you know, why are they thinking or doing what they are doing. But where do you have really reasoning-dense text? Well, it’s in textbooks. So this is the secret, basically. 

LLORENS: And, Ahmed, recently you and I have had conversations about a universe of different pretraining methods, textbook-like reasoning tokens, you know, being one, and then also the whole universe of, of post-training methods and how there’s a whole space to explore there. So maybe you can get into your research interests, you know, where are you pushing on that frontier? And, you know, what haven’t we talked about yet in terms of pretraining versus post-training? 

AWADALLAH: Yeah, that’s a very good question. And, actually, it was very interesting that many, many similar insights would apply to what Sébastien was just describing. But if you look at how we have been pretraining models recently, we start with the pretraining stage, where we basically show the model a lot of text—the textbooks—and we have them learning to predict the next word. And with a lot of data and a lot of size, the big size, a lot of emergent properties were showing up in some models that we didn’t really even try to teach them to the model. But we have also been seeing that there are other stages of pretraining—some people refer to it as post-training—where after we pretrain the model, we actually start teaching it specific skills, and that comes in the form of input-output samples or sometimes an input and two different outputs, and we are trying to teach the model that the first output is preferred to the second output. We can do that to teach the model a particular style or a skillset or even for alignment, to teach it to act in a safer way.  

But what we have found out is that now that we have these large models, as well—and they are actually very powerful engines that can enable us to create all sorts of data—many of these properties, we don’t have to wait for them to emerge with the size. We can, actually, go back and create synthetic tailored data to try to teach a smaller model that particular skill. We started with reasoning, as well, because reasoning is a pretty hard property, and we haven’t really seen reasoning emerging even to that level we have in models like GPT-4 right now, except after scaling to so large size in the model and in the data size, as well. So the question was, now that we have emerged it in these models, can we actually create data that teaches the model that particular skill? And we were not trying to teach the model any new knowledge, really. We were just trying to teach the small model how to behave, how to solve a task. So, for example, with a model like GPT-4, we are seeing that you can ask it to solve a task that requires breaking up a task into steps and going step by step into solving that task. We have never seen that with a small model, but what we have found out is that you can, actually, use a powerful model to demonstrate the solution strategy to the small model, and you can actually demonstrate so many solution strategies for so many tasks. And the small models are able, actually, to learn that, and the reasoning ability is significantly improved based on that. 

LLORENS: I find the word reasoning pretty loaded. 

AWADALLAH: It is.

LLORENS: I think a lot of people mean a lot of different things by reasoning. Actually, I found some clarity. I had a nice discussion with two of our colleagues, Emre Kiciman and Amit Sharma, and, you know, they wrote a recent paper on reasoning. Sometimes we mean symbolic-style reasoning; sometimes we mean more commonsense reasoning. You talked about, kind of, more symbolic-style-reasoning tokens perhaps, or how do I think about the difference between those kinds of training data versus world knowledge that I might want a model to reason about? 

BUBECK: Yeah, very good question. So if you take the perspective that you start with a neural network, which is a completely blank slate, you know, just purely random weights, then you need to teach it everything. So going for the reasoning, the high-level reasoning that we do as human beings, this is like, you know, step No. 10. You have many, many steps that you need to satisfy before, including, as you said, commonsense reasoning. So, in fact, in our approach for the pretraining stage, we need to spend a lot of effort into the commonsense reasoning. And there, the textbooks approach is perhaps a little bit weird because there’s no textbook to teach you commonsense reasoning. You know, you acquire commonsense reasoning by going outside, you know, seeing nature, talking to people, you know, interacting, etc. So we … you have to think a little bit outside the box to come up with textbooks that will teach commonsense reasoning. But this is, actually, what we do, a big, a huge part of what we did. In fact, everything that we did for Phi-1.5 was focused on commonsense reasoning. And then when we got to Phi-2, we got a little bit closer to the Orca model, and we tried to teach also slightly higher-level reasoning, but we’re not there yet. There is still, you know, a few more layers. We’re not yet at step No. 10. 

LLORENS: Yeah, fair enough. Ece, geek out with us a little bit now on research directions. I’m sure you have a lot of interest in everything we’ve just talked about. Anything you want to add from your perspective? 

KAMAR: There is, actually, a lot to add, because one of the biggest things that we are trying to do in our new organization is understand the connections between these different works that are going on, because our purpose is not exploring independent research directions and make progress on each. But we have a very focused mission. Our focused mission is expanding the frontiers of AI capabilities, expanding the frontiers of what intelligence can be in these machines. And to be able to get there, we have to have a coordinated understanding of how Phi connects to Orca and how these two model families connects to other future-looking ideas that can push those boundaries forward. So think about this as, like, an intelligent pyramid. That’s how I have been, kind of, thinking about this in my mind. 

At the base of it, we have the building blocks of these models, base models. Phi is a beautiful example. And in the future, we are going to have other models. Phi is going to go and do other things, and other places can do other things. Phi and GPT-4 and these models are going to coexist in a model library. The one layer above that is all of the work that Orca team is doing with fine-tuning specialization. Taking a capability, taking a domain, taking some constraints and trying to see, like, I have these base models, but how do I make them work even better for the different domains and capabilities that I really, really care about and have more control over what those models generate for me. So that’s like the second step of that intelligent pyramid that we are building. But then we have been doing some really interesting demonstrations and building in our teams to, kind of, look at, like, how does orchestration play a role in that intelligence pyramid? Because when you think about it, the simplest way we can get things done with either the base models or the specialized models today is I just tell it to do something by prompting and it does something for me. But is that the end of the way we are going to be building with these models to be able to expand those frontiers? That answer is a no. And in fact, one piece of work that our teams have been doing collectively is called AutoGen (opens in new tab). And that library, which became very popular with the developer community—and we love seeing the responses we are getting. Correct me, Ahmed, I think we got to 15,000 stars under a month in GitHub …  

AWADALLAH: Yeah, we did.

KAMAR: … with this library, with this very experimental library. And we are learning a lot from the developer community about how they are using it. But what we are seeing is that the kind of things people want to do with these models, when they want to expand those capability boundaries, when they want to have a more robust execution, when they want to really overcome the brittleness of the prompting and prompting the models strategy, they actually go to orchestration, and in fact, they go to multi-agent orchestration. So that multi-agent, what we mean by multi-agent orchestration is that imagine you have a complex task that you cannot reliably do by just prompting even the best model we have in our family. But what you can do is something very similar to how humans work actually. We take a complex problem. We divide it into smaller pieces and then assign smaller pieces to different people that have different capabilities. That’s exactly how AutoGen framework works. It takes a complex task, divides it into smaller pieces, and assigns different pieces to different “agents,” which means intelligences that can prompt different models with different strategies and personas and get them working together. And what we are seeing is that this very simple idea of multi-agent orchestration, on top of all of the great work that’s happening on the modeling side, is another layer in that intelligence pyramid that can really push the frontiers forward. So one of the things we are doing in our organization is really understand these connections—how does Phi relate to Orca relate to AutoGen?—as we are building this pyramid. But there is something else we are betting on right now, which I believe is going to become very, very important as these systems become a part of the real world, as Ahmed was suggesting. 

So when we were doing the “sparks of AGI” work, there is actually something we say in the introduction when we are talking about intelligence, the core of intelligence. Any intelligence system needs to be learning, needs to be learning from their environment, needs to be learning from the interactions they are having. And this is not something we currently have even in the best models or even in the best AI systems we have in the world. They are static. They may be interacting with millions of people every day and getting feedback from them or seeing how people respond to it, but it does not make any of those systems better or more intelligent or understand their users any better. So I feel like this is one of the areas that we have to push forward very strongly. How do we incorporate a learning feedback loop into this intelligence pyramid—every layer of it—in a transparent, understandable, and reliable way so that the systems we are building are not only getting better because experts like Sébastien and Ahmed are putting a lot of time in data collection. And, of course, that work needs to happen, as well, and, you know, coming up with new ideas to make the models better. But we are, actually, creating this virtuous loop for our systems for them to get better in time.  

The last research idea we are pushing forward is something, actually, very unifying across the stack I’m talking about. One of the biggest questions is, how is the progress in AI look like today, right? Like, we are doing all of this great work, but how the capabilities of the AI systems, all the models we are building, are evolving as the models scale up and we have more data. So this is really becoming a question about evaluation and understanding. So think about this as we are doing a lot of agile work in a very fast-changing environment. What we need is headlights to be able to see where we are going and how much progress we have made. So this is why another area we are really pushing for as a research direction in our organization is not only relying on existing benchmarks and existing evaluation strategies, but really reinventing how we think about evaluation overall. We talked about this intelligence stack. How does the innovations in the intelligence stack can enable the researchers to come up with new approaches to understand the models, evaluate the models, such that we can have a much better understanding of where we are and where we are headed as we are building this intelligence pyramid? 

LLORENS: A quick follow-up question on evaluation. This is one that I think a lot about. There’s the idea of benchmarks that try to maybe test the, you know, the generality of the intelligence of a model. And then there’s, all the way, the end-to-end evaluation in the context of use. And how much do we think about the end-to-end story there when we talk about evaluation? 

KAMAR: It’s a spectrum. I would also like to hear from Sébastien and Ahmed, but it is really a spectrum, and there are different questions that motivate the work on evaluation. So when we ask a question like what does that capability curve look like for AI models, there we have to focus on the models themself and understand how the models are progressing. But then if you are asking a question of, I want to build reliable, capable AI systems of the future— how does that curve look like? That requires a different way of thinking about the evaluation where we are not only evaluating the models, but we are evaluating the whole stack. We are actually saying, OK, let’s think about prompting. Let’s think about orchestration and understanding the complementarity of the stack and looking into how the capabilities improve as we put the pieces together and to be able to light our way forward, both in terms of understanding how well we do in models and how well we do in building systems. We have to do the work in both. There is really no shortcut for that. 

LLORENS: Microsoft Research is over 30 now, over 30 years old. And suffice it to say, I think we’re, you know, we’ve been going strong for over 30 years, but we’re in new territory. And I think we are organizing differently in some ways, you know, to meet the moment. And along those lines—and you, kind of, alluded to this before—but you’ve recently taken on a new leadership role. 

KAMAR: With Sébastien and Ahmed, as well. 

LLORENS: Of course. So maybe you can say a little bit more about how we’re organizing differently, what this looks like from your perspective.

KAMAR: As you said, this is really about the moment that we are in right now. Of course, I haven’t been at Microsoft Research for the whole 30 years [LAUGHTER], but I’ve been here for at least half of it, and personally, for me, there has never been a moment as exciting as now to be an AI researcher and to be an AI researcher inside Microsoft. Think about it. This is the company that is building the cutting-edge AI technologies in the hands of millions of people and doing it at an unbelievable speed that surprises me, although I have been an employee of this company for the last 13 years. So think about the speed of innovation that we are seeing here. Think about where the ambition level is in this company when it comes to doing great AI work. 

Of course, by doing research inside Microsoft, we are also able to see where the gaps are. We are able to get a lot of feedback about what is working and what is not working. And that’s giving us a lot of really strong signals about where we need to push. And, in fact, these research directions we are talking about, they are not coming from thin air. This is really coming from working with different product groups, learning from their experiences, trying things ourselves, as well. So these are all motivating us to rethink what AI research means in this new AI age. So if you are creating an ambition level that is as high as what the current situation requires us to be, which is we are going to be at the cutting edge of the AI world, we are going to be impacting the real-world AI systems, and we are going to be pushing forward in this intelligent pyramid. That really requires that we have to coordinate ourselves very well on a very well-defined mission and go with it with conviction and go with it with speed and agility. So that’s what we are doing in our new organization that’s called AI Frontiers. This is a mission-focused AI lab and our mission is expanding the frontiers of AI capabilities, and we are doing it by being very focused on a number of key directions, which we kind of covered, but also having the agility and the teamwork to always re-evaluate ourselves and ask the question of, are these the most important problems to work on right now? Or how the world is changing, should we rethink? Should we create new directions? Should we end directions and build? This is, I think, one of the most important things about where we are in the AI world right now. We are not working on hypothetical ideas. Of course, we are dreaming big; we are taking risks. We are not only doing incremental things. But even for the ideas that are long-term and riskier, we are only going to learn if we are building those ideas, sharing it with the community, and learning from that feedback. So those are the building blocks of our new organization.

LLORENS: One of the things that’s exciting about doing research, I find, in an industrial environment like Microsoft is the ability to essentially affect the population through translating things into products, right. On the other hand, there is a big difference between what comes out at the end of a research pipeline, a research asset, you know, a model like Phi or Orca, and a thing that powers a product. One of the things I think we’ll do with AI Frontiers is provide more of a channel, a more coherent channel, of research artifacts like this into product. But can you talk a little bit about that? What is that difference? What goes into getting something from, you know, what we might put on GitHub to something we might give to our colleagues in Azure, for example? 

BUBECK: I think the timelines are really shortened recently. Overall, research has accelerated so dramatically that the distance between a real product and something that comes at the end of a research, you know, project is, like, the gap is very small, I would say. And this is really, you know, to Ece’s point about having an organization which is mission focused and about building things, this is, to me, the essence of what’s going on right now. We cannot have horizons which are 10 years into the future. The truth is, nobody knows where AI is going to be 10 years from now, so it’s meaningless to plan at time horizons which are the usual time horizon that we are used to in research. If you are in research, you know, from 10 years ago and you’re planning with a 10-years horizon, then, of course, there is going to be an immense gap between whatever you produce and, you know, a real product. This is not the case anymore. So even something like Phi, you know, it could be in product very soon. 

AWADALLAH: Yeah. When I first joined, actually, Microsoft Research, we would also think about the research that we’re doing right now is two, three, five years away, and we’d categorize research that way for making it into product. That spectrum’s collapsing. 

BUBECK: Completely. 

AWADALLAH: Things are happening so fast. The amount of work needed from taking it from research results to a product is still a lot of work. And that’s why I have been amazed by how fast we have been moving as a company, putting these things safely and reliably into the hands of our customers. However, that spectrum is not in years anymore. Things are moving very, very fast and some of the findings that we find make their way into impact in a matter of weeks or months. 

KAMAR: And there’s one more point to make here, which is doing AI Frontiers inside MSR. We are choosing to go, to be building a mission-focused organization that’s going really fast on some of these problems and get our hands dirty and work with different parties in the company. And at the same time, we are inside a very strong organization that has researchers studying many different problems at different time horizons and sometimes being able to, you know, go through on directions that we may not be able to afford by being in this mission-focused organization. So one of the things we very much care about is also building bridges, not only with the company, not only with the academic world, but also with the different groups inside the Microsoft Research umbrella and really benefit from the riskier bets that, you know, the traditional MSR labs are taking and collaborating with them and enabling all of us to try those ideas. So we are really hoping that by being inside this MSR family, we are gaining a lot and we are able to scale on our ideas and experimentation a lot more. 

LLORENS: You alluded to the, you know, the work it takes to go from a research artifact to something in a product, and part of that work pertains to responsible AI, as we might say inside Microsoft, or just AI safety more broadly. I think that’s true for transitioning to translating something to product, but even to releasing something, you know, a paper, you know, with a GitHub, you know, artifact that we put out there. Let’s go back let’s even say to the Orca work. How are you thinking about safety in the context of open sourcing something like Orca? What are the tests you’re running? And, you know, what does that frontier look like?

AWADALLAH: Yeah, that’s a very good question. And, actually, we put a lot of emphasis on safety even on research assets and, actually, we put a lot of our research assets through a process as rigorous as we would products before we are able to release them. And this is definitely the right thing to do. And, as you mentioned Orca, we did Orca fairly early on, and we weren’t yet at this stage sure what the process should be, so we, actually, never released it, because … like, once we wrote the paper and found out that we had something interesting, we wanted to release it because we wanted to share it with the research community and we wanted the research community to be able to build on top of it, but we didn’t have a story for what does that mean in order to actually release it safely. So we took some time back and worked with the rest of the company and came up with a very rigorous process. And before we are able to put anything out, it had to go through that process. That said, I think we are still even learning how to evaluate and how to measure and what does it mean to measure safety. So it’s not like a checkbox where we figured it out, and that’s what we are doing, and we feel good about it, and we put it out there. There is a continuous effort from a very large number of teams throughout the company in both products and research to always refine these processes so that we make sure we advance our understanding of what safe release of these models is and also make sure that we have the right processes and systems to make sure everything we put out there goes through that process. 

LLORENS: And there are frontiers here that are super interesting. I think multimodality is a really interesting frontier relative to evaluation and safety. And we started earlier in the conversation even talking about AI in the real world that we interact with maybe not even just as a chatbot, but as an agent of some kind that can take action in the real world. So it’s great to see us taking this so seriously at this phase, because I think it’s going to get even more complicated, you know, as we move forward and more important. Why don’t we talk AI and society for a minute. One of the things that I find important for me as I reflect on my own research, my own journey here, is remaining grounded by perspectives outside of this laboratory, outside of the spheres that we’re in. We get some of that at our dinner tables, right. I do have the opportunity, for me personally, to engage with communities, community organizations, even politicians. But I’m really interested in how you all stay grounded in perspectives outside of this world here in Microsoft Research. Ece, why don’t we start with you? 

KAMAR: Yeah, one of the things, talking about AI and society and responsibe AI, one of the things that’s very important is that a significant portion of our organization, our researchers and engineers, have significantly contributed to the work that Microsoft has done in the responsible AI space over the last decade. And, in fact, I’m … one of the things I’m most proud of in terms of my personal time in MSR is how much MSR contributed to where Microsoft is in doing AI responsibly. And that all happened because we, actually, got to see the realities of AI development and have the passion to drive innovation in terms of building AI responsibly. Now I think this is an opportunity for us to do this at larger scales as we have more coordinated efforts in terms of pushing the frontiers of AI in this new organization and MSR more broadly. So there are a few ways we are doing this right now. And then I’ll come to your point about the community. One of the things that we very much care about is sharing our work with the academic community and with the developer community through open sourcing. So all of the works—Phi, Orca, AutoGen, and the other things we are going to be doing—we release them. And, in fact, what is so significant about the small-language-model space is that they enable a lot of hands-on research work that may not be possible without these family of models, because when you think about it, a lot of the other models that have reasoning capabilities that may compare with Phi and Orca, they were much larger and they were black boxes to the research community. Now that we are putting these models out there in an MIT License, we really welcome the academic community to take these models, to look into how they are actually getting better in reasoning, and ask the question of how. Ask the question of, how do we have better controls in Phi and Orca? How do we improve the training data such that we can mitigate some of the biases, reliability issues, toxicity in it?

One of the things I personally very much believe in is that there cannot be any camps about stopping AI versus going as fast as possible. This is really about building AI responsibly and making sure that our innovations happening are also taking responsibility as a core part of that innovation. So with that in mind, we think it is so important to enable the whole academic community with models, with architectures, with agents, libraries such that the innovation in terms of how do we make AI responsible comes from the whole world instead of just the field that has access to such models. 

BUBECK: And if I may, like, for the Phi model on Hugging Face, we are approaching a million downloads. So, you know, it’s very real. Like, this is really getting into the hands of, well, a million people, so … 

LLORENS: Yeah, for sure. 

AWADALLAH: Yeah, and to add to that, we are seeing this a lot with AutoGen, as well, because AutoGen, it’s not a model. You can use a lot of models with it. And it created a big developer community around it, and we have been learning a ton from them and not just in how they are using it, but actually in so many innovative ideas of even how to use it to make your applications safer or to make application more reliable, because the framework enables you to define different roles, and people are coming up with very interesting ideas about maybe adding a safeguard agent in order to make sure that whatever the team of agents is doing actually fits the particular safety criteria or adding some other agents that are trying to make sure that the completion of the task aligns with the initial human intent. So we are going early with enabling the community to use what we are doing and open sourcing it. It is helping us collectively come up with better ways for building these things in a much better and safer way. 

KAMAR: And then on top of the work we are hopefully enabling the academic community, there is also something about working inside a company like Microsoft and learning from real-world use cases. And responsible AI is really about real world, and we want to make sure that we, over time, think about ways—possibly even collaborating with you, Ashley, and your team—really, like, sharing our learnings about how the real world looks like, what the real-world considerations are, with a much larger community so that we can think about all of these considerations together and innovate together in terms of building AI responsibly. 

LLORENS: And the global research community—we talk a lot about that—is more expansive, I think, than it’s ever been, at least as it pertains to computing research and the amount of different disciplines right now involved in what we’ve considered computing research. On the one hand, there are the computer scientists that are playing with Phi right now, that are playing with AutoGen. On the other hand, there’s legal scholars, there’s policy researchers, there’s medical practitioners, and so the global research community is just more expansive than ever, and it’s just been great to be able to use Microsoft as a platform to be able to engage more broadly, as well. So, look, I’ve had really a lot of fun, you know, talking to you all on a daily basis but today in particular. Thanks for a fascinating discussion. 

KAMAR: Thank you, Ashley. 

BUBECK: Thanks, Ashley.  

AWADALLAH: Thank you.

The post Panel Discussion: AI Frontiers appeared first on Microsoft Research.

]]>