Research Forum Brief | February 2025 Articles

AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness

Microsoft Research Team — Tue, 25 Feb 2025 19:36:13 +0000

Presented by Gagan Bansal at Microsoft Research Forum, February 2025

“When we released AutoGen, one of the first things that the developers absolutely loved about it was its simplicity and the many pre-built agents and teams that it provided, such as the user proxy agent and the assistant agent, and the group chat between multiple agents. With the AutoGen AgentChat layer, we are maintaining these features and adding tons of more essential features such as streaming support, serialization, state management and memory for agents, and finally full-time support for a better development experience.”
– Gagan Bansal, Senior Researcher, Microsoft Research AI Frontiers

Microsoft research copilot experience How has the new update to the AutoGen framework enhanced its capabilities for agentic AI applications?

Transcript: Lightning Talk

AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness

Gagan Bansal, Senior Researcher, Microsoft Research AI Frontiers

This talk introduces a transformative update to the AutoGen framework that builds on user feedback and redefines modularity, stability, and flexibility to empower the next generation of agentic AI research and applications.

Microsoft Research Forum, February 25, 2025

FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: The following talk invites us to follow the journey of AutoGen (opens in new tab) from a leading open-source framework for multi-agent applications to a complete redesign that lays the foundation for the future of agentic AI research and applications with the release of AutoGen 0.4 (opens in new tab). The framework’s new layered architecture provides flexibility and scalability and includes an ecosystem of extensions and applications, some created by the same team, such as Magentic-One, a team of generalist agents, and Studio, a low-code developer tool. AutoGen 0.4 is also a story about collaboration between MSR, partners within Microsoft, and a vibrant open-source community.

GAGAN BASAL: Hi, I am Gagan Bansal and I am a researcher at Microsoft Research AI Frontiers. And today I’ll talk about some exciting technical updates to AutoGen, a leading open-source framework for agentic AI. And although I am presenting, this is joint work with many incredible colleagues and interns at Microsoft over the last year.

AutoGen is a leading open-source framework for multi-agent applications that we released in fall 2023. It enables developers and researchers to create intelligent applications using large language models, tool use, and multi-agent collaboration patterns. With AutoGen, our goal has been to lead the innovation in agentic AI research. When we first launched AutoGen in Fall 2023, it quickly became the leading open-source framework for agentic AI, and it continues to empower developers and researchers in many, many domains, including business process automation, marketing, finance, security, and others.

Since AutoGen’s launch, we’ve not just been maintaining it. We’ve been listening closely to feedback from developers and researchers, and in this rapidly evolving landscape of AI progress, their expectations were high. Users told us that they needed greater modularity and the ability to reuse agents seamlessly. They also asked for better support for debugging and scaling their agentic solutions. And finally, there were many apps to enhance the code quality and maturity of the platform.

Pursuing these needs required us to question our assumptions and even possibly reimagine the platform. So, in early 2024, we used these learnings to experiment with alternate architectures, and we ended up adopting an actor model for multi-agent orchestration. The actor model is a well-known programming model for concurrent programing and high use systems. Here, actors are the computational building blocks that can exchange messages and also perform work. In Fall 2024, we announced a preview of this version and this new year, we’re thrilled to announce a full release. In summary, AutoGen v0.4 is our response to address our users’ feedback in this evolving landscape of AI research. AutoGen is now not just a framework, but it’s a whole ecosystem for agentic AI. It provides you with a framework that lets you build sophisticated agents and multi-agent applications, and it also provides you with developer tools and many well-defined applications.

Let me first tell you about the AutoGen framework. At the heart of this release is a layered architecture that is designed for flexibility and scalability. At the base is AutoGen Core. This layer implements the actor model for agents. Building on core is AutoGen AgentChat. This layer provides a simple and easy to use API that is perfect for rapid prototyping. And building on Core and AgentChat is Extensions.

This layer provides advanced clients, agents and teams, and integrations with third party software. This layered architecture is nice because whether you are an advanced developer or a researcher prototyping new ideas, AutoGen provides you with the tools you need for your project’s stage of development. The Core implements an actor model for agentic AI. At the highest level, this implementation provides two key features.

The first is asynchronous message exchange between agents. It does so by providing a runtime, and then it also provides event-driven agents that perform computations in response to these messages. There are several implications of this design, and one of them is that it decouples how the messages are delivered between the agents from how the agents handle them. This naturally improves the modularity and scalability of agentic workflows built with AutoGen, especially for deployment.

The Core’s event-driven architecture provides several other benefits. For example, it provides affordances to observe and control agent behavior, which is crucial for responsible development of agentic technology. It also enables running multiple agents on different processes and even implementing them using different languages. Finally, it enables developers to implement a large class of multi-agent patterns, including static and dynamic workflows.

When we released AutoGen, one of the first things that the developers absolutely loved about it was its simplicity and the many pre-built agents and teams that it provided, such as the user proxy agent and the assistant agent, and the group chat between multiple agents. With the AutoGen AgentChat layer, we are maintaining these features and adding tons of more essential features such as streaming support, serialization, state management and memory for agents, and finally full-time support for a better development experience.

Please check out the link below for the migration guide. Finally, the Extension layers provide advanced runtimes, tools, clients, and ecosystem integrations that continuously expand the framework’s capabilities. In addition to the framework, this new release also provides upgrades to essential developer tools and applications built using AutoGen. And here I’ll briefly mention two of them. In late 2023, we also released AutoGen Studio, which is a low code tool for authoring multi-agent applications.

And we are excited to announce that with version 0.4, Studio has received massive upgrades. It now supports a drag and drop, multi-agent builder. It supports real time updates as agents solve tasks, flow visualizations and execution controls, so that the users remain in control, and component galleries so that the community can discover and build on each other’s work. We’ve always believed that the framework should enable state-of-the-art applications for solving complex tasks with agents, which is why we’ve been building applications with the framework ourselves and using that to guide the framework’s development.

Last year, we released Magentic-One, a state-of-the-art multi-agent team for solving file- and web-related tasks built using AutoGen. And now its developer API, and general capabilities, such as sophisticated orchestrators and specialized agents such as the web server and the file server, are now available in the AutoGen ecosystem. For us, this new ecosystem is only the beginning and sets the stage for future innovation in agentic AI.

Over the past two years, our team has made early progress in AI agents and we continue to deeply think about the changing landscape of current AI research and continue to invest in taking steps to help lead the innovation on agents. And by the way, we’re also working closely with our colleagues at Semantic Kernel, to provide an enterprise ready multi-agent runtime for AutoGen.

Thank you for attending Microsoft Research Forum. Please check out these links to learn more about AutoGen.

Blog AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness

Publication AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Publication Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

The post AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness appeared first on Microsoft Research.

Belief state transformers

Microsoft Research Team — Tue, 25 Feb 2025 19:33:38 +0000

Presented by John Langford at Microsoft Research Forum, February 2025

“That ability to condition on generation, rather than evaluate the generation, ends up being amazingly useful in terms of giving you a more honest valuation of the generated text.”
– John Langford, Partner Research Manager, Microsoft Research AI Frontiers

Microsoft research copilot experience What advantages do belief state transformers offer over traditional GPT-style models in planning and decision-making tasks?

Transcript: Lightning Talk

Belief state transformers

John Langford, Partner Research Manager, Microsoft Research AI Frontiers

This talk showcases a new transformer architecture that generates compact belief states for goal-conditioned planning, enhancing planning algorithms’ efficiency and effectiveness.

Microsoft Research Forum, February 25, 2025

FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: Transformer models have brought us a revolution in language modeling with their capability to generate impressive language with many emergent properties. At the same time, LLMs have a number of weaknesses, one being that they are not very good at evaluating their own output. Let’s hear how the new Belief State Transformer architecture unlocks new abilities by combining a standard GPT-style architecture of a forward encoder for token prediction with an additional backward encoder.

JOHN LANGFORD: I’m John Langford. I’d like to tell you about belief state transformers, which is a new paper we have in archives, and which is also accepted at ICLR [International Conference on Learning Representations]. There are many coauthors on this paper. I’d like to thank them, particularly Edward, who did much of the work here.

To start with, let’s talk about standard GPT-style transformers. In standard GPT style transformers, you have a sequence of symbols which are going into a forward encoder, and then the forward encoder outputs some information to the output head, and then the output head predicts the final token. So, this is a straightforward approach and yet amazingly powerful. It’s kind of the key backbone behind GPT-4 and other language models.

For the purposes of research, though, we need to have something to think about, to complain about, and I’m going to complain about self-evaluation. Often these language models can’t be used to evaluate their own output too well, because the generation of the next token is done by exactly the mechanism you would use to evaluate it in that output head. So, this is kind of like grading yourself, and like grading yourself you can miss things that an independent grader would actually see pretty well.

Right, so a belief state transformer changes the architecture. And so, it’s taking two transformers and grafting them together. One of them is going to be the standard forward encoder on the prefix. And then we’re also going to have another transformer, which is a backward encoder on the suffix. These are both going to put out some information, which goes to the output head. And the output head is going to predict the next token and the previous token. So, it’s the next token of the prefix and the previous token of the suffix. Something to worry about with these transformers is the computation. So, these are transformers obviously doing more computation. But it turns out that this “more computation” is only in a constant factor of more computation.

And the key observation here is that in the forward encoder, just doing the attention, what you’re going to use in the GPT-style transformer, is already order N-squared [N²]. Every token looks at every previous token in order to figure out what information is necessary to predict the next token. In the belief state transformer, that happens twice. You have two different transformers, each with their own attention, and so you pay a factor of two.

And then, in addition, you’re going to pay because the number of times you evaluate the head, the output head, is order n squared because there are order N-squared prefix/suffix pairs. So, there’s a constant factor increasing computation, which is problematic, but it’s not like the end of the world. You can subsample or things like that. And what you get in return is order N-squared gradients rather than order N gradients.

In a standard GPT-style transformer, you only have order N gradients because you only have order N symbols, and you get one gradient per symbol. Here you get order N-squared gradients because you have order N-squared prefix/suffix pairs. That means there’s many more ways to get information out of a sequence. And that unlocks the possibility of learning new things that were previously unlearnable.

Okay, so now let’s go on to the belief state. Why are we talking about a belief state when we say belief state transformer. Well, it turns out you can prove a theorem. And this theorem says that the output of the forward encoder is a belief state for the prefix. So what that means is that the output of the forward encoder will converge to all the information necessary to predict the future. So that’s all symbols after the prefix. So, that ability to create a compact belief state is new with belief state transformers, something that previously we only really knew how to do with state space machines.

Okay, so let’s try this out. Looking at Tiny Stories. Tiny Stories is dataset where you have a bunch of children stories, which are generated by GPT-4.

We’re going to feed a prefix and a suffix into our system, and it’s going to fill in the middle, which is what happens in blue. And then for a baseline, we’re going to compare the fill-in-the-middle approach to using GPT-style transformers. So the way the fill-in-the-middle approach works with GPT-style transformers is you take the prefix, and then you add the suffix, and then you just predict the tokens after that.

So that works reasonably well. This is very commonly used. And now if we have these two different approaches the question is how do we actually value these different approaches? Which one is better? So, the way we’re going to judge this is we’re going to ask GPT-4 which is better in various ways: syntax, style, and so forth. And then we’ll ask it for a summary judgment, which is a standard technique.

We looked at what it was doing, and it seemed very reasonable. And in doing this, we end up with the belief state transformer winning about a factor of three more often than the GPT-style transformer. So that’s huge. It’s so huge that you really want to understand why. And it seems like the key here is self-evaluation. So, under the hood, we’re actually running each of these, say 120 times, using a beam search. The code for that is on the right. So, given the beam search, you have several different possible completions. And now how do you choose which completion to actually use? Because you have to pick one of these. You’re trying to pick a completion. And for the GPT-style transformer, there’s only one way to really do this. The way is you take the next head, and you use it as a probability function, and you look at the probability of the sequence of tokens which is produced.

That works reasonably well. It actually does improve picking out a high-probability sequence of tokens versus a lower probability sequence of tokens. But it’s not as much as you get with the belief state transformer. And the reason why is the self-grading issue that I was talking about earlier. There’s many ways that a system could be blind to its own mistakes. With the belief state transformer, though, you have another option, because the next head can instead condition on the generated data and run over the suffix in order to value the generated data.

So, that ability to condition on generation, rather than evaluate the generation, ends up being amazingly useful in terms of giving you a more honest valuation of the generated text. All right, so just to summarize, we have this belief state transformer. This learns a compact belief state, which is a new thing in transformers. It gives us a way to have a simple set of values, which summarize all information we need to predict the future.

And this seems to provide a very strong form of self-evaluation, which is potentially very useful in many situations where you’re trying to use test-time compute, or even using test-time compute to further create training data. So, this is more in the paper. There’s some other things that you can do with transformer that are kind of new.

I think the biggest question in my mind is what happens when you scale this up? And, of course, we’re working on that. That’s one of the great things about being in MSR [Microsoft Research]. They have some GPUs to scale this up to much larger datasets. So, stay tuned. And, thank you.

Research Lab AI Frontiers

Publication Learning to Achieve Goals with Belief State Transformers

The post Belief state transformers appeared first on Microsoft Research.

OmniParser V2: Turning Any LLM into a Computer Use Agent

Microsoft Research Team — Wed, 12 Feb 2025 18:31:35 +0000

Yadong Lu, Senior Researcher; Thomas Dhome-Casanova, Software Engineer; Jianwei Yang, Principal Researcher; Ahmed Awadallah, Partner Research Manager

Get OmniParser V2 Code

OmniTool

Model checkpoints on HuggingFace

Graphic User interface (GUI) automation requires agents with the ability to understand and interact with user screens. However, using general purpose LLM models to serve as GUI agents faces several challenges: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associating the intended action with the corresponding region on the screen. OmniParser closes this gap by ‘tokenizing’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs. This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements.

OmniParser V2 takes this capability to the next level. Compared to its predecessor (opens in new tab), it achieves higher accuracy in detecting smaller interactable elements and faster inference, making it a useful tool for GUI automation. In particular, OmniParser V2 is trained with a larger set of interactive element detection data and icon functional caption data. By decreasing the image size of the icon caption model, OmniParser V2 reduces the latency by 60% compared to the previous version. Notably, Omniparser+GPT-4o achieves state-of-the-art average accuracy of 39.6 on a recently released grounding benchmark ScreenSpot Pro (opens in new tab), which features high resolution screen and tiny target icons. This is a substantially improvement on GPT-4o’s original score of 0.8.

To enable faster experimentation with different agent settings, we created OmniTool, a dockerized Windows system that incorporates a suite of essential tools for agents. Out of the box, we enable OmniParser to be used with a variety of state-of-the-art LLMs: OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) and Anthropic (Sonnet) combining the screen understanding, grounding, action planning and execution steps.

Risks and Mitigations

To align with the Microsoft AI principles (opens in new tab) and Responsible AI practices (opens in new tab), we conduct risk mitigation by training the icon caption model with Responsible AI data, which helps the model avoid inferring sensitive attributes (e.g.race, religion etc.) of the individuals which happen to be in icon images as much as possible. At the same time, we encourage user to apply OmniParser only for screenshot that does not contain harmful content. For the OmniTool, we conduct threat model analysis using Microsoft Threat Modeling Tool overview – Azure | Microsoft Learn (opens in new tab). We provide a sandbox docker container, safety guidance and examples in our GitHub Repository. And we advise a human to stay in the loop in order to minimize the risk.

The post OmniParser V2: Turning Any LLM into a Computer Use Agent appeared first on Microsoft Research.

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Microsoft Research Team — Tue, 05 Nov 2024 02:35:39 +0000

By Adam Fourney, Principal Researcher; Gagan Bansal, Senior Researcher; Hussein Mozannar, Senior Researcher; Victor Dibia, Principal Research Software Engineer; Saleema Amershi, Partner Research Manager

Contributors: Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi

We are introducing Magentic-One, our new generalist multi-agent system for solving open-ended web and file-based tasks across a variety of domains. Magentic-One represents a significant step towards developing agents that can complete tasks that people encounter in their work and personal lives. We are also releasing an open-source implementation of Magentic-One on Microsoft AutoGen, our popular open-source framework for developing multi-agent applications.

The future of AI is agentic. AI systems are evolving from having conversations to getting things done—this is where we expect much of AI’s value to shine. It’s the difference between generative AI recommending dinner options to agentic assistants that can autonomously place your order and arrange delivery. It’s the shift from summarizing research papers to actively searching for and organizing relevant studies in a comprehensive literature review.

Modern AI agents, capable of perceiving, reasoning, and acting on our behalf, are demonstrating remarkable performance in areas such as software engineering, data analysis, scientific research, and web navigation. Still, to fully realize the long-held vision of agentic systems that can enhance our productivity and transform our lives, we need advances in generalist agentic systems. These systems must reliably complete complex, multi-step tasks across a wide range of scenarios people encounter in their daily lives.

Introducing Magentic-One (opens in new tab), a high-performing generalist agentic system designed to solve such tasks. Magentic-One employs a multi-agent architecture where a lead agent, the Orchestrator, directs four other agents to solve tasks. The Orchestrator plans, tracks progress, and re-plans to recover from errors, while directing specialized agents to perform tasks like operating a web browser, navigating local files, or writing and executing Python code.

Magentic-One achieves statistically competitive performance to the state-of-the-art on multiple challenging agentic benchmarks, without requiring modifications to its core capabilities or architecture. Built on AutoGen (opens in new tab), our popular open-source multi-agent framework, Magentic-One’s modular, multi-agent design offers numerous advantages over monolithic single-agent systems. By encapsulating distinct skills in separate agents, it simplifies development and reuse, similar to object-oriented programming. Magentic-One’s plug-and-play design further supports easy adaptation and extensibility by enabling agents to be added or removed without needing to rework the entire system—unlike single-agent systems, which often struggle with inflexible workflows.

We’re making Magentic-One open-source (opens in new tab) for researchers and developers. While Magentic-One shows strong generalist capabilities, it’s still far from human-level performance and can make mistakes. Moreover, as agentic systems grow more powerful, their risks—like taking undesirable actions or enabling malicious use-cases—can also increase. While we’re still in the early days of modern agentic AI, we’re inviting the community to help tackle these open challenges and ensure our future agentic systems are both helpful and safe. To this end, we’re also releasing AutoGenBench (opens in new tab), an agentic evaluation tool with built-in controls for repetition and isolation to rigorously test agentic benchmarks and tasks while minimizing undesirable side-effects.

Code on GitHub

Read the technical report

How it works

2”. If “Yes” the flow goes back to the outer loop’s Task Ledger which is updated before the agents try again. If “No”, the flow again points out of the Orchestrator toward one of the other agents. The other agents depicted at the bottom of the diagram are named and described as follows: a Coder (“Write code and reason to solve tasks”), Computer Terminal (“Execute code written by the coder agent”), WebSurfer (“Browse the internet (navigate pages, fill forms, etc)”), and a FileSurfer (“Navigate files (e.g., PDFs, pptx, WAV, etc)”)." class="wp-image-1100667" />

Magentic-One features an Orchestrator agent that implements two loops: an outer loop and an inner loop. The outer loop (lighter background with solid arrows) manages the task ledger (containing facts, guesses, and plan) and the inner loop (darker background with dotted arrows) manages the progress ledger (containing current progress, task assignment to agents).

Magentic-One work is based on a multi-agent architecture where a lead Orchestrator agent is responsible for high-level planning, directing other agents and tracking task progress. The Orchestrator begins by creating a plan to tackle the task, gathering needed facts and educated guesses in a Task Ledger that is maintained. At each step of its plan, the Orchestrator creates a Progress Ledger where it self-reflects on task progress and checks whether the task is completed. If the task is not yet completed, it assigns one of Magentic-One other agents a subtask to complete. After the assigned agent completes its subtask, the Orchestrator updates the Progress Ledger and continues in this way until the task is complete. If the Orchestrator finds that progress is not being made for enough steps, it can update the Task Ledger and create a new plan. This is illustrated in the figure above; the Orchestrator work is thus divided into an outer loop where it updates the Task Ledger and an inner loop to update the Progress Ledger.

Magentic-One consists of the following agents:

Orchestrator: The lead agent responsible for task decomposition, planning, directing other agents in executing subtasks, tracking overall progress, and taking corrective actions as needed
WebSurfer: An LLM-based agent proficient in commanding and managing the state of a Chromium-based web browser. For each request, the WebSurfer performs actions such as navigation (e.g., visiting URLs, performing searches), interacting with webpages (e.g., clicking, typing), and reading actions (e.g., summarizing, answering questions). It then reports on the new state of the webpage. The WebSurfer relies on the browser’s accessibility tree and set-of-marks prompting to perform its tasks.
FileSurfer: An LLM-based agent that commands a markdown-based file preview application to read local files. It can also perform common navigation tasks such as listing directory contents and navigating through them.
Coder: An LLM-based agent specialized in writing code, analyzing information collected from the other agents, and creating new artifacts.
ComputerTerminal: Provides access to a console shell for executing programs and installing new libraries.

Together, Magentic-One’s agents equip the Orchestrator with the tools and capabilities it needs to solve a wide range of open-ended problems and autonomously adapt to, and act in, dynamic and ever-changing web and file-system environments.

While the default multimodal LLM used for all agents is GPT-4o, Magentic-One is model-agnostic, allowing the integration of heterogeneous models to support different capabilities or meet different cost requirements. For example, different LLMs and SLMs or specialized versions can power different agents. For the Orchestrator, we recommend a strong reasoning model, like GPT-4o. In a different configuration, we also experimented with using OpenAI o1-preview for the Orchestrator’s outer loop and for the Coder, while other agents continued to use GPT-4o.

Evaluation

To rigorously evaluate Magentic-One’s performance, we introduce AutoGenBench, an open-source standalone tool for running agentic benchmarks that allows repetition and isolation, e.g., to control for variance of stochastic LLM calls and side-effects of agents taking actions in the world. AutoGenBench facilitates agentic evaluation and allows adding new benchmarks. Using AutoGenBench, we can evaluate Magentic-One on a variety of benchmarks. Our criterion for selecting benchmarks is that they should involve complex multi-step tasks, with at least some steps requiring planning and tool use, including using web browsers to act on real or simulated webpages. We consider three benchmarks in this work that satisfy this criterion: GAIA, AssistantBench, and WebArena.

In the Figure below we show the performance of Magentic-One on the three benchmarks and compare with GPT-4 operating on its own and the per-benchmark highest-performing open-source baseline and non open-source benchmark specific baseline according to the public leaderboards as of October 21, 2024. Magentic-One (GPT-4o, o1) achieves statistically comparable performance to previous SOTA methods on both GAIA and AssistantBench and competitive performance on WebArena. Note that GAIA and AssistantBench have a hidden test set while WebArena does not, and thus WebArena results are self-reported. Together, these results establish Magentic-One as a strong generalist agentic system for completing complex tasks.

Evaluation results of Magentic-One on the GAIA, AssistantBench and WebArena. Error bars indicate 95% confidence intervals. Note that WebArena results are self-reported.

Risks and mitigations

Agentic systems like Magentic-One mark a significant shift in both the opportunities and risks associated with AI. Magentic-One interacts with a digital world designed for humans, taking actions that can change states and potentially lead to irreversible consequences. These inherent and undeniable risks were evident during our testing, where several emerging issues surfaced. For example, during development, a misconfiguration led agents to repeatedly attempt and fail to log into a WebArena website. This resulted in the account being temporarily suspended. The agents then tried to reset the account’s password. Even more concerning were cases in which agents, until explicitly stopped, attempted to recruit human assistance by posting on social media, emailing textbook authors, or even drafting a freedom of information request to a government entity. In each case, the agents were unsuccessful due to a lack of the required tools or accounts, or because human observers intervened.

Aligned with the Microsoft AI principles and Responsible AI practices, we worked to identify, measure, and mitigate potential risks before deploying Magentic-One. Specifically, we conducted red-teaming exercises to assess risks related to harmful content, jailbreaks, and prompt injection attacks, finding no increased risk from our design. Additionally, we provide cautionary notices and guidance for using Magentic-One safely, including examples and appropriate default settings. Users are advised to keep humans in the loop for monitoring, and ensure that all code execution examples, evaluations, and benchmarking tools are run in sandboxed Docker containers to minimize risks.

Recommendations and looking forward

We recommend using Magentic-One with models that have strong alignment, pre- and post-generation filtering, and closely monitored logs during and after execution. In our own use, we follow the principles of least privilege and maximum oversight. Minimizing risks associated with agentic AI will require new ideas and extensive research, as much work is still needed to understand these emerging risks and develop effective mitigations. We are committed to sharing our learnings with the community and evolving Magentic-One in line with the latest safety research.

As we look ahead, there are valuable opportunities to improve agentic AI, particularly in safety and Responsible AI research. Agents acting on the public web may be vulnerable to phishing, social engineering, and misinformation threats, much like human users. To counter these risks, an important direction is to equip agents with the ability to assess the reversibility of their actions—distinguishing between those that are easily reversible, those that require effort, and those that are irreversible. Actions like deleting files, sending emails, or filing forms are often difficult or impossible to undo. Systems should therefore be designed to pause and seek human input before proceeding with such high-risk actions.

We invite the community to collaborate with us in ensuring that future agentic systems are both helpful and safe.

For further information, results and discussion, please see our technical report. (opens in new tab)

The post Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks appeared first on Microsoft Research.

OmniParser for pure vision-based GUI agent

Microsoft Research Team — Tue, 08 Oct 2024 22:31:18 +0000

By Yadong Lu, Senior Researcher; Jianwei Yang, Principal Researcher; Yelong Shen, Principal Research Manager; Ahmed Awadallah, Partner Research Manager

Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains underexplored in real-world applications, particularly when it comes to acting as general agents across diverse operating systems and applications with only vision input. One of the primary limiting factors is the absence of a robust technique for screen parsing which is capable of 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen.

Meet OmniParser, a compact screen parsing module that can convert UI screenshots into structured elements. OmniParser can be used with a variety of models to create agents capable of taking actions on UIs. When used with GPT-4V, it significantly improves the agent capability to generate precisely grounded actions for interface regions.

An agent using OmniParser and GPT-4V achieved the best performance on the recently released WindowsAgentArena (opens in new tab) benchmark.

We are making OmniParser publicly available on GitHub, along with a report describing the training procedure to encourage research on creating agents that can act on different applications and environments.

Get OmniParser

View the project

Read the report

Creating OmniParser

Curating Specialized Datasets–The development of OmniParser began with the creation of two datasets:

An interactable icon detection dataset, which was curated from popular web pages and annotated to highlight clickable and actionable regions.
An icon description dataset, designed to associate each UI element with its corresponding function. This dataset serves as a key component for training models to understand the semantics of detected elements.

Fine-Tuning Detection and Captioning Models–OmniParser leverages two complementary models:

A detection model, fine-tuned on the interactable icon dataset, which reliably identifies actionable regions within a screenshot.
A captioning model, trained on the icon description dataset, which extracts the functional semantics of the detected elements, generating contextually accurate descriptions of their intended actions.

Benchmark performance

We demonstrate that with the parsed results, the performance of GPT-4V is greatly improved on ScreenSpot benchmarks. On Mind2Web, OmniParser +GPT-4V achieves better performance compared to GPT-4V agent that uses extra information extracted from HTML. And on AITW benchmark, OmniParser outperforms GPT-4V augmented with specialized Android icon detection model that is trained with view hierarchy. It also achieves the best performance on a new benchmark WindowsAgentArena (opens in new tab)!

To further demonstrate OmniParser is a plugin choice for off-the-shelf vision language models, we show the ScreenSpot benchmark performance of OmniParser combined with recently announced vision language models: Phi-3.5-V and Llama-3.2-V. We hope OmniParser can serve as a general and easy-to-use tool that has the capability to parse general user screen across both PC and mobile platforms without any dependency on extra information such as HTML and view hierarchy in Android.

The post OmniParser for pure vision-based GUI agent appeared first on Microsoft Research.

Direct Nash Optimization: Teaching language models to self-improve with general preferences

Microsoft Research Team — Tue, 03 Sep 2024 19:07:10 +0000

Presented by Corby Rosset at Microsoft Research Forum, September 2024

“The traditional way to fine-tune an LLM for post-training … basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. … Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.”
– Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers

Microsoft research copilot experience What is Direct Nash Optimization, and how does it enable language models to self-improve using general preferences?

Transcript: Lightning Talk

Direct Nash Optimization: Teaching language models to self-improve with general preferences

Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers

This talk discusses teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such as AlpacaEval and MT-Bench.

Microsoft Research Forum, September 3, 2024

CORBY ROSSET: Hi, I’m Corby. I’m a scientist in Microsoft Research. Today, we’re going to be talking about Direct Nash Optimization, which is a technique to help language models self-improve.

We all know that there are two main ways to improve language models. One is to scale up the number of parameters or to scale up the amount of training data. Both of these approaches are costly even for the post-training techniques. The traditional way to fine-tune an LLM for post-training is using SFT. SFT basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. More advanced post-training techniques such as RLHF use a fixed reward model, which can be easily hacked or go stale during training and involves much more complex reinforcement learning, which can be unstable. Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.

Before we move on, we want to give a concrete example of what we mean by self-improving behavior. Here’s a simple geometry problem where a base model that was already SFTed makes a simple arithmetic error on the left-hand side. After our self-improving technique, the model is able to correct this mistake.

Here we give a simple overview of how Direct Nash Optimization works. One of the properties of generative LLMs is that you can sample multiple outputs from them. This is advantageous because what we can do is, given an input, we can take our language model and sample, in this case, two outputs—answer A and answer B—and we can have them scored or rated by a preference function oracle, which tells us which response is better. Then we can use a contrastive training mechanism, such as DPO or IPO or others to update the parameters of the language model to hopefully improve it. In the next iteration, timestep t+1, we repeat the process over again. The key insight of this technique is how we define reward. Typically, in the RLHF framework, we want to maximize the reward of a language model policy against some given external reward model. Here, we redefine “reward” as the expected win rate against your own behavior as judged by a preference function P. What this means is that for a given response y to an input x, the reward of that response is defined as the expected win rate against y primes sampled from the policy itself. Hence, rewards are maximized by responses that are preferred over other responses.

When you start comparing the y primes, or the model’s own outputs to each other, this incentivizes a self-improving behavior because you’re basically competing against yourself. You can formulate this in a game theoretic manner where, in this game, you have a single player which is competing against itself, and the payoffs are given by the preference function. In this game, a Nash equilibrium is achieved by the best possible π* whose responses are preferred over any other competing policy in its class.

At a high level, Direct Nash Optimization has many advantages. Firstly, it optimizes towards a more general preference function directly rather than a point-wise reward model, which is limited in its expressibility since it can’t model transitive preferences. Secondly, it is an iterative algorithm, meaning it is much simpler to implement. We use a contrastive update as the loss, which does not involve any policy gradients or heavy reinforcement learning machinery. We also sample on policy outputs from the model and compare them to each other in a self-play framework. We use a powerful preference annotator—in this case, GPT-4—to rank or judge the best response among them. This approach is also flexible since we can compare the responses to each other but also to outputs from a more powerful teacher such as GPT-4, which provides even bigger improvements. Most importantly, this algorithm is theoretically guaranteed to monotonically approach the Nash equilibrium, hence the name Direct Nash Optimization.

If you implement this algorithm correctly, you will find state-of-the-art results on several benchmarks, including this one, which is AlpacaEval2. This benchmark basically measures how well language models follow instructions and align with human expectations. This benchmark computes a win rate of the language model’s outputs versus a powerful reference—in this case, GPT-4—in a side-by-side comparison. The y-axis is the win rate, and the x-axis is the amount of iterations of training. We see that the dark blue line, which is DNO, the vanilla implementation, outperforms two important baselines. The red line is SFT, and the orange and yellow lines are offline contrastive algorithms, such as DPO and KTO. Hence, we see that self-improving post-training is better than offline contrastive training and SFT. Notably, DNO is also able to outperform similar training techniques from other models, which were 10 times as large, namely the gray line, which was a 70 billion parameter Llama model. We are also encouraged to see that these results do not saturate, and with more training in the purple line over more iterations, we see even better results.

We hope this work inspires other researchers to continue to investigate self-improving post-training as an effective method for aligning language models with human expectations. Thank you for watching.

Research Lab AI Frontiers

Group Augmented Learning and Reasoning

The post Direct Nash Optimization: Teaching language models to self-improve with general preferences appeared first on Microsoft Research.

AutoGen Update: Complex Tasks and Agents

Microsoft Research Team — Tue, 04 Jun 2024 18:08:31 +0000

Presented by Adam Fourney at Microsoft Research Forum, June 2024

“Agents are a very, very powerful abstraction over things like task decomposition, specialization, tool use, etc. Really, you think about which roles you need on your team, and you put together your team of agents, and you get them to talk to one another, and then you start making progress on your task.”
– Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers

Microsoft research copilot experience What updates did Adam Fourney provide on AutoGen, and how does it handle complex tasks with multiple agents?

Transcript: Lightning Talk

AutoGen Update: Complex Tasks and Agents

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers

Adam Fourney discusses the effectiveness of using multiple agents, working together, to complete complex multi-step tasks. He will showcase their capability to outperform previous single-agent solutions on benchmarks like GAIA, utilizing customizable arrangements of agents that collaborate, reason, and utilize tools to achieve complex outcomes.

Microsoft Research Forum, June 4, 2024

ADAM FOURNEY: Hello, my name is Adam Fourney, and today, I’ll be presenting our work on completing complex tasks with agents. And though I’m presenting, I’m sharing the contributions of many individuals as listed below. All right, so let’s just dive in.

So in this presentation, I’ll share our goal, which is to reliably accomplish long-running complex tasks using large foundational models. I’ll explain the bet that we’re taking on using multi-agent workflows as the platform or the vehicle to get us there, and I’ll share a little bit about our progress in using a four-agent workflow to achieve state-of-the-art performance on a recent benchmark.

So what exactly is a complex task? Well, if we take a look at the following example from the GAIA benchmark for General AI Assistants, it reads, “How many nonindigenous crocodiles were found in Florida from the years 2000 through 2020?” Well, to solve this task, we might begin by performing a search and discovering that the U.S. Geological Survey maintains an online database for nonindigenous aquatic species. If we access that resource, we can form an appropriate query, and we’ll get back results for two separate species. If we open the collection reports for each of those species, we’ll find that in one instance, five crocodiles were encountered, and in the other, just a single crocodile was encountered, giving a total of six separate encounters during those years. So this is an example of a complex task, and it has certain characteristics of tasks of this nature, which is that it benefits strongly from planning, acting, observing, and reflecting over multiple steps, where those steps are doing more than just generating tokens. Maybe they’re executing code. Maybe they’re using tools or interacting with the environment. And the observations they’re doing … they’re adding information that was previously unavailable. So these are the types of tasks that we’re interested in here. And as I mentioned before, we’re betting on using multi-agent workflows as the vehicle to get us there.

So why multi-agents? Well, first of all, the whole setup feels very agentic from, sort of, a first-principles point of view. The agents are reasoning, they’re acting, and then they’re observing the outcomes of their actions. So this is very natural. But more generally, agents are a very, very powerful abstraction over things like task decomposition, specialization, tool use, etc. Really, you think about which roles you need on your team, and you put together your team of agents, and you get them to talk to one another, and then you start making progress on your task. So to do all this, to build all this, we are producing a platform called AutoGen (opens in new tab), which is open source and available on GitHub. And I encourage you to check this out at the link below.

All right, so now let’s talk about the progress we’ve been making using this approach. So if you recall that question about crocodiles from the beginning, that’s from the GAIA benchmark for General AI Assistants. And we put together four agents to work on these types of problems. It consists of a general assistant, a computer terminal that can run code or execute programs, a web server that can browse the internet, and an orchestrator to, sort of, organize and oversee their work. Now with that team of four agents, we were actually able to, in March, achieve the top results on the GAIA leaderboard for that benchmark by about 8 points. But what’s perhaps more exciting to us is that we are able to more than double the performance on the hardest set of questions, the Level 3 questions, which the authors of that work describe as questions for a perfect general assistant, requiring to take arbitrarily long sequences of actions, use any number of tools, and to access the world in general. So this is all very exciting, and I want to share a little bit more about what those agents are actually doing.

So this is the loop or the plan that they are following. So it begins with the question or the prompt, and then we produce a ledger, which is like a working memory that consists of given or verified facts; facts that we need to look up, for example, on the internet; facts that we need to derive, perhaps through computation; and educated guesses. Now these educated guesses turn out to be really important because they give the language models space to speculate in a constrained environment without some of the downstream negative effects of hallucination. So once we have that ledger, we assign the tasks to the independent agents, and then we go into this inner loop, where we ask first, are we done? If not, well, are we still making progress? As long as we’re making progress, we’ll go ahead and we’ll delegate the next step to the next agent. But if we’re not making progress, we’ll note that down. We might still delegate one other step, but if that stall occurs for three rounds, then we will actually go back, update the ledger, come up with a new set of assignments for the agents, and then start over.

All right, so this is the configuration that’s been working well for us, and it’s all I have time to share with you today. But I mentioned our goal, our bet, and our progress, and I want to conclude by sharing our plans for the future. So already we’re starting to tackle increasingly more complex benchmarks and real-world scenarios with this configuration. And we’re really excited about opportunities to introduce new agents that, for example, learn and self-improve with experience; that understand images and screenshots a little better for maybe more effective web surfing or use of interfaces; and that are maybe a bit more systematic about exploring that solution space. So rather than just updating that ledger and then restarting when they get stuck, they can be a bit more pragmatic about the strategies that they’re employing.
All right, well, thank you for your attention, and thank you for attending the Microsoft Research Forum, and we look forward to you joining us next time.

Research Lab AI Frontiers

Talk What's new in AutoGen?

Project AutoGen

Publication AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Blog AutoGen: Enabling next-generation large language model applications

Download AutoGen

The post AutoGen Update: Complex Tasks and Agents appeared first on Microsoft Research.

Evaluation and Understanding of Foundation Models

Microsoft Research Team — Tue, 30 Jan 2024 13:18:54 +0000

Presented by Besmira Nushi at Microsoft Research Forum, January 2024

“We see model evaluation and understanding as a guide to AI innovation. Our work measures, informs, and accelerates model improvement and, at the same time, is a contribution that is useful to the scientific community for understanding and studying new forms and levels of intelligence.”
– Besmira Nushi, Principal Researcher

Microsoft research copilot experience Summarize the main three points of Besmira's talk

Transcript

Besmira Nushi, Principal Researcher, Microsoft Research AI Frontiers

Besmira Nushi summarizes timely challenges and ongoing work on evaluating and in-depth understanding of large foundation models as well as agent platforms built upon such models.

Microsoft Research Forum, January 30, 2024

BESMIRA NUSHI: Hi, everyone. My name is Besmira Nushi, and together with my colleagues at Microsoft Research, I work on evaluating and understanding foundation models. In our team, we see model evaluation and understanding as a guide to AI innovation. Our work measures, informs, and accelerates model improvement and, at the same time, is a contribution that is useful to the scientific community for understanding and studying new forms and levels of intelligence.

But evaluation is hard, and new generative tasks are posing new challenges in evaluation and understanding. For example, it has become really difficult to scale up evaluation for long, open-ended, and generative outputs. At the same time, for emergent abilities, very often some benchmarks do not exist and often we have to create them from scratch. And even when they exist, they may be saturated or leaked into training datasets. In other cases, factors like prompt variability and model updates may be just as important as the quality of the model that is being tested in the first place. When it comes to end-to-end and interactive scenarios, other aspects of model behavior may get in the way and may interfere with task completion and user satisfaction. And finally, there exists a gap between evaluation and model improvement.

In our work, we really see this as just the first step towards understanding new failure modes and new architectures through data and model understanding. So in Microsoft Research, when we address these challenges, we look at four important pillars. First, we build novel benchmarks and evaluation workflows. Second, we perform and put a focus on interactive and multi-agent systems evaluation. And in everything we do, in every report that we write, we put responsible AI at the center of testing and evaluation to understand the impact of our technology on society. Finally, to bridge the gap between evaluation and improvement, we pursue efforts in data and model understanding.

But let’s look at some examples. Recently, in the benchmark space, we released KITAB. KITAB is a novel benchmark and dataset for testing constraint satisfaction capabilities for information retrieval queries that have certain user specifications in terms of constraints. And when we tested recent state-of-the-art models with this benchmark, we noticed that only in 50 percent of the cases these models are able to satisfy user constraints.

And similarly, in the multimodal space, Microsoft Research just released HoloAssist (opens in new tab). HoloAssist is a testbed with extensive amounts of data that come from recording and understanding how people perform tasks in the real and physical world. And this provides us with an invaluable amount of resources in terms of evaluation for understanding and measuring how the new models are going to assist people in things like task completion and mistake correction. In the responsible AI area, ToxiGen (opens in new tab) is a new dataset that is designed to mention and to understand toxicity generation from language models. And it is able to measure harms that may be generated from such models across 13 different demographic groups.

Similarly, in the multimodal space, we ran extensive evaluations to measure representational fairness and biases. For example, we tested several image generation models to see how they represent certain occupations, certain personality traits, and geographical locations. And we found that sometimes such models may present a major setback when it comes to representing different occupations if compared to real-world representation. For instance, in some cases, we see as low as 0 percent representation for certain demographic groups.

Now when it comes to data on model understanding, often what we do is that we look back at architectural and model behavior patterns to see how they are tied to important and common errors in the space. For example, for the case of constraint satisfaction for user queries, we looked at factual errors, information fabrication and mapped them to important attention patterns. And we see that whenever factual errors occur, there are very weak attention patterns within the model that map to these errors. And this is an important finding that is going to inform our next steps in model improvement.

So as we push the new frontiers in AI innovation, we are also just as excited about understanding and measuring that progress scientifically. And we hope that many of you are going to join us in that challenge.

Thank you.

Publication KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

Blog HoloAssist: A multimodal dataset for next-gen AI copilots for the physical world

Publication ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Publication Social Biases through the Text-to-Image Generation Lens

Publication Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models

Research Lab Microsoft Research AI for Science

The post Evaluation and Understanding of Foundation Models appeared first on Microsoft Research.

Improving Reasoning in Language Models with LASER: Layer-Selective Rank Reduction

Microsoft Research Team — Tue, 30 Jan 2024 13:14:41 +0000

Presented by Dipendra Misra at Microsoft Research Forum, January 2024

“An LLM is trained on lots of data, often collected from the internet, and uses a model architecture, typically a transformer, to train the model, and they work remarkably well across a range of different tasks. And so one way perhaps we can build towards understanding [an] LLM is by performing interventions in the model and then seeing how that intervention reflects in [its performance].”
– Dipendra Misra, Senior Researcher

Microsoft research copilot experience Summarize the main three points of Dipendra's talk

Transcript

Dipendra Misra, Senior Researcher, Microsoft Research NYC and AI Frontiers

Dipendra Misra will present a surprising discovery that by merely replacing selected weight matrices in an LLM with their suitable low-rank approximation, you can significantly improve the performance of the LLM, at times by 20 to 30 percentage points.

Microsoft Research Forum, January 30, 2024

DIPENDRA MISRA: Welcome, everyone. I’m Dipendra Misra, a researcher at Microsoft Research New York City and AI Frontiers, and I’m excited to be talking about our new method called LASER, which is Layer-Selective Rank Reduction, an approach for improving pretrained large language models. So large language models, or LLMs, have revolutionized machine learning, and yet there is so little we know about how they work.

So in a summary, an LLM is trained on lots of data, often collected from the internet, and uses a model architecture, typically a transformer, to train the model, and they work remarkably well across a range of different tasks. And so one way perhaps we can build towards the understanding of LLM is by performing intervention in the model and then seeing how that intervention reflects in the performance of the LLM. For example, we may find that performing a certain type of intervention may affect one type of task but not the other. And by this way, we may understand how the information about solving different tasks is stored inside the LLM. So with this motivation in mind, we introduce LASER, which is a type of intervention where we select one of the weight matrices of the LLM and replace it by its low-rank approximation.

So in the bottom over here, we see our transformer architecture. If you’re not familiar with the details of it, that’s fine. What we need to know here is that the transformer architecture consists of repeated transformer blocks arranged in different layers, and each block has multiple weight matrices, which are shown here in square. So, for example, here, to perform LASER, we select this weight matrix, which is highlighted in red, and it’s coming from layer No. 22, and we call it the \(W\) matrix here.

And to perform this low-rank approximation, we first use what’s called a singular value decomposition, which decomposes this matrices into three matrices called the \(U\), \(Σ\), and \(V\). The \(Σ\) here contains the singular value of the matrices, and it’s arranged diagonally in decreasing order. So to perform its lower-rank approximation, we throw away all the information in \(U\), \(Σ\), and \(V\), which is \(not\) in blue color, and then we multiply the remaining matrix, and we get its low-rank approximation, which is shown in \(W_{lr}\). And this is a very computationally efficient process and can be done easily with existing libraries.

So in summary, to perform a single LASER intervention, one has to make three choices. So first is which layer to select. Second is which type of weight matrix to edit. And third is how much approximation should be done. In our paper, we also study how these different LASER interventions can be composed across layers and applied simultaneously. So before discussing how to evaluate LASER, I want to mention that LASER also has the advantage of reducing the memory footprint of the model. And this is important because we are living in this age where the memory taken by LLMs is growing at an astonishing pace, and by reducing the memory footprint, we can allow more people to be able to use these LLMs and store them on device.

So for our first evaluation, we evaluate LASER on an existing GPT-J LLM and evaluate on the CounterFact question-answering dataset. The motivation for this is that the GPT-J LLM has its training data available publicly, which allows us to do interesting analysis with it, and the CounterFact question-answering dataset has paraphrases, which allows us to measure robustness to paraphrases.

Now as I mentioned earlier, we are doing intervention using LASER on the LLM, so one would expect that the model loss should go up as we are doing more approximation, meaning that the model is going to perform bad, right, because we are throwing [out] information from an LLM, which is trained on large amounts of data. But to our surprise, what we find [is] that if the right type of LASER intervention is performed, then the model loss doesn’t go up but actually goes down, meaning that we actually improve the pretrained LLM even more.

So in this figure here, we show what happens when the LASER is applied to the MLP matrices, and we see that if we apply LASER at the earlier layer, then the loss is going up. Here, the orange color or the yellow color shows that we’re doing less approximation, and black or in blue means we are doing more approximation. So in the lower layer, we can see that the yellow has a lower loss, but the black has a higher loss. But if you apply LASER in the later layers, we see that the loss is actually decreasing as we do more approximation. And this is truly surprising.

So does this hold more generally? So we find that, yes, this does hold across several tasks and in three different LLMs, namely RoBERTa, GPT-J, and Llama 2. And at times, we see surprising gains like 20 to 30 percentage points. For example, on this task of gender prediction using biographies, we see that the performance of GPT-J goes from 70.9 percent to 97.5 percent accuracy. And in our paper, we have more type of analysis. I’ll just briefly describe two of them quickly.

So one of them shows that if you apply LASER, then the most gains that we get are from improvements in data points which are rarer in the training data. And we also find that the components that the LASER removes from a weight matrices typically offer semantically correct but incorrect responses. And so we can view LASER as a denoising process which is removing this erroneous information.

So in conclusion, we present LASER, which is a new way of doing intervention in large language models, and we show a surprising result that performing LASER can both increase the accuracy of these large language models while also removing the memory footprint. And more details can be found in our paper, which is available on arXiv and will appear as a conference paper at the upcoming ICLR conference.

Thank you.

Publication The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

Podcast Abstracts: January 25, 2024

The post Improving Reasoning in Language Models with LASER: Layer-Selective Rank Reduction appeared first on Microsoft Research.

Panel Discussion: AI Frontiers

Microsoft Research Team — Tue, 30 Jan 2024 00:33:42 +0000

Hosted by Ashley Llorens, with Ece Kamar, Sébastien Bubeck, and Ahmed Awadallah at Microsoft Research Forum, January 2024

“The sparks that we are seeing [are] really about having building blocks that give us the initial technologies … to get to those AI systems that have a memory, that have a history, that have a deep understanding of human concepts, and that can carry out tasks that are a lot broader, a lot more complex than what we can do today.”
– Ece Kamar, Managing Director, AI Frontiers

Microsoft research copilot experience Summarize the main three points of the panel discussion

Transcript

Ashley Llorens, VP and Distinguished Scientist, Microsoft
Ece Kamar, Managing Director, Microsoft Research AI Frontiers
Sébastien Bubeck, VP, Microsoft GenAI
Ahmed Awadallah, Senior Principal Research Manager, Microsoft Research AI Frontiers

Microsoft AI researchers discuss frontiers in small language models and where AI research and capabilities are headed next.

Microsoft Research Forum, January 30, 2024

I’m Ashley Llorens, with Microsoft Research. My team works across research and product to incubate emerging technologies and runs programs that connect our research at Microsoft to the broader research community. I sat down with research leaders Ece Kamar, Ahmed Awadallah, and Sébastien Bubeck to explore some of the most exciting new frontiers in AI. We discussed their aspirations for AI, the research directions they’re betting on to get us there, and how their team is working differently to meet this moment.

ASHLEY LLORENS: So let’s dive in. We’re experiencing an inflection point in human technology where machines, broadly speaking, are starting to exhibit the sparks of general intelligence, and it’s hard to avoid the enthusiasm. Even if you wanted to. And I think it’s fair to say that there’s no shortage of that enthusiasm here among us. But as researchers, we’re also skeptics. You know, we go right in and try to understand the limitations of the technology as well as the capabilities, because it’s really those limitations that expose and define the frontiers that we want to push forward on. And so what I want to start here with is to sketch those frontiers here with you a little bit. I’d like to hear about an aspiration you have for AI and why the technology cannot do that today. Then we’ll come back around to the research directions that you’re betting on to close those gaps. And, so, I don’t know. Ahmed, what do you think? What aspiration do you have for AI, and why can’t the tech do it today?

AHMED AWADALLAH: I have a lot of aspirations. I think … you just mentioned we saw the sparks of AGI, so naturally, we’re looking forward to actually seeing AGI. But beyond that, more realistically, I think two of the things I’m really looking forward to is having AI that can actually perceive and operate in the real world. We have made significant advances with language models. We are seeing a lot of advantages with multimodality. It looks like an AI that can perceive and operate in the real world is not that far off from where we are. But there are a lot of challenges, as well. And I’m really excited to see how we can get to that.

LLORENS: What does that look like for you, when AI operates in the real world? What is it doing?

AWADALLAH: It looks … To me, it means that we go, first, we go beyond language, and we are getting a lot into multimodal models right now that can perceive images and languages. However, a big part of what we do is that we take actions in the world in different ways. We have a lot of behavior that we exhibit as we do tasks, and it’s not clear that we can do that right now with AI. So imagine that we have an AI system that we can ask to do things on our behalf, both in the digital and in the physical world. Imagine that we have guarantees that they will accomplish these tasks in a way that aligns with our original intent.

LLORENS: Yeah, it’s compelling. Ece, what do you think?

ECE KAMAR: My dream for the AI systems is that they become our helpers, companions, longer-term collaborators than just, like, prompting something and it gives me an answer. And we are, actually, still quite far from having AI systems that can really help us through our life for the different purposes that we have and also really understand our goals, intentions, and also preferences. So I think we have, right now, the sparks that we are seeing are really about having building blocks that give us the initial technologies to build on, to get to those AI systems that have a memory, that have a history, that have a deep understanding of human concepts, and that can carry out tasks that are a lot broader, a lot more complex than what we can do today. And our task right now is using these blocks to really imagine what those future systems are going to look like and discover those new innovations that will push the capabilities forward so that we can really build systems that create a difference in our lives, not only the systems that we want to play with or, you know, do small tasks for us—that are already changing how I work, by the way. These things are not minor, but they can really be a part of my daily life and help me with everything I do.

LLORENS: Seb, what do you think?

SÉBASTIEN BUBECK: Yeah, my aspiration for AI, actually, has nothing to do with the technology itself. I hope that AI will illuminate how the human mind works. That’s really my real aspiration. You know, I think what’s going on in our minds and the way we reason is extremely mysterious. And anything that is mysterious, it looks kind of magical. We have no idea what are the basic elements for it. And with AI, we’re seeing that, at the very least, it’s mimicking the type of reasoning that’s going on in human beings. So I’m hoping that we’re going to be able to really uncover those building blocks of reasoning. That’s my dream for the next decade, I guess.

LLORENS: How good of an analogy do you think, I’ll say, transformers or, you know, today’s machine learning models are for how we think and reason?

BUBECK: It’s a terrible analogy. [LAUGHS] So it really … the transformer is absolutely not, in my mind, trying to mimic what the human brain is doing. It’s more like the emergent properties are similar. So, you know, it’s … the substrate is going to be obviously different. I mean, one is a machine and one is wetware, and the concrete algorithm that is running will be different. But it’s plausible that the emergent property will be similar. That’s what I’m hoping.

LLORENS: No, yeah. Super interesting. And now I want to understand a little bit about the research directions that you are most excited about to get there. I don’t think you’re going to tell me about your neuroscience research. [LAUGHS]

BUBECK: [LAUGHS] I wish. I wish.

LLORENS: That’s an interesting place to start …

KAMAR: Not yet. Maybe in the next episode. [LAUGHS]

BUBECK: Exactly.

LLORENS: But what are you betting on right now to get us closer to that?

BUBECK: Yeah. No, it’s actually connected, the two things. So what we are experimenting with right now is the following. So to us, I think to all of us here, GPT-4 showed the sparks of AGI, early signs of humanlike reasoning. And to us, we see this as a, kind of, proof of concept. OK, it means you can get this type of intelligence—quote, unquote—if you scale up a ton, if you have a very, very large neural network trained on a lot of data with a lot of compute for a very long time. OK, great. But exactly which one of those elements was needed? Is it the big data that’s necessary? Is it the large neural network? Is it a lot of compute? And what is a lot, by the way? What is large? You know, is 1 billion large? Is 10 billion large? You know, questions like this. So to me, this comes from a scientific inquiry perspective. But at the end of the day, it has enormous economical impact, because when you answer these questions, you go make everything smaller. And this is what we’ve been doing with the Phi series of models, trying to build those small language models. Again, we come at it from the scientific perspective, but it has very, very concrete impact for the future of Microsoft.

LLORENS: So I think Phi is on a lot of minds right now. Let’s actually stick with Phi for a minute. What is the secret? [LAUGHS] What—let’s stick with that—what is the secret? What’s enabling you to get to the reasoning capabilities that you’re demonstrating with models of that size?

BUBECK: Yes, yes, yeah. There is …

LLORENS: What size is Phi, by the way?

BUBECK: Yeah, so the latest, Phi-2 (opens in new tab), is 2.7 billion parameters. Phi-1.5 (opens in new tab) was 1.3 billion. So we have doubled the size. So the secret is actually very simple. The secret is in the title of the first paper that we wrote in the Phi series, which is “Textbooks Are All You Need.” So “Textbooks Are All You Need,” this is, of course, a play on the most famous paper of all time in machine learning, “Attention Is All You Need,” that introduced the attention mechanism for the transformer architecture. So in “Textbooks Are All You Need,” what we say is if you play with the data and you come up with data which is of “textbook quality”—so the meaning of this is a little bit fuzzy, and this is where part of the secret lies—but if you come up with this textbook-quality data, we’re able to get a thousand x gains if you look at the total compute that you need to spend to reach a certain level in terms of benchmark, intelligence, etc. So now what is this textbook quality, this mysterious textbook quality? Well, the way I want to put it is as follows. What matters in text when you give text to this transformer to try to teach them a concept is how much reasoning is going on in the text? How, what kind of concept can you extract if you are to predict the next word in that text? So what we want is text which is reasoning dense, and, you know, like, novels, they are not really reasoning dense. Sometimes you need to reason a little bit to understand, OK, how all the characters are related, you know, why are they thinking or doing what they are doing. But where do you have really reasoning-dense text? Well, it’s in textbooks. So this is the secret, basically.

LLORENS: And, Ahmed, recently you and I have had conversations about a universe of different pretraining methods, textbook-like reasoning tokens, you know, being one, and then also the whole universe of, of post-training methods and how there’s a whole space to explore there. So maybe you can get into your research interests, you know, where are you pushing on that frontier? And, you know, what haven’t we talked about yet in terms of pretraining versus post-training?

AWADALLAH: Yeah, that’s a very good question. And, actually, it was very interesting that many, many similar insights would apply to what Sébastien was just describing. But if you look at how we have been pretraining models recently, we start with the pretraining stage, where we basically show the model a lot of text—the textbooks—and we have them learning to predict the next word. And with a lot of data and a lot of size, the big size, a lot of emergent properties were showing up in some models that we didn’t really even try to teach them to the model. But we have also been seeing that there are other stages of pretraining—some people refer to it as post-training—where after we pretrain the model, we actually start teaching it specific skills, and that comes in the form of input-output samples or sometimes an input and two different outputs, and we are trying to teach the model that the first output is preferred to the second output. We can do that to teach the model a particular style or a skillset or even for alignment, to teach it to act in a safer way.

But what we have found out is that now that we have these large models, as well—and they are actually very powerful engines that can enable us to create all sorts of data—many of these properties, we don’t have to wait for them to emerge with the size. We can, actually, go back and create synthetic tailored data to try to teach a smaller model that particular skill. We started with reasoning, as well, because reasoning is a pretty hard property, and we haven’t really seen reasoning emerging even to that level we have in models like GPT-4 right now, except after scaling to so large size in the model and in the data size, as well. So the question was, now that we have emerged it in these models, can we actually create data that teaches the model that particular skill? And we were not trying to teach the model any new knowledge, really. We were just trying to teach the small model how to behave, how to solve a task. So, for example, with a model like GPT-4, we are seeing that you can ask it to solve a task that requires breaking up a task into steps and going step by step into solving that task. We have never seen that with a small model, but what we have found out is that you can, actually, use a powerful model to demonstrate the solution strategy to the small model, and you can actually demonstrate so many solution strategies for so many tasks. And the small models are able, actually, to learn that, and the reasoning ability is significantly improved based on that.

LLORENS: I find the word reasoning pretty loaded.

AWADALLAH: It is.

LLORENS: I think a lot of people mean a lot of different things by reasoning. Actually, I found some clarity. I had a nice discussion with two of our colleagues, Emre Kiciman and Amit Sharma, and, you know, they wrote a recent paper on reasoning. Sometimes we mean symbolic-style reasoning; sometimes we mean more commonsense reasoning. You talked about, kind of, more symbolic-style-reasoning tokens perhaps, or how do I think about the difference between those kinds of training data versus world knowledge that I might want a model to reason about?

BUBECK: Yeah, very good question. So if you take the perspective that you start with a neural network, which is a completely blank slate, you know, just purely random weights, then you need to teach it everything. So going for the reasoning, the high-level reasoning that we do as human beings, this is like, you know, step No. 10. You have many, many steps that you need to satisfy before, including, as you said, commonsense reasoning. So, in fact, in our approach for the pretraining stage, we need to spend a lot of effort into the commonsense reasoning. And there, the textbooks approach is perhaps a little bit weird because there’s no textbook to teach you commonsense reasoning. You know, you acquire commonsense reasoning by going outside, you know, seeing nature, talking to people, you know, interacting, etc. So we … you have to think a little bit outside the box to come up with textbooks that will teach commonsense reasoning. But this is, actually, what we do, a big, a huge part of what we did. In fact, everything that we did for Phi-1.5 was focused on commonsense reasoning. And then when we got to Phi-2, we got a little bit closer to the Orca model, and we tried to teach also slightly higher-level reasoning, but we’re not there yet. There is still, you know, a few more layers. We’re not yet at step No. 10.

LLORENS: Yeah, fair enough. Ece, geek out with us a little bit now on research directions. I’m sure you have a lot of interest in everything we’ve just talked about. Anything you want to add from your perspective?

KAMAR: There is, actually, a lot to add, because one of the biggest things that we are trying to do in our new organization is understand the connections between these different works that are going on, because our purpose is not exploring independent research directions and make progress on each. But we have a very focused mission. Our focused mission is expanding the frontiers of AI capabilities, expanding the frontiers of what intelligence can be in these machines. And to be able to get there, we have to have a coordinated understanding of how Phi connects to Orca and how these two model families connects to other future-looking ideas that can push those boundaries forward. So think about this as, like, an intelligent pyramid. That’s how I have been, kind of, thinking about this in my mind.

At the base of it, we have the building blocks of these models, base models. Phi is a beautiful example. And in the future, we are going to have other models. Phi is going to go and do other things, and other places can do other things. Phi and GPT-4 and these models are going to coexist in a model library. The one layer above that is all of the work that Orca team is doing with fine-tuning specialization. Taking a capability, taking a domain, taking some constraints and trying to see, like, I have these base models, but how do I make them work even better for the different domains and capabilities that I really, really care about and have more control over what those models generate for me. So that’s like the second step of that intelligent pyramid that we are building. But then we have been doing some really interesting demonstrations and building in our teams to, kind of, look at, like, how does orchestration play a role in that intelligence pyramid? Because when you think about it, the simplest way we can get things done with either the base models or the specialized models today is I just tell it to do something by prompting and it does something for me. But is that the end of the way we are going to be building with these models to be able to expand those frontiers? That answer is a no. And in fact, one piece of work that our teams have been doing collectively is called AutoGen (opens in new tab). And that library, which became very popular with the developer community—and we love seeing the responses we are getting. Correct me, Ahmed, I think we got to 15,000 stars under a month in GitHub …

AWADALLAH: Yeah, we did.

KAMAR: … with this library, with this very experimental library. And we are learning a lot from the developer community about how they are using it. But what we are seeing is that the kind of things people want to do with these models, when they want to expand those capability boundaries, when they want to have a more robust execution, when they want to really overcome the brittleness of the prompting and prompting the models strategy, they actually go to orchestration, and in fact, they go to multi-agent orchestration. So that multi-agent, what we mean by multi-agent orchestration is that imagine you have a complex task that you cannot reliably do by just prompting even the best model we have in our family. But what you can do is something very similar to how humans work actually. We take a complex problem. We divide it into smaller pieces and then assign smaller pieces to different people that have different capabilities. That’s exactly how AutoGen framework works. It takes a complex task, divides it into smaller pieces, and assigns different pieces to different “agents,” which means intelligences that can prompt different models with different strategies and personas and get them working together. And what we are seeing is that this very simple idea of multi-agent orchestration, on top of all of the great work that’s happening on the modeling side, is another layer in that intelligence pyramid that can really push the frontiers forward. So one of the things we are doing in our organization is really understand these connections—how does Phi relate to Orca relate to AutoGen?—as we are building this pyramid. But there is something else we are betting on right now, which I believe is going to become very, very important as these systems become a part of the real world, as Ahmed was suggesting.

So when we were doing the “sparks of AGI” work, there is actually something we say in the introduction when we are talking about intelligence, the core of intelligence. Any intelligence system needs to be learning, needs to be learning from their environment, needs to be learning from the interactions they are having. And this is not something we currently have even in the best models or even in the best AI systems we have in the world. They are static. They may be interacting with millions of people every day and getting feedback from them or seeing how people respond to it, but it does not make any of those systems better or more intelligent or understand their users any better. So I feel like this is one of the areas that we have to push forward very strongly. How do we incorporate a learning feedback loop into this intelligence pyramid—every layer of it—in a transparent, understandable, and reliable way so that the systems we are building are not only getting better because experts like Sébastien and Ahmed are putting a lot of time in data collection. And, of course, that work needs to happen, as well, and, you know, coming up with new ideas to make the models better. But we are, actually, creating this virtuous loop for our systems for them to get better in time.

The last research idea we are pushing forward is something, actually, very unifying across the stack I’m talking about. One of the biggest questions is, how is the progress in AI look like today, right? Like, we are doing all of this great work, but how the capabilities of the AI systems, all the models we are building, are evolving as the models scale up and we have more data. So this is really becoming a question about evaluation and understanding. So think about this as we are doing a lot of agile work in a very fast-changing environment. What we need is headlights to be able to see where we are going and how much progress we have made. So this is why another area we are really pushing for as a research direction in our organization is not only relying on existing benchmarks and existing evaluation strategies, but really reinventing how we think about evaluation overall. We talked about this intelligence stack. How does the innovations in the intelligence stack can enable the researchers to come up with new approaches to understand the models, evaluate the models, such that we can have a much better understanding of where we are and where we are headed as we are building this intelligence pyramid?

LLORENS: A quick follow-up question on evaluation. This is one that I think a lot about. There’s the idea of benchmarks that try to maybe test the, you know, the generality of the intelligence of a model. And then there’s, all the way, the end-to-end evaluation in the context of use. And how much do we think about the end-to-end story there when we talk about evaluation?

KAMAR: It’s a spectrum. I would also like to hear from Sébastien and Ahmed, but it is really a spectrum, and there are different questions that motivate the work on evaluation. So when we ask a question like what does that capability curve look like for AI models, there we have to focus on the models themself and understand how the models are progressing. But then if you are asking a question of, I want to build reliable, capable AI systems of the future— how does that curve look like? That requires a different way of thinking about the evaluation where we are not only evaluating the models, but we are evaluating the whole stack. We are actually saying, OK, let’s think about prompting. Let’s think about orchestration and understanding the complementarity of the stack and looking into how the capabilities improve as we put the pieces together and to be able to light our way forward, both in terms of understanding how well we do in models and how well we do in building systems. We have to do the work in both. There is really no shortcut for that.

LLORENS: Microsoft Research is over 30 now, over 30 years old. And suffice it to say, I think we’re, you know, we’ve been going strong for over 30 years, but we’re in new territory. And I think we are organizing differently in some ways, you know, to meet the moment. And along those lines—and you, kind of, alluded to this before—but you’ve recently taken on a new leadership role.

KAMAR: With Sébastien and Ahmed, as well.

LLORENS: Of course. So maybe you can say a little bit more about how we’re organizing differently, what this looks like from your perspective.

KAMAR: As you said, this is really about the moment that we are in right now. Of course, I haven’t been at Microsoft Research for the whole 30 years [LAUGHTER], but I’ve been here for at least half of it, and personally, for me, there has never been a moment as exciting as now to be an AI researcher and to be an AI researcher inside Microsoft. Think about it. This is the company that is building the cutting-edge AI technologies in the hands of millions of people and doing it at an unbelievable speed that surprises me, although I have been an employee of this company for the last 13 years. So think about the speed of innovation that we are seeing here. Think about where the ambition level is in this company when it comes to doing great AI work.

Of course, by doing research inside Microsoft, we are also able to see where the gaps are. We are able to get a lot of feedback about what is working and what is not working. And that’s giving us a lot of really strong signals about where we need to push. And, in fact, these research directions we are talking about, they are not coming from thin air. This is really coming from working with different product groups, learning from their experiences, trying things ourselves, as well. So these are all motivating us to rethink what AI research means in this new AI age. So if you are creating an ambition level that is as high as what the current situation requires us to be, which is we are going to be at the cutting edge of the AI world, we are going to be impacting the real-world AI systems, and we are going to be pushing forward in this intelligent pyramid. That really requires that we have to coordinate ourselves very well on a very well-defined mission and go with it with conviction and go with it with speed and agility. So that’s what we are doing in our new organization that’s called AI Frontiers. This is a mission-focused AI lab and our mission is expanding the frontiers of AI capabilities, and we are doing it by being very focused on a number of key directions, which we kind of covered, but also having the agility and the teamwork to always re-evaluate ourselves and ask the question of, are these the most important problems to work on right now? Or how the world is changing, should we rethink? Should we create new directions? Should we end directions and build? This is, I think, one of the most important things about where we are in the AI world right now. We are not working on hypothetical ideas. Of course, we are dreaming big; we are taking risks. We are not only doing incremental things. But even for the ideas that are long-term and riskier, we are only going to learn if we are building those ideas, sharing it with the community, and learning from that feedback. So those are the building blocks of our new organization.

LLORENS: One of the things that’s exciting about doing research, I find, in an industrial environment like Microsoft is the ability to essentially affect the population through translating things into products, right. On the other hand, there is a big difference between what comes out at the end of a research pipeline, a research asset, you know, a model like Phi or Orca, and a thing that powers a product. One of the things I think we’ll do with AI Frontiers is provide more of a channel, a more coherent channel, of research artifacts like this into product. But can you talk a little bit about that? What is that difference? What goes into getting something from, you know, what we might put on GitHub to something we might give to our colleagues in Azure, for example?

BUBECK: I think the timelines are really shortened recently. Overall, research has accelerated so dramatically that the distance between a real product and something that comes at the end of a research, you know, project is, like, the gap is very small, I would say. And this is really, you know, to Ece’s point about having an organization which is mission focused and about building things, this is, to me, the essence of what’s going on right now. We cannot have horizons which are 10 years into the future. The truth is, nobody knows where AI is going to be 10 years from now, so it’s meaningless to plan at time horizons which are the usual time horizon that we are used to in research. If you are in research, you know, from 10 years ago and you’re planning with a 10-years horizon, then, of course, there is going to be an immense gap between whatever you produce and, you know, a real product. This is not the case anymore. So even something like Phi, you know, it could be in product very soon.

AWADALLAH: Yeah. When I first joined, actually, Microsoft Research, we would also think about the research that we’re doing right now is two, three, five years away, and we’d categorize research that way for making it into product. That spectrum’s collapsing.

BUBECK: Completely.

AWADALLAH: Things are happening so fast. The amount of work needed from taking it from research results to a product is still a lot of work. And that’s why I have been amazed by how fast we have been moving as a company, putting these things safely and reliably into the hands of our customers. However, that spectrum is not in years anymore. Things are moving very, very fast and some of the findings that we find make their way into impact in a matter of weeks or months.

KAMAR: And there’s one more point to make here, which is doing AI Frontiers inside MSR. We are choosing to go, to be building a mission-focused organization that’s going really fast on some of these problems and get our hands dirty and work with different parties in the company. And at the same time, we are inside a very strong organization that has researchers studying many different problems at different time horizons and sometimes being able to, you know, go through on directions that we may not be able to afford by being in this mission-focused organization. So one of the things we very much care about is also building bridges, not only with the company, not only with the academic world, but also with the different groups inside the Microsoft Research umbrella and really benefit from the riskier bets that, you know, the traditional MSR labs are taking and collaborating with them and enabling all of us to try those ideas. So we are really hoping that by being inside this MSR family, we are gaining a lot and we are able to scale on our ideas and experimentation a lot more.

LLORENS: You alluded to the, you know, the work it takes to go from a research artifact to something in a product, and part of that work pertains to responsible AI, as we might say inside Microsoft, or just AI safety more broadly. I think that’s true for transitioning to translating something to product, but even to releasing something, you know, a paper, you know, with a GitHub, you know, artifact that we put out there. Let’s go back let’s even say to the Orca work. How are you thinking about safety in the context of open sourcing something like Orca? What are the tests you’re running? And, you know, what does that frontier look like?

AWADALLAH: Yeah, that’s a very good question. And, actually, we put a lot of emphasis on safety even on research assets and, actually, we put a lot of our research assets through a process as rigorous as we would products before we are able to release them. And this is definitely the right thing to do. And, as you mentioned Orca, we did Orca fairly early on, and we weren’t yet at this stage sure what the process should be, so we, actually, never released it, because … like, once we wrote the paper and found out that we had something interesting, we wanted to release it because we wanted to share it with the research community and we wanted the research community to be able to build on top of it, but we didn’t have a story for what does that mean in order to actually release it safely. So we took some time back and worked with the rest of the company and came up with a very rigorous process. And before we are able to put anything out, it had to go through that process. That said, I think we are still even learning how to evaluate and how to measure and what does it mean to measure safety. So it’s not like a checkbox where we figured it out, and that’s what we are doing, and we feel good about it, and we put it out there. There is a continuous effort from a very large number of teams throughout the company in both products and research to always refine these processes so that we make sure we advance our understanding of what safe release of these models is and also make sure that we have the right processes and systems to make sure everything we put out there goes through that process.

LLORENS: And there are frontiers here that are super interesting. I think multimodality is a really interesting frontier relative to evaluation and safety. And we started earlier in the conversation even talking about AI in the real world that we interact with maybe not even just as a chatbot, but as an agent of some kind that can take action in the real world. So it’s great to see us taking this so seriously at this phase, because I think it’s going to get even more complicated, you know, as we move forward and more important. Why don’t we talk AI and society for a minute. One of the things that I find important for me as I reflect on my own research, my own journey here, is remaining grounded by perspectives outside of this laboratory, outside of the spheres that we’re in. We get some of that at our dinner tables, right. I do have the opportunity, for me personally, to engage with communities, community organizations, even politicians. But I’m really interested in how you all stay grounded in perspectives outside of this world here in Microsoft Research. Ece, why don’t we start with you?

KAMAR: Yeah, one of the things, talking about AI and society and responsibe AI, one of the things that’s very important is that a significant portion of our organization, our researchers and engineers, have significantly contributed to the work that Microsoft has done in the responsible AI space over the last decade. And, in fact, I’m … one of the things I’m most proud of in terms of my personal time in MSR is how much MSR contributed to where Microsoft is in doing AI responsibly. And that all happened because we, actually, got to see the realities of AI development and have the passion to drive innovation in terms of building AI responsibly. Now I think this is an opportunity for us to do this at larger scales as we have more coordinated efforts in terms of pushing the frontiers of AI in this new organization and MSR more broadly. So there are a few ways we are doing this right now. And then I’ll come to your point about the community. One of the things that we very much care about is sharing our work with the academic community and with the developer community through open sourcing. So all of the works—Phi, Orca, AutoGen, and the other things we are going to be doing—we release them. And, in fact, what is so significant about the small-language-model space is that they enable a lot of hands-on research work that may not be possible without these family of models, because when you think about it, a lot of the other models that have reasoning capabilities that may compare with Phi and Orca, they were much larger and they were black boxes to the research community. Now that we are putting these models out there in an MIT License, we really welcome the academic community to take these models, to look into how they are actually getting better in reasoning, and ask the question of how. Ask the question of, how do we have better controls in Phi and Orca? How do we improve the training data such that we can mitigate some of the biases, reliability issues, toxicity in it?

One of the things I personally very much believe in is that there cannot be any camps about stopping AI versus going as fast as possible. This is really about building AI responsibly and making sure that our innovations happening are also taking responsibility as a core part of that innovation. So with that in mind, we think it is so important to enable the whole academic community with models, with architectures, with agents, libraries such that the innovation in terms of how do we make AI responsible comes from the whole world instead of just the field that has access to such models.

BUBECK: And if I may, like, for the Phi model on Hugging Face, we are approaching a million downloads. So, you know, it’s very real. Like, this is really getting into the hands of, well, a million people, so …

LLORENS: Yeah, for sure.

AWADALLAH: Yeah, and to add to that, we are seeing this a lot with AutoGen, as well, because AutoGen, it’s not a model. You can use a lot of models with it. And it created a big developer community around it, and we have been learning a ton from them and not just in how they are using it, but actually in so many innovative ideas of even how to use it to make your applications safer or to make application more reliable, because the framework enables you to define different roles, and people are coming up with very interesting ideas about maybe adding a safeguard agent in order to make sure that whatever the team of agents is doing actually fits the particular safety criteria or adding some other agents that are trying to make sure that the completion of the task aligns with the initial human intent. So we are going early with enabling the community to use what we are doing and open sourcing it. It is helping us collectively come up with better ways for building these things in a much better and safer way.

KAMAR: And then on top of the work we are hopefully enabling the academic community, there is also something about working inside a company like Microsoft and learning from real-world use cases. And responsible AI is really about real world, and we want to make sure that we, over time, think about ways—possibly even collaborating with you, Ashley, and your team—really, like, sharing our learnings about how the real world looks like, what the real-world considerations are, with a much larger community so that we can think about all of these considerations together and innovate together in terms of building AI responsibly.

LLORENS: And the global research community—we talk a lot about that—is more expansive, I think, than it’s ever been, at least as it pertains to computing research and the amount of different disciplines right now involved in what we’ve considered computing research. On the one hand, there are the computer scientists that are playing with Phi right now, that are playing with AutoGen. On the other hand, there’s legal scholars, there’s policy researchers, there’s medical practitioners, and so the global research community is just more expansive than ever, and it’s just been great to be able to use Microsoft as a platform to be able to engage more broadly, as well. So, look, I’ve had really a lot of fun, you know, talking to you all on a daily basis but today in particular. Thanks for a fascinating discussion.

KAMAR: Thank you, Ashley.

BUBECK: Thanks, Ashley.

AWADALLAH: Thank you.