Research Forum | Episode 3 - abstract chalkboard background with colorful hands

Research Forum Brief | June 2024

AutoGen Update: Complex Tasks and Agents

Share this page

Adam Fourney

“Agents are a very, very powerful abstraction over things like task decomposition, specialization, tool use, etc. Really, you think about which roles you need on your team, and you put together your team of agents, and you get them to talk to one another, and then you start making progress on your task.”

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers

Transcript: Lightning Talk

AutoGen Update: Complex Tasks and Agents

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers

Adam Fourney discusses the effectiveness of using multiple agents, working together, to complete complex multi-step tasks. He will showcase their capability to outperform previous single-agent solutions on benchmarks like GAIA, utilizing customizable arrangements of agents that collaborate, reason, and utilize tools to achieve complex outcomes.

Microsoft Research Forum, June 4, 2024

ADAM FOURNEY: Hello, my name is Adam Fourney, and today, I’ll be presenting our work on completing complex tasks with agents. And though I’m presenting, I’m sharing the contributions of many individuals as listed below. All right, so let’s just dive in.

So in this presentation, I’ll share our goal, which is to reliably accomplish long-running complex tasks using large foundational models. I’ll explain the bet that we’re taking on using multi-agent workflows as the platform or the vehicle to get us there, and I’ll share a little bit about our progress in using a four-agent workflow to achieve state-of-the-art performance on a recent benchmark.

So what exactly is a complex task? Well, if we take a look at the following example from the GAIA benchmark for General AI Assistants, it reads, “How many nonindigenous crocodiles were found in Florida from the years 2000 through 2020?” Well, to solve this task, we might begin by performing a search and discovering that the U.S. Geological Survey maintains an online database for nonindigenous aquatic species. If we access that resource, we can form an appropriate query, and we’ll get back results for two separate species. If we open the collection reports for each of those species, we’ll find that in one instance, five crocodiles were encountered, and in the other, just a single crocodile was encountered, giving a total of six separate encounters during those years. So this is an example of a complex task, and it has certain characteristics of tasks of this nature, which is that it benefits strongly from planning, acting, observing, and reflecting over multiple steps, where those steps are doing more than just generating tokens. Maybe they’re executing code. Maybe they’re using tools or interacting with the environment. And the observations they’re doing … they’re adding information that was previously unavailable. So these are the types of tasks that we’re interested in here. And as I mentioned before, we’re betting on using multi-agent workflows as the vehicle to get us there.

So why multi-agents? Well, first of all, the whole setup feels very agentic from, sort of, a first-principles point of view. The agents are reasoning, they’re acting, and then they’re observing the outcomes of their actions. So this is very natural. But more generally, agents are a very, very powerful abstraction over things like task decomposition, specialization, tool use, etc. Really, you think about which roles you need on your team, and you put together your team of agents, and you get them to talk to one another, and then you start making progress on your task. So to do all this, to build all this, we are producing a platform called AutoGen (opens in new tab), which is open source and available on GitHub. And I encourage you to check this out at the link below.

All right, so now let’s talk about the progress we’ve been making using this approach. So if you recall that question about crocodiles from the beginning, that’s from the GAIA benchmark for General AI Assistants. And we put together four agents to work on these types of problems. It consists of a general assistant, a computer terminal that can run code or execute programs, a web server that can browse the internet, and an orchestrator to, sort of, organize and oversee their work. Now with that team of four agents, we were actually able to, in March, achieve the top results on the GAIA leaderboard for that benchmark by about 8 points. But what’s perhaps more exciting to us is that we are able to more than double the performance on the hardest set of questions, the Level 3 questions, which the authors of that work describe as questions for a perfect general assistant, requiring to take arbitrarily long sequences of actions, use any number of tools, and to access the world in general. So this is all very exciting, and I want to share a little bit more about what those agents are actually doing.

So this is the loop or the plan that they are following. So it begins with the question or the prompt, and then we produce a ledger, which is like a working memory that consists of given or verified facts; facts that we need to look up, for example, on the internet; facts that we need to derive, perhaps through computation; and educated guesses. Now these educated guesses turn out to be really important because they give the language models space to speculate in a constrained environment without some of the downstream negative effects of hallucination. So once we have that ledger, we assign the tasks to the independent agents, and then we go into this inner loop, where we ask first, are we done? If not, well, are we still making progress? As long as we’re making progress, we’ll go ahead and we’ll delegate the next step to the next agent. But if we’re not making progress, we’ll note that down. We might still delegate one other step, but if that stall occurs for three rounds, then we will actually go back, update the ledger, come up with a new set of assignments for the agents, and then start over.

All right, so this is the configuration that’s been working well for us, and it’s all I have time to share with you today. But I mentioned our goal, our bet, and our progress, and I want to conclude by sharing our plans for the future. So already we’re starting to tackle increasingly more complex benchmarks and real-world scenarios with this configuration. And we’re really excited about opportunities to introduce new agents that, for example, learn and self-improve with experience; that understand images and screenshots a little better for maybe more effective web surfing or use of interfaces; and that are maybe a bit more systematic about exploring that solution space. So rather than just updating that ledger and then restarting when they get stuck, they can be a bit more pragmatic about the strategies that they’re employing.
All right, well, thank you for your attention, and thank you for attending the Microsoft Research Forum, and we look forward to you joining us next time.