{"id":1121829,"date":"2025-02-11T14:21:47","date_gmt":"2025-02-11T22:21:47","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1121829"},"modified":"2025-02-18T15:14:29","modified_gmt":"2025-02-18T23:14:29","slug":"exact-improving-ai-agents-decision-making-via-test-time-compute-scaling","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/exact-improving-ai-agents-decision-making-via-test-time-compute-scaling\/","title":{"rendered":"ExACT: Improving AI agents\u2019 decision-making via test-time compute scaling"},"content":{"rendered":"\n
\"A<\/figure>\n\n\n\n

Autonomous AI agents are transforming the way we approach multi-step decision-making processes, streamlining tasks like web browsing, video editing, and file management. By applying advanced machine learning, they automate workflows, optimize performance, and reduce the need for human input. <\/p>\n\n\n\n

However, these systems struggle in complex, dynamic environments. A key challenge lies in balancing exploitation, <\/em>using known strategies for immediate gains, with exploration<\/em>, which involves seeking new strategies that could yield long-term benefits. Additionally, they often have difficulty adapting to unpredictable changes in conditions and objectives, as well as generalizing knowledge across contexts, limiting their ability to transfer learned strategies between domains. <\/p>\n\n\n\n

In response, we developed ExACT<\/a>, an approach for teaching AI agents to explore more effectively, enabling them to intelligently navigate their environments, gather valuable information, evaluate options, and identify optimal decision-making and planning strategies. ExACT combines two key techniques: Reflective-MCTS (R-MCTS) and Exploratory Learning.<\/p>\n\n\n\n

R-MCTS builds on the traditional Monte Carlo Tree Search (MCTS) algorithm, introducing features like contrastive reflection and a multi-agent debate function. Through contrastive reflection, the agent refines its decision-making by comparing expected outcomes with actual results, allowing it to learn from both its successes and mistakes. The multi-agent debate function provides various evaluations of a given state, where multiple agents offer contrasting perspectives to help provide a balanced and reliable assessment.<\/p>\n\n\n\n

Exploratory Learning trains agents to navigate environments effectively. Together, these techniques show strong computational scalability during both training and testing, as demonstrated on VisualWebArena\u2014a benchmark for evaluating multimodal autonomous language agents (Figure 1). <\/p>\n\n\n\n

\"Evaluation<\/a>
Figure 1. Evaluation demonstrates the compute scaling properties of GPT-4o during both training and testing. The assessment includes two scenarios: (1) applying the GPT-4o-based R-MCTS agent to all 234 tasks from the Classifieds category in VisualWebArena (left), and (2) testing fine-tuned GPT-4o on 169 previously unseen tasks from Classifieds without using search algorithms (right).<\/figcaption><\/figure>\n\n\n\n

R-MCTS extends the classic MCTS by enabling real-time improvements in decision-making. Shown in Figure 2, an iterative feedback loop allows R-MCTS to learn from past experiences, avoid prior mistakes, and focus on more effective actions in similar contexts.<\/p>\n\n\n\n

\"Overview<\/a>
Figure 2. Overview of the R-MCTS process in ExACT. <\/em><\/figcaption><\/figure>\n\n\n\n

Evaluating R-MCTS<\/h3>\n\n\n\n

R-MCTS demonstrates state-of-the-art performance\u202facross all VisualWebArena environments, surpassing the previous best-performing method, Search Agent, with improvements ranging from 6% to 30% (Table 1). Additionally, as of January 2025, it holds the second position on the OSWorld leaderboard and demonstrates state-of-the-art performance\u202fin the blind test setting, where there is no prior access to the test environment, reflecting its advanced capabilities (Table 2). <\/p>\n\n\n\n

Rank<\/th>Model<\/th>Score<\/th><\/tr><\/thead>
1<\/strong><\/td>GPT-4o + ExACT<\/td>33.70<\/td><\/tr>
2<\/strong><\/td>GPT-4o + Search<\/td>26.40<\/td><\/tr>
3<\/strong><\/td>GPT-4o + WebDreamer<\/td>23.60<\/td><\/tr>
4<\/strong><\/td>GPT-4o + ICAL<\/td>23.40<\/td><\/tr>
5<\/strong><\/td>GPT-4o<\/td>19.78<\/td><\/tr>
6<\/strong><\/td>Llama-3-70B + Search<\/td>16.70<\/td><\/tr><\/tbody><\/table>
Table 1. The VisualWebArena leaderboard highlights R-MCTS as achieving state-of-the-art performance as of December 2024. <\/em><\/figcaption><\/figure>\n\n\n\n
Rank<\/th>Model<\/th>Blind Test<\/th>Score<\/th><\/tr><\/thead>
1<\/strong><\/td>learn-by-interact w\/ Claude-3.5-sonnet<\/td>\ud83d\uddf6<\/td>22.50<\/td><\/tr>
2<\/strong><\/td>ExACT w\/ GPT-4o<\/td>\u2714<\/td>16.60<\/td><\/tr>
3<\/strong><\/td>GPT-4<\/td>\u2714<\/td>12.24<\/td><\/tr>
4<\/strong><\/td>GPT-4o<\/td>\u2714<\/td>11.36<\/td><\/tr>
5<\/strong><\/td>GPT-4 Vision (0409)<\/td>\u2714<\/td>10.82<\/td><\/tr>
6<\/strong><\/td>learn-by-interact w\/ Gemini-1.5-pro<\/td>\u2714<\/td>10.30<\/td><\/tr><\/tbody><\/table>
Table 2. The OSWorld leaderboard for the category of A11y tree inputs shows that ExACT with GPT-4o ranks second and demonstrates state-of-the-art performance\u202fin the blind test setting, as of December 2024.<\/figcaption><\/figure>\n\n\n\n

How Exploratory Learning works<\/h2>\n\n\n\n

Exploratory Learning enables agents to dynamically search and adjust their computational resources during testing without depending on MCTS. In contrast to Imitation Learning, which centers on training models using the optimal actions identified through search, Exploratory Learning focuses on cultivating the agent’s ability to navigate its environment by teaching it to evaluate states, explore different pathways, and efficiently backtrack from unpromising paths to identify more favorable alternatives. <\/p>\n\n\n\n

\"In<\/a>
Figure 3. In contrast to Imitation Learning, Exploratory Learning uses the entire search trajectory for training.<\/figcaption><\/figure>\n\n\n\n

Evaluating Exploratory Learning<\/h3>\n\n\n\n

We conducted experiments using GPT-4o fine-tuned with Exploratory Learning in the VisualWebArena environment. Results demonstrate the following key benefits: <\/p>\n\n\n\n