{"id":1169137,"date":"2026-04-21T09:53:12","date_gmt":"2026-04-21T16:53:12","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1169137"},"modified":"2026-05-04T09:57:18","modified_gmt":"2026-05-04T16:57:18","slug":"the-art-of-building-verifiers-for-computer-use-agents","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/the-art-of-building-verifiers-for-computer-use-agents\/","title":{"rendered":"The Art of Building Verifiers for Computer Use Agents"},"content":{"rendered":"\n
By Corby Rosset, Pratyusha Sharma, Andrew Zhao, Miguel Gonzalez-Fernandez, Ahmed Awadallah<\/em><\/p>\n\n\n\n We share lessons learned from building a best-in-class verifier for computer use agent trajectories on the web, called the Universal Verifier. False positive rates drop to near zero (vs. \u226545% for WebVoyager, \u226522% for WebJudge), and agreement with humans matches human-human agreement. We open-source our Universal Verifier system along with CUAVerifierBench, a new set of CUA trajectories with both process and outcome human labels.<\/p>\n\n\n\n Here’s what we found:<\/p>\n\n\n\n Full paper is available here (opens in new tab)<\/span><\/a>, and code and data are available at https:\/\/github.com\/microsoft\/fara (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n Computer use agents \u2014 models that browse the web, click buttons, fill forms \u2014 have gotten impressively capable. But progress on training and evaluating them is bottlenecked by a deceptively simple question: did the agent actually succeed?<\/em><\/p>\n\n\n\n This turns out to be much harder than it sounds. Unlike text generation, where you can compare an output to a reference, computer use trajectories are long, visually rich, and interact with environments the agent does not control, inviting new categories of errors like environment blockers, out-of-stock items, and logins. A task might be partially completed. Success might arrive through an unexpected path. Failures can be subtle \u2014 e.g. mis-copying numbers from a table that appear only in a screenshot buried deep in a multi-step interaction. And the consequences of getting verification wrong compound: bad labels corrupt both your benchmarks and your training data.<\/p>\n\n\n\n We spent 96 experiments and several weeks building what we call the Universal Verifier \u2014 a system designed to verify agent success and score its effort against a generated rubric. What we ended up with is less a single trick and more a set of learned design principles, each addressing a failure mode we discovered. This post walks through those principles, what we tried that didn’t work, and what surprised us.<\/p>\n\n\n\n Figure 1: Human expert vs. auto-research agent across successive verifier design iterations. The expert iterated over 32 experiments across three weeks; the auto-research agent completed comparable iterations in roughly one day.<\/em><\/p>\n\n\n\n The root of the pipeline is rubric generation, and flawed rubrics produce errors that cascade through everything downstream. We found four systematic failure modes \u2014 and rubric design alone accounted for roughly half of our total Cohen’s \u03ba gains. You can see how our rubrics evolved on WebTailBench here: https:\/\/microsoft.github.io\/fara\/docs\/webtailbench_rubric_comparison.html (opens in new tab)<\/span><\/a><\/p>\n\n\n\n This was the most insidious problem. LLM-generated rubrics frequently introduce requirements that were never stated in the task. For example in Figure 2, given a multi-step task, our early rubric added criteria for the price and address of a hotel \u2014 neither of which the user requested for the primary intent of finding a coffee shop near the hotel. The agent completed the actual task but scored 2\/8 because it “failed” those phantom criteria. After fixing the rubric to match only what was asked, the same trajectory scored 16\/18 \u2014 a success.<\/p>\n\n\n\n Figure 2: One way we improved rubrics is by removing “phantom” criteria and focusing only on what the task required.<\/em><\/p>\n\n\n\n This matters because phantom criteria inflate the denominator. An agent that did exactly what was asked gets penalized for not doing things nobody wanted.<\/p>\n\n\n\n When rubric items aren’t logically independent, a single upstream error propagates into every downstream criterion, multiplying the penalty. We learned to ensure each criterion could be evaluated on its own as demonstrated in Figure 3.<\/p>\n\n\n\n Figure 3: An example of error isolation in practice. For the task: “List all the members of the bands Nsync and BackStreet Boys. Find the net worth of the one with the longest last name.” The agent incorrectly identified “Timberlake” as the longest last name when “Kirkpatrick” is correct \u2014 but the error does not cascade to downstream criteria about reporting net worth.<\/em><\/p>\n\n\n\n Agents sometimes claim success when it contradicts evidence \u2014 they’ll confidently assert they found the right product when they did not. Or worse, they will fabricate results like stating the shopping cart has the product when it is empty. Initially, we generated and scored rubrics in one pass, but this rarely caught subtle hallucinations. So, we separate rubric generation from scoring, and decomposed scoring the rubric into two stages: with and without screenshot evidence. Discrepancies between the two stages surface hallucinations that a single-pass scorer would miss. As we explain below, we handle screenshot evidence very carefully to not miss any details.<\/p>\n\n\n\n
<\/figure>\n\n\n\n\n
Why is it so hard to tell whether the agent succeeded?<\/h2>\n\n\n\n
<\/figure>\n\n\n\nHow do you build a good rubric?<\/h2>\n\n\n\n
Phantom criteria<\/h3>\n\n\n\n
<\/figure>\n\n\n\nCascading criteria<\/h3>\n\n\n\n
<\/figure>\n\n\n\nHallucination detection<\/h3>\n\n\n\n
<\/figure>\n\n\n\n