<\/div>\n<\/div>\n\n\n\n
By Yadong Lu1<\/sup>, Lingrui Xu<\/em>2<\/sup>, Chao Huang2<\/sup>, Ahmed Awadallah1<\/sup><\/em>1<\/sup>Microsoft Research, 2<\/sup>The University of Hong Kong<\/em><\/p>\n\n\n\nInstead of solving web tasks by predicting where to click one at a time, we only give the model a terminal where it has the full freedom to spawn browser sessions, and to explore websites through writing code. The final result was a reusable program to complete any web tasks. We found this minimal harness to be surprisingly effective in solving web tasks.<\/p>\n\n\n\n
TL;DR<\/h3>\n\n\n\n\nExisting web agents often drive a persistent browser session one action at a time. We instead reduce the web-agent harness to a deliberately minimal terminal-based setup: three modules, roughly 1K lines of code, one agent loop, and no multi-agent orchestration. The agent emits bash commands and controls the browser by writing Playwright code, reaching SOTA results on Odysseys and Online-Mind2Web with a 100-step budget.<\/li>\n\n\n\n Because actions are expressed as code, the agent can naturally chain many web interactions within a single step, and spawn multiple browser sessions, making execution far more efficient than predicting one primitive action at a time.<\/li>\n\n\n\n We show the resulting script can be packaged as a reusable CLI with arguments. In a cost analysis, GPT-5.4 averages $2.37 per task, yielding a reusable RPA-style script. With our crafted tools, even a smaller model (Qwen3.5-9B) achieves strong performance on the hard split of Online-Mind2Web.<\/li>\n\n\n\n Once a task script is crafted, it can be shared and reused across platforms\u2014e.g., Codex, Claude Code, and OpenClaw.<\/li>\n<\/ol>\n\n\n\n<\/div>\n\n\n\n
Beyond step-by-step web interaction in a stateful browser<\/h2>\n\n\n\n The dominant paradigm for web agents today treats the browser session itself as the agent’s workspace. At each step, the model receives the current page state\u2014through a screenshot, or page state text\u2014and predicts the next operation to apply to that same session. This operation may be a low-level action such as click, type, or scroll; a structured command such as selecting a DOM element; or, more recently, a short code snippet executed through a CLI tool call. In all cases, they share a common constraint: the agent is required to predict web actions one step at a time within a predefined interaction loop.<\/p>\n\n\n\n
This design was useful when LLM agents had limited ability to reason, code, and recover from errors. A carefully engineered harness helped bridge the gap between what the model could reliably produce and what real web tasks required. But as models become stronger\u2014especially at writing and debugging code\u2014the same harness becomes a bottleneck, constraining the agent to a narrow interaction loop instead of letting it solve the task more flexibly.<\/p>\n\n\n\n
Webwright builds upon this view. We separate the agent from the browser, and treat the browser as something the agent can launch, inspect, and discard while developing a program. The persistent artifact is not the browser session, but the code and logs in the local workspace. The agent can write exploratory scripts, spawn fresh browser sessions, and freely decide when to capture screenshots, inspect failures, and iteratively refine its code\u2014much like a human engineer developing a robotic process automation (RPA) script. This approach has two obvious advantages:<\/p>\n\n\n\n
First, Webwright enables robust and reusable interaction with web environments.<\/strong> Instead of relying on fragile pixel-level actions, a coding agent with a terminal and a local workspace can interact with the underlying structure of a webpage\u2014querying elements, waiting for conditions, and handling dynamic behaviors such as lazy loading or re-rendering. This makes the agent far less sensitive to UI variations across sites and platforms. Moreover, the resulting scripts are reusable: once a workflow is encoded as a program, it can be rerun, adapted, and shared across tasks, rather than rediscovered from scratch each time.<\/p>\n\n\n\nSecond, Webwright allows for efficient composition of complex workflows.<\/strong> Rather than issuing one primitive action at a time, a coding agent can naturally express multi-step interactions\u2014such as selecting a date or filling out an entire form\u2014as a compact program. Loops, functions, and abstractions allow the agent to generalize across similar tasks (e.g., selecting different dates) without repeatedly predicting similar sequences of low-level steps. This significantly reduces the number of interaction rounds, improves execution speed, and mitigates the accumulation of errors from long action chains.<\/p>\n\n\n\nDespite the simplicity of this setup, we find that it is surprisingly effective in solving complex and especially long horizon web tasks.<\/p>\n\n\n\n
<\/div>\n\n\n\n
Completing web tasks in a terminal<\/h2>\n\n\n\n Webwright implements this idea with a deliberately minimal harness. The system has three core components: a Runner, a Model Endpoint, and a terminal Environment. Each component is implemented as a single module: the runner is about 150 lines of code, the model interface about 550 lines, and the environment about 300 lines. There is no multi-agent orchestration or complex planning hierarchy\u2014just a single agent loop. Given a user task, the Runner sends the current context to the model. The model returns an action, which is parsed into a thinking block and a shell command block. The command is then executed in the Environment, which manages a local workspace and returns observations such as terminal output, logs, screenshots, or error tracebacks. These observations are added back into the context, and the loop continues until the agent completes the task.<\/p>\n\n\n\n
This minimal design is intentional. All intermediate code, logs, screenshots, and results are stored in the workspace, making each run easy to inspect. By keeping the harness small and avoiding unnecessary orchestration, Webwright is easier to debug, adapt, and build on top of.<\/p>\n\n\n\n <\/figure>\n\n\n\nFigure 1: Webwright architecture overview and the agent interaction loop.<\/em><\/p>\n\n\n\n<\/div>\n\n\n\n
What are the challenges we overcome?<\/h3>\n\n\n\n Premature “done” and context explosion are the two core issues. With open-ended bash actions, the model must self-report completion and often claims success without actually finishing, so we added a simple gate: the agent needs to generate a self-reflection config, run a final script in a fresh folder with logs and screenshots, and pass its own self-reflection judgement that outputs success\/failure before emitting done: true<\/code>; otherwise, the flag is dropped and it retries. Meanwhile, we empirically found long coding trajectories quickly exceed context limits, so we compact history every 20 steps into a single summary.<\/p>\n\n\n\n