AI and Microsoft Research header - abstract neural network pattern on dark spectrum background

AI Frontiers

OmniParser for pure vision-based GUI agent

Share this page

By Yadong Lu, Senior Researcher; Jianwei Yang, Principal Researcher; Yelong Shen, Principal Research Manager; Ahmed Awadallah, Partner Research Manager

Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains underexplored in real-world applications, particularly when it comes to acting as general agents across diverse operating systems and applications with only vision input. One of the primary limiting factors is the absence of a robust technique for screen parsing which is capable of 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen.

Meet OmniParser, a compact screen parsing module that can convert UI screenshots into structured elements. OmniParser can be used with a variety of models to create agents capable of taking actions on UIs. When used with GPT-4V, it significantly improves the agent capability to generate precisely grounded actions for interface regions.

An agent using OmniParser and GPT-4V achieved the best performance on the recently released WindowsAgentArena (opens in new tab) benchmark.

We are making OmniParser publicly available on GitHub, along with a report describing the training procedure to encourage research on creating agents that can act on different applications and environments.

Creating OmniParser

Curating Specialized Datasets–The development of OmniParser began with the creation of two datasets:

  • An interactable icon detection dataset, which was curated from popular web pages and annotated to highlight clickable and actionable regions.
  • An icon description dataset, designed to associate each UI element with its corresponding function. This dataset serves as a key component for training models to understand the semantics of detected elements.

Fine-Tuning Detection and Captioning Models–OmniParser leverages two complementary models:

  • A detection model, fine-tuned on the interactable icon dataset, which reliably identifies actionable regions within a screenshot.
  • A captioning model, trained on the icon description dataset, which extracts the functional semantics of the detected elements, generating contextually accurate descriptions of their intended actions.

Benchmark performance

We demonstrate that with the parsed results, the performance of GPT-4V is greatly improved on ScreenSpot benchmarks. On Mind2Web, OmniParser +GPT-4V achieves better performance compared to GPT-4V agent that uses extra information extracted from HTML. And on AITW benchmark, OmniParser outperforms GPT-4V augmented with specialized Android icon detection model that is trained with view hierarchy. It also achieves the best performance on a new benchmark WindowsAgentArena (opens in new tab)!

OmniParser chart, bar chart showing average performance across benchmarks
OmniParser bar chart showing plugin ready for other vision language models

To further demonstrate OmniParser is a plugin choice for off-the-shelf vision language models, we show the ScreenSpot benchmark performance of OmniParser combined with recently announced vision language models: Phi-3.5-V and Llama-3.2-V. We hope OmniParser can serve as a general and easy-to-use tool that has the capability to parse general user screen across both PC and mobile platforms without any dependency on extra information such as HTML and view hierarchy in Android.