Yadong Lu, Senior Researcher; Thomas Dhome-Casanova, Software Engineer; Jianwei Yang, Principal Researcher; Ahmed Awadallah, Partner Research Manager
Graphic User interface (GUI) automation requires agents with the ability to understand and interact with user screens. However, using general purpose LLM models to serve as GUI agents faces several challenges: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associating the intended action with the corresponding region on the screen. OmniParser closes this gap by ‘tokenizing’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs. This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements.
OmniParser V2 takes this capability to the next level. Compared to its predecessor (opens in new tab), it achieves higher accuracy in detecting smaller interactable elements and faster inference, making it a useful tool for GUI automation. In particular, OmniParser V2 is trained with a larger set of interactive element detection data and icon functional caption data. By decreasing the image size of the icon caption model, OmniParser V2 reduces the latency by 60% compared to the previous version. Notably, Omniparser+GPT-4o achieves state-of-the-art average accuracy of 39.6 on a recently released grounding benchmark ScreenSpot Pro (opens in new tab), which features high resolution screen and tiny target icons. This is a substantially improvement on GPT-4o’s original score of 0.8.
To enable faster experimentation with different agent settings, we created OmniTool, a dockerized Windows system that incorporates a suite of essential tools for agents. Out of the box, we enable OmniParser to be used with a variety of state-of-the-art LLMs: OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) and Anthropic (Sonnet) combining the screen understanding, grounding, action planning and execution steps.
Risks and Mitigations
To align with the Microsoft AI principles (opens in new tab) and Responsible AI practices (opens in new tab), we conduct risk mitigation by training the icon caption model with Responsible AI data, which helps the model avoid inferring sensitive attributes (e.g.race, religion etc.) of the individuals which happen to be in icon images as much as possible. At the same time, we encourage user to apply OmniParser only for screenshot that does not contain harmful content. For the OmniTool, we conduct threat model analysis using Microsoft Threat Modeling Tool overview – Azure | Microsoft Learn (opens in new tab). We provide a sandbox docker container, safety guidance and examples in our GitHub Repository. And we advise a human to stay in the loop in order to minimize the risk.