{"id":1091139,"date":"2024-10-08T15:31:18","date_gmt":"2024-10-08T22:31:18","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1091139"},"modified":"2024-10-09T16:17:50","modified_gmt":"2024-10-09T23:17:50","slug":"omniparser-for-pure-vision-based-gui-agent","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/omniparser-for-pure-vision-based-gui-agent\/","title":{"rendered":"OmniParser for pure vision-based GUI agent"},"content":{"rendered":"\n
By Yadong Lu<\/a>, Senior Researcher; Jianwei Yang<\/a>, Principal Researcher; Yelong Shen, Principal Research Manager; Ahmed Awadallah<\/a>, Partner Research Manager<\/p>\n\n\n\n Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains underexplored in real-world applications, particularly when it comes to acting as general agents across diverse operating systems and applications with only vision input. One of the primary limiting factors is the absence of a robust technique for screen parsing which is capable of 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen.<\/p>\n\n\n\n Meet OmniParser, a compact screen parsing module that can convert UI screenshots into structured elements. OmniParser can be used with a variety of models to create agents capable of taking actions on UIs. When used with GPT-4V, it significantly improves the agent capability to generate precisely grounded actions for interface regions.<\/p>\n\n\n\n An agent using OmniParser and GPT-4V achieved the best performance on the recently released WindowsAgentArena (opens in new tab)<\/span><\/a> benchmark.<\/p>\n\n\n\n We are making OmniParser publicly available on GitHub, along with a report describing the training procedure to encourage research on creating agents that can act on different applications and environments.<\/p>\n\n\n\n