{"id":1091139,"date":"2024-10-08T15:31:18","date_gmt":"2024-10-08T22:31:18","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1091139"},"modified":"2024-10-09T16:17:50","modified_gmt":"2024-10-09T23:17:50","slug":"omniparser-for-pure-vision-based-gui-agent","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/omniparser-for-pure-vision-based-gui-agent\/","title":{"rendered":"OmniParser for pure vision-based GUI agent"},"content":{"rendered":"\n<p>By <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yadonglu\/\">Yadong Lu<\/a>, Senior Researcher; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jianwyan\/\">Jianwei Yang<\/a>, Principal Researcher; Yelong Shen, Principal Research Manager; <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hassanam\/\">Ahmed Awadallah<\/a>, Partner Research Manager<\/p>\n\n\n\n<p>Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains underexplored in real-world applications, particularly when it comes to acting as general agents across diverse operating systems and applications with only vision input. One of the primary limiting factors is the absence of a robust technique for screen parsing which is capable of 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen.<\/p>\n\n\n\n<p>Meet OmniParser, a compact screen parsing module that can convert UI screenshots into structured elements. OmniParser can be used with a variety of models to create agents capable of taking actions on UIs. When used with GPT-4V, it significantly improves the agent capability to generate precisely grounded actions for interface regions.<\/p>\n\n\n\n<p>An agent using OmniParser and GPT-4V achieved the best performance on the recently released <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/microsoft.github.io\/WindowsAgentArena\/\" target=\"_blank\" rel=\"noreferrer noopener\">WindowsAgentArena<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> benchmark.<\/p>\n\n\n\n<p>We are making OmniParser publicly available on GitHub, along with a report describing the training procedure to encourage research on creating agents that can act on different applications and environments.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/microsoft\/OmniParser\" target=\"_blank\" rel=\"noreferrer noopener\">Get OmniParser<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/microsoft.github.io\/OmniParser\/\" target=\"_blank\" rel=\"noreferrer noopener\">View the project<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/omniparser-for-pure-vision-based-gui-agent\/\">Read the report<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"creating-omniparser\">Creating OmniParser<\/h2>\n\n\n\n<p><strong>Curating Specialized Datasets<\/strong>&#8211;The development of OmniParser began with the creation of two datasets:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An <strong>interactable icon detection dataset<\/strong>, which was curated from popular web pages and annotated to highlight clickable and actionable regions.<\/li>\n\n\n\n<li>An <strong>icon description dataset<\/strong>, designed to associate each UI element with its corresponding function. This dataset serves as a key component for training models to understand the semantics of detected elements.<\/li>\n<\/ul>\n\n\n\n<p><strong>Fine-Tuning Detection and Captioning Models<\/strong>&#8211;OmniParser leverages two complementary models:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A <strong>detection model<\/strong>, fine-tuned on the interactable icon dataset, which reliably identifies actionable regions within a screenshot.<\/li>\n\n\n\n<li>A <strong>captioning model<\/strong>, trained on the icon description dataset, which extracts the functional semantics of the detected elements, generating contextually accurate descriptions of their intended actions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"benchmark-performance\">Benchmark performance<\/h3>\n\n\n\n<p>We demonstrate that with the parsed results, the performance of GPT-4V is greatly improved on ScreenSpot benchmarks. On Mind2Web, OmniParser +GPT-4V achieves better performance compared to GPT-4V agent that uses extra information extracted from HTML. And on AITW benchmark, OmniParser outperforms GPT-4V augmented with specialized Android icon detection model that is trained with view hierarchy. It also achieves the best performance on a new benchmark <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/microsoft.github.io\/WindowsAgentArena\/\" target=\"_blank\" rel=\"noreferrer noopener\">WindowsAgentArena<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>!<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"934\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig1_1400px.png\" alt=\"OmniParser chart, bar chart showing average performance across benchmarks\" class=\"wp-image-1091286\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig1_1400px.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig1_1400px-300x200.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig1_1400px-1024x683.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig1_1400px-768x512.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig1_1400px-240x160.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"934\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig2_1400px.png\" alt=\"OmniParser bar chart showing plugin ready for other vision language models\" class=\"wp-image-1091283\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig2_1400px.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig2_1400px-300x200.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig2_1400px-1024x683.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig2_1400px-768x512.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/OmniParser_performance_fig2_1400px-240x160.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>To further demonstrate OmniParser is a plugin choice for off-the-shelf vision language models, we show the ScreenSpot benchmark performance of OmniParser combined with recently announced vision language models: Phi-3.5-V and Llama-3.2-V. We hope OmniParser can serve as a general and easy-to-use tool that has the capability to parse general user screen across both PC and mobile platforms without any dependency on extra information such as HTML and view hierarchy in Android.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Yadong Lu, Senior Researcher; Jianwei Yang, Principal Researcher; Yelong Shen, Principal Research Manager; Ahmed Awadallah, Partner Research Manager Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains [&hellip;]<\/p>\n","protected":false},"author":43611,"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":992148,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-1091139","msr-blog-post","type-msr-blog-post","status-publish","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":992148,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1091139","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43611"}],"version-history":[{"count":9,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1091139\/revisions"}],"predecessor-version":[{"id":1129596,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1091139\/revisions\/1129596"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1091139"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1091139"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1091139"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1091139"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}