{"id":1048632,"date":"2024-06-19T14:06:14","date_gmt":"2024-06-19T21:06:14","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=1048632"},"modified":"2024-08-29T14:19:45","modified_gmt":"2024-08-29T21:19:45","slug":"image-understanding-benchmark","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/image-understanding-benchmark\/","title":{"rendered":"A Dynamic Benchmark for Image Understanding"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

A Dynamic Benchmark for Image Understanding<\/h2>\n\n\n\n

We have created a procedurally generatable, synthetic dataset for testing spatial reasoning, visual prompting, object recognition and detection.<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

A key question for understanding multimodal model performance is how well is can understand images, in particular basic vs. detailed spatial understanding of images. These capabilities are needed for models to be used in real-world tasks, such as an assistant in the physical world.  We have created a procedurally generatable, synthetic dataset for testing spatial reasoning, visual prompting, object recognition and detection.  The datasets are challenging and by being procedurally generated and non-public thus the results can\u2019t be due to memorization.<\/p>\n\n\n\n

The benchmark has 4 sub-tasks that test high-level and detailed understanding image using a Visual-Question-Answering (VQA) approach. The figure below shows the four tasks. Each one has a single object and pair of objects condition. For each image we show the question that can be presented as a prompt to a Multimodal language model.<\/p>\n\n\n\n\n\n

\"Examples<\/figure>\n","protected":false},"excerpt":{"rendered":"

We have created a procedurally generatable, synthetic dataset for testing spatial reasoning, visual prompting, object recognition and detection. A key question for understanding multimodal model performance is how well is can understand images, in particular basic vs. detailed spatial understanding of images. These capabilities are needed for models to be used in real-world tasks, such […]<\/p>\n","protected":false},"featured_media":1048644,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13562],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1048632","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Neel Joshi","user_id":33073,"people_section":"Related people","alias":"neel"}],"msr_research_lab":[992148],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1048632"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1048632\/revisions"}],"predecessor-version":[{"id":1081296,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1048632\/revisions\/1081296"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1048644"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1048632"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1048632"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1048632"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1048632"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1048632"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}