{"id":803533,"date":"2021-12-14T14:06:35","date_gmt":"2021-12-14T22:06:35","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=803533"},"modified":"2021-12-16T20:57:47","modified_gmt":"2021-12-17T04:57:47","slug":"azure-ai-milestone-new-foundation-model-florence-v1-0-pushing-vision-and-vision-language-state-of-the-art","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/azure-ai-milestone-new-foundation-model-florence-v1-0-pushing-vision-and-vision-language-state-of-the-art\/","title":{"rendered":"Azure AI milestone: New foundation model Florence v1.0 advances state of the art, topping popular computer vision leaderboards"},"content":{"rendered":"\n

The Project Florence Team<\/a> <\/p>\n\n\n\n

\"Animated
With the new computer vision foundation model Florence v1.0, the Project Florence team set the new state of the art on the popular leaderboards TextCaps Challenge 2021, nocaps, Kinetics-400\/Kinetics-600 action classification, and OK-VQA Leaderboard. <\/figcaption><\/figure>\n\n\n\n

Florence v1.0<\/a>\u2014along with recent milestones in Neural Text-to-Speech and question answering\u2014is part of a larger Azure AI <\/a>mission to provide relevant, meaningful AI solutions and services that work better for people because they better capture how people learn and work\u2014with improved vision, knowledge understanding, and speech capabilities. At the center of these efforts is XYZ-code, a joint representation of three cognitive attributes: monolingual text (X), audio or visual sensory signals (Y), and multilingual (Z). For more information about these efforts, read the <\/em>XYZ-code blog post<\/em><\/a>.<\/em> <\/p>\n\n\n\n

Project Florence<\/a>  was launched by Microsoft Azure Cognitive Services<\/a> in May 2020 to advance its large-scale multitask, multimodal computer vision services. Today, we\u2019re thrilled to announce an important milestone: Florence v1.0<\/a>, a computer vision foundation model that successfully scales a large variety of vision and vision-language tasks.  <\/p>\n\n\n\n

Florence v1.0 demonstrates superior performance on challenging tasks such as zero-shot image classification, image\/text retrieval, open-set object detection, and visual question answering. We\u2019ve achieved new state of the art with large margins on a wide range of benchmarks. <\/strong>Supported by Florence v1.0, we\u2019ve also achieved the new state of the art on multiple popular vision and vision-language leaderboards, including TextCaps Challenge 2021 and Kinetics-400\/Kinetics-600 action classification. Florence v1.0 is currently being deployed in Azure Cognitive Services, helping to enhance its computer vision offerings. <\/p>\n\n\n\n

A holistic, people-centered approach to AI<\/h2>\n\n\n\n

Project Florence is part of ongoing efforts to develop AI that operates more like people do, a journey that has been challenging but exciting. We take a holistic and people-centered approach to learning and understanding by using multimodality. Our approach examines the relationship between three attributes of human cognition\u2014monolingual text (X), audio or visual sensory cues (Y), and multilingual (Z)\u2014and brings them together under XYZ-code, a common representation to enable AI that can speak, hear, see, and understand better. The goal is to c<\/a>reate pretrained basic AI models that learn common representations of different modalities and support a wide range of downstream AI tasks with the ability to leverage additional external domain knowledge to underpin AI systems that interpret and interact in the world more like people do.<\/p>\n\n\n\n

In helping to advance the ambitious goal of XYZ-code, the Project Florence team achieved its first milestone last year, attaining state-of-the-art (opens in new tab)<\/span><\/a> performance on the nocaps benchmark (opens in new tab)<\/span><\/a>. Compared with image descriptions provided by people, captions for the same images generated by the AI system were more detailed and precise. This capability is a key component to the Microsoft mission of inclusive and accessible technology (opens in new tab)<\/span><\/a>.  <\/p>\n\n\n\n

\"From<\/a>
Florence v1.0 leverages data curation, unified learning, a Transformer architecture comprising an image encoder and a language encoder, and adaptation. It can be integrated into modern computer vision systems to power real-world vision and multimedia applications. Compared with existing image-text pretraining models, mainly limited to cross-modal shared representations for classification and retrieval (illustrated by the light-green adaptation module above), Florence expands the representation to support object detection, modalities beyond just RGB like image depth, and videos, respectively.<\/figcaption><\/figure><\/div>\n\n\n\n

Florence v1.0: From research to application<\/h2>\n\n\n\n

Project Florence\u2019s mission is to take the advancements being made in areas such as feature representation learning, transfer learning, and model architecture search and turn them into applications that can empower our partners and customers to achieve more with Azure Cognitive Services. Florence v1.0 and other AI breakthroughs achieved so far are being transferred to the cloud platform, helping to improve model quality for image captioning, tagging, and customized object detection. <\/p>\n\n\n\n

The Florence image captioning model is available to customers via the computer vision offering of\u202fAzure Cognitive Services, which is part of Azure AI<\/a>, and can enable developers to incorporate alt text more easily, helping them improve accessibility of their own products and services. The Florence image captioning model is also being incorporated into Seeing AI<\/a>, an app that identifies text, objects, and people in a user\u2019s surroundings, and Microsoft Word, Outlook,<\/strong> and PowerPoint on various platforms.<\/p>\n\n\n\n

\n