{"id":689814,"date":"2020-09-22T21:43:29","date_gmt":"2020-09-23T04:43:29","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=689814"},"modified":"2022-08-24T10:56:02","modified_gmt":"2022-08-24T17:56:02","slug":"project-florence-vl","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-florence-vl\/","title":{"rendered":"Project Florence-VL"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\"Azure\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

Project Florence-VL<\/h1>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears. Though any individual channel might be incomplete or noisy, humans can naturally align and fuse information collected from multiple channels, in order to grasp the key concepts needed for a better understanding of the world.<\/p>\n\n\n\n

So, what is Florence-VL about?<\/h3>\n\n\n\n

One of the core aspirations in artificial intelligence is to develop algorithms that endow computers with an ability to effectively learn from multi-modality (or multi-channel) data. This data is similar to sights and sounds attained from vision and language that help humans make sense of the world around us. For example, computers could mimic this ability by searching the most similar images for a text query (or vice versa) and by describing the content of an image using natural language.<\/p>\n\n\n\n

Azure Florence-Vision and Language, short for Florence-VL, is launched to achieve this goal, where we aim to build new foundation models for Multimodal Intelligence. Florence-VL, as part of Project Florence<\/a>, is funded by the Microsoft AI Cognitive Service team since 2020. Motivated by the strong demand from real applications and recent research progresses on computer vision, natural language processing, and vision-language understanding, we strive to advance the state of the art on vision-language modeling and develop the best computer vision technologies as part of our mission to empower everyone on the planet to achieve more.<\/p>\n\n\n\n

Our journey starts from VL pre-training<\/h3>\n\n\n\n

Recently, Vision-Language Pre-training (VLP) has shown great progress toward learning general-purpose multimodal representations. The most representative approach is to train large transformer-based models on massive image-text pair data in a self-supervised manner, such as predicting the masked elements based on their context. The cross-modal representations of the pretrained models can then be finetuned to adapt to various downstream vision-language tasks.<\/p>\n\n\n\n

There are tons of images on the web and social media that have annotated texts. These texts can be used as a “free” source of labels. There are also a large number of videos that have audio channels describing what happens in these videos, and this audio can be transcribed into text as labels as well. In addition to the benefit of not requiring manual data labeling, VLP has another important benefit of cross-modal knowledge distillation, where knowledge learned in one modality helps learning in a different modality.<\/p>\n\n\n\n

Along the journey, our team has developed a series of seminal works, including UNITER (opens in new tab)<\/span><\/a>, OSCAR (opens in new tab)<\/span><\/a>, VILLA (opens in new tab)<\/span><\/a>, and VinVL (opens in new tab)<\/span><\/a>. These models, when equipped with large-scale pre-training, have helped us build up state-of-the-art techniques for challenging vision-language tasks. For example, in VIVO (opens in new tab)<\/span><\/a>, we have achieved the first human parity on the novel image captioning (nocaps (opens in new tab)<\/span><\/a>) task. By enhancing pre-training with detected scene text in images, our TAP (opens in new tab)<\/span><\/a> model has also achieved No. 1 on the TextCaps Challenge 2021 (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n

How to further modernize our Florence-VL efforts?<\/h3>\n\n\n\n

These successes are encouraging, which have driven us to further modernize our Florence-VL efforts, detailed below.<\/p>\n\n\n\n