{"id":610734,"date":"2019-10-08T10:07:15","date_gmt":"2019-10-08T17:07:15","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=610734"},"modified":"2020-02-08T17:36:11","modified_gmt":"2020-02-09T01:36:11","slug":"expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture\/","title":{"rendered":"Expanding scene and language understanding with large-scale pre-training and a unified architecture"},"content":{"rendered":"

\"New<\/p>\n

Making sense of the world around us is a skill we as human beings begin to learn from an early age. Though there is still much to know about the process, we can see that people learn a lot, both directly and indirectly, from observing and interacting with their environments and other people in them: an uncle points to a shiny red piece of fruit and tells his nephew it\u2019s an apple; a teacher reads a book about a hungry caterpillar that turns into a butterfly; a child observes her parents talking about the mail and the mail carrier who delivered it as they shuffle white envelopes with printed lettering and stamps back and forth. Even if the context around an object changes\u2014a flower<\/em> in a vase on the kitchen table versus a flower<\/em> planted in the ground in the backyard versus a field of many flowers<\/em>\u2014children are able to make new associations and adjust old ones as information is gained and call on their implicit commonsense knowledge to figure out what they encounter. The more we interact with our physical environments, our screens, photographs, and books, the better we become at understanding and using language to explain the items that exist and the things that are happening in our surroundings.<\/p>\n

For machines, on the other hand, scene understanding and language understanding are quite challenging to hone, especially with only weak supervision, essentially the indirect learning people are able to leverage so well. Without exact labels for all the components in a scene to learn from, machines struggle to gain a solid foundation on which to build other capabilities that require scene and language understanding. Collecting the necessary labels is usually expensive, and even good labels provide only a reasonable understanding of the scene, not the language.<\/p>\n

The main question becomes, then, whether we can leverage the large amount of image-text pairs available on the web to mimic the way people improve their scene and language understanding. Can we build a model that unifies machine capabilities to perform well on both vision-language generation tasks and understanding tasks?<\/p>\n

In our paper \u201cUnified Vision-Language Pre-Training for Image Captioning and VQA<\/a>,\u201d we present a unified single-model encoder-decoder system capable of two disparate tasks: image captioning and visual question answering (VQA). Generating descriptions for scenes and answering natural language questions about them are good indicators of a system\u2019s overall effectiveness at both scene understanding and language understanding. We believe the model, which we\u2019re calling the Vision-Language Pre-training (VLP) model, is among the first to use data from both language and vision to show significant improvements on different downstream tasks. Our proposed model, which is open source on GitHub<\/a>, was pre-trained using three million image-text pairs. If we can further take advantage of the vast amount of publicly available visuals with text data provided\u2014think large corpora of movies with subtitles and human conversations grounded in images and videos, such as comments under an image or video posted on social media\u2014we see machine scene and language understanding reaching human parity.<\/p>\n

\"Microsoft<\/a>

Microsoft researchers have developed a unified encoder-decoder model for general vision-language pre-training that they fine-tuned for image captioning and visual question answering. With the vision-language pre-training, both training speed and overall accuracy have been significantly improved on the downstream tasks compared to random initialization or language-only pre-training.<\/p><\/div>\n

Improving on current models<\/h3>\n

Existing approaches for image captioning and VQA suffer from low-quality captions and reasoning capabilities. This is mainly due to three reasons:<\/p>\n

    \n
  1. They\u2019re not effective enough to leverage context, which is a very important capability, especially when there are various objects, relationships, and concepts in the given scene. Moreover, the model should be capable of identifying important components to describe images accurately and perform reasoning about them given a natural language question.<\/li>\n
  2. They\u2019re not leveraging large-scale training data for pre-training. This is crucial to learn universal representations for both language and vision that are practically useful for many downstream tasks, not just image captioning and VQA.<\/li>\n
  3. Their architecture is not designed to perform equally well on diverse sets of tasks where both language and vision alignment\u2014as is needed for VQA and information retrieval, for example\u2014and language generation are performed using a single model.<\/li>\n<\/ol>\n

    VLP seeks to overcome the above limitations with an architecture that:<\/p>\n