{"id":610734,"date":"2019-10-08T10:07:15","date_gmt":"2019-10-08T17:07:15","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=610734"},"modified":"2020-02-08T17:36:11","modified_gmt":"2020-02-09T01:36:11","slug":"expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture\/","title":{"rendered":"Expanding scene and language understanding with large-scale pre-training and a unified architecture"},"content":{"rendered":"
<\/p>\n
Making sense of the world around us is a skill we as human beings begin to learn from an early age. Though there is still much to know about the process, we can see that people learn a lot, both directly and indirectly, from observing and interacting with their environments and other people in them: an uncle points to a shiny red piece of fruit and tells his nephew it\u2019s an apple; a teacher reads a book about a hungry caterpillar that turns into a butterfly; a child observes her parents talking about the mail and the mail carrier who delivered it as they shuffle white envelopes with printed lettering and stamps back and forth. Even if the context around an object changes\u2014a flower<\/em> in a vase on the kitchen table versus a flower<\/em> planted in the ground in the backyard versus a field of many flowers<\/em>\u2014children are able to make new associations and adjust old ones as information is gained and call on their implicit commonsense knowledge to figure out what they encounter. The more we interact with our physical environments, our screens, photographs, and books, the better we become at understanding and using language to explain the items that exist and the things that are happening in our surroundings.<\/p>\n For machines, on the other hand, scene understanding and language understanding are quite challenging to hone, especially with only weak supervision, essentially the indirect learning people are able to leverage so well. Without exact labels for all the components in a scene to learn from, machines struggle to gain a solid foundation on which to build other capabilities that require scene and language understanding. Collecting the necessary labels is usually expensive, and even good labels provide only a reasonable understanding of the scene, not the language.<\/p>\n The main question becomes, then, whether we can leverage the large amount of image-text pairs available on the web to mimic the way people improve their scene and language understanding. Can we build a model that unifies machine capabilities to perform well on both vision-language generation tasks and understanding tasks?<\/p>\n In our paper \u201cUnified Vision-Language Pre-Training for Image Captioning and VQA<\/a>,\u201d we present a unified single-model encoder-decoder system capable of two disparate tasks: image captioning and visual question answering (VQA). Generating descriptions for scenes and answering natural language questions about them are good indicators of a system\u2019s overall effectiveness at both scene understanding and language understanding. We believe the model, which we\u2019re calling the Vision-Language Pre-training (VLP) model, is among the first to use data from both language and vision to show significant improvements on different downstream tasks. Our proposed model, which is open source on GitHub<\/a>, was pre-trained using three million image-text pairs. If we can further take advantage of the vast amount of publicly available visuals with text data provided\u2014think large corpora of movies with subtitles and human conversations grounded in images and videos, such as comments under an image or video posted on social media\u2014we see machine scene and language understanding reaching human parity.<\/p>\n