{"id":610734,"date":"2019-10-08T10:07:15","date_gmt":"2019-10-08T17:07:15","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=610734"},"modified":"2020-02-08T17:36:11","modified_gmt":"2020-02-09T01:36:11","slug":"expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture\/","title":{"rendered":"Expanding scene and language understanding with large-scale pre-training and a unified architecture"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-614181 size-full alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290.jpg\" alt=\"New unified VLP model seeks to improve scene and language understanding\" width=\"1400\" height=\"788\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-1280x720.jpg 1280w\" sizes=\"(max-width: 1400px) 100vw, 1400px\" \/><\/p>\n<p>Making sense of the world around us is a skill we as human beings begin to learn from an early age. Though there is still much to know about the process, we can see that people learn a lot, both directly and indirectly, from observing and interacting with their environments and other people in them: an uncle points to a shiny red piece of fruit and tells his nephew it\u2019s an apple; a teacher reads a book about a hungry caterpillar that turns into a butterfly; a child observes her parents talking about the mail and the mail carrier who delivered it as they shuffle white envelopes with printed lettering and stamps back and forth. Even if the context around an object changes\u2014a <em>flower<\/em> in a vase on the kitchen table versus a <em>flower<\/em> planted in the ground in the backyard versus a field of many <em>flowers<\/em>\u2014children are able to make new associations and adjust old ones as information is gained and call on their implicit commonsense knowledge to figure out what they encounter. The more we interact with our physical environments, our screens, photographs, and books, the better we become at understanding and using language to explain the items that exist and the things that are happening in our surroundings.<\/p>\n<p>For machines, on the other hand, scene understanding and language understanding are quite challenging to hone, especially with only weak supervision, essentially the indirect learning people are able to leverage so well. Without exact labels for all the components in a scene to learn from, machines struggle to gain a solid foundation on which to build other capabilities that require scene and language understanding. Collecting the necessary labels is usually expensive, and even good labels provide only a reasonable understanding of the scene, not the language.<\/p>\n<p>The main question becomes, then, whether we can leverage the large amount of image-text pairs available on the web to mimic the way people improve their scene and language understanding. Can we build a model that unifies machine capabilities to perform well on both vision-language generation tasks and understanding tasks?<\/p>\n<p>In our paper \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/uni\ufb01ed-vision-language-pre-training-for-image-captioning-and-vqa\/\">Unified Vision-Language Pre-Training for Image Captioning and VQA<\/a>,\u201d we present a unified single-model encoder-decoder system capable of two disparate tasks: image captioning and visual question answering (VQA). Generating descriptions for scenes and answering natural language questions about them are good indicators of a system\u2019s overall effectiveness at both scene understanding and language understanding. We believe the model, which we\u2019re calling the Vision-Language Pre-training (VLP) model, is among the first to use data from both language and vision to show significant improvements on different downstream tasks. Our proposed model, which is open source on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/LuoweiZhou\/VLP\">GitHub<\/a>, was pre-trained using three million image-text pairs. If we can further take advantage of the vast amount of publicly available visuals with text data provided\u2014think large corpora of movies with subtitles and human conversations grounded in images and videos, such as comments under an image or video posted on social media\u2014we see machine scene and language understanding reaching human parity.<\/p>\n<div id=\"attachment_610797\" style=\"width: 815px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_1_unified_vision.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-610797\" class=\"wp-image-610797 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_1_unified_vision-805x1024.png\" alt=\"Microsoft researchers have developed a unified encoder-decoder model for general vision-language pre-training that they fine-tuned for image captioning and visual question answering. With the vision-language pre-training, both training speed and overall accuracy have been significantly improved on the downstream tasks compared to random initialization or language-only pre-training.\" width=\"805\" height=\"1024\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_1_unified_vision-805x1024.png 805w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_1_unified_vision-236x300.png 236w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_1_unified_vision-768x977.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_1_unified_vision.png 1100w\" sizes=\"(max-width: 805px) 100vw, 805px\" \/><\/a><p id=\"caption-attachment-610797\" class=\"wp-caption-text\">Microsoft researchers have developed a unified encoder-decoder model for general vision-language pre-training that they fine-tuned for image captioning and visual question answering. With the vision-language pre-training, both training speed and overall accuracy have been significantly improved on the downstream tasks compared to random initialization or language-only pre-training.<\/p><\/div>\n<h3>Improving on current models<\/h3>\n<p>Existing approaches for image captioning and VQA suffer from low-quality captions and reasoning capabilities. This is mainly due to three reasons:<\/p>\n<ol>\n<li>They\u2019re not effective enough to leverage context, which is a very important capability, especially when there are various objects, relationships, and concepts in the given scene. Moreover, the model should be capable of identifying important components to describe images accurately and perform reasoning about them given a natural language question.<\/li>\n<li>They\u2019re not leveraging large-scale training data for pre-training. This is crucial to learn universal representations for both language and vision that are practically useful for many downstream tasks, not just image captioning and VQA.<\/li>\n<li>Their architecture is not designed to perform equally well on diverse sets of tasks where both language and vision alignment\u2014as is needed for VQA and information retrieval, for example\u2014and language generation are performed using a single model.<\/li>\n<\/ol>\n<p>VLP seeks to overcome the above limitations with an architecture that:<\/p>\n<ul>\n<li>deploys a shared multi-layer transformer network for encoding and decoding;<\/li>\n<li>is optimized for both bidirectional and sequence-to-sequence prediction; and<\/li>\n<li>incorporates special masks in a self-attention mechanism to enable a single model performing both generation and understanding tasks over a given scene.<\/li>\n<\/ul>\n<p>In current approaches where models are pre-trained to handle multiple tasks, their encoders and decoders are pre-trained separately or just their encoders are pre-trained. But we pre-train the encoder and decoder together and optimize for both bidirectional and sequence-to-sequence prediction. Doing so creates better aligned encoder and decoder representations, allowing the same model to be used for tasks as different as image captioning and VQA.<\/p>\n<h3>Testing and evaluation<\/h3>\n<p>We evaluated VLP\u2019s ability to caption and reason over images on three challenging benchmarks: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/cocodataset.org\/#home\">COCO<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.aclweb.org\/anthology\/Q14-1006\">Flickr30K<\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/visualqa.org\/\">VQA 2.0<\/a>. VLP outperformed baseline models and state-of-the art models on several image captioning and VQA metrics, proving to be more accurate and converging faster during training.<\/p>\n<p>Qualitative results on COCO and VQA 2.0 (Figure 2 below) show VLP is not only able to key in on more details when generating captions, as demonstrated by its caption for the first photo, but it also can be capable of answering challenging questions about the image where previous models trained only on language fail to answer them correctly. For example, VLP is able to identify the similarity in clothing design among different people in the first photo and recognizes the person is not taking his own picture in the second photo.<\/p>\n<div id=\"attachment_610812\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_2_Unified_vision.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-610812\" class=\"wp-image-610812 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_2_Unified_vision-1024x625.png\" alt=\"Figure 2: The above table shows qualitative examples on COCO and VQA 2.0. The first column indicates images from the COCO validation set. The second column shows the five human-annotated ground-truth (GT) captions. The third column indicates captions generated by three different models and the corresponding CIDEr scores, a metric used to evaluate caption quality. Only Unified VLP has vision-language pre-training. The last column shows VQA questions and correct answers associated with the image and answers generated by the models. The top two are successful cases and the bottom two are failed cases.\" width=\"1024\" height=\"625\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_2_Unified_vision-1024x625.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_2_Unified_vision-300x183.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_2_Unified_vision-768x469.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/09\/Figure_2_Unified_vision.png 1130w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><p id=\"caption-attachment-610812\" class=\"wp-caption-text\">Figure 2: The above table shows qualitative examples on COCO and VQA 2.0. The first column indicates images from the COCO validation set. The second column shows the five human-annotated ground-truth (GT) captions. The third column indicates captions generated by three different models and their corresponding CIDEr scores, a metric used to evaluate caption quality. Only Unified VLP has vision-language pre-training. The last column shows VQA questions and correct answers associated with the image and answers generated by the models. The top two are successful cases and the bottom two are failed cases.<\/p><\/div>\n<h3>Looking ahead<\/h3>\n<p>People learn to understand language and how it relates to their environment as children by observing and interacting with various objects and events surrounding them. For machines, that interaction happens with data such as image-text pairs. With smart model design and smart data selection, we can capitalize on existing publicly available resources to reach even greater heights in language and scene understanding, as evidenced by VLP.<\/p>\n<p>With VLP, we believe we show the potential of unified models to reach the levels of language and scene understanding necessary to successfully complete a variety of distinct downstream tasks\u2014single models that complete multiple tasks efficiently without sacrificing performance. That means more effective and capable vision-language systems without the costs of several separately trained models to achieve the same goals. We look forward to continuing to strengthen the VLP architecture and pre-training method while adding more data during pre-training and a more diverse set of downstream tasks.<\/p>\n<p><em>This work was spearheaded by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/luoweizhou.github.io\/\">University of Michigan PhD student Luowei Zhou<\/a> during a Microsoft Research internship. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/web.eecs.umich.edu\/~jjcorso\/\">University of Michigan Professor Jason J. Corso<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hpalangi\/\">Hamid Palangi<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/leizhang\/\">Lei Zhang<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jfgao\/\">Jianfeng Gao<\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/in\/houdong-hu-08334227\/\">Houdong Hu<\/a> of Microsoft served as advisors on the work. A special thanks to <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fuwei\/\">Furu Wei<\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/homepages.inf.ed.ac.uk\/s1478528\/\">Li Dong<\/a> from <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/lab\/microsoft-research-asia\/\">Microsoft Research Asia<\/a> for sharing their initial code base for language pre-training.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Making sense of the world around us is a skill we as human beings begin to learn from an early age. Though there is still much to know about the process, we can see that people learn a lot, both directly and indirectly, from observing and interacting with their environments and other people in them: [&hellip;]<\/p>\n","protected":false},"author":38022,"featured_media":614181,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[194467],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-610734","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artifical-intelligence","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144931,737755],"related-projects":[737098],"related-events":[633696],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-960x540.jpg\" class=\"img-object-cover\" alt=\"New unified VLP model seeks to improve scene and language understanding\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/10\/Research_UnifiedVision_Site_text_1400x788-5d9f0747b4290.jpg 1400w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"Hamid Palangi","formattedDate":"October 8, 2019","formattedExcerpt":"Making sense of the world around us is a skill we as human beings begin to learn from an early age. Though there is still much to know about the process, we can see that people learn a lot, both directly and indirectly, from observing&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/610734"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=610734"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/610734\/revisions"}],"predecessor-version":[{"id":614184,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/610734\/revisions\/614184"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/614181"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=610734"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=610734"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=610734"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=610734"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=610734"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=610734"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=610734"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=610734"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=610734"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=610734"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=610734"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}