{"id":657990,"date":"2020-05-15T09:03:10","date_gmt":"2020-05-15T16:03:10","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=657990"},"modified":"2020-05-19T09:10:09","modified_gmt":"2020-05-19T16:10:09","slug":"objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language\/","title":{"rendered":"Objects are the secret key to revealing the world between vision and language"},"content":{"rendered":"
<\/p>\n
Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears. Though any individual channel might be incomplete or noisy, humans can naturally align and fuse the information collected from multiple channels to grasp the key concepts needed for a better understanding of the world. One of the core aspirations in artificial intelligence is to develop algorithms that endow computers with an ability to effectively learn from multi-modality (or multi-channel) data, similar to sights and sounds attained from vision and language that help humans make sense of the world around us. For example, computers could mimic this ability by searching the most similar images for a text query (or vice versa) and describing the content of an image using natural language.<\/p>\n
Recently, vision-and-language pre-training<\/a> (VLP) has shown great progress toward addressing this problem. The most representative approach is to train large Transformer-based<\/a> models on massive image-text pair data in a self-supervised manner, such as predicting the masked elements based on their context. The cross-modal representations of the pre-training models can be fine-tuned to adapt to various downstream vision-and-language tasks. However, existing VLP methods simply concatenate image region features and text features as input to the model for pre-training and use self-attention to learn image-text semantic alignments in a brute-force yet implicit manner, leaving the model to figure out the cross-modal alignment from scratch.<\/p>\n In this blog post, we introduce Oscar<\/a> (Object-Semantics Aligned Pre-training) to highlight our observation that objects can be naturally used as anchor points to ease the learning of semantic alignments between images and texts. This discovery leads to a novel VLP framework that creates new state-of-the-art performance on six well-established vision-and-language tasks. Please check out our paper on this technology, \u201cOscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks,\u201d<\/a> and explore the code<\/a> for more details.<\/p>\n Though the observed data varies among different channels (modalities), we hypothesize that important factors tend to be shared among multiple channels (for example, dogs can be described visually and verbally), capturing channel-invariant (or modality-invariant) factors at the semantic level. In vision-and-language tasks, salient objects in an image can be mostly detected by modern object detectors, and such objects are often mentioned in the paired text. For example, on the MS COCO dataset<\/a>, the percentages that an image and its paired text share at least 1, 2, or 3 objects are 49.7%, 22.2%, and 12.9%, respectively.<\/p>\n<\/h3>\n
Object tags as anchor points<\/h3>\n