{"id":657990,"date":"2020-05-15T09:03:10","date_gmt":"2020-05-15T16:03:10","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=657990"},"modified":"2020-05-19T09:10:09","modified_gmt":"2020-05-19T16:10:09","slug":"objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language\/","title":{"rendered":"Objects are the secret key to revealing the world between vision and language"},"content":{"rendered":"

\"\"<\/p>\n

Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears. Though any individual channel might be incomplete or noisy, humans can naturally align and fuse the information collected from multiple channels to grasp the key concepts needed for a better understanding of the world. One of the core aspirations in artificial intelligence is to develop algorithms that endow computers with an ability to effectively learn from multi-modality (or multi-channel) data, similar to sights and sounds attained from vision and language that help humans make sense of the world around us. For example, computers could mimic this ability by searching the most similar images for a text query (or vice versa) and describing the content of an image using natural language.<\/p>\n

Recently, vision-and-language pre-training<\/a> (VLP) has shown great progress toward addressing this problem. The most representative approach is to train large Transformer-based<\/a> models on massive image-text pair data in a self-supervised manner, such as predicting the masked elements based on their context. The cross-modal representations of the pre-training models can be fine-tuned to adapt to various downstream vision-and-language tasks. However, existing VLP methods simply concatenate image region features and text features as input to the model for pre-training and use self-attention to learn image-text semantic alignments in a brute-force yet implicit manner, leaving the model to figure out the cross-modal alignment from scratch.<\/p>\n

In this blog post, we introduce Oscar<\/a> (Object-Semantics Aligned Pre-training) to highlight our observation that objects can be naturally used as anchor points to ease the learning of semantic alignments between images and texts. This discovery leads to a novel VLP framework that creates new state-of-the-art performance on six well-established vision-and-language tasks. Please check out our paper on this technology, \u201cOscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks,\u201d<\/a> and explore the code<\/a> for more details.<\/p>\n

Oscar is one important piece of Microsoft\u2019s new AI at Scale<\/a><\/u> initiative to enable next-generation AI capabilities at scale. In this accompanying AI blog post<\/a><\/u>, learn more about how Oscar is being integrated with other technologies to create powerful AI that people can use in original and innovative ways.<\/div>\n

<\/h3>\n

Object tags as anchor points<\/h3>\n

Though the observed data varies among different channels (modalities), we hypothesize that important factors tend to be shared among multiple channels (for example, dogs can be described visually and verbally), capturing channel-invariant (or modality-invariant) factors at the semantic level. In vision-and-language tasks, salient objects in an image can be mostly detected by modern object detectors, and such objects are often mentioned in the paired text. For example, on the MS COCO dataset<\/a>, the percentages that an image and its paired text share at least 1, 2, or 3 objects are 49.7%, 22.2%, and 12.9%, respectively.<\/p>\n

\"From

Figure 1: Illustration showing the process by which Oscar represents an image-text pair into semantic space. (a) An example of input image-text pair. (b) The object tags are used as anchor points to align image regions with word embeddings of pre-trained language models. (c) The word semantic space is more representative than image region features.<\/p><\/div>\n

An example image-text pair is shown in Figure 1a. By utilizing a pre-trained object detector<\/a> such as Faster R-CNN<\/a>, the image can be represented as a set of visual region features, each of which is associated with an object tag. Accordingly, the sentence can be represented as a sequence of word embeddings using pre-trained language models such as BERT<\/a>. Importantly, in Oscar we construct the representations of the object tags using their corresponding word embeddings from a pre-trained BERT.<\/p>\n

As conceptually illustrated in Figure 1b, this explicitly couples images and sentences in a shared space, allowing objects to play the role of anchor points to align the semantics of vision-and-language. The word embedding space of BERT is semantically well structured after massive pure-text pre-training\u2014this would further provide good initialization for the shared space. In this example, dog<\/em> and couch<\/em> are similar in the visual feature space due to the overlap regions, but they are distinctive in the word embedding space, as illustrated in Figure 1c.<\/p>\n

Oscar learning pipeline<\/h3>\n

With object tags introduced as a new component, Oscar differs from existing VLP in two ways:<\/p>\n