{"id":657576,"date":"2020-06-16T10:16:54","date_gmt":"2020-06-16T17:16:54","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=657576"},"modified":"2020-07-22T08:26:38","modified_gmt":"2020-07-22T15:26:38","slug":"learning-local-and-compositional-representations-for-zero-shot-learning","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/learning-local-and-compositional-representations-for-zero-shot-learning\/","title":{"rendered":"Learning local and compositional representations for zero-shot learning"},"content":{"rendered":"\n

<\/p>\n\n\n\n

\"Graphic<\/figure>\n\n\n\n

In computer vision, one key property we expect of an intelligent artificial model, agent, or algorithm is that it should be able to correctly recognize the type, or class<\/em>, of objects it encounters. This is critical in numerous important real-world scenarios\u2014from biomedicine, where an intelligent system might be tasked with distinguishing between cancerous cells and healthy ones, to self-driving cars, where being able to discriminate between pedestrians, other vehicles, and road signs is crucial to successfully and safely navigating roads.<\/p>\n\n\n\n

Deep learning is one of the most significant tools for state-of-the-art systems in computer vision, and its use has resulted in models that have reached or can even exceed human-level performance in important and challenging real-world image classification tasks. Despite their successes, these models still have difficulty generalizing<\/em>, or adapting to tasks in testing or deployment scenarios that don\u2019t closely resemble the tasks they were trained on. For example, a visual system trained under typical weather conditions in Northern California may fail to properly recognize pedestrians in Quebec because of differences in weather, clothes, demographics, and other features. As it\u2019s difficult to predict\u2014if not impossible to collect\u2014all the possible data that might be present at deployment, there\u2019s a natural interest in testing model classification performance under deployment scenarios in which very few examples of test classes are available, a scenario captured under the framework of few-shot learning<\/em>. Zero-shot<\/em> learning (ZSL) goes a step further: No examples of test classes are available when training. The model must instead rely on semantic information, such as attributes or text descriptions, associated with each class it encounters in training to correctly classify new classes.<\/p>\n\n\n\n\n\t

\n\t\t\n\n\t\t

\n\t\tMicrosoft Research Blog<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"Microsoft\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n\n

Humans express a remarkable ability to adapt to unfamiliar situations. From a very young age, we\u2019re able to reason about new categories of objects by leveraging already existing information about related objects with similar attributes, parts, or properties. For example, upon being exposed to a zebra for the first time, a child might reason about it using her prior knowledge that stripes are a type of pattern and that a horse is an animal with similar characteristics and shape. This type of reasoning is intuitive and, we hypothesize, reliant mainly on two key concepts: locality, loosely defined as being dependent on local information, or small parts of the whole, and compositionality, arising from a combination of simpler parts or other characteristics, such as color, to determine the new objects we encounter. In the paper \u201cLocality and Compositionality In Zero-Shot Learning,\u201d (opens in new tab)<\/span><\/a> which was accepted to the eighth International Conference on Learning Representations (ICLR2020) (opens in new tab)<\/span><\/a>, we demonstrate that representations that focus on compositionality and locality are better at zero-shot generalization. Considering how to apply these notions in practice to improve zero-shot learning performance, we also introduce Class-Matching DIM (CMDIM), a variant of the popular unsupervised learning algorithm Deep InfoMax, which results in very strong performance compared to a wide range of baselines.<\/p>\n\n\n\n

\"Figure
Figure 1: The importance of locality and compositionality in contributing to good representations can be captured by how a child might come to understand what a zebra is from learned concepts and descriptions. If we come to identify a zebra as a striped horse, then stripes would be local information\u2014a distinct part of the object\u2014and the compositional aspect would be learning to combine knowledge we have about stripes with knowledge we have about a horse. This process is intuitive to humans and works very well in zero-shot learning.<\/figcaption><\/figure>\n\n\n\n

Exploring locality and compositionality<\/h3>\n\n\n\n

In the field of representation learning, a locally aware representation can broadly be defined as one that retains local information. For example, in an image of a bird, relevant local information could be the beak, wings, feathers, tail, and so on, and a local representation might be one that encodes one or some of these parts, as well as their relative position in the whole image. A representation is compositional if it can be expressed as a combination of representations of these important parts, but also other important \u201cfacts\u201d about the image, such as color, background, and other environmental factors or even actions. However, it\u2019s difficult to determine whether a model is local or compositional without the help of human experts. To efficiently explore the role of these traits in learning good representations for zero-shot learning, we introduce proxies reliant on human annotations to measure these characteristics.<\/p>\n\n\n\n