{"id":657576,"date":"2020-06-16T10:16:54","date_gmt":"2020-06-16T17:16:54","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=657576"},"modified":"2020-07-22T08:26:38","modified_gmt":"2020-07-22T15:26:38","slug":"learning-local-and-compositional-representations-for-zero-shot-learning","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/learning-local-and-compositional-representations-for-zero-shot-learning\/","title":{"rendered":"Learning local and compositional representations for zero-shot learning"},"content":{"rendered":"\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_ICLR-5edfe7bb02533.gif\" alt=\"Graphic depicting how a model is trained to understand \"zebra\" by learning the idea of \"strips\" and \"horse\"\"\/><\/figure>\n\n\n\n<p>In computer vision, one key property we expect of an intelligent artificial model, agent, or algorithm is that it should be able to correctly recognize the type, or <em>class<\/em>, of objects it encounters. This is critical in numerous important real-world scenarios\u2014from biomedicine, where an intelligent system might be tasked with distinguishing between cancerous cells and healthy ones, to self-driving cars, where being able to discriminate between pedestrians, other vehicles, and road signs is crucial to successfully and safely navigating roads.<\/p>\n\n\n\n<p>Deep learning is one of the most significant tools for state-of-the-art systems in computer vision, and its use has resulted in models that have reached or can even exceed human-level performance in important and challenging real-world image classification tasks. Despite their successes, these models still have difficulty <em>generalizing<\/em>, or adapting to tasks in testing or deployment scenarios that don\u2019t closely resemble the tasks they were trained on. For example, a visual system trained under typical weather conditions in Northern California may fail to properly recognize pedestrians in Quebec because of differences in weather, clothes, demographics, and other features. As it\u2019s difficult to predict\u2014if not impossible to collect\u2014all the possible data that might be present at deployment, there\u2019s a natural interest in testing model classification performance under deployment scenarios in which very few examples of test classes are available, a scenario captured under the framework of <em>few-shot learning<\/em>. <em>Zero-shot<\/em> learning (ZSL) goes a step further: No examples of test classes are available when training. The model must instead rely on semantic information, such as attributes or text descriptions, associated with each class it encounters in training to correctly classify new classes.<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1061244\">\n\t\t\n\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/about-microsoft-research\/\" aria-label=\"About Microsoft Research\" data-bi-cN=\"About Microsoft Research\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/About-page-promo_1066x600.jpg\" alt=\"\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">About Microsoft Research<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p class=\"large\">Advancing science and technology to benefit humanity<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/about-microsoft-research\/\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" aria-label=\"View our story\" data-bi-cN=\"About Microsoft Research\" target=\"_blank\">\n\t\t\t\t\t\t\tView our story\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<p>Humans express a remarkable ability to adapt to unfamiliar situations. From a very young age, we\u2019re able to reason about new categories of objects by leveraging already existing information about related objects with similar attributes, parts, or properties. For example, upon being exposed to a zebra for the first time, a child might reason about it using her prior knowledge that stripes are a type of pattern and that a horse is an animal with similar characteristics and shape. This type of reasoning is intuitive and, we hypothesize, reliant mainly on two key concepts: locality, loosely defined as being dependent on local information, or small parts of the whole, and compositionality, arising from a combination of simpler parts or other characteristics, such as color, to determine the new objects we encounter. In the paper <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/locality-and-compositionality-in-zero-shot-learning\/\">\u201cLocality and Compositionality In Zero-Shot Learning,\u201d<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> which was accepted to the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/iclr.cc\/\">eighth International Conference on Learning Representations (ICLR2020)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we demonstrate that representations that focus on compositionality and locality are better at zero-shot generalization. Considering how to apply these notions in practice to improve zero-shot learning performance, we also introduce Class-Matching DIM (CMDIM), a variant of the popular unsupervised learning algorithm Deep InfoMax, which results in very strong performance compared to a wide range of baselines.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"946\" height=\"651\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-1.jpg\" alt=\"Figure 1: Two pieces of training data\u2014an image of a black-and-white striped pattern labeled \u201cstripes\u201d and an image of a horse labeled \u201chorse\u201d\u2014above testing data. The testing data is shown to consist of the sentence \u201cA \u2018zebra\u2019 is a striped horse,\u201d a piece of semantic information on the class \u201czebra,\u201d enclosed in a pink box and an image of a zebra. The two are associated with a plus sign, and an arrow leads from the two to the inference that the image is \u201czebra.\u201d\" class=\"wp-image-657579\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-1.jpg 946w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-1-300x206.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-1-768x529.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-1-800x550.jpg 800w\" sizes=\"auto, (max-width: 946px) 100vw, 946px\" \/><figcaption>Figure 1: The importance of locality and compositionality in contributing to good representations can be captured by how a child might come to understand what a zebra is from learned concepts and descriptions. If we come to identify a zebra as a striped horse, then stripes would be local information\u2014a distinct part of the object\u2014and the compositional aspect would be learning to combine knowledge we have about stripes with knowledge we have about a horse. This process is intuitive to humans and works very well in zero-shot learning.<\/figcaption><\/figure>\n\n\n\n<h3 id=\"exploring-locality-and-compositionality\">Exploring locality and compositionality<\/h3>\n\n\n\n<p>In the field of representation learning, a locally aware representation can broadly be defined as one that retains local information. For example, in an image of a bird, relevant local information could be the beak, wings, feathers, tail, and so on, and a local representation might be one that encodes one or some of these parts, as well as their relative position in the whole image. A representation is compositional if it can be expressed as a combination of representations of these important parts, but also other important \u201cfacts\u201d about the image, such as color, background, and other environmental factors or even actions. However, it\u2019s difficult to determine whether a model is local or compositional without the help of human experts. To efficiently explore the role of these traits in learning good representations for zero-shot learning, we introduce proxies reliant on human annotations to measure these characteristics.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>We use supervised parts classification as a proxy for locality: On top of a representation, we train a parts localization module that tries to predict where the important parts are in the image and measure the module\u2019s performance <em>without backpropagating through the encoder<\/em>. We then use the resulting classification F1 score as a proxy for locality. The core idea here is that if we\u2019re able to correctly identify where a part is located\u2014and where it\u2019s not\u2014the model must be encoding information on local structure.<\/li><li>For compositionality, we rely on the TRE ratio, a modification of the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/1902.07181.pdf\">tree reconstruction error (TRE)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. The TRE ratio measures how a representation differs from a perfectly compositional one according to a simple linear model. Rather than simply consider the TRE, we considered the ratio of the TRE computed with the actual attributes and the TRE computed with random attributes. This normalization makes it easier to compare different families of models, some of which are inherently more decomposable according to any sets of attributes.<\/li><\/ul>\n\n\n\n<p>Using the above proxies, in addition to others, as a method of evaluation, we analyze locality and compositionality in encoders trained using a diverse set of representation learning methods:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>fully supervised classifiers trained from scratch (FC)<\/li><li>unsupervised generative methods: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1312.6114\">variational autoencoders (VAEs)\/beta-VAEs<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/learning-representations-by-maximizing-mutual-information-across-views\/\">adversarial autoencoders (AAEs)<\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1406.2661\">generative adversarial networks (GANs)<\/a><\/li><li>mutual-information based methods: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/learning-deep-representations-by-mutual-information-estimation-and-maximization\/\">Deep InfoMax (DIM<\/a>) and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/learning-representations-by-maximizing-mutual-information-across-views\/\">Augmented Multiscale DIM (AMDIM)<\/a><\/li><\/ul>\n\n\n\n<p>In addition to existing methods, we created the mutual-information based method CMDIM, for which positive samples, or good examples, are drawn from the set of images of the same class. Using our analyses on these representation learning methods gives us insight on and allows us to evaluate how well they \u201cscore\u201d with respect to locality and compositionality.<\/p>\n\n\n\n<h3 id=\"zero-shot-learning-from-scratch\">Zero-shot learning from scratch<\/h3>\n\n\n\n<p>To tie this all together to generalization, we evaluate each of these models on the downstream task of zero-shot learning. However, because state-of-the-art ZSL in computer vision also relies heavily on pre-training from large-scale datasets like ImageNet, it\u2019s more difficult to draw conclusions on the role of locality and compositionality on fundamental representation learning principles that aid in ZSL performance.<\/p>\n\n\n\n<p>As such, we introduce a stricter ZSL setting defined as <em>zero-shot learning from scratch<\/em> (ZSL-FS), where we don\u2019t use pre-trained models and instead rely on only the data in the training set to train an encoder. We use this setting for multiple reasons: It enables us to focus on the question of whether the representation learned by an encoder is robust to the ZSL setting, as well as extends the insights of our paper to settings in which pre-trained encoders don\u2019t exist or result in poor performance, such as in the field of medical imaging or audio signals.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"328\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-2-1024x328.png\" alt=\"Three separate scatter plots show the relationship between ZSL accuracy (on the y-axis) and the TRE ratio (on the x-axis) for three datasets (from left to right): CUB, AwA2, and SUN. On the right of the scatter plots is a key identifying the models and color associated with each: AAE (blue), AMDIM (orange), CMDIM p=1 (green), DIM (red), FC (purple), VAE (brown), and beta-VAE (pink). In each plot, a solid blue line extends diagonally from top to bottom between the plotted points, designating the inverse correlation between ZSL Accuracy and TRE Ratio. Lighter blue shading along each line indicates variance, with the largest variance being shown for the SUN dataset.\" class=\"wp-image-657588\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-2-1024x328.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-2-300x96.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-2-768x246.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-Fig-2.png 1431w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 2: There is a strong link between TRE ratio, which measures the compositionality of a representation, and zero-shot learning accuracy across encoders and datasets used in the study. The lower the TRE ratio\u2014that is, the more compositional the representation\u2014the better the accuracy. The relationship between the TRE ratio and ZSL accuracy was found to be more direct for CUB and AwA2, datasets for which attributes are strongly relevant to the image. The correlation is weaker for the SUN dataset. Its attributes carry less semantic meaning because of an averaging of per-instance attributes across classes. Each model was trained with encoders of varying sizes, as indicated by the multiple plot points for each. <\/figcaption><\/figure>\n\n\n\n<h3 id=\"the-results-locality-compositionality-and-improved-zsl-accuracy\">The results: Locality, compositionality, and improved ZSL accuracy<\/h3>\n\n\n\n<p>As shown in Figure 2 above, there is a very strong link between zero-shot learning accuracy and TRE ratio that holds across encoders and datasets. We used three datasets: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/www.vision.caltech.edu\/visipedia\/CUB-200-2011.html\">Caltech-UCSD Birds-200-2011 (CUB)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1707.00600\">Animals with Attributes 2 (AWA2)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ieeexplore.ieee.org\/document\/6247998\">SUN Attribute<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. It\u2019s interesting to note the correlation is weaker for the SUN dataset, for which the attributes carry less semantic meaning (being the result of averaging per-instance attributes across classes).<\/p>\n\n\n\n<p>While the TRE ratio focuses on implicit compositionality, as measured by a simple linear model, we can also consider the case of an explicitly compositional model. This refers to a model that is by definition compositional because it first learns part representations and then combines them. We run a second set of experiments to investigate this. In this set of experiments, we compare the performance of a model averaging part representations (the parts are local patches of the image) with a model averaging predictions (an ensemble). We show that the explicitly compositional model outperforms the non-compositional one across model families.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"988\" height=\"766\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-fig-3.png\" alt=\"A scatter plot shows the relationship between ZSL Accuracy (on the y-axis) and Parts F1 Score (on the x-axis) for the encoders trained in the study, each represented by a different color. Each encoder is plotted with and without a local loss, indicated by an \u201cx\u201d and a dot, respectively, with a line connecting the two to show change in Parts F1 Score. A dotted blue line extends diagonally from bottom to top between the plotted points, representing the interpolation.\" class=\"wp-image-657591\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-fig-3.png 988w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-fig-3-300x233.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/ICLR-fig-3-768x595.png 768w\" sizes=\"auto, (max-width: 988px) 100vw, 988px\" \/><figcaption>Figure 3: Parts F1 score for the models, trained on the CUB dataset with a DCGAN-based encoder, plotted against ZSL accuracy. There\u2019s a clear relationship between the two: Representations that have a good understanding of local information (as measured by the parts F1 score) perform better in zero-shot learning. The addition of a loss emphasizing locality increases parts F1 score for almost all models (it decreases the score for AAE). This improves generalization for all models except for the reconstruction-based methods, AAE, beta-VAE, and VAE.<\/figcaption><\/figure>\n\n\n\n<p>Concerning locality, there\u2019s also a clear relationship between parts F1 score and zero-shot learning accuracy. The better an encoder\u2019s understanding of local information is, indicated by a higher parts F1 score, the better its ZSL performance. This relationship breaks down for reconstruction-based models (AAEs and VAEs, in our case), which seem to focus on capturing pixel-level information rather than semantic information. We used a visualization technique based on mutual information heat maps to estimate where the encoder focuses. The technique revealed that AAEs and VAEs, contrary to the other families of models, have trouble finding semantically relevant parts of an image, such as wings or the contour of the bird, and instead focus on the whole image.<\/p>\n\n\n\n<p>In conclusion, these findings around the relationship between accuracy and locality and compositionality will hopefully provide researchers with a more principled approach to zero-shot learning, one that focuses on these concepts when designing new methods. In future work, we aim to investigate how locality and compositionality impact other zero-shot tasks, such as zero-shot semantic segmentation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In computer vision, one key property we expect of an intelligent artificial model, agent, or algorithm is that it should be able to correctly recognize the type, or class, of objects it encounters. This is critical in numerous important real-world scenarios\u2014from biomedicine, where an intelligent system might be tasked with distinguishing between cancerous cells and [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":667371,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-657576","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[650565],"related-researchers":[{"type":"guest","value":"tristan-sylvain","user_id":"657597","display_name":"Tristan  Sylvain","author_link":"<a href=\"https:\/\/mila.quebec\/en\/person\/sylvain-tristan\/\" aria-label=\"Visit the profile page for Tristan  Sylvain\">Tristan  Sylvain<\/a>","is_active":true,"last_first":"Sylvain, Tristan ","people_section":0,"alias":"tristan-sylvain"},{"type":"guest","value":"linda-petrini","user_id":"667353","display_name":"Linda  Petrini","author_link":"Linda  Petrini","is_active":true,"last_first":"Petrini, Linda ","people_section":0,"alias":"linda-petrini"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-960x540.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-1536x865.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-2048x1153.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/1400x788_NoLogo_Iclr_Still-01-1920x1080.png 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/mila.quebec\/en\/person\/sylvain-tristan\/\" title=\"Go to researcher profile for Tristan  Sylvain\" aria-label=\"Go to researcher profile for Tristan  Sylvain\" data-bi-type=\"byline author\" data-bi-cN=\"Tristan  Sylvain\">Tristan  Sylvain<\/a>, Devon Hjelm, and Linda  Petrini","formattedDate":"June 16, 2020","formattedExcerpt":"In computer vision, one key property we expect of an intelligent artificial model, agent, or algorithm is that it should be able to correctly recognize the type, or class, of objects it encounters. This is critical in numerous important real-world scenarios\u2014from biomedicine, where an intelligent&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/657576","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=657576"}],"version-history":[{"count":11,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/657576\/revisions"}],"predecessor-version":[{"id":677409,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/657576\/revisions\/677409"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/667371"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=657576"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=657576"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=657576"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=657576"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=657576"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=657576"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=657576"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=657576"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=657576"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=657576"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=657576"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}