{"id":581803,"date":"2019-05-01T10:01:46","date_gmt":"2019-05-01T17:01:46","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=581803"},"modified":"2019-10-15T11:09:15","modified_gmt":"2019-10-15T18:09:15","slug":"deep-infomax-learning-good-representations-through-mutual-information-maximization","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deep-infomax-learning-good-representations-through-mutual-information-maximization\/","title":{"rendered":"Deep InfoMax: Learning good representations through mutual information maximization"},"content":{"rendered":"<p>As researchers continue to apply machine learning to more complex real-world problems, they\u2019ll need to rely less on algorithms that require annotation. This is not only because labels are expensive, but also because supervised learners trained only to predict annotations tend not to generalize beyond structure in the data necessary for the given task. For instance, a neural network trained to classify images <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/openreview.net\/forum?id=Bygh9j09KX\">tends to do so based on texture<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> that correlates with the class label rather than with shape or size, which may limit the suitability of the classifier in test settings. This issue, among others, is a core motivation for unsupervised learning of good representations.<\/p>\n<p>Learning good representations without relying on annotations has been a long-standing challenge in machine learning. Our approach, which we call <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/learning-deep-representations-by-mutual-information-estimation-and-maximization\/\">Deep InfoMax (DIM)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, does so by learning a predictive model of localized features of a deep neural network. The work is presented at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/iclr.cc\/\">2019 International Conference on Learning Representations (ICLR)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<p>DIM (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/rdevon\/DIM\">code on GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) is based on two learning principles: mutual information maximization in the vein of the infomax optimization principle and self-supervision, an important unsupervised learning method that relies on intrinsic properties of the data to provide its own annotation. DIM is flexible, simple to implement, and incorporates a task we call <em>self-prediction<\/em>.<\/p>\n<h3>Mutual information estimation: Does this pair belong together?<\/h3>\n<p>DIM draws inspiration from the infomax principle, a guideline for learning good representations by maximizing the mutual information between the input and output of a neural network. In this setting, the mutual information is defined as the KL-divergence between the joint distribution\u2014all inputs paired with the corresponding outputs\u2014and product-of-marginals distribution\u2014all possible input\/output pairs. While intuitive, infomax has had limited success with deep networks, partially because estimating mutual information is difficult in settings where the input is high-dimensional and\/or the representation is continuous.<\/p>\n<p>The recently introduced <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/proceedings.mlr.press\/v80\/belghazi18a.html\">Mutual Information Neural Estimator (MINE)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> trains a neural network to maximize a lower bound to the mutual information. MINE works by training a discriminator network between samples from the joint distribution, also known as positive samples, and samples from the product-of-marginals distribution, also known as negative samples. For a neural network encoder, the discriminator in MINE is tasked with answering the following question: Does this pair\u2014the input and the output representation\u2014belong together?<\/p>\n<p>DIM borrows this idea from MINE to learn representations using the gradients from a discriminator to help train the encoder network. This is similar to learning for the generator in generative adversarial networks (GANs), except the encoder is making this task easier for the discriminator, not harder. In addition, we don\u2019t rely on the KL-based discriminator from MINE, as this works less effectively in practice than discriminators that use the Jensen-Shannon divergence (JSD) or infoNCE, an estimator used by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.google.com\/url?q=https%3A%2F%2Farxiv.org%2Fabs%2F1807.03748&sa=D&sntz=1&usg=AFQjCNEXXzaWUyM86lRCQkx49BSkcPStiw\">Contrastive Predictive Coding (CPC)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<h3>Learning shared information<\/h3>\n<p>Unfortunately, training an encoder to only maximize the mutual information between the input and output will yield representations that contain trivial or \u201cnoisy\u201d information from the input. For example, in the cat picture below, there are many locations, or patches, from which a neural network could extract information that would increase the mutual information during optimization.<\/p>\n<div id=\"attachment_582811\" style=\"width: 815px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/semantics_and_noise.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-582811\" class=\"wp-image-582811 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/semantics_and_noise.png\" alt=\"An image contains both relevant information and irrelevant information, or noise. In many cases, the noise can represent a larger \u201cquantity\u201d of information.\" width=\"805\" height=\"449\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/semantics_and_noise.png 805w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/semantics_and_noise-300x167.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/semantics_and_noise-768x428.png 768w\" sizes=\"(max-width: 805px) 100vw, 805px\" \/><p id=\"caption-attachment-582811\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> An image contains both relevant information and irrelevant information, or noise. In many cases, the noise can represent a larger \u201cquantity\u201d of information.<\/p><\/div>\n<p>But there is only a subset of locations we\u2019re actually interested in. Worse still, maximizing mutual information between the whole image input and the output representation, which we refer to as global DIM in the paper, will be biased toward learning features that are unrelated, as their sum has more unique information than redundant locations. For example, the ear, eye, and fur all indicate information about the same thing\u2014\u201ca cat\u201d\u2014so encoding all of these locations won\u2019t increase the mutual information as much as encoding the foliage in the background.<\/p>\n<p>So pure mutual information maximization isn\u2019t exactly what we want. We want to maximize information that is shared across the input\u2014in this case, across relevant locations. To accomplish this, we maximize the mutual information between a global summary feature vector, which coincides with the full image, and feature vectors corresponding to local patches. This is a self-supervision task analogous to training the encoder to predict local patch features from a global summary feature; we call this local DIM, but for simplicity, we&#8217;ll refer to this approach as \u201cDIM\u201d below.<\/p>\n<div id=\"attachment_582286\" style=\"width: 556px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/Dog-fig-4.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-582286\" class=\"wp-image-582286 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/Dog-fig-4.png\" alt=\"In a sample self-supervision task, the model selects the best patch conditioned on the center patch for a particular location given a set of candidates. Generally, in tasks like these, the conditioned and predicted features correspond to different locations (autoregression). DIM differs in that the conditioned feature is a function of all predicted features, which we call self-prediction and which both simplifies implementation and improves results.\" width=\"546\" height=\"375\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/Dog-fig-4.png 546w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/Dog-fig-4-300x206.png 300w\" sizes=\"(max-width: 546px) 100vw, 546px\" \/><p id=\"caption-attachment-582286\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> In a sample self-supervision task, the model selects the best patch conditioned on the center patch for a particular location given a set of candidates. Generally, in tasks like these, the conditioned and predicted features correspond to different locations (autoregression). DIM differs in that the conditioned feature is a function of all predicted features, which both simplifies implementation and improves results. The researchers refer to that task as self-prediction.<\/p><\/div>\n<h3>Predicting the local given the whole<\/h3>\n<p>DIM is not unlike other self-supervision approaches that learn to predict local patches across an image at the feature level. Other works learn a representation of images by asking the encoder to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/openaccess.thecvf.com\/content_ICCV_2017\/papers\/Doersch_Multi-Task_Self-Supervised_Visual_ICCV_2017_paper.pdf\">pick the correct patch for a given location from among a set of candidates<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and conditioned on another patch\u2014for example, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.google.com\/url?q=https%3A%2F%2Fwww.cv-foundation.org%2Fopenaccess%2Fcontent_iccv_2015%2Fpapers%2FDoersch_Unsupervised_Visual_Representation_ICCV_2015_paper.pdf&sa=D&sntz=1&usg=AFQjCNEEOWIzEUeUwtEBlGFJfvKqXTjVcA\">conditioned on the center patch<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u2014with the task performed completely in the representation space. In natural language processing, methods like <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.google.com\/url?q=https%3A%2F%2Fopenreview.net%2Fforum%3Fid%3DrJvJXZb0W&sa=D&sntz=1&usg=AFQjCNFO5bT6tSZQjC6z8FqjYjdH3XIvhQ\">Quick Thoughts<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> use similar types of self-supervision to learn good sentence representations. However, these approaches are purely autoregressive: They all involve tasks where the conditioning and predicted features correspond to different locations in the input. In DIM, the conditioning global feature can be a function of all the local features being predicted, a task we call self-prediction, but DIM is also flexible enough to incorporate autoregression.<\/p>\n<div id=\"attachment_596731\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-596731\" class=\"wp-image-596731 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level-1024x576.jpg\" alt=\"Learning Representations by Maximizing Mutual Information Across Views\" width=\"1024\" height=\"576\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/07\/multi-level-960x540.jpg 960w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-596731\" class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> Learning representations by maximizing mutual information between a global summary feature and local features in a convolutional neural network. First, local features are computed by a forward pass, followed by a summarization into a global summary feature vector. DIM then maximizes the mutual information between the global summary feature vector and all local features simultaneously.<\/p><\/div>\n<p>The base encoder architecture for DIM with images is very simple, requiring only a small modification on top of a convolutional neural network. First, a standard ConvNet yields a set of localized feature vectors. These local features are summarized using a standard neural network such as a combination of convolutional and fully connected layers. The output of this is the global summary feature vector, which we use to maximize the mutual information with all local feature vectors.<\/p>\n<p>To estimate and maximize the mutual information, one can use a variety of neural architectures as small additions to the encoder architecture. One that worked well for us was a deep bilinear model in which both the local and global feature vectors were fed into separate fully connected networks with the same output size, followed by a dot product between the two for the score. This score was then fed into either JSD, infoNCE, or the KL-based estimator in MINE.<\/p>\n<p>We evaluated DIM, along with various other unsupervised models, by training a small nonlinear classifier on top of the local representations on the CIFAR-10, CIFAR-100, Tiny ImageNet, and STL-10 datasets. DIM outperformed all methods we tested and proved comparable to supervised learning. We also explored other ways to measure the \u201cgoodness\u201d of the representation, such as independence of the global variables and the ability to reconstruct, with results and analyses available in the paper.<\/p>\n<table class=\"aligncenter\" style=\"width: 100%; border-collapse: separate; border-spacing: 0px; border-style: solid; border-color: #cccccc;\" border=\"1\" cellspacing=\"0\" cellpadding=\"10\">\n<thead>\n<tr style=\"height: 24px; background-color: #f0f0f0;\">\n<td style=\"width: 23.5685%; padding: 10px; border: 1px solid; height: 24px;\"><strong>Model<\/strong><\/td>\n<td style=\"width: 18.1092%; padding: 10px; border: 1px solid; height: 24px;\"><strong>Cifar10<\/strong><\/td>\n<td style=\"width: 19.1745%; padding: 10px; border: 1px solid; height: 24px;\"><strong>Cifar100<\/strong><\/td>\n<td style=\"width: 21.3049%; padding: 10px; border: 1px solid; height: 24px;\"><strong>Tiny Imagenet<\/strong><\/td>\n<td style=\"width: 17.5766%; padding: 10px; border: 1px solid; height: 24px;\"><strong>STL10<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"width: 23.5685%; padding: 10px; border: 1px solid; height: 24px;\"><strong>Fully supervised<\/strong><\/td>\n<td style=\"width: 18.1092%; padding: 10px; border: 1px solid; height: 24px;\">75.39<\/td>\n<td style=\"width: 19.1745%; padding: 10px; border: 1px solid; height: 24px;\">42.27<\/td>\n<td style=\"width: 21.3049%; padding: 10px; border: 1px solid; height: 24px;\">36.60<\/td>\n<td style=\"width: 17.5766%; padding: 10px; border: 1px solid; height: 24px;\">68.7<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 23.5685%; padding: 10px; border: 1px solid; height: 24px;\"><strong>VAE<\/strong><\/td>\n<td style=\"width: 18.1092%; padding: 10px; border: 1px solid; height: 24px;\">60.71<\/td>\n<td style=\"width: 19.1745%; padding: 10px; border: 1px solid; height: 24px;\">37.21<\/td>\n<td style=\"width: 21.3049%; padding: 10px; border: 1px solid; height: 24px;\">18.63<\/td>\n<td style=\"width: 17.5766%; padding: 10px; border: 1px solid; height: 24px;\">58.27<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 23.5685%; padding: 10px; border: 1px solid; height: 24px;\"><strong>AAE<\/strong><\/td>\n<td style=\"width: 18.1092%; padding: 10px; border: 1px solid; height: 24px;\">59.44<\/td>\n<td style=\"width: 19.1745%; padding: 10px; border: 1px solid; height: 24px;\">36.22<\/td>\n<td style=\"width: 21.3049%; padding: 10px; border: 1px solid; height: 24px;\">18.04<\/td>\n<td style=\"width: 17.5766%; padding: 10px; border: 1px solid; height: 24px;\">59.54<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 23.5685%; padding: 10px; border: 1px solid; height: 24px;\"><strong>BiGAN<\/strong><\/td>\n<td style=\"width: 18.1092%; padding: 10px; border: 1px solid; height: 24px;\">62.57<\/td>\n<td style=\"width: 19.1745%; padding: 10px; border: 1px solid; height: 24px;\">37.59<\/td>\n<td style=\"width: 21.3049%; padding: 10px; border: 1px solid; height: 24px;\">24.38<\/td>\n<td style=\"width: 17.5766%; padding: 10px; border: 1px solid; height: 24px;\">71.53<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 23.5685%; padding: 10px; border: 1px solid; height: 24px;\"><strong>NAT<\/strong><\/td>\n<td style=\"width: 18.1092%; padding: 10px; border: 1px solid; height: 24px;\">56.19<\/td>\n<td style=\"width: 19.1745%; padding: 10px; border: 1px solid; height: 24px;\">29.18<\/td>\n<td style=\"width: 21.3049%; padding: 10px; border: 1px solid; height: 24px;\">13.70<\/td>\n<td style=\"width: 17.5766%; padding: 10px; border: 1px solid; height: 24px;\">64.32<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 23.5685%; padding: 10px; border: 1px solid; height: 24px;\"><strong>DIM (KL)<\/strong><\/td>\n<td style=\"width: 18.1092%; padding: 10px; border: 1px solid; height: 24px;\">72.66<\/td>\n<td style=\"width: 19.1745%; padding: 10px; border: 1px solid; height: 24px;\">48.52<\/td>\n<td style=\"width: 21.3049%; padding: 10px; border: 1px solid; height: 24px;\">30.53<\/td>\n<td style=\"width: 17.5766%; padding: 10px; border: 1px solid; height: 24px;\">69.15<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 23.5685%; padding: 10px; border: 1px solid; height: 24px;\"><strong>DIM (JSD)<\/strong><\/td>\n<td style=\"width: 18.1092%; padding: 10px; border: 1px solid; height: 24px;\">73.25<\/td>\n<td style=\"width: 19.1745%; padding: 10px; border: 1px solid; height: 24px;\">48.13<\/td>\n<td style=\"width: 21.3049%; padding: 10px; border: 1px solid; height: 24px;\">33.54<\/td>\n<td style=\"width: 17.5766%; padding: 10px; border: 1px solid; height: 24px;\">72.86<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 23.5685%; padding: 10px; border: 1px solid; height: 24px;\"><strong>DIM (infoNCE)<\/strong><\/td>\n<td style=\"width: 18.1092%; padding: 10px; border: 1px solid; height: 24px;\">75.21<\/td>\n<td style=\"width: 19.1745%; padding: 10px; border: 1px solid; height: 24px;\">49.74<\/td>\n<td style=\"width: 21.3049%; padding: 10px; border: 1px solid; height: 24px;\">34.21<\/td>\n<td style=\"width: 17.5766%; padding: 10px; border: 1px solid; height: 24px;\">72.57<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p class=\"wp-caption-text\">Above are classification accuracies using a single-layer neural network as a nonlinear classifier on top of the local feature vectors for evaluation on CIFAR-10, CIFAR-100, Tiny ImageNet, and STL-10. The representations for DIM outperform representations from the other unsupervised methods when their representations are evaluated in this way.<\/p>\n<h3>Incorporating autoregression<\/h3>\n<p>While DIM favors self-prediction over the type of autoregression commonly found in self-supervision models like CPC, DIM is flexible and can be easily modified to incorporate some autoregression, which can ultimately improve the representation.<br \/>\n\t\t\t<div class=\"ms-grid \">\n\t\t\t<div class=\"ms-row\">\n\t\t\t\t\t<div  class=\"m-col-8-24\" >\n\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/predictor_feature_vectors_7.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft wp-image-582295\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/predictor_feature_vectors_7.gif\" alt=\"\" width=\"200\" height=\"600\" \/><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\t<\/div>\n\t\t<div  class=\"m-col-16-24\" >\n\t\t<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/occluded_image_8.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft wp-image-582292\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/occluded_image_8.gif\" alt=\"\" width=\"500\" height=\"482\" \/><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\t<\/div>\n\t<\/p>\t\t\t<\/div>\n\t\t<\/div>\n\t\t<\/p>\n<p class=\"wp-caption-text\">The figure on the left shows the type of autoregressive semi-supervised task commonly used in Contrastive Predictive Coding (CPC), where the predictor and predicted features come from different locations. The figure on the right shows DIM with a mixed self-prediction and autoregressive task. DIM uses the global summary feature vector as a predictor for both the occluded and unoccluded locations.<\/p>\n<p>This can be done by computing the global vector with part of the input occluded. The local features are then computed using the complete original input. This is a type of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/pdf\/1605.02226.pdf\">orderless autoregression<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> that allows us to make the task slightly harder for DIM and potentially improve results. Alternatively, we can perform mixed self-prediction and orderless autoregression by computing multiple global vectors using a simple convolutional layer on top of the local features.<\/p>\n<div style=\"width: 1034px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/multicat.gif\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/04\/multicat.gif\" alt=\"cat image\" width=\"1024\" height=\"524\" \/><p class=\"wp-caption-text\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> With DIM, you can perform mixed self-prediction and orderless autoregression by computing multiple global vectors using a simple convolutional layer on top of the local features.<\/p><\/div>\n<p>We compared DIM with single and multiple global vectors\u2014that is, without and with orderless autoregression\u2014to CPC (ordered autoregression) on classification tasks using ResNet architectures and strided crops as used in CPC. DIM and CPC performed comparably despite the much simpler and faster task of DIM. This indicates the strict ordered autoregression in CPC may not be necessary for these types of representation learning tasks on images.<\/p>\n<table class=\"aligncenter\" style=\"width: 70%; border-collapse: separate; border-spacing: 0px; border-style: solid; border-color: #cccccc;\" border=\"1\" cellspacing=\"0\" cellpadding=\"10\">\n<thead>\n<tr style=\"height: 24px; background-color: #f0f0f0;\">\n<td style=\"width: 60%; padding: 10px; border: 1px solid;\"><strong>Model<\/strong><\/td>\n<td style=\"width: 20%; padding: 10px; border: 1px solid;\"><strong>Cifar10<\/strong><\/td>\n<td style=\"width: 20%; padding: 10px; border: 1px solid;\"><strong>STL10<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"width: 60%; padding: 10px; border: 1px solid;\"><strong>DIM (single global)<\/strong><\/td>\n<td style=\"width: 20%; padding: 10px; border: 1px solid;\">80.95<\/td>\n<td style=\"width: 20%; padding: 10px; border: 1px solid;\">76.97<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 60%; padding: 10px; border: 1px solid;\"><strong>CPC<\/strong><\/td>\n<td style=\"width: 20%; padding: 10px; border: 1px solid;\">77.45<\/td>\n<td style=\"width: 20%; padding: 10px; border: 1px solid;\">77.81<\/td>\n<\/tr>\n<tr style=\"height: 48px;\">\n<td style=\"width: 60%; padding: 10px; border: 1px solid; height: 34px;\"><strong>DIM (multiple globals)<\/strong><\/td>\n<td style=\"width: 20%; padding: 10px; border: 1px solid;\">77.51<\/td>\n<td style=\"width: 20%; padding: 10px; border: 1px solid;\">78.21<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p class=\"wp-caption-text\">Above is the classification evaluation on CIFAR-10 and STL-10 for DIM with a ResNet architecture with single and multiple global feature vectors compared to Contrastive Predictive Coding (CPC). DIM and CPC performed comparably despite the much simpler and faster task of DIM.<\/p>\n<p>Overall, the ideas behind DIM are simple and can be easily extended to other domains. They&#8217;ve already been extended\u2014with impressive results\u2014to <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deep-graph-infomax\/\">learning unsupervised representations of graphs<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1904.10931\">brain imaging data<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. We\u2019re optimistic this is only the beginning for how DIM can help researchers advance machine learning by providing a more effective way to learn good representations.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As researchers continue to apply machine learning to more complex real-world problems, they\u2019ll need to rely less on algorithms that require annotation. This is not only because labels are expensive, but also because supervised learners trained only to predict annotations tend not to generalize beyond structure in the data necessary for the given task. For [&hellip;]<\/p>\n","protected":false},"author":38022,"featured_media":583048,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[194467],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-581803","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artifical-intelligence","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[571356],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"957\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2019\/05\/Deep-InfoMax-Paper_Site_still_03_2019_1400x627.gif\" class=\"img-object-cover\" alt=\"mutual information maximization prediction\" decoding=\"async\" loading=\"lazy\" \/>","byline":"Devon Hjelm, Philip Bachman, and Adam Trischler","formattedDate":"May 1, 2019","formattedExcerpt":"As researchers continue to apply machine learning to more complex real-world problems, they\u2019ll need to rely less on algorithms that require annotation. This is not only because labels are expensive, but also because supervised learners trained only to predict annotations tend not to generalize beyond&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/581803"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=581803"}],"version-history":[{"count":49,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/581803\/revisions"}],"predecessor-version":[{"id":615411,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/581803\/revisions\/615411"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/583048"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=581803"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=581803"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=581803"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=581803"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=581803"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=581803"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=581803"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=581803"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=581803"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=581803"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=581803"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}