{"id":581803,"date":"2019-05-01T10:01:46","date_gmt":"2019-05-01T17:01:46","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=581803"},"modified":"2019-10-15T11:09:15","modified_gmt":"2019-10-15T18:09:15","slug":"deep-infomax-learning-good-representations-through-mutual-information-maximization","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/deep-infomax-learning-good-representations-through-mutual-information-maximization\/","title":{"rendered":"Deep InfoMax: Learning good representations through mutual information maximization"},"content":{"rendered":"
As researchers continue to apply machine learning to more complex real-world problems, they\u2019ll need to rely less on algorithms that require annotation. This is not only because labels are expensive, but also because supervised learners trained only to predict annotations tend not to generalize beyond structure in the data necessary for the given task. For instance, a neural network trained to classify images tends to do so based on texture (opens in new tab)<\/span><\/a> that correlates with the class label rather than with shape or size, which may limit the suitability of the classifier in test settings. This issue, among others, is a core motivation for unsupervised learning of good representations.<\/p>\n Learning good representations without relying on annotations has been a long-standing challenge in machine learning. Our approach, which we call Deep InfoMax (DIM) (opens in new tab)<\/span><\/a>, does so by learning a predictive model of localized features of a deep neural network. The work is presented at the 2019 International Conference on Learning Representations (ICLR) (opens in new tab)<\/span><\/a>.<\/p>\n DIM (code on GitHub (opens in new tab)<\/span><\/a>) is based on two learning principles: mutual information maximization in the vein of the infomax optimization principle and self-supervision, an important unsupervised learning method that relies on intrinsic properties of the data to provide its own annotation. DIM is flexible, simple to implement, and incorporates a task we call self-prediction<\/em>.<\/p>\n DIM draws inspiration from the infomax principle, a guideline for learning good representations by maximizing the mutual information between the input and output of a neural network. In this setting, the mutual information is defined as the KL-divergence between the joint distribution\u2014all inputs paired with the corresponding outputs\u2014and product-of-marginals distribution\u2014all possible input\/output pairs. While intuitive, infomax has had limited success with deep networks, partially because estimating mutual information is difficult in settings where the input is high-dimensional and\/or the representation is continuous.<\/p>\n The recently introduced Mutual Information Neural Estimator (MINE) (opens in new tab)<\/span><\/a> trains a neural network to maximize a lower bound to the mutual information. MINE works by training a discriminator network between samples from the joint distribution, also known as positive samples, and samples from the product-of-marginals distribution, also known as negative samples. For a neural network encoder, the discriminator in MINE is tasked with answering the following question: Does this pair\u2014the input and the output representation\u2014belong together?<\/p>\n DIM borrows this idea from MINE to learn representations using the gradients from a discriminator to help train the encoder network. This is similar to learning for the generator in generative adversarial networks (GANs), except the encoder is making this task easier for the discriminator, not harder. In addition, we don\u2019t rely on the KL-based discriminator from MINE, as this works less effectively in practice than discriminators that use the Jensen-Shannon divergence (JSD) or infoNCE, an estimator used by Contrastive Predictive Coding (CPC) (opens in new tab)<\/span><\/a>.<\/p>\n Unfortunately, training an encoder to only maximize the mutual information between the input and output will yield representations that contain trivial or \u201cnoisy\u201d information from the input. For example, in the cat picture below, there are many locations, or patches, from which a neural network could extract information that would increase the mutual information during optimization.<\/p>\nMutual information estimation: Does this pair belong together?<\/h3>\n
Learning shared information<\/h3>\n