{"id":658404,"date":"2020-05-13T03:57:57","date_gmt":"2020-05-13T10:57:57","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=658404"},"modified":"2020-06-18T07:30:10","modified_gmt":"2020-06-18T14:30:10","slug":"diving-into-deep-infomax-with-dr-devon-hjelm","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/diving-into-deep-infomax-with-dr-devon-hjelm\/","title":{"rendered":"Diving into Deep InfoMax with Dr. Devon Hjelm"},"content":{"rendered":"<h3><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-658428\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1024x576.png\" alt=\"Photo of Devon Hjelm for the Microsoft Research Podcast\" width=\"1024\" height=\"576\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1536x865.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-2048x1153.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1920x1080.png 1920w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/h3>\n<h3>Episode 115 | May 13, 2020<\/h3>\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/devonh\/\">Dr. Devon Hjelm<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a senior researcher at the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/lab\/microsoft-research-montreal\/\">Microsoft Research lab in Montreal<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and today, he joins me to dive deep into his research on <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deep-graph-infomax\/\">Deep InfoMax<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a novel self-supervised learning approach to training AI models \u2013 and getting good representations \u2013 without human annotation. He also tells us how an interest in neural networks, first human and then machine, led to an inspiring career in deep learning research.<\/p>\n<h3>Related:<\/h3>\n<ul type=\"disc\">\n<li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/podcast\">Microsoft Research Podcast<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: View more podcasts on Microsoft.com<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/itunes.apple.com\/us\/podcast\/microsoft-research-a-podcast\/id1318021537?mt=2\">iTunes<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen to new podcasts each week on iTunes<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/subscribebyemail.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\">Email<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen by email<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/subscribeonandroid.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\">Android<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen on Android<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/open.spotify.com\/show\/4ndjUXyL0hH1FXHgwIiTWU\">Spotify<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Listen on Spotify<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.blubrry.com\/feeds\/microsoftresearch.xml\">RSS feed<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/note.microsoft.com\/ww-registration-microsoft-research-newsletter-s.html?wt.mc_id=S-webpage_podcast\">Microsoft Research Newsletter<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Sign up to receive the latest news from Microsoft Research<\/li>\n<\/ul>\n<hr \/>\n<h3>Transcript<\/h3>\n<p>Devon\u00a0Hjelm:\u00a0The key thing that we walked away with, with Deep\u00a0InfoMax, was that we don\u2019t really care about estimating mutual information, we don\u2019t care about the number that corresponds to how dependent things are, we just want a model that understands whether or not there\u2019s more or less mutual information so that we can use that number as a learning signal to train the encoder.<\/p>\n<p><b>Host:\u00a0<\/b><b>You\u2019re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I\u2019m your host, Gretchen Huizinga.<\/b><\/p>\n<p><b>Host:\u00a0<\/b><b>Dr. Devon\u00a0<\/b><b>Hjelm<\/b><b>\u00a0is a\u00a0<\/b><b>senior\u00a0<\/b><b>researcher at the Microsoft Research lab in Montreal, and today, he joins me to dive deep into his research on Deep\u00a0<\/b><b>InfoMax<\/b><b>, a novel self-supervised learning approach to training AI models \u2013 and getting good representations \u2013 without human annotation. He also tells us how an interest in neural networks, first human and then machine, led to an inspiring career in deep learning research. That and much more on this episode of the Microsoft Research Podcast.<\/b><\/p>\n<p><b>Host: Devon\u00a0<\/b><b>Hjelm<\/b><b>, welcome to the podcast.<\/b><\/p>\n<p>Devon\u00a0Hjelm: Thank you. Glad to be here.<\/p>\n<p><b>Host: So, you are a\u00a0<\/b><b>senior<\/b><b>\u00a0researcher who\u2019s deep into deep learning at the MSR Lab in Montreal.\u00a0<\/b><b>So I\u2019ve had several of your colleagues on the show over the last couple of years and we\u2019ve talked about different flavors and different approaches to machine learning\u00a0<\/b><b>and<\/b><b>\u00a0learning machines, but today I want to hear your take on what you<\/b><b>\u2019<\/b><b>re all up to there.\u00a0<\/b><b>What\u2019s the big goal of your lab, and has it changed over the past couple of years at all, or grown more nuanced given new discoveries and advances in the research?<\/b><\/p>\n<p>Devon\u00a0Hjelm: Well, yeah, so, the lab is relatively new.\u00a0It\u2019s only been under\u00a0Microsoft or\u00a0MSR for like two or three years now, and\u00a0the lab is also\u00a0fairly diverse. It started from a background of like machine reading comprehension and language understanding, trying to build like tools based on language and knowledge graphs and stuff like that for people to moving to Montreal and just basically becoming part of the ecosystem there. Incorporating more deep learning, incorporating things like fairness and FATE. Its mission is very much still focused on empowering people through research and compute and stuff like that.<\/p>\n<p><b>Host: Right. So how would you define sort of the big audacious goal of the work you\u2019re doing in Montreal?\u00a0<\/b><\/p>\n<p>Devon\u00a0Hjelm: So, the team that I\u2019m part of, we\u2019re kind of like a deep learning camp, I guess. We\u2019re the people who really focus on using these very large deep neural networks. And so, the\u00a0core idea that we\u2019re kind of really like focused on is how do we use these big things to help empower people, give them\u00a0really interesting\u00a0new, useful tools that improve their lives. Almost everything that we\u2019re assuming here is that we\u2019re going to be using\u00a0deep learning or\u00a0deep neural networks to do this because, over the last decade or so, we\u2019ve seen tremendous, kind of like, explosion of utility on models that are based off of deep learning.<\/p>\n<p><b>Host: Right.<\/b><\/p>\n<p>Devon\u00a0Hjelm: And we anticipate that to continue to be the case.<\/p>\n<p><b>Host: Well, let\u2019s talk more specifically about what you\u2019re investigating personally and why you think it\u2019s important. Give us the Virtual Earth 3D snapshot of the research interests you have and what they bring to the broader field of machine learning. What gets Devon\u00a0<\/b><b>Hjelm<\/b><b>\u00a0up in the morning?<\/b><\/p>\n<p>Devon\u00a0Hjelm: When you look at, when you are using like a large scale model to produce something useful for people in the world, you are kind of talking about, the model\u2019s taking some data, usually complex and high-dimensional that\u2019s coming from the real world, it\u2019s transforming it in some way and then, from that transformation, it\u2019s kind of producing utility. So, for one example, you can imagine like a self-driving car. It\u2019s exposed to a camera video feed, and then from that camera video feed, it builds an understanding of, kind of like, all the different objects that it sees in its view. For instance, like different cars, different people and stuff like that. And then from there, it like\u00a0makes a decision\u00a0where to drive, so that it successfully navigates you down the road without like any catastrophic accidents. So, the intermediate step in-between that is like, what is the product of, sort of like, the processing of that big network that leads to the good performance? So in the case of a self-driving car, you need a visual system that\u2019s able to identify what all the objects are, what they\u2019re doing, what their velocities might be, and so I can make good decisions on whether or not I want to, you know, turn or go straight or slam on the brakes or something like that. I\u2019m really interested in, sort of like, how do we arrive to those good, what we call, representations of the world from high-dimensional data.<\/p>\n<p><b>Host: Well let\u2019s rewind for a minute because where you\u2019ve been has influenced where you are today. You did a postdoc under\u00a0<\/b><b>Yoshua<\/b><b>\u00a0<\/b><b>Bengio<\/b><b>\u00a0who\u2019s a bona fide Turing Award winner and one of the godfathers of deep learning. And he also was one of the founders \u2013 or the founder \u2013 of Montreal Institute of Learning Algorithms, or MILA. And I know you are still collaborating with Dr.\u00a0<\/b><b>Bengio<\/b><b>\u00a0today. But talk a little bit about what you were working on during your postdoc days, and how that work has evolved and informed what you\u2019re working on today.<\/b><\/p>\n<p>Devon\u00a0Hjelm: Yeah, so I\u2019ve always been like extremely well-influenced by\u00a0Yoshua\u00a0and also\u00a0the general camp on which he kind of is centered on, which is the whole deep learning camp.\u00a0Yoshua\u00a0has always been, sort of like,\u00a0really strongly\u00a0involved with generative models and representation learning and unsupervised learning. So, it was just, kind of like, a natural fit for\u00a0me to do a postdoc over there. So, while I was there, I focused on generative adversarial networks, also called GANs, and this work, kind of like, naturally led into mutual information estimation because there\u2019s a lot of parallels, or kind of similarities, between how a generative adversarial network learns to generate data and how you might estimate mutual information.<\/p>\n<p><b>Host:\u00a0<\/b><b>Mmm<\/b><b>-hmm.<\/b><\/p>\n<p>Devon\u00a0Hjelm: And then this ultimately led into stuff having to do with learning representations using mutual information estimation.<\/p>\n<p><b>Host: All right. I want to go back a little bit because you mentioned a \u201ccamp,\u201d and if I understand that, it\u2019s like people getting together and saying this is how I believe, this is my worldview of deep learning, as opposed to another worldview of deep learning. So, can you kind of differentiate what the difference is there?<\/b><\/p>\n<p>Devon\u00a0Hjelm: I mean, ultimately, everybody is, sort of like, interested in this general problem space that I described initially which was like, how do you take complex real-world data and do useful things with it? How do you plan? How do you reason? Stuff like that. But a key component of this is how do you process, or how do you perceive, the world? Up until, sort of like, deep learning appeared, the fields weren\u2019t having tremendous success on how to process like very, very large dimensional data that was coming from vision or natural language, and so when you look at the, kind of like, high level view of what it means to process complex data and to do useful things on it, different people focus on different parts of that. So, for instance, there\u2019s a whole field of people who basically focus on features that are given from very complex neural networks and just figure out how to reason on top of those. But then there\u2019s also people who believe that, you know, however we perceive the world, they should be packed into symbols that resemble formal logic and we need to be using these sorts of things if we really want to be talking about solving these really hard problems. And so, the deep learning camp kind of, sort of defaults to the idea well, we\u2019re just going to throw the whole thing, end-to-end, at the problem, and train the whole thing end-to-end, and do back propagation, throwing as much data as possible at it. And you know, it\u2019s worked really,\u00a0really well, and it continues to sort of be one of the factors that drives us forward.<\/p>\n<p><b>Host: Well I think it\u2019s interesting because it does affect, you know, your choice in research on what direction you are going and how you are going to run at the hill.<\/b><\/p>\n<p>Devon\u00a0Hjelm: Yeah, and one of the consequences of having to do things end-to-end is it\u2019s extremely expensive. So, it\u2019s\u00a0actually becoming\u00a0more and more difficult. Back in the day, people who were working with these static image datasets that were small, like 32&#215;32 pixels, and this has slowly expanded and exploded to very, very large datasets. People are working with video and, with that, if you want to do things end-to-end, the compute costs goes up, it becomes more difficult to run like thousands of trials of similar models to see which ones work better, and it\u00a0becomes like a thing that\u2019s harder for like smaller researchers to do and more of a thing that\u2019s done by the very biggest players.<\/p>\n<p><b>Host: Well, we\u2019ve been \u2013 and by we, I mean, you \u2013 have been working on AI for more than fifty years now\u2026<\/b><\/p>\n<p>Devon\u00a0Hjelm:\u00a0Right.<\/p>\n<p><b>Host: \u2026and there\u2019s been some amazing progress in the field, especially in deep learning when it comes to performing easily definable and discreet tasks, but when it comes to performing tasks in complex real-world situations, we are, as you say, still very far from solving AI. So, in broad strokes \u2013 and I want you to stay kind of high-level here \u2013 what\u2019s the big problem and what\u2019s holding machines back IRL?<\/b><\/p>\n<p>Devon\u00a0Hjelm: So,\u00a0I mean, the way that I see it, one of the biggest challenges that we\u2019re facing right now,\u00a0after,\u00a0sort of like,\u00a0the surge in deep learning,\u00a0is generalization.\u00a0So, this is the ability for a model, given that it\u2019s been trained in a certain way, to perform well in a different situation. And this is\u00a0really important\u00a0because it\u2019s either very difficult to impossible to collect all possible data that you would need that would resemble, sort of like, the test environment. So, for instance, you can imagine the self-driving car scenario. It\u2019s very expensive for me to try to train a visual system under all possible road conditions,\u00a0at all times\u00a0of day, at all locations on earth. And these models, they do have a tendency, if you are not careful, to totally fail when you present them with new combinations of data such as that.<\/p>\n<p><b>Host:\u00a0<\/b><b>Mmm<\/b><b>-hmm.<\/b><\/p>\n<p>Devon\u00a0Hjelm: So, if I only train in, you know, northern California and I transfer to Quebec in the middle of winter, there\u2019s things about those systems that will fail. And then, in addition to that, I mean, if we want these things to work with humans who are notoriously good at expressing unique, hard-to-model behavior, our models have to be pretty good at generalizing to that behavior to actually be useful. Otherwise it will only be useful to like a subset of the population, and that\u2019s not what we really strive for.<\/p>\n<p><b>Host:\u00a0<\/b><b>Well<\/b><b>,\u00a0<\/b><b>one of the thorniest long-standing challenges to ML researchers is learning good representations without annotation. And this is part of the expense problem, right, is the labeling data and so on. So, what\u2019s wrong, in your opinion, with the annotation model and the learning algorithms behind it and what kinds of learning algorithms do you think we need to take us into a new ML future?<\/b><\/p>\n<p>Devon\u00a0Hjelm:\u00a0There\u2019s a couple of different things I suppose.\u00a0If you are going to train a model under, sort of like, the standard supervised setting, suppose I\u2019m given, like I said before, like self-driving car data, and somebody annotates like the position and the class for every single object in the visual scene, and then I\u2019m able to train a model to this end-to-end, you know, it\u00a0could resolve to a pretty good representation that I might be able to do some planning on. But that annotation alone is very difficult to do. But on top of that, it\u2019s difficult to say\u00a0whether or not\u00a0any particular annotation is useful for a general task. You know, some scene that\u2019s happening, some video scene or something like that, and you are trying to describe what\u2019s going on, how would a human describe what\u2019s going on? It might only capture a fraction of what really is going on and a model trained in that way might only be useful for certain tasks and not for other tasks.<\/p>\n<p><b><i>(music plays)<\/i><\/b><\/p>\n<p><b>Host:\u00a0<\/b><b>W<\/b><b>hen we\u2019re talking about learning good representations, that\u2019s sort of one of the nuggets of what you\u2019re after, right?\u00a0<\/b><b>So, let\u2019s get specific and talk about how you\u2019re taking a run at this good representation hill. Last year you presented a paper at ICLR that outlines an approach you call Deep\u00a0<\/b><b>InfoMax<\/b><b>. So, tell us about Deep\u00a0<\/b><b>InfoMax<\/b><b>, and start kind of \u201cwrit large.\u201d What is it and what are the learning principles on which it\u2019s based? We\u2019ll get\u00a0<\/b><b>real<\/b><b>\u00a0specific and technical in a second here, but give us the big picture.<\/b><\/p>\n<p>Devon\u00a0Hjelm: Sure, sure. So, at the high level, it\u2019s a type of model that learns representations in an unsupervised way, that is without labels that a\u00a0human needs\u00a0to define ahead of time. And it\u2019s also, I guess, what\u2019s being called a self-supervised model. And so this is a model that kind of, instead of tasks being designed by a human, in the sense that the labels are targets that it\u2019s trying to predict are coming from, you know, something like the class cat or dog, it generates its own labels by basically playing around with the statistics of the data. There\u2019s two, kind of like, core themes behind Deep\u00a0InfoMax. So one is, you are given a bunch of data that has some structure,\u00a0like there\u2019s patches. I can extract patches from images. I can present the model whole or parts of the image and I can basically ask the model, can you tell\u00a0whether or not\u00a0these things go together or not?<\/p>\n<p><b>Host: Yeah.<\/b><\/p>\n<p>Devon\u00a0Hjelm: And it\u2019s just basically a two-way game, just yes or no.<\/p>\n<p><b>Host: Okay.<\/b><\/p>\n<p>Devon\u00a0Hjelm: Does it go together? So, this is like one part of it. And then the other part is like the actual function that you use to train this thing. So, there\u00a0has to\u00a0be, sort of like, a number that the model outputs that tells you how well it\u2019s doing, and this thing is the mutual information estimation maximization thing.<\/p>\n<p><b>Host: Okay.\u00a0<\/b><\/p>\n<p>Devon\u00a0Hjelm: When you present a model, say two different sets of pairs, and you ask it to differentiate between these two, this effectively is forcing the model to learn something about the mutual information about the things that go together, because you are encouraging it to\u00a0understand the dependencies or the relationships of the things that go together. So, for instance, if I give you a bunch of different pictures of patches of the same cat, you\u00a0have to\u00a0understand a little bit of the structure of a cat. And so, these things are dependent or related in the sense that they all eventually compose the same thing, a cat.<\/p>\n<p><b>Host:\u00a0<\/b><b>Mmm<\/b><b>-hmm.<\/b><\/p>\n<p>Devon\u00a0Hjelm: But if I present other things and I just basically say, you should be able to tell that these things go together, as opposed to like say, patches that come from cats and dogs, it forces the model to learn that these things are related.<\/p>\n<p><b>Host: Let\u2019s unpack Deep\u00a0<\/b><b>InfoMax<\/b><b>\u00a0technically on several levels. And start with a critical question that I think you borrowed from Sesame\u00a0<\/b><b>Street,<\/b><b>\u00a0can you tell which thing is not like the other? So, you\u2019re addressing this with the question, does this pair belong together? How do you do that by both borrowing from and diverging from the technical approaches in the Mutual Information Neural Estimator, as you call it, or MINE?<\/b><\/p>\n<p>Devon\u00a0Hjelm: So, MINE, at its core, is meant to estimate mutual information. And so mutual information is this quantity that expresses how related two different random variables are, or how related two different sets of random variables are.<\/p>\n<p><b>Host: Right.<\/b><\/p>\n<p>Devon\u00a0Hjelm: And so, it\u2019s an extremely important quantity because being able to tell how related things\u00a0are can\u00a0help with all sorts of things like prediction, all sorts of other, sort of, important downstream tasks. But it\u2019s also a very notoriously difficult quantity to estimate. So if you have very high-dimensional data that\u2019s continuous, say like images or language, there traditionally hasn\u2019t been any straightforward way to estimate it because you have to do this, sort of like, infinite integral over distributions that you don\u2019t necessarily know. This is where neural networks come in. Neural networks, and\u00a0in particular GANs, have this ability to estimate log ratios of probabilities\u2026<\/p>\n<p><b>Host:\u00a0<\/b><b>Mmm<\/b><b>-hmm.<\/b><\/p>\n<p>Devon\u00a0Hjelm: \u2026without\u00a0actually needing\u00a0to know what the structure of that distribution is. And what I mean by that structure of that distribution is like, we don\u2019t know if it\u2019s Gaussian or, you know, Poisson distributed or whatever. But GANs, they estimate these log ratios and then they use them to train their generator function.<\/p>\n<p><b>Host:\u00a0<\/b><b>Mmm<\/b><b>-hmm.<\/b><\/p>\n<p>Devon\u00a0Hjelm: So, if you look at the mutual information, it\u2019s just like a divergence, it\u2019s a difference, like a difference between two different distributions. One is the joint distribution\u00a0between two variables, and one is the product of marginals. And so, the joint distribution is just basically the probability that these two things co-occur, and then the product of marginals is their probabilities that they occur independently of each other. And so, the way that GANs do this estimation is you just draw samples from two different distributions, and you train a discriminator. And a discriminator is just a classifier. So, you just present samples from one distribution, present samples from the other distribution, and you ask, does it belong to one distribution or the other distribution?<\/p>\n<p><b>Host: Okay.<\/b><\/p>\n<p>Devon\u00a0Hjelm: So, this is, sort of like, the technical thing. So like, if for instance, if I\u2019m trying to train a model that\u2019s able to distinguish between, you know, cats and dogs, I present it with cats and I tell it hey, this is label zero, I present it with dogs, this is label one. And at the end of the day, if you train like a standard deep network classifier, it\u2019s learning to estimate the log ratio of the probability of the cat or probably the dog. So mutual information estimator, what it essentially comes down to is it\u2019s training a classifier between samples that go together and don\u2019t go together. So Deep\u00a0InfoMax\u00a0is very much based on our work on Mutual Information Neural Estimator, or\u00a0MINE. What Deep\u00a0InfoMax\u00a0basically does is, it takes like a full image, and it presents it through a deep neural network. And when it gets processed through this deep neural network, if you look at different layers of this network, this is a convolutional neural network so different locations of this convolutional neural network have been processed by different patches of the image. So, you can think of these like features at these different levels, at these different locations, being part of the input, right? So, what Deep\u00a0InfoMax\u00a0basically does is it says, well, all those features basically go together, so I\u2019m going to group them altogether and present them to a classifier and say, well, tell me that these go together, classify it as zero or one, whatever you call together, and then take combinations of those patch representations with images that came from somewhere else, put those together and say, these\u00a0<i>don\u2019t<\/i>\u00a0go together.<\/p>\n<p><b>Host: Okay.<\/b><\/p>\n<p>Devon\u00a0Hjelm: And so that process is actually very similar to what we did in mutual information neural estimation, is the things that go together, these are really like samples from the joint distribution. The things that don\u2019t go together, well this resembles something like samples from the product of marginals. And so, when you train a classifier to distinguish between these two, you\u2019re training the model in a similar way that you are in MINE to interpret the dependencies between all the things that go together, that make them go together, like why do they go together? And that\u2019s, sort of like, encoded into the idea about the joint distribution. So, when you do that, you really are estimating something like the mutual information. But\u00a0the key thing that we walked away with Deep\u00a0InfoMax\u00a0was that we don\u2019t really care about estimating mutual information, we don\u2019t care about the number that corresponds to how dependent things are, we just want a model that understands whether or not there\u2019s more or less mutual information so that we can use that number as a learning signal to train the encoder.<\/p>\n<p><b>Host: Well, let\u2019s talk a bit more about some of the problems that arise when you aim for, as you call it, \u201cpure mutual information maximization.\u201d You\u2019ve said in the past, that\u2019s not actually what we\u2019re aiming for here.\u00a0<\/b><b>So<\/b><b>,<\/b><b>\u00a0<\/b><b>what do you do with the issues of noisy information, and what do you really want to aim for here?<\/b><\/p>\n<p>Devon\u00a0Hjelm: I guess there\u2019s like two different ways to answer that. So one is that,\u00a0I mean,\u00a0when we do this Deep\u00a0InfoMax-style learning on images, while it does resemble something like mutual information maximization, there\u2019s a caveat in the sense that MINE is only an estimator. It\u2019s sort of like a lower bound to the mutual information. It can only learn the number of dependencies that it\u2019s capable of based on the capacity of the neural network and the number of samples from the world that it\u2019s received. So, the lower capacity of the model, the less stuff it will be able to learn. The less samples it\u2019s exposed to, the less stuff it will be able to learn.<\/p>\n<p><b>Host:\u00a0<\/b><b>Mmm<\/b><b>-hmm.<\/b><\/p>\n<p>Devon\u00a0Hjelm: But it\u00a0has to\u00a0learn\u00a0something, so Deep\u00a0InfoMax\u00a0is based on this, sort of like, structural thing where you patch things in different ways. This kind of biases the model to learn things that are expressible, structurally. So for instance, because I\u2019m effectively doing a comparison kind of game between different patches of the same image, it needs to understand why those patches are related and it maybe doesn\u2019t need to understand something more nuanced like the texture of one of the patches compared to the other, and the reason why it doesn\u2019t need to understand that is because it\u2019s maybe not learnable. It\u2019s a much harder problem than just understanding\u00a0whether or not, like, the shape goes together, or the general color goes together, or something like that. So, the model will focus on those things that are easy to pick up. And a lot of times, how we design these tasks, this way of breaking up the data in a\u00a0particular way, when we apply it to mutual information neural estimation-style learning, or Deep\u00a0InfoMax, matters more than the actual objective that we use.<\/p>\n<p><b>Host: So, is there anything different, or anything sort of noteworthy, about any of the technical aspects of Deep\u00a0<\/b><b>InfoMax<\/b><b>\u00a0that sort of says hey, stand up and take notice? This is a new approach to solving some of these\u00a0<\/b><b>problems?<\/b><\/p>\n<p>Devon\u00a0Hjelm: So, the main nuanced thing about the Deep\u00a0InfoMax\u00a0model was that, so as I mentioned before, we were taking these local representations,\u00a0these features\u00a0that corresponded to a patch of the input.<\/p>\n<p><b>Host:\u00a0<\/b><b>Mmm<\/b><b>-hmm.<\/b><\/p>\n<p>Devon\u00a0Hjelm: And it\u2019s important, in Deep\u00a0InfoMax, if you do things that way, that those features\u00a0actually correspond\u00a0to a patch. What\u2019s interesting about a lot of the convolutional models that are used in the wild, like the very popular ones like\u00a0ResNet, while they have a spatial extent as you progress through the network, very, very quickly, those locations cover the whole input.<\/p>\n<p><b>Host:\u00a0<\/b><b>Mmm<\/b><b>.<\/b><\/p>\n<p>Devon\u00a0Hjelm: So even though it has a spatial extent, there are locations that are spatial in the neural network, but if you back-project and look at the stuff that it\u2019s processed, it\u2019s\u00a0actually processing\u00a0the whole input. So that\u2019s not actually a different view of the input anymore. Um\u2026\u00a0It\u2019s good and bad in the sense that\u00a0it\u2019s nice that, at some point, the architectures are mixing everything it can from the input to try to infer\u00a0whether or not\u00a0something belongs to a class, in the case of supervised learning. But in the case of self-supervised learning, if you want to really leverage these locations from the natural architecture, then you need to be a little bit more careful about how you apply architectures. So, if you have these architectures that quickly expand over like full receptive field of the input, then you run into trouble. And so Deep\u00a0InfoMax\u00a0is kind of\u00a0particular in\u00a0that it really tried to leverage the internal structure of the model over just say like pure messing with the input data and then designing losses on top of it.<\/p>\n<p><b><i>(music plays)<\/i><\/b><\/p>\n<p><b>Host: At this point in the podcast, Devon, I always ask my guests what could possibly go wrong, so I\u2019ll ask you, too. And I do this because I want to address some of the elephants in the room where, you know, this is a powerful technology. Is there anything about your work that keeps you up at night, metaphorically, and if so, how are you addressing it?\u00a0<\/b><\/p>\n<p>Devon\u00a0Hjelm: I think we all\u00a0have to, kind of like, think about this, whether or not our work is like more directly related to things like fairness and privacy and nefarious agents, then if we don\u2019t. So, some people, particularly on like FATE team, focus on more, like sort of, the reasoning aspects of our models. Like how do they take data presented to them and produce results that are fair or retain privacy and stuff like that? But, kind of like, the general trend that we\u2019re seeing is that we\u2019re using, for our reasoning, we\u2019re using more and more deep models that produce features that people use to do that reasoning on top of. And so, I\u2019m very much interested in how the quality of those features of that model impact these, I don\u2019t know what you call them, moral metrics. So, for instance, is it possible that my model, if it\u2019s presented with a face of a person, also encodes their identity perfectly or something like that? That my features either do or don\u2019t make it easy for someone to infer where those features came from and their identities\u2026<\/p>\n<p><b>Host: Right.<\/b><\/p>\n<p>Devon\u00a0Hjelm: \u2026that they might want to keep hidden. So, yeah, I mean, like in particular when you\u2019re talking about like the mutual information stuff, like we\u2019re working really hard on maximizing mutual information all over the place, so we\u2019re just trying to capture as much information about the data as possible. But you could also imagine using the same techniques to minimize mutual information. So, you just basically like flip the sign on some of the law objective functions and say okay, there\u2019s these properties that I really don\u2019t want in this representation. Whatever you do, whatever this representation looks like, minimize it. And you can use the exact same objective functions to try to do the exact same thing. It\u2019s a little bit\u00a0trickier because you are dealing with this min\/max, but you can imagine doing stuff like that. So, it\u2019s like another sub-part of that whole question about what a good representation is, is sometimes you don\u2019t really want all the underlying things in that\u00a0representation.<\/p>\n<p><b>Host: Right.<\/b><\/p>\n<p>Devon\u00a0Hjelm: You maybe want to\u00a0actually hide\u00a0things.<\/p>\n<p><b>Host: Yeah, yeah, yeah. I want to drill in just a little bit on how you can control, at the outset, keeping a lid on the things that you could see going wrong with machines that act in the world like humans do and pass the Turing test on a grand scale.<\/b><\/p>\n<p>Devon\u00a0Hjelm: So when we talk about whether or not a representation is good or not, or useful, which I guess is like, sort of like, the core of what I\u2019m focused on, it\u2019s important that, among the collection of things that we use to evaluate our models, we keep in mind metrics that evaluate things like fairness and privacy. So, one thing that we\u2019re seeing, as we progress, in representation learning is that it\u2019s not just like one metric that really matters as far as like\u00a0whether or not\u00a0the representation is going to be good for deployment on some complicated downstream task.\u00a0It\u2019s not going to be just classification,\u00a0there\u2019s like a suite of things that we have to evaluate models on and the suite of things, kind of like, provides a better story, like a fine grained story, as far as like whether or not this representation truly is going to be useful. And one of those dimensions of usefulness is things like privacy and fairness.<\/p>\n<p><b>Host: Hmmm. Speaking of stories\u2026 Tell us about yourself, Devon, and your path to machine learning research at MSR, and how\u2019s the ML game better since you joined the team?<\/b><\/p>\n<p>Devon\u00a0Hjelm: I\u00a0mean I\u00a0guess I\u2019ve always been pretty interested in representation learning, like understanding representations of the world, like deriving them.\u00a0So,\u00a0I started in physics, which is about learning about representations of the world which have to do with like dynamics and all sorts of quantities, physical quantities. And then I got interested in languages, because I was interested in how people represent the world through their language, through words and their relationships and stuff like that. And so, I went from physics to linguistics and did a stint there. So, I quickly realized that like, at least where I was at, they didn\u2019t have the tools necessary to really solve the types of problems that I was interested in which was like understanding, from language, how humans represent the world. So, I started getting really interested in using, you know, computers to help solve these problems, like models and stuff like that. So, I got involved with some people over at the CS department, and then joined the PhD program in computer science. At University of New Mexico, probably like one of the stronger, sort of like, groups that were focusing on like modeling complex data and learning representations, were on the neural imaging side. So, there was like a big research institute called the Mind Research Network. So, I talked to Vince Calhoun,\u00a0who\u2019s still\u00a0chief\u00a0officer over there,\u00a0and I said, hey, I\u2019m interested in these deep neural networks.\u00a0I think they might be useful for neuroscience stuff.\u00a0They were\u00a0looking at\u00a0like, sort of like,\u00a0brain imaging data like FMRI,\u00a0EG,\u00a0and some other related datasets and modes.\u00a0And he said oh, okay well, here is some public data that we have available, try your model on it and see how it works. And I did it, and it produced something interesting, and then he said okay, well, I\u2019ll take you on as a grad student. So, I lived over there for a couple of years, and I was kind of like the black sheep who was using the deep neural networks while everybody else was using these more linear models like ICA and PCA and stuff like that. So, me and my, sort of like, unofficial advisor Sergey\u00a0Plis\u00a0were kind of like the deep learning nerds and we put out some nice papers that used deep learning and showed that it worked with FMRI data and\u2026 but through that whole process, because Mind Research Network was such a good like research institution with good connections and grant money and stuff like that, they were able to connect our small group to a bunch of really big names in deep learning like Russ\u00a0Salakhutdinov\u00a0and\u00a0Kyunghyun\u00a0Cho. Russ was, at the time, a professor at University of Toronto. He\u2019s since moved on to CMU, and\u00a0Kyunghyun\u00a0Cho\u00a0was a postdoc at the time with\u00a0Yoshua. You know, this entire time I\u2019m like pushing on the whole representation learning stuff, but in the context of neural imaging, learning how to use new models, even my stuff with generative models, it was all about learning good representations because you can use the intermediate states for generative models as representations as well.<\/p>\n<p><b>Host:\u00a0<\/b><b>Mmm<\/b><b>-hmm.\u00a0<\/b><\/p>\n<p>Devon\u00a0Hjelm: And just through those connections, I was able to get, like, more deep learning papers into NEURIPS, and through that connection I was able to like, reach out to\u00a0Yoshua\u00a0at the end of my PhD and was able to connect up with them there.<\/p>\n<p><b>Host: And so, you were at MILA for a while, Montreal Institute of Learning Algorithms.<\/b><\/p>\n<p>Devon\u00a0Hjelm: Yeah. I\u2019m still an adjunct professor there.<\/p>\n<p><b>Host: Okay.<\/b><\/p>\n<p>Devon\u00a0Hjelm: So, I still co-supervise students and help them with research.<\/p>\n<p><b>Host: What\u2019s one thing we don\u2019t know about you, Devon? Something interesting that may have impacted your career, sort of personal. Maybe it\u2019s a side quest or a personality trait or something interesting to sort of give us context about who you are a little deeper in?<\/b><\/p>\n<p>Devon\u00a0Hjelm: Well, I played bass, you know,\u00a0more or less professionally, for three years in grad school. I was playing\u2026<\/p>\n<p><b>Host: Were you in a band?<\/b><\/p>\n<p>Devon\u00a0Hjelm: Yeah, like, I\u2019d have gigs like three, sometimes four days a week, playing salsa.<\/p>\n<p><b>Host: Are you kidding?<\/b><\/p>\n<p>Devon\u00a0Hjelm: No. I\u2019m not. So, yeah, we would play everywhere from like casinos to like dances to everything like that. I mean, I played music my entire life, but I was\u00a0pretty ambitious\u00a0about being very, very good and so that group was pretty cool because there were super high-skilled salsa musicians from like all over the world. One person was from like a touring group in Cuba. And then there was another guy who joined our group who was on some Grammy albums. And so, I got to do that for a little while. It was\u00a0really good! So, like I was working as a musician on top of doing my PhD.<\/p>\n<p><b>Host: Do you still play?<\/b><\/p>\n<p>Devon\u00a0Hjelm: I don\u2019t play bass anymore because I got tired of spending all my Fridays and Saturdays playing the same music all the time and not being able to like\u00a0sit\u00a0and like enjoy watching someone else play. Right now, I\u2019m learning how to play the mandolin because it\u2019s easier to play by yourself!<\/p>\n<p><b>Host:\u00a0<\/b><b>L<\/b><b>et\u2019s get some parting thoughts and shots in.\u00a0<\/b><b>As we close, I want to give you the chance to think ahead and dream about the future. Let\u2019s say you\u2019re wildly successful, Devon. What will the world look like at the end of your career? What will have you accomplished in your field, and what will be able to do that we hadn\u2019t been able to do before?<\/b><\/p>\n<p>Devon\u00a0Hjelm: I can imagine all sorts of like things that we can do with models that we can\u2019t do today. Like for instance, like present a model to brand new environment that\u2019s able to navigate and explore this environment on its own with very little help with from human experimenters and learn everything it needs to learn to do useful stuff.\u00a0I firmly believe that it\u2019s like important that our ultimate goal for all of this AI effort is to arrive to models and algorithms and agents that are useful to human beings in the real world so that they can do things that they couldn\u2019t do before more easily,\u00a0sort of like, to empower the more general population. On top of that, if I was wildly successful, is continuing an ongoing exciting community of people working on\u00a0really difficult\u00a0problems because they are passionate about it. Everybody has, sort of like, their own ideas for as like what would be useful or good for people. And\u00a0as long as\u00a0I\u2019m part of that, someone who gets to interact with that community and help build and shape those things, I think that\u2019s the best possible thing that I can hope for. And so, I\u2019m just hoping that like I can be part of that community and it continues to thrive.<\/p>\n<p><b>Host: Devon\u00a0<\/b><b>Hjelm<\/b><b>, thank you for joining us today. It\u2019s been\u00a0<\/b><b>really fun<\/b><b>!<\/b><\/p>\n<p>Devon\u00a0Hjelm: Thank you.<\/p>\n<p><b><i>(music plays)<\/i><\/b><\/p>\n<p><b><i>To learn more about Dr. Devon\u00a0<\/i><\/b><b><i>Hjelm<\/i><\/b><b><i>, and the very latest in deep learning research, visit Microsoft.com\/research<\/i><\/b><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Dr. Devon Hjelm is a senior researcher at the Microsoft Research lab in Montreal, and on the podcast, he joins me to dive deep into his research on Deep InfoMax, a novel self-supervised learning approach to training AI models \u2013 and getting good representations \u2013 without human annotation. He also tells us how an interest in neural networks, first human and then machine, led to an inspiring career in deep learning research.<\/p>\n","protected":false},"author":37583,"featured_media":658428,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"https:\/\/player.blubrry.com\/id\/60482376\/","msr-podcast-episode":"115","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[240054],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-658404","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-msr-podcast","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"https:\/\/player.blubrry.com\/id\/60482376\/","podcast_episode":"115","msr_research_lab":[437514],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[629145,652389],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-960x540.png\" class=\"img-object-cover\" alt=\"Photo of Devon Hjelm for the Microsoft Research Podcast\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1536x865.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-2048x1153.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2020\/05\/Devon_Podcast_1400x788_No_logos-1920x1080.png 1920w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"May 13, 2020","formattedExcerpt":"Dr. Devon Hjelm is a senior researcher at the Microsoft Research lab in Montreal, and on the podcast, he joins me to dive deep into his research on Deep InfoMax, a novel self-supervised learning approach to training AI models \u2013 and getting good representations \u2013&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/658404"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=658404"}],"version-history":[{"count":5,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/658404\/revisions"}],"predecessor-version":[{"id":668118,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/658404\/revisions\/668118"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/658428"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=658404"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=658404"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=658404"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=658404"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=658404"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=658404"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=658404"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=658404"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=658404"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=658404"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=658404"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}