Teaching computers to see with Dr. Gang Hua

Published

Principal Researcher and Research Manager, Gang Hua. Photography courtesy of Maryatt Photography.

Episode 28, June 13, 2018

In technical terms, computer vision researchers “build algorithms and systems to automatically analyze imagery and extract knowledge from the visual world.” In layman’s terms, they build machines that can see. And that’s exactly what Principal Researcher and Research Manager, Dr. Gang Hua, and Computer Vision Technology team, are doing. Because being able to see is really important for things like the personal robots, self-driving cars, and autonomous drones we’re seeing more and more in our daily lives.

Today, Dr. Hua talks about how the latest advances in AI and machine learning are making big improvements on image recognition, video understanding and even the arts. He also explains the distributed ensemble approach to active learning, where humans and machines work together in the lab to get computer vision systems ready to see and interpret the open world.

Related:


Transcript

Gang Hua: If we look back ten, fifteen years ago, you see the computer vision community’s more diverse. You see all kinds of machine learning methods, you see all kind of knowledges borrowed from physics, from optical field, all getting into this field to try to tackle the problem from multi-perspective. As we are emphasizing diversity everywhere, I think the scientific community is going to be more healthy if we have diverse perspective.

(music plays)

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

In technical terms, computer vision researchers “build algorithms and systems to automatically analyze imagery and extract knowledge from the visual world.” In layman’s terms, they build machines that can see. And that’s exactly what Principal Researcher and Research Manager, Dr. Gang Hua, and the Computer Vision Technology team, are doing. Because being able to see is really important for things like the personal robots, self-driving cars, and autonomous drones we’re seeing more and more in our daily lives.

Today, Dr. Hua talks about how the latest advances in AI and machine learning are making big improvements on image recognition, video understanding and even the arts. He also explains the distributed ensemble approach to active learning, where humans and machines work together in the lab to get computer vision systems ready to see and interpret the open world. That and much more on this episode of the Microsoft Research Podcast.

Host: Gang Hua.

Gang Hua: Hi.

Host: Hello, welcome to the podcast. Great to have you here.

Gang Hua: Thanks for inviting me.

Host: You’re a Principle Researcher and the Research Manager at MSR and your focus is computer vision research.

Gang Hua: Mmm hmm.

Host: In broad strokes right now, what gets a computer vision researcher up in the morning? What’s the big goal?

Gang Hua: Yeah, computer vision is a relatively young research field. In general, you can think this field is trying to build machines, to endow computers the capability to just see the world and interpret the world just like human. From a more technical side of view, it’s the input to the computer is really just uh, image and videos. You can think of them as a sequence of numbers, but what we want to extract from these images and videos from these numbers is some sort of structures of the world, or some semantic information out of it. For example, I could say, this part of the image really corresponds to a cat. That part of the image corresponds to a car, this type of interpretation. So, that’s the goal of computer vision. Like for us humans, it looks to be a simple task to achieve, but in order to teach these computers to do it, we really have made a lot of progress in the past ten years. But as the research field, this thing has been there for fifty years. Still yet, a lot of problems to tackle and address.

Host: Yeah. In fact, you gave a talk about five years ago, where you said, and I paraphrase, “After thirty years of research why should we still care about face recognition research?” Tell us how you answered then, and now, where you think we are…

Gang Hua: So, I think that the status quo five years ago, I would say like, at that moment, if we capture a snap shot of how the research in facial recognition has progressed since the beginning of computer vision or face recognition research, I would say we achieved a lot. But more in controlled environment, where you could carefully control the lighting, the camera, the setting and all those kinds of things when you are framing the faces. At that moment, five years ago, when we moved towards more like wild settings, like faces taking on their uncontrolled environment, I would say there’s a huge gap there in terms of recognition accuracy. But in the past five years, I would say the whole community also made a lot of progress leveraging like the more advanced, deep learning techniques. Even for facial recognition in the wild scenario, we’ve made a lot of progress and really have pushed these things to a stage where a lot of commercial applications becomes feasible.

Host: Ok, so deep learning has really enabled, even recently, some great advances in the field of computer vision and computer recognition of images.

Gang Hua: Right.

Host: So, that’s interesting, when you talk about the difference between a highly controlled situation, versus recognizing things in the wild, and I’ve had a couple of researchers on here who have said, yeah, where computers fail is when the data sets are not full enough… for example dog, dog, dog, 3-legged dog, mmm, is it still a dog?

Gang Hua: Sure.

Host: Right? So, what kinds of things do deep learning techniques give you that you didn’t have before in these recognition advances?

Gang Hua: Yeah that’s a great question. From a research perspective, you know, the power of deep learning is presenting several effects. The first thing is that it can conduct the learning in an end-to-end fashion and learn what’s the right representation for that semantic pattern. For example, when we are talking about a dog, if we’re really looking through all kinds of pictures of a dog, although say, if my input is really a 64×64 images, suppose each pixel has like two hundred and fifty values to take. That’s a huge space, if you think about the combinatorially. But when we talk about dog as a pattern, like actually, every pixel’s correlated a little bit, so, the actual pattern for “dog” is going to reside in a much lower dimensional space. So, the power of deep learning is that I can conduct the learning in an end-to-end fashion that really learns the right numerical representation for “dog.” And because of the deep structures, we can come up really complicated models which can really digest a large amount of training data. So that means like, if my training data covered all kinds of variations, like all kinds of views of this pattern, eventually I can recognize it in a broader setting, because I have covered almost all the spaces. OK. So, another capability of deep learning is that this kind of compositional behavior, because it’s a layer, fit for the structure and the layer of the representation there, so when the information or image gets fit into deep networks, and it starts by extracting some very low-level image primitives, then gradually, the model can assemble all those primitives together and form a higher and higher level of semantic structures. So, in this sense, it captures all the small patterns corresponding to the bigger patterns and composes them together to represent the final pattern. So, that’s why it is very powerful, especially for visual recognition tasks, so, yeah.

Host: Right, so, the broad umbrella of CVPR is computer vision pattern recognition.

Gang Hua: Yes. Right.

Host: And a lot of that pattern recognition is what the techniques are really driving to.

Gang Hua: Sure, yeah. So, that’s actually, computer vision really, is trying to make sense out of pixels. If we talk about it in a really mechanical way, is that I fit in the image. You either extract some numeric output or some symbolic output from it. The numeric output, for example, could be a 3-D point cloud which described the structure of the scene or the shape of an object. It could also be corresponding to some semantic labels like a dog and cat, as I mentioned at the beginning, so yeah.

Host: Right. So, we’ll get to labeling in a bit. It’s an interesting whole part of the machine learning process is that it has to be fed labels as well as pixels, right?

Gang Hua: Sure, yeah.

(music plays)

Host: You have three main areas of interest in your computer vision research that we talked about. Video, faces, and arts and media. Let’s talk about each of those in turn and start with your current research in what you call “video understanding.”

Gang Hua: Yes. Video understanding, like the title sort of explains itself. Instead of – the input now becomes a video stream. Instead of single image we are reasoning about pixels and how they move. If we view computer vision reasoning about the single image as a spatial reasoning problem, now we are talking about a spatial-temporal reasoning problem, because video is the third dimension, or that temporal dimension. And if you look into a lot of real world problems, we’re taking about continuous video streams, whether it is a surveillance camera in a building or a traffic camera overseeing highways. You have this constant flow of frames coming in and the object inside it is moving. So, you want to basically digest information out of it.

Host: When you talk about those kinds of cameras, it gives us massive amounts of video, you know, constant stream of cameras and security in the 7-Eleven and things like that. What is your group trying to do on behalf of humans with those video streams?

Gang Hua: Sure. So, one incubation project we are doing, like my team’s building a foundational technology there. One incubation project we are trying to do is to really analyze the traffic on roads. If you think about a city, when they set up all those traffic cameras, most of the video streams area actually wasted. But if you carefully think about it, these cameras are smart. Just think about the one scenario where you want to more intelligently control the traffic lights. So, if in one direction I saw a lot more traffic flow, instead of having a fixed schedule of turn-on, turn-off those red lights and the green lights? I could say like, OK, because this side has less cars or even no cars at this moment, I would allow the other directions their green lights to keep longer time so that the traffic can flow better. So that’s just the one type of application there.

Host: Could you please get that out there?

Gang Hua: Sure!

Host: I mean—yeah, because how many of us have sat at a traffic light when it’s red and there’s no traffic coming the other way.

Gang Hua: Exactly.

Host: At all. It is like why can’t I go?

Gang Hua: Sure. Yeah, you could also think about some other applications like if we accumulated videos across years, if there’s citizens requesting like we set up additional bicycle lanes, we could use the videos we have analyzing all the traffic data there and then decide, if it makes sense, to set up a bike lane there. If we set it, we would sort of significantly affect the other traffic flows and help the cities to make decisions like that.

Host: I think this is so brilliant, because a lot of times we make decisions based on, you know, our own ideas rather than data that says, you know, hey, this is where a bike lane would be terrific. This is where it would actually ruin everything for everybody, right?

Gang Hua: For sure, yeah. They sometimes leverage some other type of other sensors to do that. You hire a company like to set up some special equipment on the roads to do that. But it’s very cost ineffective. Just think about all those cameras are sitting there. The video streams out there. Right? So.

Host: Yeah. That’s a fantastic explanation of what you can do with machine learning and video understanding.

Gang Hua: Right.

Host: Yeah. Another area you care about is faces and kind of harkens back to the “why should we still care about facial recognition research?”

Gang Hua: Sure.

Host: But yeah. And this line of research has some really interesting applications. Talk about what’s happening with facial recognition research. Who’s doing it and what’s new?

Gang Hua: Yeah, so indeed if we look back, facial recognition technology has progressed in Microsoft, I think when I was at Live Labs Research, we set up the first facial recognition library which could be leveraged by different product teams. Indeed, the first adopter is Xbox. They tried to use facial recognition technology for automatic user login at that moment. I think that that’s the first adoption. Over the time, like the facial recognition research, the center sort of migrated to Microsoft Research Asia, where we still have a group of researchers I collaborate with. Like we are continuously trying to push the state of the art out. This is now become a synergistic effort where we have engineering teams helping us to gather more data and then we just train better models. Our research recently, actually, focused more on line of research we call identity-preserving face synthesis. So, recently there’s a big advancement in the deep learning community too, which is the establishment of using deep networks to generative models which can model the distribution of images so that you can draw from that distribution, basically synthesize the image. You build deep networks which output is an image, indeed. So, what we want to achieve is actually a step further. We want to synthesize faces. While we want to keep the identity of those faces, seeing we don’t want our algorithms to just randomly sample a set of face out without any semantic information. Say you want to generate face of Brad Pitt. I want to really generate a face that looks like Brad Pitt. If I want to generate a face similar to anybody I know, I think we just want to be able to achieve that.

Host: So, the identity preservation is the sort of outcome that you’re aiming for of the person that you’re trying to generate the face of?

Gang Hua: Right.

Host: You know, tangentially, I wonder if you get this technology going does it morph with you as you get older, and start to recognize you — or do you have to keep updating your face?

Gang Hua: Yeah that’s indeed a very good question. I would say, in general, we actually have some ongoing research trying to tackle that problem. I think for existing technology, yes, you need to update your face maybe from time to time. Especially if you’ve undergone a lot of changes. For example, somebody could have done some plastic surgery. That will basically break the current system.

Host: Wait a minute. That’s not you.

Gang Hua: Sure, no, not me at all. So, there are several ways you can think about it. Like human faces actually don’t change much like between age 17-18, when you grow up, all the way to maybe 50-ish, they don’t change much. So, when people first got born, like kids? Their face actually changes a lot, because there’s growing bones and basically the shape and skin could change a lot. But once people get maturity into adult stage, the change is very slow. So, we actually have some research, we are training models the aging process too that will help establish better facial recognition system across age. This is actually a very good kind of technology which can allow you to get into the law enforcement domain for example, some missing kids, they could be — have been kidnapped by somebody, but after many years if you…

Host: They look different.

Gang Hua: Yeah, they look different; if the smart facial recognition algorithms can match the original photos…

Host: And say what they would look like at maybe 14 if they were kidnapped earlier or something?

Gang Hua: Yes, yes, exactly.

Host: Wow, that’s a great application of that. Well, let’s talk about the other area that you’re actively pursuing and that’s media and the arts. Tell us how research is overlapping with art and particularly with your work in deep artistic style transfer.

Gang Hua: Sure. If we look into people’s desire, right? First we need to eat, and we need to drink, and we need to sleep. OK? Then once all these tasks are fulfilled, actually, we humans have a strong desire of arts…

Host: And creation.

Gang Hua: And the creation and things like that. So, this theme of research in computer vision, if we link it to like a more artistic type of what we call media and arts, like basically using computer vision technologies to give people a good artistic enjoyment. So, the particular research project we have done in past the two years is sequence of algorithms where we can render an image into any sort of artistical styles you want as long as you provide an example of that artistic style. For example, we can render an image to Van Gogh’s style.

Host: Van Gogh?

Gang Hua: Yeah, or any other painter’s painting style…

Host: Renoit, or Monet… or Picasso.

Gang Hua: Yeah, all of them. You can think of…

Host: Interesting. With pixels.

Gang Hua: With pixels, yeah. Those are all again, like, all done by deep networks and some deep learning technologies we designed.

Host: It sounds like you need a lot of disciplines to feed into this research. Where are you getting all of your talent from in terms of…?

Gang Hua: In a sense, I would say our goal on this is making — you know, artworks are not necessarily accessible to everyone. Like some of these artworks are really expensive. By this kind of digital technology, what we are trying to do is make this kind of artworks to be accessible to common users.

Host: To democratize.

Gang Hua: Yeah democratize it. Yes, as you mentioned that, so.

Host: That’s good.

Gang Hua: Our algorithm allows us to build an explicit representation, like a numerical representation for each kind of style. Then, if you want to create new styles we can blend them. So, that is like we are building an artist space where we can explore in-between to see how these visual effects are evolving between like two painters and things like that. And even have deeper understanding how they composed their artistic style and things like that. Yeah.

Host: What’s really interesting to me is that this is a really quantitative field — computer science, algorithms, and a lot of math and numbers. And then you’ve got art over here which is much more metaphysical. And yet, you’re bringing them together and it’s revealing the artistic side of the quantitative brain.

Gang Hua: Sure. I think to bring all these things together, the biggest tool we are leveraging indeed is statistics.

Host: Interesting.

Gang Hua: As all kinds of machine learning algorithms are dealing with, it’s really trying to capture just statistics of the pixels.

(music plays)

Host: Let’s get a little technic… We have been a little technical, but let’s get a little more technical. Some of your recently published work – and our listeners can find that on both the MSR website and your website – you talked about a new distributed ensemble approach to active learning. Tell us what’s different about that… what you propose, how it’s different, what does it promise?

Gang Hua: Yeah, that’s indeed a great question. I think when we are taking about active learning, we are referring to a process where we have some sort of human oracle involved in the learning process. In traditional active learning, we’re seeing that… I have a learning machine. This learning machine can intelligently pick up some data samples and ask the human oracle to provide a little bit more input. So, the learning machine actually picks samples and asks the human oracle to actually provide, for example, a label for this image. So, in this work, when we are taking about an ensemble machine, we are actually dealing with a more complicated problem. We are trying to factor active learning into actually in a crowd-sourcing environment. If you think about the Amazon Mechanical Turk, nowadays, it’s really, one of the biggest platforms where people send their data and ask the crowd workers to label all of them, but in this process, if you are not careful, the labels connected from this process for your data could be quite lousy, indeed. They may not be able to be used by you. So, in this process, we actually try to achieve two goals there. The first goal, we want to smartly distribute the data so that we can make the label to be most cost effective, OK? The second is that we want to, actually, assess the quality of all my crowd workers, so that maybe, even in the online process, I can purposely send my data to the good workers to label. So, that’s how our model works. So, actually, we have ensemble model distributed one. Like each crowd worker corresponds to one of these learning machines. And we try to do a statistical check across all the models so that in the same process, we actually come out with a quality score for each of the crowd workers on the fly. So that we can use the model to not only select the samples but also send the data to the labelers with the highest quality to label them. That way, with the progress on these label efforts I can quickly come out with a good model.

Host: That leads me to the human-in-the-loop issue and the necessity of the checks and balances between humans and machines. Aside from what you’re just talking about, how are you tackling other issues of quality control by using humans with your machines?

Gang Hua: I have been thinking about this problem for a while, mainly in the context of robotics. If you think about any intelligence system, I would say, unless you are in a really closed-world setting, then you may have a system which can run fully autonomously. But whenever we hit the open world, like a current machine learning based intelligent systems are not necessarily good in dealing with all kinds of open-world cases because there’s corner cases which may not have been covered.

Host: And variables that you don’t think about, yeah.

Gang Hua: Exactly. So, one thing I have been thinking is how could we really engage humans in that loop to not only help the intelligent agent when they need it, and also at the same time forming some mechanism which we can teach these agents to be able to handle similar situations in the future. I will give you a very specific example. When I was at Stevens Institute of Technology, I had a project from NIH which is… we called co-robots.

Host: What kind of robots?

Gang Hua: Co-robots. It’s actually wheelchair robots. The idea is that, as long as the user can move their neck, we actually have a head-mounted camera which the user can move their head, we use the head-mounted camera actually to track the position of the head and let the users to be able to control the wheelchair robots indeed. But we don’t want the user to control it all the time. So, our goal is, actually, say, if, in a home setting, we wanted these wheelchair robots to be able to carry the user, and move largely autonomously inside the room whenever the user gave a guidance, say, hey, I want to go to that room, then the wheelchair robots would mostly do autonomous navigation, but if the robots sort of encounter a situation, does not know how to deal with it? For example, how to move around? Then, at that moment, we’re going to ask the robots to proactively ask the human for control. Then, the users will control the robots and deal with that situation. Maybe next time these robots encounter similar situation, they’re going to be able to deal with it again.

(music plays)

Host: What were you doing before you came here and how did you end up at Microsoft Research?

Gang Hua: This is my second term in Microsoft. So, as I mentioned, the first term is between like 2006 and 2009, when I was in lab called the Live Labs. That’s my first term. During that tenure, I established the first face recognition library. Then I kind of got attracted by external world a little bit. So, I went to Nokia Research, IBM Research and I landed at Stevens Institute of Technology as a faculty member there so…

Host: And that’s over in New Jersey, right?

Gang Hua: Yeah that’s in New Jersey, in East Coast. Then in 2015, I came back to Microsoft Research but in the Beijing Lab first. I transferred back actually in 2017, because my family stayed here.

Host: So now you are here in Redmond after Beijing. How did that move happen?

Gang Hua: My family always stayed in Seattle. So, Microsoft Research Beijing Lab is a great place. I would say, like, I really enjoyed it. And one of the unique things there is the super, super dynamic research intern program. So, year-round, there’s several hundred interns actually working in the lab. And they collaborate closely with their mentors. I think it’s a really dynamic environment there. But I, because my family is in Seattle, so I sort of explored a little bit, then the Intelligent Group is setting up this computer vision group there. So that’s why I joined.

Host: You’re back in Seattle again.

Gang Hua: Yeah.

Host: So, I ask this question of all the researchers that come on the podcast and I’ll ask you too. Is there anything about your work that that we should concerned about? What I say is, anything that keeps you up at night?

Gang Hua: I would say, when we talk about, especially in the computer vision domain, I think privacy is, potentially, the largest concern. If now you look into all countries, there are hundreds of millions of cameras that are set up everywhere in public domain or in buildings and those, I would say, like with the technology advancement, it is really not sci-fi to expect that cameras can now really track people all the time. I mean everything has two sides. I would say yeah, this, on one hand, could help us, for example better to deal with our criminals. But for ordinary citizens there’s a lot of privacy concerns.

Host: So what kinds of things… and this is why I ask this question because it prompts people to think, ok, I have this power because of this great technology, what could possibly go wrong? So, what kinds of things can we be thinking about and instituting – or implementing – to not have that problem?

Gang Hua: Microsoft has big efforts on GDPR. And I think that’s great, because this is a mechanism to ensure everything we produce actually got aligned with certain regulation. On the other hand, everything need to strike for a balance between usability and the security or privacy. If you think about it, like you use like some online services, like your activities basically leave traces there. That’s how it is used to better serve you for the future. But if you want to be more convenient, sometimes you need to give a little bit of information out. But you don’t want to give all your information out, right? I think that the boundary is actually not black and white. We simply need to carefully control that, so that we just get the right amount of information to serve the customer better, but not unlimited information, or information that the users are not comfortable or feeling well to give up…

Host: Right, so it seems like there’s a trend towards permissions and agency of the user to say, “I’m comfortable with this. But I’m not comfortable with that.”

Gang Hua: Mmm hmmm. Right.

Host: As we finish up here, Gang, talk about what you see on the horizon for the next generation of computer vision researchers. What are the big unsolved problems that might prompt exciting breakthroughs, or just be the grind for the next 10 years?

Gang Hua: That’s a great question and also, a very big question. There are big problems we actually should tackle. If you think about, like now, computer vision really leverages statistical machine learning a lot. We can train recognition models, which achieved great results. But that process is largely, still, appearance-based. So, we need to better get in some of the fundamentals in computer vision which is 3-D geometry into the perception process, ok… And there’s also other things, like especially when we are talking about video understanding. It’s a holistic problem where you need to do spatial-temporal reasoning and we need to be able to factor more cognition concepts into this process like causal inference. If something happened, what really caused this thing to happen? Machine learning techniques mostly deal with like correlation between data, ok? Correlation and causality are two totally different concepts there. So, I feel that also needs to happen. And some of the fundamental problems, so like learning from small data and even learning from language, potentially we need to address. Think about how we humans are learning, we learn in two ways: learning from experience, but there is another factor. We learn from language. For example, while we are taking with each other, indeed, simply through language, I already learned a lot from you, for example…

Host: And I you.

Gang Hua: Sure. You know, that’s a very compact information flow. We are now centrally focused on deep learning. If we look back like ten, fifteen years ago, you see the computer vision community’s more diverse. You see all kinds of machine learning methods, you see all kinds of knowledges borrowed from physics, from optical field, all getting into this field to try to tackle the problem from multi-perspective. As we are emphasizing diversity everywhere, I think the scientific community is going to be more healthy if we have diverse perspective and tackling the problem from multiple perspective.

(music plays)

Host: You know, that’s great advice. Because as the community welcomes new researchers, they want to have big thinkers, broad thinkers, divergent thinkers, to sort of push for the next big breakthrough.

Gang Hua: Yeah. Exactly.

Host: Gang Hua, thank you for coming in. It’s been really illuminating, and I’ve really enjoyed our conversation.

Gang Hua: Thank you very much.

To learn more about Dr. Gang Hua and the amazing advances in computer vision, visit Microsoft.com/research

Related publications

Continue reading

See all podcasts