Episode 80, June 12, 2019
You may not know who Dr. Andrew Fitzgibbon is, but if you’ve watched a TV show or movie in the last two decades, you’ve probably seen some of his work. An expert in 3D computer vision and graphics, and head of the new All Data AI group at Microsoft Research Cambridge, Dr. Fitzgibbon was instrumental in the development of Boujou, an Emmy Award-winning 3D camera tracker that lets filmmakers place virtual props, like the floating candles in Hogwarts School for Witchcraft and Wizardry, into live-action footage. But that was just his warm-up act.
On today’s podcast, Dr. Fitzgibbon tells us what he’s been working on since the Emmys in 2002, including body- and hand-tracking for powerhouse Microsoft technologies like Kinect for Xbox 360 and HoloLens, explains how research on dolphins helped build mathematical models for the human hand, and reminds us, once again, that the “secret sauce” to most innovation is often just good, old-fashioned hard work.
Related:
- Microsoft Research Podcast: View more podcasts on Microsoft.com
- iTunes: Subscribe and listen to new podcasts each week on iTunes
- Email: Subscribe and listen by email
- Android: Subscribe and listen on Android
- Spotify: Listen on Spotify
- RSS feed
- Microsoft Research Newsletter: Sign up to receive the latest news from Microsoft Research
Transcript
Andrew Fitzgibbon: I do believe that there will be a future where we find it weird that we used to carry a flat screen in our pocket and pull out this flat screen to look at it in order to do our digital work. I think that sometime soon, when I refit my office, instead of setting up a bunch of LCD panels, I’ll just put a large black curved piece of plywood in front of me, and I’ll wear a HoloLens, and all my documents will appear in the real world in front of me. I absolutely believe that the screen in the pocket, or the screen attached to the desk, is going to be as weird as the phone attached to the building.
Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.
Host: You may not know who Dr. Andrew Fitzgibbon is, but if you’ve watched a TV show or movie in the last two decades, you’ve probably seen some of his work. An expert in 3D computer vision and graphics, and head of the new All Data AI group at Microsoft Research Cambridge, Dr. Fitzgibbon was instrumental in the development of Boujou, an Emmy Award-winning 3D camera tracker that lets filmmakers place virtual props, like the floating candles in Hogwarts School for Witchcraft and Wizardry, into live-action footage. But that was just his warm-up act.
On today’s podcast, Dr. Fitzgibbon tells us what he’s been working on since the Emmys in 2002, including body- and hand-tracking for powerhouse Microsoft technologies like Kinect for Xbox 360 and HoloLens, explains how research on dolphins helped build mathematical models for the human hand, and reminds us, once again, that the “secret sauce” to most innovation is often just good, old-fashioned hard work. That and much more on this episode of the Microsoft Research Podcast.
Host: Andrew Fitzgibbon, welcome to the podcast.
Andrew Fitzgibbon: Hi, Gretchen, great to be here.
Host: So, I usually start my podcasts by way of introduction, but I want to go a little off script with you because you’re funny. You’ve said that you situate your research at the intersection of computer vision and computer graphics with excursions into neuroscience. And at a more subatomic level, you characterize your work as, at its core, extracting information about the world from photons. So, give us a little more context about these excursions and extractions. How, in general, would you describe what you do, Andrew? What gets you up in the morning?
Andrew Fitzgibbon: So, what I want to do is make computers help us change the world. And we can change the world in a bunch of ways that are useful to humans. I’d like to think that I work on the technologies that underpin the ways we can make computers make the world better for us. A lot of what we do in what we now call AI – AI, the combination of machine learning, computer vision, natural language processing – it’s all about taking information from the world, and that information comes through sensors somehow, whether the sensors are the pixels in a camera, or the sensors in a microphone, or even the sensors in your keyboard when you type a tweet. These are all sources of information from the real world that we would like to do something good with. And for me, something good might be improving the computer graphics in Harry Potter. And I consider that good because it makes lots of humans happy. And in some sense, everything we do in life is about making humans happy.
Host: Well, tell me a little more about the excursions into neuroscience. When you’re taking about computer vision and computer graphics, it might not seem a natural leap, but I think it kind of is. Can you unpack that a little?
Andrew Fitzgibbon: Yes, I started collaborating when I was in Oxford with some of the neuroscientists who worked there. And we were interested in a fairly simple question, which is roughly, can humans point at things? Now, you think, obviously humans can point at things, but we don’t have a good theory for how the brain integrates 3D information. So, I was interested in purely providing the neuroscientists with a sort of a mathematical backup. I was saying, no one thinks the brain does it by multiplying matrices and vectors, but if it did, you would see these sort of error patterns. And then we went to the real world and looked at people wearing virtual reality headsets and the kind of pointing mistakes they make. And those are different from the patterns that a computer would make using today’s technology. So, we still don’t know what the brain does, but we have more evidence that it’s not the same as what the computer does.
Host: Hmm. All right, so let’s talk a little more about extracting information about the world from photons. That’s a very granular way of describing the computer vision work that you do. Could you explain that, technically, a little better?
Andrew Fitzgibbon: So, regular listeners to your podcast will know that computer vision is one of those problems that’s sort of easy to state, you know, does this picture contain a cat or not? But has turned out to be incredibly hard. And even in this age of deep learning successes with computer vision, we still know that it’s incredibly hard. So, my abstraction of “gather information from photons” is just to kind of stand back and say, why are we so excited about computer vision? Why are we so excited about these capabilities? And one of the things that computer vision does is it allows us to acquire information from far away. And of course, one can hear information from far away. But it’s a capability that allows us to do things like recognize people far away or to drive a car. And I want to think about it at that level of abstraction, because I want to always understand that my end goal is to do something real. If we think always about the end goal, I think we do a much better job of making progress on the fundamental research. The alternative is to say my interest is in understanding deep neural networks. I absolutely love trying to understand deep neural networks, and I do so at a theoretical level. But also, I want to know what the practical consequence of that understanding will be.
Host: Well, now that we’ve situated your research, let’s situate you. Until this week, I would have introduced you as a partner scientist in HoloLens, but you’ve recently been tapped to lead a group at MSR Cambridge called All Data AI, or ADA, which turns out to be an acronym that references Ada Lovelace. Was that intentional at all, and why, if it was?
Andrew Fitzgibbon: So yes, it was intentional in the sense that Ada Lovelace is sometimes considered the world’s first computer programmer. Whether or not she programmed computers, she certainly was the first person to observe that the computational powers of the analytical engine, or the difference engine, could be applied to quantities that were not numbers. Now, of course since the dawn of computers, we’ve represented quantities in the computer, like strings of characters, like words. They’re all numbers. So, what’s new about saying that it operates on something beyond numbers? Well, the thing that’s new is not so much that the computer understands a sequence of numbers, but that it can understand the interconnections between them and that is represented, fundamentally, by a computer science concept called a graph. And a graph is something that we can use to represent a wide variety of computer science concepts.
Host: Okay. So why now with this group at MSR Cambridge? Has there been a confluence of things that exist now that didn’t before?
Andrew Fitzgibbon: I think, of course, we have all observed the advances in AI due to the advances in machine learning, due to the advances in deep learning. And I think now, there’s a golden opportunity to apply that to a broader range of areas. It’s also a fantastic opportunity, here and now, for us to think about a new generation of AI programming. Until today, AI programming has been the domain of high priests and priestesses who have PhDs in machine learning and who understand linear algebra. And yet, we believe that, actually, a lot of AI programming could be a lot simpler. So, we’re looking at how we might think about a third generation of AI programming. The first generation would have been the raw work of Hinton and LeCun who just wrote the code, you know, it was difficult code to write. The second generation is this set of tools which go by the name of TensorFlow and PyTorch, again, which lots of listeners will have used or heard of.
Host: Right.
Andrew Fitzgibbon: And these tools have been, again, fantastic for democratizing AI. But they make it hard for somebody who’s just a great computer programmer to understand.
Host: Hmm.
Andrew Fitzgibbon: In my view, they somewhat hide the beauty of the AI models that are underneath. Neural network models are actually relatively simple. They’re simple models with complex consequences, and the research world is increasing its understanding of these complex consequences, but it’s also nice to just see the code, unadorned, and see how simple these models are.
(music plays)
Host: Well, let’s go back a ways, before we talk about more current research, because you have some, what I would call “greatest hits” in your earlier work. One is called Boujou, a camera tracker that actually won an Emmy. And it’s been used in computer graphics and live-action footage in, what I’ve heard is pretty much every movie made since it was released in 2002. Tell us more about Boujou. How and where did it come about, what does it do, and how does it work technically?
Andrew Fitzgibbon: Boujou was a great fun project. I moved from the University of Edinburgh, where I did my PhD, to Oxford University in the mid-nineties. And I was working there with some amazing people who were interested in the question of how a robot might navigate its way around some building or around some environment. So, we worked hard on the “how does a robot navigate?” problem. And we discovered that one of the things the robot has to do in order to know where it’s been in three dimensions, is to build a three-dimensional model of the world. And we worked hard on making a beautiful 3D model, because we figured this would be useful, maybe, for those industries where, even today, it’s common for example to sculpt a car out of clay before building the computer model. Certainly, in those days, if you were going to have an alien in the movie – I think Pitch Black was one of the first aliens we looked at – they would make the alien out of clay and then try to scan it into the computer. So, we thought we had a great product. We were going to use our robot navigator, spin the camera around the alien model, pull that into the computer, and then give people a computer model of the 3D object. It turned out nobody in the movie industry was interested.
Host: Dang it.
Andrew Fitzgibbon: You had to – yeah, yeah! You had to spray paint the model with a toothbrush to get texture on it. It didn’t work in little corners. It was terrible. And they could just do better by using, you know, existing artists to create the models. But somebody at one of the effects companies kind of thought through how we must be getting this model, and said, how do you know where the cameras are? And we said, you know, well, it works it out. That’s part of how we get the 3D model. We work out how the cameras are. So, it turns out, something you really need in movies that’s really hard to do is to figure out where the cameras are. Now, what does that mean? That means maybe I have a camera mounted in a boat. And the boat contains some hobbits and is sailing down a river towards, you know, some impressive mountain. It would be just great if the impressive mountain had two huge statues pointing at the hobbits, but it turns out that nobody built those 400-meter statues, so what we have to do is build some statues back in the studio – and we built some 40-centimeter statues back in the studio – and if only we knew what motion the camera had undergone while it rocked in that boat moving down the river, we’d be able to make a robot do that same motion and then we could superimpose the images.
Host: Hmm.
Andrew Fitzgibbon: So, in those days, the only way to find out where the camera was, was – you could imagine trying to use markers or trying to use GPS… None of these things work. So, you would have to just sort of manually position the camera for every single shot. It was incredibly expensive. So, somebody who had worked on that kind of footage realized that our algorithms were producing this as a kind of unwanted side effect. We flipped – I guess nowadays, we would say we pivoted – uh, there was a startup, and Boujou was launched. And I’m super proud of it because it was one of the first products which did kind of 3D computer vision. There were products available then that would do character recognition, that would do number plate recognition, but this really did 3D vision. And we had to go beyond the academic state-of-the-art to deliver it. And we learned about delivering computer vision to the real world. Some of what we learned was, you just have to type in lots of code. People would ask me, you know, what’s the secret sauce in Boujou? And I would say, well, you know, have you read all the papers on 3D structure and motion? They would say, yeah, I’m pretty sure of them. I would say, yeah, that’s the secret sauce. We implemented all of them. And, you know, it’s an attitude that helps us today. Sometimes in research, you’re assessed by how beautiful and clean your algorithm is. Sometimes in the real world, you have to implement all the dirty algorithms until you find the beautiful one. But with Boujou, it just worked, and we had something that was actually useful.
Host: You know, on that note, I was going to ask you a question further down in the podcast, but I’m going to bump it up because it’s sort of tied in. I imagine you said it about this. You once said, “If I had to nominate one key to success, it’s a focus on everything.” Is that what you’re talking about?
Andrew Fitzgibbon: That’s exactly what I’m talking about. Yeah, exactly. There is no silver bullet. You know, you really have to focus on building a real thing. And that just means something that other humans would be happy to use. So, remember, in Boujou, we didn’t focus on the right thing, but the real thing we wanted to build was this 3D modeler. And we knew we would make a sort of a, what’s called a Wizard of Oz demo. We’d say, if this thing worked, it would make you a model that looks like this. Would you like it? And then, when the humans agree that that’s what they would like, then you have a target. And then of course, you may have to pivot, or you may deliver something that’s only half as good and discover that hey, that’s still useful. Or you may achieve everything you thought you needed to achieve and discover that actually, it needs to be twice as good. But I really like having a concrete goal to aim for.
Host: Well that’s a beautiful segue into the next topic I want to talk about, because Microsoft’s Kinect technology has been dubbed a failure that became a great success, and the science behind it has a fascinating history. You were there early on, and I’d love you to share some stories about Kinect with us. What’s your particular perspective on this technology? How did it come about, how has it evolved, and how has it impacted other areas of research you’re involved in?
Andrew Fitzgibbon: I love the Kinect story, because, in some sense, it’s a classic example of when academic style researchers meet engineers who really want to change the world. So, at Microsoft Research, we were looking at whether we could make computer vision algorithms that would be able to follow the movements of the human body. And we are working with the academic research field and we were doing pretty well, and we had good results. So one day, the Xbox people, Alex Kipman, who was working in Xbox then, came to us and said, we’ve had this great idea for a video game where you’re going to like, uh, recognize the motion of a human in a camera, and then it’s going to control the games, and it’s going to be amazing. And we said, I’m glad you asked us that, because we’re actually the world experts on this, and I can tell you it’s not going to work. So, then they said, oh, yeah, it’s funny you say that, because look at this program we wrote. And they showed us their version of it, and their version was better than anything in the academic literature.
Host: Oh my gosh.
Andrew Fitzgibbon: Some genius programmers had put together an amazing demo of how it would work. What was amazing was, it was using an idea from the academic literature, but they had engineered it so well and made it so effective and really worked hard on it.
Host: Hmm.
Andrew Fitzgibbon: The reason that idea wasn’t very popular with academics at the time was that it’s pretty hard to take a single image from a camera, identify the human in the image, and then list off where the hands are, where the elbows are… But supposing you already had an image from, let’s say, 30 milliseconds ago because it’s a video sequence, and you already knew, in the image from 30 milliseconds ago, where everything was, then you could just simply say to yourself, well, they can’t have gone far; they were there 30 milliseconds ago, you know, and find where everything was. So, we knew how to do that in academia, but what no one had done is really, really worked hard on that, really kind of tried to moonshot that and make it really work. And our contribution was to observe that, okay, you’ve got a system that works 99 percent of the time, assuming it was right 30 milliseconds ago. There’s a calculation you can do which tells you that system will definitely fail after five minutes. And this is what they observed, and they knew that that would happen. And you could design, again, around that, you know, ran in sort of three-minute sections and then reset itself. But, ideally, you would have the system just not make mistakes like that. So, our first contribution was to say, what we need to do here is just basically have the system every couple of seconds, kind of reset itself. And we were using machine learning, and this was kind of an early instance of machine learning really being applied to one of these hard computer vision problems. So, we said to them, okay, we could maybe do it, maybe… But, you know, we would need real-world examples of this thing running in ten different living rooms across the planet in order to even know if we’re doing well, not to mind train our machine learning algorithm. And then the horrifying moment two weeks later when we were on a call and they said, yup, we’ve got ten people in living rooms across the planet! Our Japan people are finishing up there tomorrow, and then they’re moving over to China… so suddenly we realized, okay, these people are really serious. And then, when we needed to hire a Hollywood studio to generate training data, they hired a Hollywood studio to generate training data. So, there was just a huge amount of vision there which… we were saying this stuff, because it seemed right, but really, no one had done it to that level before, and the reason was, they understood what the machine learning was doing. They figured it works from examples. If you don’t have enough examples, no brainer, let’s get the examples. Whereas, academics would always be, I’m happy to spend a year building a better theory. Whereas with Kinect, the partnership with people who really wanted to get stuff done – maybe that’s where I’ve inherited some of this – allowed us to make really fantastic progress.
(music plays)
Host: Microsoft’s HoloLens is another computer vision technology that you’ve had a lot to do with. So, give us the “Andrew Fitzgibbon take” on HoloLens and its journey from birth until now, with the release of HoloLens 2. What have you discovered about its capabilities over the years, and what do you think HoloLens has contributed to the computer vision research community?
Andrew Fitzgibbon: HoloLens is an amazing device. It came out of the same team that we worked with on Kinect, Alex Kipman’s team. And, at one level, HoloLens is exactly what Kipman said it was when they announced it first, three, maybe four years ago. He said, this is the future of the PC. And in one sense, I do believe that there will be a future where we find it weird that we used to carry a flat screen in our pocket and pull out this flat screen to look at it in order to do our digital work. I think that sometime soon, when I refit my office, instead of setting up a bunch of LCD panels, I’ll just put a large black curved piece of plywood in front of me, and I’ll wear a HoloLens, and all my documents will appear in the real world in front of me. I absolutely believe that the screen in the pocket, or the screen attached to the desk, is going to be as weird as the phone attached to the building. So that’s far future. That’s when HoloLens has the form factor of a small set of glasses. But towards that future, HoloLens today is amazingly valuable for real people doing real work. Because HoloLens lives in a 3D world, it has lots of 3D vision in it. One of the pieces that I’m incredibly impressed by in HoloLens because I didn’t work on it, does the job of figuring out where your head is in the 3D world. This is related to work I did, you know, on Boujou many years ago. But on HoloLens, it does it all the time, in real time, on an incredibly low-power device, so that’s a beautiful piece of technology that again, I think very few people other than Microsoft could have put together. My work on HoloLens stemmed from a piece of very blue skies research we did almost ten years ago now. We – that’s me and a friend called Tom Cashman, who arrived as an intern – we decided we wanted to learn about the 3D structure of stuff that moves. What’s stuff that moves? Well, the human body is something that moves, and we knew a bit about the 3D structure of the human body from Kinect. But we wanted to learn about this 3D structure just from still images. So, we had to think of something that’s kind of bendy and movey, but that is somehow not too bendy and not too movey. So, we realized that dolphins were the ideal thing to work on. So, we decided to write a paper called, What Shape are Dolphins? Now, we didn’t really care about dolphins. But we cared about bendy, movey 3D stuff. And we worked on that paper in order to build mathematical models of 3D. We thought, well, why did we want to know about 3D? Well, when you’re interacting with the virtual world, or the mixed reality world and the HoloLens, one of the important 3D objects is the human hand. If the system can look at your hands and fully accurately determine the position of every bone and knuckle in the hand, then you can use your hands in the virtual world to pick up virtual objects. And the virtual objects behave exactly as you would expect from real-world physics. So, the research on dolphins became a research on the slightly more bendy object that is the human hand. And then we knew that this technology was useful for something. HoloLens was being developed. So, we thought, let’s see if we can deliver this dream of real-world physics to the HoloLens. And we were extremely happy to see announced recently at Mobile World Congress in Barcelona, HoloLens 2 with fully articulated hand-tracking.
Host: Let’s shift over and talk about ADA a little bit more deeply and talk about specific problems you want to tackle and what technical ground you hope to break in the projects you want to take on. What’s on your roadmap, Andrew?
Andrew Fitzgibbon: So, ADA, is about permeating AI into the parts of the world where we almost don’t yet know we need it. Today, we still can’t ask questions of the internet like, find me all the ski chalets within a hundred meters of the slopes, right? That’s a hard question. Why is it a hard question? Because all the ski chalets have their own little website. It’s not necessarily aggregated. And the way to answer that question is very easy: you should just read and understand all the web pages in the world and be able to answer any questions about them. So, a computer that can really read and understand all the web pages in the world is clearly, in some sense, you know, sort of infinitely far away. So, the idea is that throughout computer science, there are areas where we can permeate, using AI and machine learning, to deliver systems that work better. So, for example, on the HoloLens, we had to take our hand-tracking code which we wrote as standard, you know, computer vision researcher code, and make it maybe five hundred times more efficient in order to run on that low-power device.
Host: Right.
Andrew Fitzgibbon: Now, if we could achieve that for general code, then maybe we could make, you know, five hundred times more efficient the code we run in datacenters around the world. Maybe we could achieve five hundred times as much and maybe we could save energy. If we had capabilities in user interfaces that allowed us to adapt to what the user is doing, but not be annoying – and this is a crucial combination – then of course, we would have much better capabilities, of course we would have happier users, happier humans. If we think about the challenge of maybe adapting user interfaces so that my user interface works better for me and yours works better for you, one of the things that we want to do is learn from a very small number of signals. So maybe, every time you switch on your computer, you know, you arrange the windows in a certain order. We already have systems that will try to learn your preferences and try to, uh, do that for you. But what we know today is that a lot of the times, the systems that do that are a bit annoying. Why are they a bit annoying? Because they don’t have the human-level understanding. Or this person’s in a hurry. This person’s using a different machine. They’re in a different context. They’re at home, not work. How could a system really understand, from small amounts of training data, what it should do and when it should do nothing? And these are areas which I think we’re sort of touching on today. There are whole areas of AI research that our team looks at, and they’re different questions. In all of them, again, does this tradeoff between a fundamental research angle and how are we actually going to demonstrate our AI to the world?
(music plays)
Host: So, Andrew, I’ve heard you use the phrase, change the world, or a better world. And, and these are the sort of end goals. So, there are a lot of things that could go right if you’re successful. But we have to talk, at least a little bit, about what could possibly go wrong. So, with all of the possibilities that you’re describing, they seem to pose new kinds of social protocols that we’ll have to develop and adapt with things like advanced 3D computer vision and head-mounted computers and cameras galore… Is there anything that keeps you up at night?
Andrew Fitzgibbon: There are things where I think I have some idea what the answer’s going to be, and then there are things where I hope I have some idea what the answer’s going to be! Let me do one of the first ones. When I describe a world where everybody’s glasses have the possibility of projecting 3D content into the world in front of them, you can immediately think, well, this is terrible. People are going to just, you know, spend their entire time looking at the content and not interacting with the other person in front of them. But I think we already have social protocols that solve that today. So, if you and I are talking and there’s a TV over to the right, maybe I’ll turn to look at the TV. It’s rude, but maybe you understand something interesting must have happened on the TV and we both turn and maybe it’s something we’ll discuss, and we’ll get back to chatting. We also have sort of protocols, it’s rude for me to take out my phone and look at it when I’m talking with you. But you understand if I tell you, well, I’m just going to look that up or I’m just going to figure out what’s happening in that meeting.
Host: Mm-hmm.
Andrew Fitzgibbon: So, I’m not worried, with something like HoloLens, that we won’t easily develop these protocols. We might need some technical solutions like, when there’s something in my 3D world, maybe it’s, you know, my email client is maybe sitting on the desk in front of me. You can see that it’s my email client, but of course you can’t read the content within it. So… But these are technical problems we can solve. The social protocols, I’m confident, there, we will develop. And I’m not concerned about that aspect. But you might well say, okay, but you should be concerned about maybe a world where the cameras on my device are revealing information that I wouldn’t like revealed to other people in the world. And I’m happy that we, the AI community, essentially, are talking about this and thinking about ways in which we can ensure the security of our data so that any person can have a concrete understanding about what information’s leaving the device, how well encrypted information is used, and that we can all have protocols for how we dispose of information when we don’t need it. Another exciting thing about the HoloLens is that, for example, in hand-tracking, none of the images that the cameras take leave the device. And today, we are looking at a number of ways in which I can securely send your information to the cloud, securely do machine learning on it and send the answer back, and then delete the information and guarantee that to the customer. So that’s one aspect where we, as Microsoft, have just devoted an awful lot of effort to thinking about security and privacy of that information. Another side of security is trust. We earn that trust obviously by, you know, having behaved well in the past, by having strong statements about what we’re going to keep and what we’re not going to keep. And also, by the strength of our research in aspects like secure computation and secure machine learning.
Host: It’s story time, and you’ve got a good one. Tell us what got a young Andrew Fitzgibbon interested in computer science, and what was your path to Microsoft Research?
Andrew Fitzgibbon: Gosh. It’s a long one… Um… I grew up in the eighties. I liked mathematics. I liked messing around with electronics, because we didn’t really have computers then. We had computers at school. One of my summer jobs was as a water taxi driver. And being a water taxi driver means you’re incredibly busy from about 9:00 a.m. to 11:00 a.m. when all the boats go out to the sea, and then you’re incredibly busy from about 3:00 p.m. to about 6:00 p.m. or maybe 7:00 when everyone comes back. So, you’ve got a huge amount of time sitting out in the sun, or maybe the rain, during the day where you’re just kind of killing time. And one of the things you can do is read. And of course, if many of your listeners will love mathematics, you can play around with mathematics. But what was also great back then was, because you didn’t have a computer beside you, I would write programs that I would then try and type in the following day when I got to school. And that was just a great way to sort of, in a very casual way, learn about computer programming. So, I would scratch out a program on paper, line ten, do this, line twenty, do that, or later I would write in assembly language the little bits that needed doing. And it’s kind of therapeutic. It’s like solving mathematical puzzles, right? Mathematical puzzles are great, but they’re hard to invent for yourself. So, you might buy a book of mathematical puzzles and then solve them all after a few weeks. Whereas with a computer program, you’re of course continually inventing the puzzles for yourself.
Host: Right. So, from that to job as a water taxi driver, then what, then where?
Andrew Fitzgibbon: So, I did mathematics and computer science at university. I wanted to do physics, but the physics department told me it would be much too hard to do physics with computer science. And someone told me later that also, they weren’t that keen on people with blue hair. So, I ended up doing mathematics and computer science. The mathematics department were much more enlightened. And I was sitting in a topology lecture one day and thinking, what would this stuff be useful for? And I thought, oh, maybe it will be useful for recognizing the shapes of letters. And basically, I went and found myself a master’s course that had something to do with computer vision. And I went to Heriot-Watt University in Edinburgh, which was running a fantastic course on sort of – they called it knowledge-based systems. Nowadays, you would think of it as an introductory AI course. So, I finished my master’s, and then I got a job as a programmer in Edinburgh University. I got that job in 1989. But then, somehow by 1997, somewhat accidentally, I ended up with a PhD. I remember now why that happened. I was sitting in some meeting as the programmer, and they were talking about their research. And I said something like, well, has anybody tried just like rendering five hundred views of the thing and then using that? And someone said, oh, you should write a paper on that. And I thought it was a joke, and of course it turned out that that was my first paper. It was one of these things where, let’s use brute force instead of mathematics and see what happens. And from then on, you know, I hope I’ve managed to do elegant mathematics as well as brute force, but I think a practical mindset was there even from then.
Host: Well, sometimes you need tweezers and sometimes you need a hammer.
Andrew Fitzgibbon: Right, exactly, exactly.
Host: All right, so is that your accidental PhD? And then what?
Andrew Fitzgibbon: So, the PhD was part-time, with a research assistant job at Edinburgh. And that job was coming to an end, or at least I thought it was coming to an end because the contract was due to end at a certain date, and of course, like a very clever person, I forgot to ask anybody whether that was true, so I started looking for another job. I ended up in Oxford working with the great computer vision researcher of the UK, Andrew Zisserman, and a bunch of other amazing people. And there again, I learned the interaction of mathematics and code. I worked there. That led to Boujou, which we have mentioned earlier. And then, about mid-2000s, I moved to Microsoft Research.
Host: Right. Uh, you did a bit of a drive-by on a reference to blue hair. It was the eighties, but did you, indeed, have blue hair?
Andrew Fitzgibbon: Oh, yes, I indeed had blue hair, for much of my undergraduate career. It was varying degrees of blue and, and then sometimes would turn into green when the bleach wore out.
Host: Was it ever Smurf blue?
Andrew Fitzgibbon: No, it was kind of a dark, electric blue.
Host: Oh.
Andrew Fitzgibbon: At least that was the aim.
Host: Well, at the end of every podcast, I give my guests a chance to say anything they want to our listeners. And sometimes it’s general advice or wisdom or inspiration. Other times, it’s specific challenges or open problems in the field. So, what would you like to say to emerging researchers?
Andrew Fitzgibbon: People sometimes ask, what problem should I research? And, you know, there’s sort of a simple, general answer: solve important problems that will change the world. But sometimes that’s too big, and you don’t really have an idea what the big, important problems are. A signal that I found valuable is, when you’re listening to a talk or reading a paper, find something that annoys you. Find something where you think, really? That can’t be the right way to do this. Then – and this is crucial – ask yourself really, why is this annoying me? And then find a real-world example where the thing that’s annoying you is going to go wrong. A real-world example just needs to be something that you know we should be able to do, none of the existing technologies can do it, and you’ve got an idea. And I think that has been a very valuable source of inspiration for the times when you don’t have a big idea. Sometimes you just have a big idea, and that’s great. Go for it.
Host: And don’t forget to focus on everything.
Andrew Fitzgibbon: Focus on everything. That’s – that’s the best idea!
Host: Andrew Fitzgibbon, thank you for joining us from Cambridge today.
Andrew Fitzgibbon: Thank you. It’s an honor to have been here.
(music plays)
To learn more about Dr. Andrew Fitzgibbon and the latest in 3D computer vision and All Data AI, visit Microsoft.com/research