Powerful new large-scale AI models like GPT-4 are showing dramatic improvements in reasoning, problem-solving, and language capabilities. This marks a phase change for artificial intelligence—and a signal of accelerating progress to come.
In this new Microsoft Research Podcast series, AI scientist and engineer Ashley Llorens hosts conversations with his collaborators and colleagues about what these new models—and the models that will come next—mean for our approach to creating, understanding, and deploying AI, its applications in areas such as health care and education, and its potential to benefit humanity.
The first episode features Sébastien Bubeck, who leads the Machine Learning Foundations group at Microsoft Research in Redmond. He and his collaborators conducted an extensive evaluation of GPT-4 while it was in development, and have published their findings in a paper that explores its capabilities and limitations—noting that it shows “sparks” of artificial general intelligence.
Learn more:
- Sparks of Artificial General Intelligence: Early experiments with GPT-4
Publication, March 2023 - AI and Microsoft Research
Learn more about the breadth of AI research at Microsoft - Unveiling Transformers with LEGO: a synthetic reasoning task
Publication, June 2022
Transcript
Ashley Llorens: I’m Ashley Llorens with Microsoft Research. I spent the last 20 years working in AI and machine learning. But I’ve never felt more fortunate to work in the field than at this moment. Just this month, March 2023, OpenAI announced GPT-4, a powerful new large-scale AI model with dramatic improvements in reasoning, problem-solving, and much more. This model, and the models that will come after it, represent a phase change in the decades-long pursuit of artificial intelligence.
In this podcast series, I’ll share conversations with fellow researchers about our initial impressions of GPT-4, the nature of intelligence, and ultimately how innovations like these can have the greatest benefit for humanity.
Today I’m sitting down with Sébastien Bubeck, who leads the Machine Learning Foundations Group at Microsoft Research. In recent months, some of us at Microsoft had the extraordinary privilege of early access to GPT-4. We took the opportunity to dive deep into its remarkable reasoning, problem-solving, and the many other abilities that emerge from the massive scale of GPT-4.
Sébastien and his team took this opportunity to probe the model in new ways to gain insight into the nature of its intelligence. Sébastien and his collaborators have shared some of their observations in the new paper called “Sparks of Artificial General Intelligence: Experiments with an early version of GPT-4.”
Welcome to A.I. Frontiers.
Sébastien, I’m excited for this discussion.
The place that I want to start is with what I call the AI moment. So, what do I mean by that? In my experience, everyone that’s picked up and played with the latest wave of large-scale AI models, whether it’s ChatGPT or the more powerful models coming after, has a moment.
They have a moment where they’re genuinely surprised by what the models are capable of, by the experience of the model, the apparent intelligence of the model. And in my observation, the intensity of the reaction is more or less universal. Although everyone comes at it from their own perspective, it triggers its own unique range of emotions, from awe to skepticism.
So now, I’d love from your perspective, the perspective of a machine learning theorist: what was that moment like for you?
Sébastien Bubeck: That’s a great question to start. So, when we started playing with the model, we did what I think anyone would do. We started to ask mathematical questions, mathematical puzzles. We asked it to give some poetry analysis. Peter Lee did one on Black Thought, which was very intriguing. But every time we were left wondering, okay, but maybe it’s out there on the internet. Maybe it’s just doing some kind of pattern matching and it’s finding a little bit of structure. But this is not real intelligence. It cannot be. How could it be real intelligence when it’s such simple components coming together? So, for me, I think the awestruck moment was one night when I woke up and I turned on my laptop and fired up the Playground.
And I have a three-year-old at home, my daughter, who is a huge fan of unicorns. And I was just wondering, you know what? Let’s ask GPT-4 if it can draw a unicorn. And in my professional life, I play a lot with LaTeX, this programing language for mathematical equations. And in LaTeX there is this subprogramming language called TikZ to draw images using code. And so I just asked it: can you draw a unicorn in TikZ. And it did it so beautifully. It was really amazing. You can render it and you can see the unicorn. And no, it wasn’t a perfect unicorn.
What was amazing is that it drew a unicorn, which was quite abstract. It was really the concept of a unicorn, all the bits and pieces of what makes a unicorn, the horn, the tail, the fur, et cetera. And this is what really struck me at that moment. First of all, there is no unicorn in TikZ online.
I mean, who would draw a unicorn in a mathematical language? This doesn’t make any sense. So, there is no unicorn online. I was pretty sure of that. And then we did further experiments to confirm that. And we’re sure that it really drew the unicorn by itself. But really what struck me is this getting into what is a concept of a unicorn, that there is a head, a horn, the legs, et cetera.
This has been a longstanding challenge for AI research. This has always been the problem with all those AI systems that came before, like the convolutional neural networks that were trained on ImageNet and image datasets and that can recognize whether there is a cat or dog in the image, et cetera. Those neural networks, it was always hard to interpret them. And it was not clear how they were detecting exactly whether there is a cat or dog in particular that was susceptible to these adversarial examples like small perturbations to the input that would completely change the output.
And it was understood that the big issue is that they didn’t really get the concept of a cat or dog. And then suddenly with GPT-4, it was kind of clear to me at that moment that it really understood something. It really understands what is a unicorn. So that was the moment for me.
Ashley Llorens: That’s fascinating. What did you feel in that moment? Does that change your concept of your field of study, your relationship to the field?
Sébastien Bubeck: It really changed a lot of things to me. So first of all, I never thought that I would live to see what I would call a real artificial intelligence. Of course, we’ve been talking about AI for many decades now. And the AI revolution in some sense has been happening for a decade already.
But I would argue that all the systems before were really this narrow intelligence, which does not really rise to the level of what I would call intelligence. Here, we’re really facing something which is much more general and really feels like intelligence. So, at that moment, I felt honestly lucky. I felt lucky that I had early access to this system, that I could be one of the first human beings to play with it.
And I saw that this is really going to change the world dramatically. And selfishly, (it) is going to change my field of study, as you were saying. Now suddenly we can start to attack: what is intelligence, really? We can start to approach this question, which seemed completely out of reach before.
So really deep down inside me, incredible excitement. That’s really what I felt. Then upon reflection, in the next few days, there was also some worry, of course. Clearly things are accelerating dramatically. Not only did I never think that I would live to see a real artificial intelligence, but the timeline that I had in mind ten years ago or 15 years ago when I was a Ph.D. student, I saw maybe by the end of the decade, the 2010s, maybe at that time, we will have a system that can play Go better than humans.
That was my target. And maybe 20 years after that, we will have systems that can do language. And maybe somewhere in between, we will have systems that can play multiplayer games like Starcraft II or Dota 2. All of those things got compressed into the 2010s.
And by the end of the 2010s, we had basically solved language in a way with GPT-3. And now we enter the 2020s and suddenly something totally unexpected which wasn’t in the 70 years of my life and professional career: intelligence in our hands. So, it’s just changing everything and this compressed timeline, I do worry where is this going.
There are still fundamental limitations that I’m sure we’re going to talk about. And it’s not clear whether the acceleration is going to keep going. But if it does keep going, it’s going to challenge a lot of things for us as human beings.
Ashley Llorens: As someone that’s been in the field for a while myself, I had a very similar reaction where I felt like I was interacting with a real intelligence, like something deserving of the name artificial intelligence—AI. What does that mean to you? What does it mean to have real intelligence?
Sébastien Bubeck: It’s a tough question, because, of course, intelligence has been studied for many decades. And psychologists have developed tests of your level of intelligence. But in a way, I feel intelligence is still something very mysterious. It’s kind of—we recognize it when we see it. But it’s very hard to define.
And what I’m hoping is that with this system, what I want to argue is that basically, it was very hard before to study what is intelligence, because we had only one example of intelligence. What is this one example? I’m not necessarily talking about human beings, but more about natural intelligence. By that, I mean intelligence that happened on planet Earth through billions of years of evolution.
This is one type of intelligence. And this was the only example of intelligence that we had access to. And so all our series were fine-tuned to that example of intelligence. Now, I feel that we have a new system which I believe rises to the level of being called an intelligence system. We suddenly have two examples which are very different.
GPT-4’s intelligence is comparable to human in some ways, but it’s also very, very different. It can both solve Olympiad-level mathematical problems and also make elementary school mistakes when adding two numbers. So, it’s clearly not human-like intelligence. It’s a different type of intelligence. And of course, because it came about through a very different process than natural evolution, you could argue that it came about through a process which you could call artificial evolution.
And so I’m hoping that now that we have those two different examples of intelligence, maybe we can start to make progress on defining it and understanding what it is. That was a long-winded answer to your question, but I don’t know how to put it differently.
Basically, the way for me to test intelligence is to really ask creative questions, difficult questions that you do not find, online and (through) search. In a way, you could ask: is Bing, is Google, are search engines intelligent? They can answer tough questions. Are these intelligent systems? Of course not. Everybody would say, no.
So, you have to distinguish, what is it that makes us say that GPT-4 is an intelligent system? Is it just the fact that it can answer many questions? No, it’s more that it can inspect, it answers. It can explain itself. It can interact with you. You can have a discussion. This interaction is really of the essence of intelligence to me.
Ashley Llorens: It certainly is a provocative and unsolved kind of question of: what is intelligence. And perhaps equally mysterious is how we actually measure intelligence. Which is a challenge even for humans. Which I’m reminded of with young kids in the school system, as I know you are or will be soon here as a father.
But you’ve had to think differently as you’ve tried to measure the intelligence of GPT-4. And you alluded to…I’d say the prevailing way that we’ve gone about measuring the intelligence of AI systems or intelligent systems is through this process of benchmarking, and you and your team have taken a very different approach.
Can you maybe contrast those?
Sébastien Bubeck: Of course, yeah. So maybe let me start with an example. So, we use GPT-4 to pass mock interviews for software engineer positions at Amazon and at Google and META. It passes all of those interviews very easily. Not only does it pass those interviews, but it also ranks in the very top of the human beings.
In fact, for the Amazon interview, not only did it pass all the questions, but it scored better than 100% of all the human users on that website. So, this is really incredible. And headlines would be, GPT-4 can be hired as a software engineer at Amazon. But this is a little bit misleading to view it that way because those tests, they were designed for human beings.
They make a lot of hidden assumptions about the person that they are interviewing. In particular, they will not test whether that person has a memory from one day to the next. Of course, human beings remember what they did the next day, unless there is some very terrible problem.
So, they all face those benchmarks of intelligence. At least they face this issue that they were designed to test for human beings. So, we have to find new ways to test intelligence when we’re talking about the intelligence of AI systems. That’s point number one. But number two is so far in the machine learning tradition, we have developed lots of benchmarks to test a system, a narrow AI system.
This is how the machine learning community has made progress over the decades—by beating benchmarks, by having systems that keep improving, percentage by percentage over those target benchmarks. Now, all of those become kind of irrelevant in the era of GPT-4 for two reasons. Number one is GPT-4—we don’t know exactly what data it is being trained on and in particular it might have seen all of these datasets.
So really you cannot separate anymore the training data and the test data. This is not really a meaningful way to test something like GPT-4 because it might have seen everything. For example, Google came out with a suite of benchmarks, which they called Big Bench, and in there they hid the code to make sure that you don’t know the code and you haven’t seen this data, and of course GPT-4 knows this code.
So, it has seen all of Big Bench. So, you just cannot benchmark it against Big Bench. So, that’s problem number one for the classical ML benchmark. Problem number two is that all those benchmarks are just too easy. It’s just too easy for GPT-4. It crushes all of them, hands down. Very, very easily.
In fact, it’s the same thing for the medical license exam for a multi-state bar exam. All of those things it just passes very, very easily. And the reason why we have to go beyond this is really beyond the classical ML benchmark, we really have to test the generative abilities, the interaction abilities. How is it able to interact with human beings? How is it able to interact with tools?
How creative can it be at the task? All of those questions, it’s very hard to benchmark them. Our own hard benchmark, whether there is one right solution. Now, of course, the ML community has grappled with this problem recently because generative AI has been in the works for a few years now, but the answers are still very tentative.
Just to give you an example, imagine that you want to have a benchmark where you describe a movie and you want to write a movie review. Let’s say, for example, you want to tell the system, write a positive movie review about this movie. Okay. The problem is in your benchmark. In the data, you will have examples of those reviews. And then you ask your system to write its own review, which might be very different from what you have in your training data. So, the question is, is it better to write something different or is it worse? Do you have to match what was in the training data? Maybe GPT-4 is so good that it’s going to write something better than what the humans wrote.
And in fact, we have seen that many, many times the training data was crafted by humans and GPT-4 just does a better job at it. So, it gives better labels if you want than what the humans did. It cannot even compare to humans anymore. So, this is a problem that we are facing as we are writing our paper, trying to assess GPT-4’s intelligence.
Ashley Llorens: Give me an example where the model is actually better than the humans.
Sébastien Bubeck: Sure. I mean, let me think of a good one. I mean, coding—it is absolutely superhuman at coding. We already alluded to this and this is going to have tremendous implications. But really coding is incredible. So, for example, going back to the example of movie reviews, there is this IMDB dataset which is very popular in machine learning where you can ask many basic questions that you want to ask.
But now in the era of GPT-4, you can give the IMDB dataset and you can just ask GPT-4—can you explore the dataset. And it’s going to come up with suggestions of data analysis ideas. Maybe it would say, maybe we want to do some clustering, maybe you want to cluster by the movie, directors, and you would see which movies were the most popular and why.
It can come up creatively with its own analysis. So that’s one aspect—differently coding data analysis. It can be very easily superhuman. I think in terms of writing, its writing capabilities are just astounding. For example, in the paper, we asked it many times to rewrite parts of what we wrote, and it writes it in this much more lyrical way, poetic way.
You can ask for any kind of style that you want. It’s really at the level I would say at this far in my novice eyes, I would say it’s at the level of some of the best authors out there. There is its style and this is really native. You don’t have to do anything.
Ashley Llorens: Yeah, it does it does remind me a little bit of the AlphaGo moment or maybe more specifically the AlphaZero moment, where all of a sudden, you kind of leave the human training data behind. And you’re entering into a realm where it’s its only real competition. You talked about the evolution that we need to have of how we measure intelligence from ways of measuring narrow or specialized intelligence to measuring more general kinds of intelligence.
And we’ve had these narrow benchmarks. You see a lot of this, kind of past the bar exam, these kinds of human intelligence measures. But what happens when all of those are also too easy? How do we think about measurement and assessment in that regime?
Sébastien Bubeck: So, of course, I want to say maybe it’s a good point to bring up the limitations of the system also. Right now a very clear frontier that GPT-4 is not stepping over is to produce new knowledge to discover new things, for example, let’s say in mathematics, to prove mathematical theorems that humans do not know how to prove.
Right now, the systems cannot do it. And this, I think, would be a very clean and clear demonstration, whereas there is just no ambiguity once it can start to produce this new knowledge. Now, of course, whether it’s going to happen or not is an open question. I personally believe it’s plausible. I am not 100 percent sure what’s going to happen, but I believe it is plausible that it will happen.
But then there might be another question, which is what happens if the proof that it produces becomes inscrutable to human beings. Mathematics is not only this abstract thing, but it’s also a language between humans. Of course, at the end of the day, you can come back to the axioms, but that’s not the way we humans do mathematics.
So, what happens if, let’s say, GPT-5 proves the Riemann hypothesis and it is formally proved? Maybe it gives the proof in the LEAN language, which is a formalization of mathematics, and you can formally verify that the proof is correct. But no human being is able to understand the concepts that were introduced.
What does it mean? Is the Riemann hypothesis really proven? I guess it is proven, but is that really what we human beings wanted? So this kind of question might be on the horizon. And that I think ultimately might be the real test of intelligence.
Ashley Llorens: Let’s stick with this category of the limitations of the model. And you kind of drew a line here in terms of producing new knowledge. You offered one example of that as proving mathematical theorems. What are some of the other limitations that you’ve discovered?
Sébastien Bubeck: So, GPT-4 is a large language model which was trained on the next –word-prediction objective function. So, what does it mean? It just means you give it a partial text and you’re trying to predict what is going to be the next word in that partial text. Once you want to generate content, you just keep doing that on the text that you’re producing. So, you’re producing words one by one. Now, of course, it’s a question that I have been reflecting upon myself, once I saw GPT-4. It’s a question whether human beings are thinking like this. I mean it doesn’t feel like it. It feels like we’re thinking a little bit more deeply.
We’re thinking a little bit more in advance of what we want to say. But somehow, as I reflect, I’m not so sure, at least when I speak, verbally, orally, maybe I am just coming up every time with the next word. So, this is a very interesting aspect. But the key point is certainly when I’m doing mathematics, I think I am thinking a little bit more deeply.
And I’m not just trying to see what is the next step, but I’m trying to come up with a whole plan of what I want to achieve. And right now the system is not able to do this kind of long-term planning. And we can give a very simple experiment that shows this maybe. My favorite one is, let’s say you have a very simple arithmetic equality—three times seven plus 21 times 27 equals something.
So this is part of the prompt that you give to GPT-4. And now you just ask, okay, you’re allowed to modify one digit in this so that the end result is modified in a certain way. Which one do you choose? So, the way to solve this problem is that you have to think.
You have to try. Okay, what if I were to modify the first digit? What would happen if I were to modify the second digit? What would happen? And GPT-4 is not able to do that. GPT-4 is not able to think ahead in this way. What it will say is just: I think if you modify the third digit, just randomly, it’s going to work. And it just tries and it fails. And the really funny aspect is that once it starts feigning, GPT-4, this becomes part of its context, which in a way becomes part of its truth. So, the failure becomes part of its truth and then it will do anything to justify it.
It will keep making mistakes to keep justifying it. So, these two aspects, the fact that it cannot really plan ahead and that once it makes mistakes, it just becomes part of its truths. These are very, very serious limitations, in particular for mathematics. This makes it a very uneven system, once you approach mathematics.
Ashley Llorens: You mentioned something that’s different about machine learning the way it’s conceptualized in this kind of generative AI regime, which is fundamentally different than what we’ve typically thought about as machine learning, where you’re optimizing an objective function with a fairly narrow objective versus when you’re trying to actually learn something about the structure of the data, albeit through this next word prediction or some other way.
What do you think about that learning mechanism? Are there any limitations of that?
Sébastien Bubeck: This is a very interesting question. Maybe I just want to backtrack for a second and just acknowledge that what happened there is kind of a miracle. Nobody, I think nobody in the world, perhaps, except OpenAI, expected that intelligence would emerge from this next word prediction framework just on a lot of data.
I mean, this is really crazy, if you think about it now, the way I have justified it to myself recently is like this. So, I think it is agreed that deep learning is what powers the GPT4 training. You have a big neural network that you’re training with gradient descent, just trying to fiddle with the parameters.
So, it is agreed that deep learning is this hammer, that if you give it a dataset, it will be able to extract the latent structure of that dataset. So, for example, the first breakthrough that happened in deep learning, a little bit more than ten years ago, was the Alexa.NET moment, where they trained a neural network to basically classify cats, dogs, cars accessorized with images.
And when you train this network, what happens is that you have these edge detectors that emerge on the first few layers of the neural network. And, nothing in the objective function told you that you have to come up with edge detector. This was an emergent property. Why? Because it makes sense. The structure of an image is to combine those edges to create geometric shapes.
Right now, I think what’s happening and we have seen this more and more with the large language models, is that there are more and more emerging properties that happen as you scale up the size of the network and the size of the data. Now what I believe is happening is that in the case of GPT-4 they gave it such a big dataset, so diverse with so many complex parameters in it, that the only way to make sense of it, the only latent structure that unifies all of this data is intelligence.
The only way to make sense of the data was for the system to become intelligent. This is kind of a crazy sentence. And I expect the next few years, maybe even the next few decades, will try to make sense of whether this sentence is correct or not. And hopefully, human beings are intelligent enough to make sense of that sentence.
I don’t know right now. I just feel like it’s a reasonable hypothesis that this is what happened there. And so in a way, you can say maybe there is no limitation to the next-word-prediction framework. So that’s one perspective. The other perspective is, actually, the next –word-prediction, token framework is very limiting, at least at generation time.
At least once you start to generate new sentences, you should go beyond a little bit if you want to have a planning aspect, if you want to be able to revisit mistakes that you made. So, there we believe that at least at generation time, you need to have a slightly different system. But maybe in terms of training, in terms of coming up with intelligence in the first place, maybe this is a fine way to do it.
Ashley Llorens: And maybe I’m kind of inspired to ask you a somewhat technical question, though. Yeah. Where I think one aspect of our previous notion of intelligence and maybe still the current notion of intelligence for some is this aspect of compression, the ability to take something complex and make it simple, maybe thinking grounded in Occam’s Razor, where we want to generate the simplest explanation of the data in the thing, some of the things you’re saying and some of the things we’re seeing in the model kind of go against that intuition.
So talk to me a little bit about that.
Sébastien Bubeck: Absolutely. So, I think this is really exemplified well in the project that we did here at Microsoft Research a year ago, which we called Lego. So let me tell you about this very briefly, because it will really get to the point of what you’re trying to say. So, let’s say you want train an AI system that can solve middle school systems of linear equations.
So, maybe it’s X plus Y equals Z, three X minus two, Y equals one, and so on. You have three equations with two variables. And you want to train a neural network that takes in this system of equation and outputs the answer for it. The classical perspective, the Occam’s Razor perspective would be collected dataset with lots of equations like this train the system to solve those linear equation.
And there you go. This is a way you have the same kind of distribution at training time and that they start now. What this new paradigm of deep learning and in particular of large language models would say is instead, even though your goal is to solve systems of linear equations for middle school students instead of just training data, middle school systems offline.
Now, we’re going to collect a hugely diverse list of data maybe we’re going to do next. One prediction not only on the systems of linear equation, but also on all of Wikipedia. So, this is now a very concrete experiment. You have to learn networks. Neural network A, train on the equations. Neural network B, train on the equations, plus Wikipedia. And any kind of classical thinking would tell you that neural network B is going to do worse because it has to do more things, it’s going to get more confused. It’s not the simplest way to solve the problem. But lo and behold, if you actually run the experiment for real, Network B is much, much, much better than network A. Now I need to quantify this a little bit. Network A, if it was trained with systems of linear regression with three variables, is going to be fine on systems of linear regression with three variables.
But as soon as you ask it four variables or five variables, it’s not able to do it. It didn’t really get to the essence of what it means to solve the linear equations, whereas Network B, it’s not only subsystems of equation with three variables, but it also does four, it also does five and so on.
Now the question is why? What’s going on? Why is it that making the thing more complicated, going against Occam’s Razor, why is that a good idea? And the extremely naive perspective, which in fact some people have said because it is so mysterious, would be maybe to read the Wikipedia page on solving systems of linear equation.
But of course, that’s not what happened. And this is another aspect of this whole story, which is anthropomorphication of the system is a big danger. But let’s not get into that right now. But the point is that’s not at all the reason why we became good at solving systems of linear equations.
It’s rather that it had this very diverse data and it forced it to come up with unifying principles, more canonical, component of intelligence. And then it’s able to compose this canonical component of intelligence to solve the task at hand.
Ashley Llorens: I want to go back to something you said much earlier around natural evolution versus this notion of artificial evolution. And I think that starts to allude to where I think you want to take this field next, at least in terms of your study and your group. And that is, focusing on the aspect of emergence and how intelligence emerges.
So, what do you see as the way forward from this point, from your work with Lego that you just described for you and for the field?
Sébastien Bubeck: Yes, absolutely. So, I would argue that maybe we need a new name for machine learning in a way. GPT-4 and GPT-3 and all those other large language models, in some ways, it’s not machine learning anymore. And by that I mean machine learning is all about how do you teach a machine a very well-defined task? Recognize cats and dogs, something like that. But here, that’s not what we’re doing. We’re not trying to teach it a narrow task. We’re trying to teach it everything. And we’re not trying to mimic how a human would learn. This is another point of confusion. Some people say, oh, but it’s learning language, but using more tags than any human would ever see.
But that’s kind of missing the point. The point is we’re not trying to mimic human learning. And that’s why maybe learning is not the right word anymore. We’re really trying to mimic something which is more akin to evolution. We’re trying to mimic the experience of millions, billions of entities that interact with the world. In this case, the world is the data that humans produced.
So, it’s a very different style. And I believe the reason why all the tools that we have introduced in machine learning are kind of useless and almost irrelevant in light of GPT-4 is because it’s a new field. It’s something that needs new tools to be defined. So we hope to be at the forefront of that and we want to introduce those new tools.
And of course, we don’t know what it’s going to look like, but the avenue that we’re taking to try to study this is to try to understand emergence. So emergence again is this phenomenon that as you scale up the network and the data, suddenly there are new properties that emerge at every scale. Google had this experiment where they scaled up their large language models from 8 billion to 60 billion to 500 billion.
And at 8 billion, it’s able to understand language. And it’s able to do a little bit of arithmetic at 60 billion. Suddenly it’s able to translate between languages. Before it couldn’t translate. At 60 billion parameters, suddenly it can translate. At 500 billion suddenly it can explain jokes. Why can it suddenly explain jokes?
So, we really would like to understand this. And there is another field out there that has been grappling with emergence for a long time that we’re trying to study systems of very complex particles interacting with each other and leading to some emergent behaviors.
What is this field? It’s physics. So, what we would like to propose is let’s study the physics of AI or the physics of AGI, because in a way, we are really seeing this general intelligence now. So, what would it mean to study the physics of AGI? What it would mean is, let’s try to borrow from the methodology that physicists have used for the last few centuries to make sense of reality.
And what (were) those tools? Well, one of them was to run a very controlled experiment. If you look at the waterfall and you observe the water which is flowing and is going in all kinds of ways, and you go look at it in the winter and it’s frozen. Good luck to try to make sense of the phases of water by just staring at the waterfall, GPT-4 or LAMDA or the flash language model.
These are all waterfalls. What we need are much more small scale controlled experiments where we know we have pure water. It’s not being tainted by the stone, by the algae. We need those controlled experiments to make sense of it. And LEGO is one example. So that’s one direction that we want to take. But in physics there is another direction that you can take, which is to build toy mathematical models of the real world.
You try to abstract away lots of things, and you’re left with a very simple mathematical equation that you can study. And then you have to go back to really experiment and see whether the prediction from the toy mathematical model tells you something about the real experiment. So that’s another avenue that we want to take. And then we made some progress recently also with interns at (Microsoft Research).
So, we have a paper which is called Learning Threshold Units. And here really we’re able to understand how does the most basic element, I don’t want to say intelligence, but the most basic element of reasoning emerges in those neural networks. And what is this most basic element of reasoning? It’s a threshold unit. It’s something that takes as input some value.
And if the value is too small, then it just turns it to zero. And this emergence already, it’s a very, very complicated phenomenon. And we were able to understand the non convex dynamics at play and connected to which is called the edge of stability, which is all very exciting. But the key point is that we have a toy mathematical model and there in essence what we were able to do is to say that emergence is related to the instability in training, which is very surprising because usually in classical machine learning, instability, something that you do not want, you want to erase all the instabilities.
And somehow through this physics of AI approach, where we have a toy mathematical model, we are able to say actually it’s instability in training that you’re seeing, that everybody is seeing for decades now, it actually matters for learning and for emergence. So, this is the first edge that we took.
Ashley Llorens: I want to come back to this aspect of interaction and want to ask you if you see fundamental limitations with this whole methodology around certain kinds of interactions. So right now we’ve been talking mostly about these models interacting with information in information environments, with information that people produce, and then producing new information behind that.
The source of that information is actual humans. So, I want to know if you see any limitations or if this is an aspect of your study, how we make these models better at interacting with humans, understanding the person behind the information produced. And after you do that, I’m going to come back and we’ll ask the same question of the natural world in which we as humans reside.
Sébastien Bubeck: Absolutely. So, this is one of the emergent properties of GPT-4 to put it very simply, that not only can it interact with information, but it can actually interact with humans, too. You can communicate with it. You can discuss, and you’re going to have very interesting discussions. In fact, some of my most interesting discussions in the last few months were with GPT-4.
So this is surprising. Not at all something we would have expected, But in there it’s there. Not only that, but it also has a theory of mind. So GPT-4 is able to reason about what somebody is thinking, what somebody is thinking about, what somebody else is thinking, and so on. So, it really has a very sophisticated theory of mind. There was recently a paper saying that ChatGPT is roughly at the level of a seven-year-old in terms of its theory of mind. For GPT-4, I cannot really distinguish from an adult. Just to give you an anecdote, I don’t know if I should say this, but one day in the last few months, I had an argument with my wife and she was telling me something.
And I just didn’t understand what she wanted from me. And I just talked with GPT-4 I explained the situation. And that’s what’s going on. What should I be doing? And the answer was so detailed, so thoughtful. I mean, I’m really not making this up. This is absolutely real. I learned something from GPT-4 about human interaction with my wife.
This is as real as it gets. And so, I can’t see any limitation right now in terms of interaction. And not only can it interact with humans, but it can also interact with tools. And so, this is the premise in a way of the new Bing that was recently introduced, which is that this new model, you can tell it “hey, you know what, you have access to a search engine.”
“You can use Bing if there is some information that you’re missing and you need to find it out, please make a Bing search.” And somehow natively, this is again, an emergent property. It’s able to use a search engine and make searches when it needs to, which is really, really incredible. And not only can it use those tools which are well-known, but you can also make up tools.
You can tell you can say, hey, I invented some API. Here is what the API does. Now please solve me problem XYZ using that API and it’s able to do it native. It’s able to understand your description in natural language of what the API that you built is doing and it’s able to leverage its power and use it.
This is really incredible and opens so many directions.
Ashley Llorens: We certainly see some super impressive capabilities like the new integration with Bing, for example. We also see some of those limitations come into play. Tell me about your exploration of those in this context.
Sébastien Bubeck: So, one keyword that didn’t come up yet and which is going to drive forward is a conversation at least online and on Twitter is hallucinations. So those models, GPT-4, still does hallucinate a lot. And in a way, for good reason, it’s on the spectrum where on the one hand you have bad hallucination, completely making up facts when which are contrary to the real facts in the real world.
But on the other end, you have creativity. When you create, when you generate new things, you are, in a way, hallucinating. It’s good hallucinations, but still these are hallucinations. So having a system which can both be creative, but does not hallucinate at all—it’s a very delicate balance. And GPT-4 did not solve that problem yet. It made a lot of progress, but it didn’t solve it yet.
So that’s still a big limitation, which the world is going to have to grapple with. And I think in the new Bing it’s very clearly explained that it is still making mistakes from time to time and that you need to double check the result. I still think the rough contours of what GPT-4 says and the new Bing says is really correct.
And it’s a very good first draft most of the time, and you can get started with that. But then, you need to do your research and it cannot be used for critical missions yet. Now what’s interesting is GPT-4 is also intelligent enough to look over itself. So, once it produced a transcript, you can ask another instance of GPT-4 to look over what the first instance did and to check whether there is any hallucination.
This works particularly well for what I would call in-context hallucination. So, what would be in-context hallucination? Let’s say you have a text that you’re asking it to summarize and maybe in the summary, it invents something that was not out there. Then the other instance of GPT-4 will immediately spot it. So that’s basically in-context hallucination.
We believe they can be fully solved soon. The open-world type of hallucination When you ask anything, for example, in our paper we ask: where is the McDonald’s at Sea-Tac, at the airport in Seattle. And it responds: gate C2. And the answer is not C2. The answer, it’s B3. So this type of open-world hallucination—it’s much more difficult to resolve.
And we don’t know yet how to do that. Exactly.
Ashley Llorens: Do you see a difference between a hallucination and a factual error?
Sébastien Bubeck: I have to think about this one. I would say that no, I do not really see a difference between the hallucination and the factual error. In fact, I would go as far as saying that when it’s making arithmetic mistakes, which again, it still does, when it adds two numbers, you can also view it as some kind of hallucination.
And by that I mean it’s kind of a hallucination by omission. And let me explain what I mean. So, when it does an arithmetic calculation, you can actually ask it to print each step and that improves the accuracy. It does a little bit better if it has to go through all the steps and this makes sense from the next word prediction framework.
Now what happens is very often it will skip a step. It would kind of forget something. This can be viewed as a kind of hallucination. It just, hallucinated that this step is not necessary and that it can move on to the next stage immediately. And so, this kind of factual error or like in this case in a reasoning error if you want, they are all related to the same concept of hallucination. There could be many ways to resolve those hallucinations.
Maybe we want to look inside the model a little bit more. Maybe we want to change the training pipeline a little bit. Maybe the reinforcement learning with human feedback can help. All of these are small patches and still I want to make it clear to the audience that it’s still an academic open problem whether any of those directions can eventually fix it or is it a fatal error for large language models that will never be fixed.
We do not know the answer to that question.
Ashley Llorens: I want to come back to this notion of interaction with the natural world.
As human beings, we learn about the natural world through interaction with it. We start to develop intuitions about things like gravity, for example. And there is an argument or debate right now in the community as to how much of that knowledge of how to interact with the natural world is encoded and learnable from language and the kinds of information inputs that we put into the model versus how much actually needs to be implicit, explicitly encoded in an architecture or just learned through interaction with the world.
What do you see here? Do you see a fundamental limitation with this kind of architecture for that purpose?
Sébastien Bubeck: I do think that there is a fundamental limit. I mean, that there is a fundamental limitation in terms of the current structure of the pipeline. And I do believe it’s going to be a big limitation once you ask the system to discover new facts. So, what I think is the next stage of evolution for the systems would be to hook it up with a simulator of sorts so that the system at training time when it’s going through all of the web, is going through all of the data produced by humanity.
So, it realizes, oh, maybe I need more data of a certain type, then we want to give it access to a simulator so that it can produce its own data, it can run experiments which is really what babies are doing. Infants—they run experiments when they play with a ball, when they look at their hand in front of their face.
This is an experiment. So, we do need to give access to the system a way to do experiments. Now, the problem of this is you get into this little bit of a dystopian discussion of whether do we really want to give these systems which are super intelligent in some way, do we want to give them access to simulators?
Aren’t we afraid that they will become super-human in every way if some of the experiments that they can run is to run code, is to access the internet? There are lots of questions about what could happen. And it’s not hard to imagine what could go wrong there.
Ashley Llorens: It’s a good segue into maybe a last question or topic to explore, which comes back to this phrase AGI—artificial general intelligence. In some ways, there’s kind of a lowercase version of that where we talk towards more generalizable kinds of intelligence. That’s the regime that we’ve been exploring. Then there’s a kind of a capital letter version of that, which is almost like a like a sacred cow or a kind of dogmatic pursuit within the AI community. So, what does that capital letter phrase AGI, mean to you? And maybe part B of that is: is our classic notion of AGI the right goal for us to be aiming for?
Sébastien Bubeck: Before interacting with GPT-4, to me, AGI was this unachievable dream. Some think it’s not even clear whether it’s doable. What does it even mean? And really by interacting with GPT-4, I suddenly had the realization that actually general intelligence, it’s something very concrete.
It’s able to understand any kind of topics that you bring up. It is going to be able to reason about any of the things that you want to discuss. It can bring up information, it can use tools, it can interact with humans, it can interact with an environment. This is general intelligence. Now, you’re totally right in encoding it “lowercase” AGI.
Why is it not uppercase AGI? Because it’s still lacking some of the fundamental aspects, two of them, which are really, really important. One is memory. So, every new session with GPT-4 is a completely fresh tabula rasa session. It’s not remembering what you did yesterday with it. And it’s something which is emotionally hard to take because you kind of develop a relationship with the system.
As crazy as it sounds, that’s really what happens. And so you’re kind of disappointed that it doesn’t remember all the good times that you guys had together. So this is one aspect. The other one is the learning. Right now, you cannot teach it new concepts very easily. You can turn the big crank of retraining the model.
Sure, you can do that, but I’ll give you the example of using a new API. Tomorrow, you have to explain it again. So, of course, learning and memory, those two things are very, very related, as I just explained. So, this is one huge limitation to me.
If it had that, I think it would qualify as uppercase AGI. Now, not everybody would agree even with that, because many people would say, no, it needs to be embodied, to have real world experience. This becomes a philosophical question. Is it possible to have something that you would call a generally intelligent being that only lives in the digital world?
I don’t see any problem with that, honestly. I cannot see any issue with this. Now, there is another aspect. Once you get into this philosophical territory, which is right now the systems have no intrinsic motivation. All they want to do is to generate the next token. So, is that also an obstruction to having something which is a general intelligence?
Again, to me this becomes more philosophical than really technical, but maybe there is some aspect which is technical there. Again, if you start to hook up the system to simulate those to run their own experiments, then certainly they maybe have some intrinsic motivation to just improve themselves. So maybe that’s one technical way to resolve the question. I don’t know.
Ashley Llorens: That’s interesting. And I think there’s a word around that phrase in the community. Agent. Or seeing “agentic” or goal-oriented behaviors. And that is really where you start to get into the need for serious sandboxing or alignment or other kinds of guardrails for the system that actually starts to exhibit goal-oriented behavior.
Sébastien Bubeck: Absolutely. Maybe one other point that I want to bring up about AGI, which I think is confusing a lot of people. Somehow when people hear general intelligence, they want something which is truly general that could grapple with any kind of environment. And not only that, but maybe that grapples with any kind of environment and does so in a sort of optimal way.
This universality and optimality, I think, are completely irrelevant to intelligence. Intelligence has nothing to do with universality or optimality. We as human beings are notoriously not universal. I mean, you change a little bit the condition of your environment, and you’re going to be very confused for a week. It’s going to take you months to adapt.
So, we are very, very far from universal and I think I don’t need to tell anybody that we’re very far from being optimal. The number of crazy decisions that we make every second is astounding. So, we’re not optimal in any way. So, I think it is not realistic to try to have an AGI that would be universal and optimal. And it’s not even desirable in any way, in my opinion. So that’s maybe not achievable and not even realistic, in my opinion.
Ashley Llorens: Is there an aspect of complementarity that we should be striving for in, say, a refreshed version of AGI or this kind of long-term goal for AI?
Sébastien Bubeck: Yeah, absolutely. But I don’t want to be here in this podcast today and try to say what I view in terms of this question, because I think it’s really a community that should come together and discuss this in the coming weeks, months, years, and come together with where do we want to go, where the society want to go, and so on.
I think it’s a terribly important question. And we should not dissociate our futuristic goal with the technical innovation that we’re trying to do day-to-day. We have to take both into account. But I imagine that this discussion will happen and we will know a lot more a year from now, hopefully.
Ashley Llorens: Thanks, Sébastien. Just a really fun and fascinating discussion. Appreciate your time today.
Sébastien Bubeck: Yeah, thanks, Ashley it was super fun.