Presented by John Langford at Microsoft Research Forum, February 2025

“That ability to condition on generation, rather than evaluate the generation, ends up being amazingly useful in terms of giving you a more honest valuation of the generated text.”
– John Langford, Partner Research Manager, Microsoft Research AI Frontiers
-
Microsoft research copilot experience What advantages do belief state transformers offer over traditional GPT-style models in planning and decision-making tasks?
Transcript: Lightning Talk
Belief state transformers
John Langford, Partner Research Manager, Microsoft Research AI Frontiers
This talk showcases a new transformer architecture that generates compact belief states for goal-conditioned planning, enhancing planning algorithms’ efficiency and effectiveness.
Microsoft Research Forum, February 25, 2025
FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: Transformer models have brought us a revolution in language modeling with their capability to generate impressive language with many emergent properties. At the same time, LLMs have a number of weaknesses, one being that they are not very good at evaluating their own output. Let’s hear how the new Belief State Transformer architecture unlocks new abilities by combining a standard GPT-style architecture of a forward encoder for token prediction with an additional backward encoder.
JOHN LANGFORD: I’m John Langford. I’d like to tell you about belief state transformers, which is a new paper we have in archives, and which is also accepted at ICLR [International Conference on Learning Representations]. There are many coauthors on this paper. I’d like to thank them, particularly Edward, who did much of the work here.
To start with, let’s talk about standard GPT-style transformers. In standard GPT style transformers, you have a sequence of symbols which are going into a forward encoder, and then the forward encoder outputs some information to the output head, and then the output head predicts the final token. So, this is a straightforward approach and yet amazingly powerful. It’s kind of the key backbone behind GPT-4 and other language models.
For the purposes of research, though, we need to have something to think about, to complain about, and I’m going to complain about self-evaluation. Often these language models can’t be used to evaluate their own output too well, because the generation of the next token is done by exactly the mechanism you would use to evaluate it in that output head. So, this is kind of like grading yourself, and like grading yourself you can miss things that an independent grader would actually see pretty well.
Right, so a belief state transformer changes the architecture. And so, it’s taking two transformers and grafting them together. One of them is going to be the standard forward encoder on the prefix. And then we’re also going to have another transformer, which is a backward encoder on the suffix. These are both going to put out some information, which goes to the output head. And the output head is going to predict the next token and the previous token. So, it’s the next token of the prefix and the previous token of the suffix. Something to worry about with these transformers is the computation. So, these are transformers obviously doing more computation. But it turns out that this “more computation” is only in a constant factor of more computation.
And the key observation here is that in the forward encoder, just doing the attention, what you’re going to use in the GPT-style transformer, is already order N-squared [N2]. Every token looks at every previous token in order to figure out what information is necessary to predict the next token. In the belief state transformer, that happens twice. You have two different transformers, each with their own attention, and so you pay a factor of two.
And then, in addition, you’re going to pay because the number of times you evaluate the head, the output head, is order n squared because there are order N-squared prefix/suffix pairs. So, there’s a constant factor increasing computation, which is problematic, but it’s not like the end of the world. You can subsample or things like that. And what you get in return is order N-squared gradients rather than order N gradients.
In a standard GPT-style transformer, you only have order N gradients because you only have order N symbols, and you get one gradient per symbol. Here you get order N-squared gradients because you have order N-squared prefix/suffix pairs. That means there’s many more ways to get information out of a sequence. And that unlocks the possibility of learning new things that were previously unlearnable.
Okay, so now let’s go on to the belief state. Why are we talking about a belief state when we say belief state transformer. Well, it turns out you can prove a theorem. And this theorem says that the output of the forward encoder is a belief state for the prefix. So what that means is that the output of the forward encoder will converge to all the information necessary to predict the future. So that’s all symbols after the prefix. So, that ability to create a compact belief state is new with belief state transformers, something that previously we only really knew how to do with state space machines.
Okay, so let’s try this out. Looking at Tiny Stories. Tiny Stories is dataset where you have a bunch of children stories, which are generated by GPT-4.
We’re going to feed a prefix and a suffix into our system, and it’s going to fill in the middle, which is what happens in blue. And then for a baseline, we’re going to compare the fill-in-the-middle approach to using GPT-style transformers. So the way the fill-in-the-middle approach works with GPT-style transformers is you take the prefix, and then you add the suffix, and then you just predict the tokens after that.
So that works reasonably well. This is very commonly used. And now if we have these two different approaches the question is how do we actually value these different approaches? Which one is better? So, the way we’re going to judge this is we’re going to ask GPT-4 which is better in various ways: syntax, style, and so forth. And then we’ll ask it for a summary judgment, which is a standard technique.
We looked at what it was doing, and it seemed very reasonable. And in doing this, we end up with the belief state transformer winning about a factor of three more often than the GPT-style transformer. So that’s huge. It’s so huge that you really want to understand why. And it seems like the key here is self-evaluation. So, under the hood, we’re actually running each of these, say 120 times, using a beam search. The code for that is on the right. So, given the beam search, you have several different possible completions. And now how do you choose which completion to actually use? Because you have to pick one of these. You’re trying to pick a completion. And for the GPT-style transformer, there’s only one way to really do this. The way is you take the next head, and you use it as a probability function, and you look at the probability of the sequence of tokens which is produced.
That works reasonably well. It actually does improve picking out a high-probability sequence of tokens versus a lower probability sequence of tokens. But it’s not as much as you get with the belief state transformer. And the reason why is the self-grading issue that I was talking about earlier. There’s many ways that a system could be blind to its own mistakes. With the belief state transformer, though, you have another option, because the next head can instead condition on the generated data and run over the suffix in order to value the generated data.
So, that ability to condition on generation, rather than evaluate the generation, ends up being amazingly useful in terms of giving you a more honest valuation of the generated text. All right, so just to summarize, we have this belief state transformer. This learns a compact belief state, which is a new thing in transformers. It gives us a way to have a simple set of values, which summarize all information we need to predict the future.
And this seems to provide a very strong form of self-evaluation, which is potentially very useful in many situations where you’re trying to use test-time compute, or even using test-time compute to further create training data. So, this is more in the paper. There’s some other things that you can do with transformer that are kind of new.
I think the biggest question in my mind is what happens when you scale this up? And, of course, we’re working on that. That’s one of the great things about being in MSR [Microsoft Research]. They have some GPUs to scale this up to much larger datasets. So, stay tuned. And, thank you.
Related resources
- Research Lab AI Frontiers