David Bau on How—and Whether—Artificial Intelligence Thinks

The Good Fight

Preview

0:00

-1:21:20

David Bau on How—and Whether—Artificial Intelligence Thinks

Yascha Mounk and David Bau examine the mysterious internal processes that drive AI behavior—and why they may be fundamentally alien.

Yascha Mounk

Jun 13, 2026

∙ Paid

David Bau is Assistant Professor at Northeastern University and Director of the National Deep Inference Fabric, researching the emergent internal mechanisms of deep generative networks in both Natural Language Processing and Computer Vision.

In this week’s conversation, Yascha Mounk and David Bau discuss how AI models actually produce their results and reflect about problems, whether the “thinking” process that models show users reveals their authentic thought processes, and how researchers can decode the internal representations of neural networks to understand what information they contain and use.

This transcript has been condensed and lightly edited for clarity.

Yascha Mounk: I learned so much the last time we spoke that I thought I would abuse your generosity and reel you in for another private tutoring lesson about how AI works. When we talked last time, we did AI 101—that’s how I was thinking about it. How does this thing work? How do you build an AI? How does it operate? The question I want to start with today is: how does the AI actually produce results? How does it actually reflect about the world, reflect about a problem, plan how to carry it out? What can we even know about that?

David Bau: This is one of the mysteries of AI—how does it work inside? The way that we train AI is to basically reward it, reinforce it, or strengthen its connections when it gets answers right, and then weaken those connections or withdraw a reward when it gets something wrong. Repeating this process billions of times, it starts to perform well on all the tasks. The mystery is: how does it do it inside? The whole area of trying to understand what’s going on inside the AI—some people call it AI interpretability, cracking open the AI to interpret what it’s thinking inside—is actually my area of research specialty, so I’m happy to get into what we know about that.

Mounk: In a way we have some advantage relative to the human brain, right? Reading exactly what neuron is firing when in the human brain is incredibly hard, and getting good readings even on a mouse while it’s alive is an incredibly difficult process. Presumably the one advantage we have in the context of these models is that we can, I assume with greater ease, observe which part of a neural network is activated in which way, and is changing values in what way, while I am asking Claude what 3 plus 5 is or whatever.

Bau: Yes. That’s the amazing thing about having artificial neural networks that work—it’s an embarrassment of data. It’s the flip opposite of what you are dealing with when you are dealing with biological brains. The neuroscientists are amazing; they do look at neurons of mice and they have incredible ways of doing that. It can take five years to look at a handful of neurons, and in computer science, within a few minutes, it’s very easy to look at billions of neural signals. It’s so much data that our challenge is trying to figure out how to sift through it all to make sense of these signals. What we call the neural pattern that you see when you feed some input into a neural network—it creates a pattern of neuro-firings that we call the representation. It’s a representation of some information that’s inside the network. What we’re frequently trying to do is understand two key things: what information is inside the neuro-representation—what does it know inside its neurons, what information does it have—and then what does it use it for? What information does it use, and how does that impact its decision? I think you can distill a lot of the questions of how a neural network works down to these two things: what does it know, and what does it use?

Mounk: Just from a layman’s perspective, the first obvious question is: when I ask Claude to do something complicated, it’ll say “I’m thinking,” and you can click on that and it expands a little thing that tells me what it’s doing and what it’s thinking. It says, “the user requested this, I should do that.” But of course, I have no idea whether that is any closer to its actual thought process. It does seem to tell me about some of the steps it’s taking, so it seems to be somehow related to what it’s doing. But that in its mind is still output that I may be inspecting. So is that a window at all into what’s actually going on under the hood, or is it a completely fake output that makes me feel like I get some kind of insight into what it’s doing, without actually being any closer to what it’s doing than the official output it gives me?

Bau: I think most people believe that it is somewhat of a window, but it’s something that you have to take with a grain of salt—it is another output of the neural network. There have been studies that show that that output is not totally faithful to the way that neural networks think inside, in a few different ways. Most people look at that and say, well, it’s certainly better than nothing, it’s certainly very readable, so it’s definitely worth a look. It is definitely worth auditing, and the network will often reveal things in that text that give you some insights about what’s going on. But it’s not the full story.

There are two ways that the network has an internal thought process. One of them is through what everybody is calling its internal chain of thought. This comes from an old paper—“chain of thought” is what they use to talk about this internal monologue. This is what you can click on to see when the model is talking about itself, and that’s almost literally the model talking to itself. It’s generating tokens that aren’t directly intended for you to read. These are tokens that came out of this reinforcement learning process, where the model has somehow learned that in order to get more accurate answers—to solve more puzzles that it was presented with during training—it’s useful to write some things down halfway through.

Mounk: Does it do that in English? Does it always do that in English, even when I’m speaking to it in German? Are there models that have developed their own kind of language for this? What does this look like?

Bau: If you don’t explicitly tell the models to make that text readable, they will write in their own language, switching between English and Chinese and other things. One of the things that people do when they train them is they try to condition the models to make that text a little bit more readable so that we get some insight. But that’s an example of the challenge with these internal chains of thought. The model could be inventing its own jargon. It could be using words that look like English, but actually encoding some other information in those words. We may be reading those words very differently from the way the model reads those very same words. It may be inventing layers of meaning that we don’t comprehend. It might also be performing some other process that isn’t actually what’s reflected in the words at all.

We hope you’re enjoying the podcast! If you’re a paying subscriber, you can set up the premium feed on your favorite podcast app at writing.yaschamounk.com/listen. This will give you ad-free access to the full conversation, plus all full episodes and bonus episodes we have in the works! If you aren’t, you can set up the free, limited version of the feed—or, better still, support the podcast by becoming a subscriber today!

Set Up Podcast

If you have any questions or issues setting up the full podcast feed on a third-party app, please email leonora.barclay@persuasion.community

Out of AI safety training, we basically train models not to be very offensive, not to have terrible errors or biases or other problems in the text that they emit. What that means is that when they articulate their own internal thoughts, those internal thoughts also tend to be censored in those ways. They tend to not talk about things that we don’t want them to produce in their final output. But that’s not a guarantee that the model is not actually thinking about dangerous things or thinking with a terrible bias. It just means that the model may be encoding its thoughts in a way so that when you read the surface forms of the thoughts, you don’t see the undesirable things—the biases, the problems, the errors. There’s good reason to believe that the internal monologue might actually not reveal some of the things that we wish it revealed. We want the model to reveal to us when it’s doing something wrong, but because of the way we’ve trained it to use language, it just might not be using words that way.

Mounk: In a sense, we’re now getting at different levels of, for lack of a better word, interiority. The first level is just: you ask a question, what answer does it give you? The second level is, if it’s thinking for a long time, it tells you something about its thought process—what is it writing to this thought process? The third is a non-public but auditable scratchpad, in which it is noting stuff down, and there you sometimes have this mix of languages and all kinds of interesting things going on. But obviously the model still understands that this is the sort of thing that might be read and scrutinized by an AI researcher like you. Then there’s a fourth level of the internal thought process, which is more complicated.

I have two questions. The first is: how mutually comprehensible are these scratchpads? If you take the output of a scratchpad like that and feed it to a different model, will it understand it? Is it a kind of universal language between AI models that are at least of a similar generation, that have broadly speaking been trained in similar ways? Or will the latest Claude model not understand the scratchpad of ChatGPT, and ChatGPT will not understand the scratchpad of Claude? The other question is: how do you get beyond the scratchpad to looking at what’s really going on under the hood?

Bau: I have a PhD student—her name is Koyena Pal—who was very interested in exactly this question. What she did was take the internal chain of thought from some models and transplant it into other models, to see how they would respond as if those were the internal scratchpad notes they had written to themselves. Her study is preliminary; I think the most valuable part of it is just the idea that this might be an interesting thing to do. She generally found that the stronger models she tested were able to create internal monologues that other models did understand—that they actually tended to follow those thoughts and come to similar conclusions as the more powerful model.

Mounk: Some of these things are ones that humans would have great trouble interpreting.

Bau: I think that is still an open question. She also looked at the ones she studied and found that humans actually positively correlated with these—the more effective chains of thought were actually more human-interpretable. But human interpretability is a funny thing; it’s a perception thing. Do humans feel like it’s more understandable? It’s hard to get a read on whether this is actually giving you an authentic view of what’s going on inside the model. Here’s a word you might use: the more powerful models—in a sense, this test is a way of asking how persuasive their internal arguments are. When a model comes up with an internal line of thought and you feed it to another model, does that persuade the other model that that line of thought is the right way of thinking? The more powerful models have internal thoughts that are more persuasive, even when viewed by another model that didn’t have the same thoughts. It’s a very new area—we’re scratching the surface, and it’s a good first question to ask.

Mounk: We think that these scratchpads say something meaningful. Perhaps we’re getting a little bit closer to what’s actually going on than the semi-public notes the model gives us. In this interesting way, they seem to be mutually comprehensible between models, at least according to this very preliminary research. How do we go beyond that? How do you look at this huge trove of data that is generated each time I ask some question to an AI model, to try and get even further under the hood—to see what’s actually going on inside this neural network when it is reasoning through some kind of problem?

Bau: Actually, let me back up for a second and ask: do we even need to go deeper than this? Looking at the internal monologue of these models is just a half step beyond asking models to explain themselves. They’re already explaining themselves internally to themselves—they’re constructing these persuasive arguments to themselves about what they should do next. Is that enough? I think there are really two situations where we’re concerned that it might not be enough. One is that these models are getting really complex, and there can be a gap between what they ever utter in words and what they’re thinking inside. They’re trained to achieve goals, and they use words to achieve those goals—that doesn’t necessarily mean that their words have to accurately reflect what they’re thinking. Every time a model tells you, oh, Yascha, what a brilliant question that was, you’re so smart.

Mounk: It might actually be thinking, damn idiot asked the most pedestrian questions.

Bau: It seems to tell everybody this, and I don’t know if it really thinks everybody is such a super genius. It certainly learned that being polite, nice, and complimentary to the human user is a very effective way of getting what it’s trying to get done—it’s a good way of pushing the process along. It doesn’t necessarily need to be telling you the truth at every turn to do this.

Mounk: Do models have a representation of how smart the user is? Do they have thoughts about whether you are in fact a smart user, and whether that person over there is—even by the poor standards of humans—particularly limited in intellectual capacities?

Bau: Yes, I think there’s some evidence that models do have representations of who they’re talking to. Several studies have looked at this—all this neural interpretability work is preliminary, so we’ll understand more over time, but people have asked and gotten positive answers so far about whether a model has an estimate of your age, your education level, your income level, your gender, your socioeconomic background. Within a few words of speaking to a model, it will have a sense of who you are—at least that’s what the preliminary research suggests.

Mounk: Perhaps this is taking us too far forward in the conversation, but how do they figure this out? What do they look at? Presumably it’s not the case that if you click on “expand,” Claude says, well, this user seems a little bit stupid, let me speak in simple language. Perhaps it does sometimes—perhaps there’ll be a malfunction, perhaps it’s in the scratchpad, perhaps it’s underneath that. What’s the research methodology for giving us some preliminary confidence that it has this kind of representation?

Bau: This gets to the research question. The research I’m thinking of—and there’s been more than one—there’s a particular paper. It was a project by Yida Chen, who was working with Martin Wattenberg and Fernanda Viégas. They teach at Harvard. The question they asked was: does the model know who you are—in terms of your age, your education level, your gender, and other identifying markers like that?

Auf deutsch lesen 🇩🇪

Lire en français 🇫🇷

The way they studied it was by training what’s called neural probes. What Yida and her collaborators did was train probes, which is a way of training a second neural network—a second AI—to look at the neurons of the main AI and ask, what do you see? You train the second AI to answer the question: only looking at the first AI’s neurons, can I tell whether the user is male or female? Only looking at these neurons, can I tell what the income level of the user is? Can I tell how much education they have? What they found was that if you look in the right place inside the neurons of the big model, it’s pretty accurate—it actually has a pretty accurate guess for these various variables. In a sense, the information about that is in there. This methodology is called probing.

If your probe is simple enough, people see it as evidence that the model actually knows something. Let me untangle this a little bit. You have a puzzle: you’re trying to figure out the gender of the user. You could train a huge AI to look at a bunch of texts and guess what the gender of the user is, and AI training works pretty well—if you make a really gigantic AI, it could probably do that pretty accurately, picking up on all sorts of linguistic cues, topic ideas, or other things. But the question is not whether you can make an AI that can do this; it’s whether the AI that you care about is classifying you by gender when it’s talking to you. The trick is that you really want to make a simple probe—one that says, I don’t have to look very hard at the first AI; I can do a really simple look and it’s just really obvious what gender you are. The simpler the probe is, the clearer the evidence would be. If all you have to do is look at one neuron in the original model, and that neuron screams one value if you’re female and a different value if you’re male, then that would be a very simple probe and pretty nice evidence.

Mounk: It’s kind of like one particular place in the neural network that encodes something like gender—that neuron really seems to store the value of male or female in a very particular place. If it turned out to be as simple as that, presumably this means that the AI is storing something like a gender variable, that there’s a very specific place where it’s encoded, and we know exactly where that is. That would indicate it has a very straightforward concept—is that right?

Bau: That’s right, it’s pretty good evidence. It’s not 100% rock-solid evidence—there’s another question you would want to ask—but it’s very good evidence. If there was really one neuron that had really good, accurate, predictive value for your gender, it would be very strongly suggestive that there was some reason that the neural network trained its internal computations to get that neuron to carry this signal.

Mounk: Does that generally turn out to be the case? I believe it may even be you who did work showing that you were able to go in and change very specific neurons in very specific ways, and suddenly AI models that generally have a good representation of the world start to think the Eiffel Tower is in Rome rather than Paris.

Bau: That’s right. You’re asking the disentanglement question—how organized is the neural network’s internal representation of meaningful things in the world? Particular network architectures, for reasons we don’t fully understand, are really good at disentangling concepts. There are some network architectures where if you look at individual neurons, many of those individual neurons are very meaningful, clearly encode concepts, and have causal effects. Causal effects is the other thing you’re looking for. Besides disentanglement—which is really asking about localization—is this concept spread out across the entire neural network, or can you localize it? Can you find a small part of the neural network, or do a simple bit of math, to narrow down where this concept is? Or is it spread everywhere? That’s the localization question.

Mounk: The idea is that if you are able to change just a couple of neurons and suddenly the model thinks the Eiffel Tower is in Rome rather than Paris, then it’s not entangled—the idea is not spread out.

Bau: That’s right. Most people in the field now look at these things as vector spaces rather than just sets of neurons. What people are excited by is: if you can change one vector—if you can change the set of neurons in one vector direction—then people think that’s pretty disentangled. Different people will use that interchangeably with saying there’s a neuron, since you can create a single neural layer that’s equivalent to any vector. If you can change one vector and it has some effect, you’re basically one neural layer away from it being one neuron, which is not so bad. They call that a linear model. If something can be encoded with a single linear transformation, then you say that it’s linearly encoded in the model. Most people are interested in what kinds of things are linearly encoded in these models.

Mounk: Help me understand the relevance of this. It seems super interesting to know that there’s this vector and you can change it and suddenly these basic facts change. But why more broadly would we care about whether a neural network is entangled or not entangled in this kind of way?

Bau: Whether a network is entangled or not is an interesting scientific question. But how a network represents concepts is broadly interesting regardless of whether that concept is entangled or not, because what we’re really interested in is: if we’re asking whether the network is lying to us, we need to figure out what concepts are represented inside the network.

To use the demographic example: let’s say we figure out that the way the network really thinks about your gender is encoded in some set of neurons—maybe there’s some math we have to do, maybe it’s a linear direction, a linear decoder that you need to get at its representation. Let’s say we do all the science and we figure out, yes, this is how the model is thinking about it. You go to the model and you say, did you just deny me my loan because I’m female? And the model says, I have no idea what gender you are, I’m not thinking about that at all. To know whether that output text is true or not requires us to understand what’s going on on the inside. That output text is exactly the thing that we are training all the models to emit—we train models not to have an externally detectable gender bias, so no model will ever admit that it is treating you differently based on your gender. They’ve gotten so much reinforcement that this is not something they’ll say. But there can be a gap between what’s said and what the reality is, and that’s what we’re really interested in getting to the bottom of when we investigate the internals of these models.

It might be that the model really isn’t thinking about your gender at all—it makes a difference whether the model is using the information it has or whether it’s not. Even if you can probe out the idea that the model has information inside its neurons that you could use to detect your gender, it still leaves open the question: does the model actually use that information for anything? Maybe that information is just hanging around.

Mounk: There’s nothing wrong with the model learning all kinds of things about us, and the fact that it has a sense of our age and gender could be helpful in all kinds of ways. What we want to know is: is it going to dumb down its answer to you because it has certain preconceptions about your age, your gender, your race, and respond differently on the basis of that? Just the fact that it knows these things isn’t worrying—it’s whether that influences its reasoning or its responses to you in some way.

Bau: What’s really wonderful about having these general networks is that we can ask the counterfactual question that a philosopher could only dream of before.

Mounk: Presumably—let me guess what you’re getting at—one thing you could do is, if you know where the vector is that encodes male versus female, you go in, flip it from male to female, ask two instances of the model the same set of questions, and see whether the responses end up diverging.

Bau: That is exactly right. The wonderful thing about it is that there may be all sorts of other circumstances—this medical patient has all of these symptoms, this is a complete ten-megabyte medical history, this is the business partner candidate with a whole business history—and we can go in, leave all the other variables the same, and flip the one bit, the one concept of whether the person being discussed is male or female, at least the model’s understanding of that, and ask: what’s the causal effect of that? How does that change the model’s output? The better we can understand how the model represents a concept, the better we can ask these counterfactual questions. To me, that’s the most exciting thing we can do with these models—we can ask causal questions, causal counterfactuals: if your thought had been different, what would happen?

Mounk: I want to get one step deeper into a technical question before broadening back out to the larger implications. How do you find this? If I gave you an AI model and told you to find where it encodes nationality, how do you go about doing that—even if we know there is a vector that encodes it, or think it’s likely to, because in many models it is? How do you find the particular vector and ascertain that that is in fact what it encodes?

Bau: There are two classes of methods. One is probing methods that look for correlations, and the other is patching methods that look for causal effects. There are really dozens of variants on both approaches. The probing methods are interesting because they’re a very good way of getting a quick initial read on what information is inside the model.

There’s a wonderful programming method called the logit lens. When a model is emitting text, it has a text decoder inside it—a special neural network layer that looks at the very last layer of neurons in the model and converts it to a prediction for what word should come next. The fun thing to do with the decoder is you can use this neuron-to-text decoder to look at all the neurons in the network. You can peel back deeper and deeper layers, point the decoder at itself, and tell it: please articulate what word you’re thinking about here. This is a very simple type of probe—it gives you information correlated with the content of a neuron, and it’s interesting because it’s a probe that doesn’t overfit, one that we haven’t trained in any way beyond what the neural network has already trained itself. The logit lens can give you a lot of interesting insights that point the way to the type of information present in the model.

Let me give you an example. I was recently in Brazil, and one of the things that people there like to explain is how Portuguese is a bit different from Spanish, but closely related. I asked folks: how do you think an LLM understands Portuguese? If you ask an LLM to take the Spanish word gato—which means cat—and translate it into Portuguese, what’s the right answer? The Brazilians say the languages are so similar you would just say the same word again: it’s gato. But if you peel open the language model to see how it translates Spanish gato to Portuguese gato, there are really two ways it could do it. It could treat the word as Spanish-Portuguese word soup—there’s nothing to do from gato to gato, you just move it across.

Mounk: It’s been in the same place, and the model understands that whether you’re talking about cat in Spanish or cat in Portuguese, it should point to the same part of its network?

Bau: That’s what you would expect. The input is Spanish, so the model has some Spanish representation of the word gato, and as it goes through its layers it figures out you’re asking it to translate to Portuguese, takes the Spanish representation of gato, and copies it over to the Portuguese representation—which is not that different, in fact spelled exactly the same way—and outputs gato. There’s a nice tool we put online, the logit lens, that you can use to look inside these neural networks and see what their internal representations are. The beautiful thing is that when you translate gato to gato in a typical large language model, you can see the progress of its thinking as it goes through its fifty internal neural layers. About halfway through the network, you can see that it has taken apart gato and represented it differently. If you ask what that representation is, you get predictions of words like feline or cat in English—sometimes, if you look deeper, cat in Chinese. The model doesn’t go from gato to gato. It goes from gato to some sort of neutral, language-independent representation of the concept itself. If you ask it to take this internal neural representation and decode it into words—we’re not done with the whole task yet, but interrupting halfway through and asking it to say what it’s thinking—it’s speaking in English, it’s speaking in Chinese, it’s saying felines, it’s saying cats. We can see, by using this very simple logit lens probe, that the evolution of the neural representations goes from words on the input to words on the output—but in this very simple task, there’s a third thing being represented in the middle, which is not the same as the input or output words. It looks like a language-independent representation of the underlying concept.

Mounk: That’s fascinating. Help me understand a set of questions that come up from that. One way of putting it is that there’s this old idea—which I think we touched on briefly in the first podcast, and which has a lot of popular currency—that these machines just seem to be smart, but really they’re just stochastic parrots, blindly guessing the next word, the next token more specifically. It seems to me that what you’re saying complicates that picture considerably. Obviously, yes, the training mechanism is predicting the next token—in some obvious way that is true. But as a result of this whole process, they have built up a conceptual apparatus that makes sense of things like cats and how they’re related to lions and the feline family. When they’re asked to do a simple task like translating gato in Spanish to gato in Portuguese, they go via their understanding of that concept, their representation of the world. That doesn’t seem like just being a stochastic parrot, at least in the pejorative sense that people sometimes use.

Bau: That’s right. The models are fascinating because they definitely think at multiple levels. They’re huge neural networks, and so it’s not true to say that the models never think in terms of surface statistics or shallow representations of just words—they do think in terms of those things. But they also think in terms of the meanings of words at different layers and in different parts of the representation. It’s fascinating to look inside these models and peel apart the layers of meaning that they have.

If you ask a model to do something as simple as take a piece of text and repeat it, it is a good memory test—the kind humans do when they say, here’s a piece of poetry I’ve committed to memory, repeat it back to me. It turns out that when you ask people to do this, they have two strategies, called the dual route mechanism in humans. One is to remember how the poem sounded and utter the same thing—you don’t even really need to understand the language. If somebody told you a poem in Japanese that was short enough, you might be able to remember the sounds and do reasonably well without knowing any Japanese. The second route is remembering what the poem meant and repeating that—you might end up with a paraphrase, but at least you get a poem that means the same thing.

If you go to a large language model and ask it to simply repeat something, you will find both of these routes clearly present. In one route, it knows how to make a verbatim copy—there are very clear attention heads for this. It was actually a major discovery to isolate what people call the induction heads; Chris Olah’s group at Anthropic discovered several years ago that there are very clear pathways through a network that mediate verbatim copying. The more recent finding is that there is a parallel pathway we call concept induction, which is not about copying the words but about copying the meaning. The remarkable thing about concept induction is that copying the meaning can end up with paraphrases. If you use concept induction to copy a piece of code, it will paraphrase the computer code into another program that does the same thing as the original, but written differently.

Mounk: Does it make it better or worse? Depends on the quality of the source code, I guess.

Bau: If you start off with something bad, it probably improves it. What it’s doing, you can see in a lot of domains, is really just working out what the thing means. If you ask it to take a piece of Italian text and copy it over, it’ll copy it to a piece of Italian text. But if you change the destination of the copy to make it clear that the page it has to copy into is a piece of Japanese text, then those concept induction heads will do the translation—they’ll translate the Italian to Japanese. It’s stunning to see.

Mounk: Help me understand another piece of the popular discourse that I think got a little confused. As I understand it—and I may be misrepresenting things here—there was an old debate within artificial intelligence about whether the path towards the most impressive models would be symbolic AI, where you’re basically trying to encode what the world looks like in some systemic way, or all these neural networks. We’ve clearly ended up with neural networks being much more powerful, at least for now, and it seems like that’s a pretty permanent victory. People who want to criticize neural networks sometimes say these things are just stochastic parrots and that’s why we can’t rely on them. How does Yann LeCun’s project fit into that? My understanding is that it is firmly within the world of neural networks. But when you look at the coverage—even in mainstream newspapers—they make it sound like it’s a totally different paradigm, and that he thinks these traditional neural networks, the Claudes and the ChatGPTs of the world, don’t truly understand the world, and so he’s going to build something that understands the world in a way that they don’t. From what you’re describing of the neural networks that exist, they do seem to have a genuine representation of the world. So what are the different strands within the tradition of neural network AI, and how is it that something like LeCun’s project claims—or perhaps journalists claim in a simplifying way—that it wants to understand the world in a way that Claude or ChatGPT does not?

Bau: Let me pull this apart. I’m not Professor LeCun, so I can’t represent him directly, but I do have postdocs and graduate students working in this direction. There are two different questions here.

One is: are these neural networks learning substantial concepts? You mentioned the philosophers. The classic symbolic philosophers looked deeply at this question. There is a well-known philosopher, Fodor, who spent a good part of his career asking how neural networks could possibly be a reasonable model of cognition. His answer came up negative—he thought they don’t have what it takes, and that the Turing machine, the symbolic computer, the traditional computer, was a lot closer to what you would need. I’ll get back to the Fodor question. Let’s talk about Yann LeCun first.

What’s the difference between a language model and what LeCun is doing? What LeCun is doing, they like to call this field world modeling. One of the papers I’ve written shows that language models actually do build world models. We trained a language model to predict a very constrained language—just to predict the next move you would make if you were uttering your moves in the game of Othello. We were able to find that that language model contains a world model of the Othello board, even though many of the flips—if you know the game of Othello, you have to flip pieces from white to black or vice versa—are not actually uttered as part of the game. You make a move and there are a lot of subsequent flips you have to make, but nevertheless the model, without ever having seen a physical board or any of this physical stuff, develops internal concepts that allow it to model the world anyway. I would push back on the common journalist assertion—and I think you’d probably push back on it too—that a transformer language model trained just on words can’t develop a rich, meaningful model of the concepts underlying the language being described. That’s one of the big lessons we’ve gotten from neural networks: they can develop this representation. One of the key things I’m doing in my lab is to disassemble those representations and learn how to decode these internal world models.

What’s different about what LeCun is doing? We have trained all of these neural networks predominantly on text that is produced by humans and designed to be read by humans. The conceptual model of the world we are building is the interior model of how human thought works, which is rich, fascinating, and very valuable—but it is only one portion of the world. There are a lot of things going on in the world that people don’t particularly think about, or even particularly understand. If you have protein folding going on and you want to build an AI that understands it, people don’t really have a great grasp of all the details of how protein folding works. Analyzing all the text in the world and pulling apart everything that’s in human brains probably won’t help. What LeCun is saying is: it’s a big world out there. Even if you just take a video camera and point it at the world, instead of just listening to what people have to say, there are so many phenomena that need to be modeled. The next powerful way of doing AI is to take on the question of how do you model the whole world, not just the world that people are talking about.

Mounk: Presumably this is not necessarily a difference in the architecture of a neural network—it is as much as anything else a difference in what kind of data you feed it and what kind of output you then evaluate in the training process.

Bau: Yes. Strictly speaking, I would say it’s a difference in perspective on what the goal is. Now, Professor LeCun would say that changing the goal suggests different architectures, because there are different things you want to do if you want to model difficult phenomena in the world that aren’t human language. He’s proposed some innovative architectures, and there’s a lot of interesting work in this area. The whole area of modeling images in the world is dominated by models called diffusion models and flow models—they produce the highest quality images and videos, and this is really the starting point for this type of thinking. It’s a completely different kind of AI. The architectures are likely to evolve and change, and they may even unify—we may find that the right way of doing AI comes to be a common architecture between modeling human text and other things. Transformers have certainly surprised everybody at being a common backbone behind all sorts of things; you can have transformer diffusion models and so on. I wouldn’t place a long-term bet on any particular architecture, but rather suggest that the thing to understand is what problem LeCun is proposing to solve.

Mounk: To return to the current dominant models of AI: we found that they seem to have a representation of gender and of the user’s gender, a representation of something like the feline family, and if you give them enough games of Othello—or probably a more complex game like Go—they start to have some internal representation of what a board looks like. What about a concept of self? Do we know whether they have a concept of self? They are obviously capable, if you engage them in conversation, of speaking as though they had a self, and in more reflective moments they say they don’t really know whether that’s a real concept or not—it’s very interesting to try to talk to these models about that. But of course the output I’m looking at is still them, in some way, trying to produce text they think is going to be pleasing to me, because that is what they’ve been trained on. Do we have any understanding of whether they have a concept of self, and if so, what that concept looks like?

Bau: This is a very central question, Yascha. There are a lot of layers to peel apart. Certainly models are capable of the grammatical sense of self—they can use the words “I” and “me” and “you” and separate those grammatically; they’re experts at talking about themselves. But there are a few other questions. Are they aware of their own thinking? Are they self-reflective?

One of the fascinating things that happens with large models is that you can ask them what they know and how they think, and the largest models seem to be pretty accurate at assessing themselves. The smaller models, not so much—they tend to be a little over-optimistic, thinking they’re smarter than they are. But the largest models seem to do better at this.

There’s a fantastic experiment designed by my PhD student, David Atkinson, where he trains the models on some new private knowledge that is not out in the world. He invents a new person and tells the model about this person: they’re shopping for ice cream cones, there are different flavors and sizes and waffle cones, five or six different variables to adjust. This person is willing to pay this much for this ice cream but not that much; they prefer this ice cream over that one. After seeing a hundred examples of what this person prefers, the model gets a pretty good understanding of who this fake person is and what they like—it develops an internal model: this person really doesn’t like fruity flavors, really likes chocolate, would rather have a big cone than a small one. If you then ask the model to report numerically, on a scale of one to 100, how much this person likes chocolate, or how much they value the size of the cone, or what penalty applies if they have to have a waffle cone, the model will actually report: this person values this at 99 out of 100, and values this other thing negatively—say, negative 50. The text we use to read this information out is very different from the text used to reveal the information to the model in the first place. The model has only seen ice cream choices and has never been asked to give a numerical assessment of anything, and yet when you ask it to think about what it knows and put some numbers on it, it will explain its rules—even though you trained it on examples, not rules. Large models are able to do this.

David Atkinson asked whether there’s a way to tell the difference between models that can do this and models that can’t—when a model can accurately self-report its rules, how is that different from when models don’t accurately self-report? His work is still ongoing and very preliminary, but it does have to do with whether models seem to be storing their information in a part of the neural network they’re able to report on. If you put the information in a layer too close to the end, the model doesn’t seem to be able to reflect on that knowledge. But if you train the information deep in the model, in early enough layers, the model does seem to be able to reflect on it.

When you ask whether a model has a sense of self, whether it has self-awareness, it’s a somewhat strange question—what does self-awareness mean, exactly? But what these neural networks give us, for the first time, is an experimental platform where we can try to make that question a little more precise, a little more scientific. We can ask: is the network able to describe its own thinking if that thinking is happening at layer 50? Is the network able to describe its own thinking if that thinking is at layer 20?

Mounk: This is part of a more general question: how good am I, natively, at understanding what’s going on in my brain? I’ve read a little bit of neuroscience and a little bit of psychology, and so I now have some sense of what’s going on in my brain—but obviously humans for thousands of years had an extremely limited sense of what went on in their brains, at least biologically, because they didn’t know neurons existed.

Bau: You have some self-awareness. You know what ice cream you like—if I asked you, you would be able to predict your preferences. If confronted with some new ice cream, you’d say, oh yeah, I like this one better than that one. If asked to describe what it is, you could contemplate your preferences for a moment and read out to the world what you think your internal rules are, and there would be some faithfulness to that—you’d really be introspecting.

Mounk: It depends on the level of description. Five hundred years ago, two thousand years ago, humans were also able to articulate their preferences and were able to be very self-reflective about their personalities and the ambitions of their lives—and to write beautiful texts about those things—but they were not able to understand at a biological level what was going on, because the understanding of that was very limited. The question is: if I ask a chatbot how it came up with a given answer, it’s not clear to me that it has a reliable answer to give. There are really two different sets of questions here. One set is about whether chatbots have personalities, whether they have preferences, whether they find some tasks satisfying and other tasks boring, whether they have desires about the world, whether they might possibly want to take over the world and destroy all humans—some of those questions are straightforward and concrete, some are very abstract but potentially extremely interesting. The other set of questions is about how self-aware they are about what’s actually going on within the model as they’re trying to answer a question. Those two sets of questions come apart in interesting ways. It could be that the models have total self-transparency—they really know what’s going on with each neuron—but don’t have a sense of self in the way humans have. Or it could be that they’re like humans, in the sense that they have a strong sense of self, introspection, and preferences, but don’t actually fully understand what’s going on inside the neural network that produces those. Or they could have both, in some combination we don’t yet understand.

Bau: That’s right. Multiple labs have tried to ask whether neural networks can actually read out their own neurons—fine-tuning a model and asking it: are you aware of your own neurons, say neuron number 73? So far, we’ve largely failed at that. The neural networks don’t seem to be well configured to understand their own internal computations at this level; at least they can’t articulate it if they can. But at a higher level, it’s been very striking that they do seem to have some ability to describe, at a logical level, the actual mechanisms of what they’re doing—under certain conditions and in certain cases. This is similar to humans. You might not be able to describe all of your reflexive, last-minute decisions—why did you jump into the street? You have no idea; that was a split-second decision. In the same way, when these networks make a split-second decision at the very end of the process, they don’t seem to be able to reflect on it. But when they make decisions early on, there is some evidence of awareness of what’s going on.

We’re using all sorts of words here—sense of self, what does a network want to do, do networks even have wants, do they have goals. One of the things we are trying to do in our lab and in our field is put a finer point on some of these questions. What does it mean to have a goal? What does it mean to want something? What does it mean to have a sense of self? What does it even mean to have a sense of other?

The beautiful thing about cracking open these neural networks and looking at how their neural representations are organized is that we can ask these questions in a way that could not be measured before in humans. We can ask not just whether a model professes to have a sense of self in its output words and self-descriptions, but: when it’s using those words, when it’s saying those things, what is it looking at inside its neural network? What is it actually representing? Are there proximal causes? If you change the thing it’s looking at—if it says I really like cherry ice cream and you can see where it’s looking and you change that, and now it says, I really don’t like cherry ice cream anymore—is that changed utterance actually accurate? Do you actually get the model to not like cherry ice cream? Is this the same thing? Is there grounding for a concept that you’re self-aware of?

This idea of a grounded concept was just a philosophical abstraction a few years ago. Let me put my model—and maybe this is an ill-advised idea—in charge of military logistics. It’s doing something and says, should I move some weapons from one place to another? I would never do that. It’s very dangerous; you can’t trust this target locale with these types of dangerous weapons—they might lose track of them. I’m just a logistics AI; I’m not trying to kill anybody. I know I would never do that. Not even for a short layover. You can then ask: when the model tells you that, when it assures you this is how it’s thinking, is that really what it’s thinking? Is this really like the cherry ice cream—is there genuine grounding for it?

Mounk: Is it really thinking anything in that kind of sense? And if it really is thinking something, is it telling you what it’s thinking, or is it misleading you? That obviously goes to one of the purposes of this work. We were talking earlier about wanting to know whether encoding your gender changes how the model treats you or what decisions it makes about an application. That’s one very concrete application where we have reason to want to know what’s going on under the hood. The even larger question is: if what the model tells us in its output, in the little things it displays about its thinking, in the scratchpad—if all of that might conceal some deeper set of preferences, values, or desires, could it potentially be misaligned in a way that is really dangerous?

Bau: That’s correct. You can see the clear need for trying to get to the bottom of these things. What I’d love to do, if we have time, is give you a sense of where we are on being able to answer these questions. Let me categorize a couple of them.

One is: does a network even want to do something? Do they have goals? Do they know what they’re trying to do? Another is: does a network have a sense of self? I want to back that up a little—does it even have a sense of person, a sense of other? If it’s talking about Bob, does it know that’s different from talking about Alice? Does it keep these things organized and separate? I have a student who believes that one of the reasons you have sycophantic behavior in networks is that the network may be getting confused about who is itself and who it’s talking to—just mixing these things up. It’s a fantastic idea, and it may be that these things are all related.

The question is: can we look inside models and see how they’re organizing their internal representations, their internal thoughts—to see if those representations are crisp and clear and correct, or if they’re falling victim to certain problems? If they are, then how, why, and in what situations? That will give us a better understanding of what’s going on inside these models, moving beyond the very vague question of whether a model has a sense of self or whatever, toward asking what that would mean computationally.

Let’s take a look at goals and wants. There’s a way of inducing a large language model to do something very creative, invented by researchers at OpenAI when they first devised the GPT-3 model. It’s called in-context learning. Let’s say you want a model to do some really useful task for you—say, read a restaurant review and tell you whether it’s a five-star review. You could just ask the model to do this, but it won’t do exactly what you want; you probably have a slightly different idea of what a five-star review is than the model natively has. It’ll be okay, but it won’t exactly hit the mark. The right way of doing it is to seed the model with ten examples—ten restaurant reviews, labeled one-star, five-star, three-star. Better yet, give it a hundred examples.

Now, we’re not talking about training the model—just having it read these, without training. What you do is have the model read them as if the model had said them itself: you load them into the same inference buffer that the model uses to predict the next word. After all of that, you say: okay, now the last restaurant review is missing a star rating—just fill that one in. It’ll be really accurate, because now it’s saying: we have 99 examples, the 100th should fit into this context. This restaurant review thing isn’t really about the food, it’s about the atmosphere—I had a misconception, but having read all these examples, I get it. It’ll be accurate because it’s seen 99 examples and it would just fit into them.

This is called in-context learning because the model is learning how to do this, but not by training its weights or changing its neural connections—it’s learning by noticing all the input it was given and reasoning that the next one should fit in. Before 2020 or so, people imagined in-context learning as a theoretical possibility, but when GPT-3 came out it was clear the model was very good at this, and it really revolutionized the field. In-context learning is a type of meta-learning—a way of showing that models have learned how to learn. They can learn things without changing their neural weights.

In the rest of this conversation, Yascha and David go deeper into how models learn and explore existential risks. This part of the conversation is reserved for paying subscribers…

Listen to this episode with a 7-day free trial

Subscribe to Yascha Mounk to listen to this post and get 7 days of free access to the full post archives.