第 11 章

幻觉、工具与工作记忆 Hallucinations, Tools, and Working Memory

本章谈「LLM 心理学」的第一组现象。幻觉(hallucination)就是模型自信地编造信息——根源在于训练集里「谁是某某」这类问题全都被自信地正确回答,模型于是学会了无论知不知道都用同样自信的口吻作答。我们看两种缓解办法:其一,探测模型、找出它「不知道」的边界,在训练集里补上「我不知道」的样例,让它学会把内部的不确定感和「拒答」关联起来;其二,给模型工具(如网页搜索),它发出特殊 token 调用工具,结果被粘进上下文窗口,再据此作答。由此引出核心区分:上下文窗口是模型可直接访问的「工作记忆」,而参数里的知识只是对训练的「模糊回忆」——就像几个月前读过的东西 vs 此刻摆在你眼前的东西。 This chapter covers the first cluster of 'LLM psychology' phenomena. Hallucination is when a model confidently makes things up — rooted in the fact that 'who is X' questions in the training set are all answered confidently and correctly, so the model learns to answer in that same confident tone whether or not it actually knows. We look at two mitigations: first, probe the model to find the boundary of what it doesn't know, and add 'I don't know' examples to the training set so it learns to associate internal uncertainty with refusal; second, give the model tools (e.g. web search) — it emits special tokens to call a tool, the result is pasted into the context window, and it answers from that. This leads to the core distinction: the context window is the model's directly-accessible 'working memory,' whereas knowledge in the parameters is just a 'vague recollection' of training — like something you read months ago vs. something in front of you right now.

现在转到我喜欢称为「LLM 心理学」的话题:这套训练流程会带来哪些涌现的认知效应?第一个就是幻觉(hallucination)——LLM 编造信息、凭空捏造内容。这是 LLM 助手的一个大问题,在很多年前的早期模型里尤其严重。近来有所好转(下面会讲缓解办法),但我们先要理解它从何而来。 Now to the topic I like to call 'LLM psychology': what emergent cognitive effects does this training pipeline produce? The first is hallucination — when an LLM makes things up, fabricating information out of thin air. It's a big problem with LLM assistants, and was especially severe in early models years ago. It has improved lately (mitigations below), but first we need to understand where it comes from.

想象训练集里有这样几段对话:「谁是 Tom Cruise?」——「他是著名的美国演员、制片人……」;「谁是 John Barrasso?」——「他是一位美国参议员……」;「谁是成吉思汗?」——「成吉思汗是……」。这些都很合理。问题在于:写这些标准答案的人,要么本来就认识此人,要么上网查过,然后用一种「自信作答」的口吻把答案写了下来。 Imagine the training set has conversations like: 'Who is Tom Cruise?' — 'He's a famous American actor and producer ...'; 'Who is John Barrasso?' — 'He's a US senator ...'; 'Who is Genghis Khan?' — 'Genghis Khan was ...'. These are all reasonable. The problem: whoever wrote these correct answers either already knew the person or looked them up online, and then wrote the answer in a 'confidently answering' tone.

于是在测试时,如果你问一个完全编造的名字——比如「谁是 Orson Kovats?」(一个我随手造出来、大概并不存在的人)——模型不会老老实实说「我不知道」。哪怕网络内部的某些特征、激活其实「知道」这个人它并不熟悉,它也不会把这点说出口,因为它在统计上是在模仿训练集,而训练集里「谁是某某」这类问题全都被自信地正确回答了。所以它会接管这种风格、给出统计上最可能的猜测——本质就是编造。在老模型(如几年前的 Falcon 7B)上反复采样这个问题,你会得到「美国科幻作家」「某 1950 年代电视剧里的虚构角色」「前小联盟棒球运动员」……每次都不同,因为它根本不知道,只是在按概率瞎采样。 So at test time, if you ask a completely made-up name — say 'Who is Orson Kovats?' (someone I invented and who probably doesn't exist) — the model won't honestly say 'I don't know.' Even if some features and activations inside the network actually 'know' it's unfamiliar with this person, it won't surface that, because statistically it imitates the training set, where 'who is X' questions are all confidently answered correctly. So it adopts that style and gives its statistically most likely guess — essentially fabricating. Resample this question on an old model (e.g. Falcon 7B from a few years ago) and you get 'an American sci-fi author,' 'a fictional character from a 1950s TV show,' 'a former minor-league baseball player' ... different every time, because it simply doesn't know and is just sampling from probabilities.

⚠️ 幻觉的根源:模型是「统计 token 翻滚器」,只在采样序列里的下一个 token。它没有联网、不会临时去查证。当被问到训练里没见过的事,它不会停下来核实,而是延续「自信作答」的格式,给出一个听起来合理、风格一致、但内容可能纯属虚构的答案。你我体验到的是「编造的事实知识」,而模型只是在模仿答案的格式。 Root of hallucination: the model is a 'statistical token tumbler,' merely sampling the next token in the sequence. It isn't online and won't go verify anything on the fly. Asked about something it never saw in training, it doesn't stop to check; it continues the 'confidently answering' format and produces an answer that sounds plausible and stylistically consistent but may be pure fiction. What you and I experience as 'made-up factual knowledge' is the model just imitating the format of an answer.

缓解办法之一:我们显然需要在数据集里加入一些样例,让助手的正确答案就是「我不知道这个事实」。但只能在模型「真的不知道」时这样答。那怎么知道模型知不知道?可以用经验探测(probe)的方式。Meta 在 Llama 3 系列里就这么干(论文里称之为 factuality / 事实性):他们拿一段训练文档里的随机段落,用一个 LLM 生成关于该段落的具体事实性问题(因为信息就在上下文窗口里,LLM 重述得相当准),得到「问题—正确答案」对。然后拿这些问题去「拷问」目标模型,比如问三五次,用另一个 LLM 当裁判,自动比对模型的回答和正确答案——全程无需人类介入。 Mitigation one: we clearly need examples in the dataset where the assistant's correct answer is 'I don't know this fact.' But only when the model genuinely doesn't know. How do we know what it knows? We can probe it empirically. Meta did exactly this for the Llama 3 series (calling it 'factuality' in the paper): take a random paragraph from a training document, use an LLM to generate specific factual questions about that paragraph (since the info is right there in the context window, the LLM rephrases it quite accurately), yielding 'question–correct answer' pairs. Then interrogate the target model with these questions, say three to five times, using another LLM as a judge to automatically compare its answers to the correct one — no humans needed.

如果模型几次都答对了,说明它大概知道,放过;如果它前后矛盾、答错,说明它其实不知道、在瞎编。这时我们就在训练集里新建一段对话:同样的问题,答案换成「抱歉,我不知道 / 我记不清了」。把这类样例针对许多文档、许多问题都做一遍,就给了模型在训练集里「基于知识边界拒答」的机会。妙处在于:网络内部很可能本就存在某个表征「不确定」的神经元;只是默认情况下,这个神经元的激活并没有被「接线」到模型用语言说出「我不知道」上。补上这些样例后,模型就能学会这条关联——当「不确定神经元」亮起时,它被允许说「我记不清了」。这就是一个相当有效的幻觉缓解。 If the model answers correctly several times, it probably knows — leave it. If it contradicts itself or gets it wrong, it doesn't actually know and is making things up. We then create a new conversation in the training set: the same question, but the answer becomes 'Sorry, I don't know / I don't remember.' Do this across many documents and many questions, and you give the model the chance to refuse based on its knowledge boundary. The elegance: the network likely already contains some neuron representing 'uncertainty'; by default that neuron's activation just isn't 'wired up' to the model actually saying 'I don't know' in words. Adding these examples lets the model learn the association — when the 'uncertainty neuron' lights up, it's allowed to say 'I don't remember.' That's a quite effective hallucination mitigation.

但我们能做得更好。缓解办法之二:与其只让模型说「不知道」,不如给它一个真正去把问题答对的机会——给它工具(tools)。你我遇到不会的事实题会怎么办?去搜索、上网、查到答案再告诉对方。模型也可以这样:我们设计一种机制,让模型能发出特殊 token 来调用工具。比如新引入 <SEARCH_START> 和 <SEARCH_END> 两个 token:模型不再硬答,而是发出 <SEARCH_START> 加一段查询、再加 <SEARCH_END>。 But we can do better. Mitigation two: rather than just letting the model say 'I don't know,' give it a real chance to answer correctly — give it tools. What do you and I do with a factual question we don't know? Search, go online, find the answer, then report it. The model can do the same: we design a mechanism letting the model emit special tokens to call a tool. For example, introduce two new tokens <SEARCH_START> and <SEARCH_END>: instead of forcing an answer, the model emits <SEARCH_START>, a query, then <SEARCH_END>.

运行推理的那个程序,一旦看到 <SEARCH_END>,就不再继续从模型采样下一个 token,而是暂停生成,带着这段查询去开一个会话(比如访问 bing.com 或 Google),把搜回来的网页文本取回,(可能再用一些特殊 token 包一下)复制粘贴回上下文里。这段文本一旦进入上下文窗口,就直接喂进神经网络了——不再是模糊的回忆,而是模型可直接访问的数据。于是模型在之后采样新 token 时,就能轻松引用这段刚粘进来的内容,据此作答并给出引用来源。这正是前面问「谁是 Orson Kovats」时,先进模型「闪了一下『搜索网页』」然后给出带引用答案的原因。教模型正确使用工具,同样靠训练:准备几千条示范如何搜索的对话即可,模型本身从预训练就对「网页搜索是什么」有不错的理解。 The program running inference, upon seeing <SEARCH_END>, stops sampling the next token from the model and pauses generation; it takes the query, opens a session (e.g. to bing.com or Google), retrieves the web text, (perhaps wraps it in some special tokens) and pastes it back into the context. Once that text enters the context window, it feeds directly into the neural network — no longer a vague recollection but data the model can directly access. So when the model samples new tokens afterward, it can easily reference the freshly pasted content, answer from it, and cite sources. This is exactly why, when we earlier asked 'who is Orson Kovats,' an advanced model briefly 'flashed searching the web' and then gave a cited answer. Teaching the model to use tools correctly also relies on training: a few thousand demonstration conversations of how to search suffice, and the model already understands 'what a web search is' from pretraining.

两种幻觉缓解:① 探测知识边界、补「我不知道」样例;② 用工具(网页搜索)把外部信息搬进上下文 Two hallucination mitigations: (1) probe the knowledge boundary and add 'I don't know' examples; (2) use tools (web search) to bring external info into context

💡 本章最重要的心理学要点:参数里的知识 vs 上下文里的知识。神经网络几十亿参数里的知识,是对很久以前预训练所见之物的「模糊回忆」——就像你一个月前读过的东西:常读的记得牢,罕见的就模糊不清。而上下文窗口里的 token 是「工作记忆」:它直接喂进网络、可被模型直接访问,就像此刻摆在你眼前、你刚刚重读过的内容。所以「让信息进入上下文」≈「你把资料找出来、刷新了工作记忆」。 The most important psychology point of this chapter: knowledge in the parameters vs. knowledge in the context. Knowledge in the network's billions of parameters is a 'vague recollection' of what it saw long ago in pretraining — like something you read a month ago: what you read often you remember well, the rare stuff is hazy. Tokens in the context window are 'working memory': they feed directly into the network and are directly accessible to the model, like something in front of you right now that you just re-read. So 'getting info into the context' ≈ 'you looked the material up and refreshed your working memory.'

这一区分对实际使用很有启发。比如你让 ChatGPT「总结《傲慢与偏见》第一章」,它能给出还不错的结果,因为它对这种名著有相当好的「回忆」(网上关于它的内容极多)。但如果你想让模型准确复述某个具体内容,更好的做法是直接把材料给它:「请总结《傲慢与偏见》第一章,原文附在下方供参考:……」然后用分隔符把整章粘进去。因为内容在上下文里时,模型可以直接访问、无需回忆,总结质量通常显著更高——就像你在总结前重读了一遍那一章。 This distinction has practical implications. Ask ChatGPT to 'summarize chapter one of Pride and Prejudice' and it does reasonably well, because it has a pretty good 'recollection' of such a famous work (tons of content about it online). But if you want the model to recall something specific accurately, it's better to give it the material directly: 'Please summarize chapter one of Pride and Prejudice; the text is attached below for reference: ...' then paste the whole chapter in with a delimiter. Because the content is in the context, the model can access it directly without recalling, and the summary is usually significantly higher quality — like you'd re-read the chapter before summarizing it.

•幻觉 = 模型自信地编造;根源是训练集里「谁是某某」全被自信地正确回答,模型学会了无论知否都自信作答。
•模型是「统计 token 翻滚器」,默认不会停下来核实,也不会主动说「我不知道」。
•缓解一:探测模型找出知识边界,补「我不知道」样例,让它把内部不确定感和拒答关联起来。
•缓解二:给工具——模型发特殊 token 调用(如网页搜索),结果被粘进上下文,再据此作答并引用。
•上下文窗口 = 工作记忆,直接可访问;参数里的知识 = 对训练的模糊回忆。
•实践:要准确复述就把材料直接放进上下文,而不是依赖模型的回忆。

•Hallucination = the model confidently making things up; rooted in 'who is X' being confidently answered correctly in training, so it learns to answer confidently whether it knows or not.
•The model is a 'statistical token tumbler'; by default it won't stop to verify, nor will it volunteer 'I don't know.'
•Mitigation one: probe the model to find its knowledge boundary, add 'I don't know' examples, so it associates internal uncertainty with refusal.
•Mitigation two: give tools — the model emits special tokens to call one (e.g. web search); the result is pasted into context, then it answers and cites.
•Context window = working memory, directly accessible; knowledge in parameters = a vague recollection of training.
•Practical tip: to recall something accurately, put the material directly into context rather than relying on the model's recollection.

📝 本章测验

为什么早期模型会自信地编造它根本不知道的事?Why do early models confidently fabricate things they don't actually know?

第一种幻觉缓解(让模型会说「我不知道」)是怎么实现的?How is the first hallucination mitigation (getting the model to say 'I don't know') implemented?

模型用网页搜索工具时,检索回来的文本去了哪里、起了什么作用?When the model uses a web-search tool, where does the retrieved text go and what does it do?

「参数里的知识」和「上下文里的知识」最准确的类比是什么?What is the most accurate analogy for 'knowledge in parameters' vs. 'knowledge in context'?