LLM 全栈学习
第 10 章

后训练数据:对话 Post-Training Data: Conversations

后训练的核心,是把训练数据从「互联网文档」换成「对话」——人类与助手之间的多轮问答。我们没法像写代码那样显式地给助手编程,只能通过对话数据集「以例编程」:用例子隐式地教会它如何回应。这些对话会被一套协议(像 <|im_start|>、角色、内容、<|im_end|> 这样的特殊 token)编码成一维 token 序列,这些特殊 token 是后训练才新引入的。我们追溯到 OpenAI 2022 年的 InstructGPT,看人类标注员如何按「有用、真实、无害」的标注规范写出理想回答;再看现代数据集(如 UltraChat)如何大量借助 LLM 合成。最终的直觉:和 ChatGPT 对话,约等于在问「一个受过良好指导的人类标注员会怎么回答」。 The core of post-training is swapping the training data from 'internet documents' to 'conversations' — multi-turn Q&A between a human and an assistant. We can't program the assistant explicitly like writing code; we can only program it implicitly through a dataset of conversations, teaching it how to respond by example. These conversations are encoded into a one-dimensional token sequence by a protocol (special tokens like <|im_start|>, a role, the content, and <|im_end|>), and those special tokens are newly introduced in post-training. We trace this back to OpenAI's 2022 InstructGPT, seeing how human labelers write ideal responses following 'helpful, truthful, harmless' labeling instructions; then how modern datasets (e.g. UltraChat) are largely LLM-synthesized. The takeaway intuition: talking to ChatGPT ≈ asking 'what would a well-instructed human labeler say.'

进入后训练,我们要开始思考的对象是「对话」。这些对话可以是多轮的(multi-turn),最简单的情形就是一个人类和一个助手之间的来回。比如人类问「2 加 2 等于几」,助手应当回答类似「2 加 2 等于 4」;人类追问「如果把加号换成乘号呢」,助手再给出回应。对话里还可以体现出助手的某种「人格」——比如语气友好;而当人类要求做我们不希望它去做的事时,助手会给出所谓的「拒绝(refusal)」,礼貌地说「这个我没法帮你」。换句话说,我们现在要做的,是把助手在这些对话里应有的行为「编程」进去。 Entering post-training, the object we start thinking about is the conversation. These conversations can be multi-turn, and in the simplest case it's a back-and-forth between a human and an assistant. For example, the human asks 'what is 2 plus 2,' and the assistant should reply something like '2 plus 2 is 4'; the human follows up with 'what if it were times instead of plus,' and the assistant responds again. A conversation can also reveal a certain 'personality' for the assistant — say, a friendly tone; and when the human asks for something we don't want it to do, the assistant gives what's called a 'refusal,' politely saying 'I can't help with that.' In other words, what we're now doing is programming in how the assistant ought to behave across these conversations.

💡 关键直觉:因为这是神经网络,我们没法像写普通程序那样用代码显式地给助手编程。一切都是通过「在数据集上训练」来完成的。所以我们是在「隐式地以例编程(programming by example)」——制造大量对话样例,让模型从中学到该如何回应。你给它看的每一段对话,都是一条「应有行为」的示范。 Key intuition: because this is a neural network, we can't program the assistant explicitly in code the way we'd write an ordinary program. Everything is done by training on a dataset. So we're implicitly 'programming by example' — manufacturing many example conversations from which the model learns how it should respond. Every conversation you show it is a demonstration of desired behavior.

那这些数据从哪来?最初它们来自人类标注员(human labelers)。我们给标注员一段对话情境,请他们写出在这种情境下「理想的助手回答」。模型随后就在这些数据上训练、去模仿这类回答。具体流程是:拿出预训练阶段产出的 base 模型(它原本是在互联网文档上训练的),把那个互联网文档数据集「扔掉」,换成一个对话数据集,然后用完全相同的算法继续训练。模型会非常快地调整过来,学到「助手该如何回应人类问题」的统计规律。 Where does this data come from? Originally it came from human labelers. We give a labeler some conversational context and ask them to write out the 'ideal assistant response' for that situation. The model then trains on this data and learns to imitate those responses. Concretely: take the base model produced in pretraining (originally trained on internet documents), throw away that internet-document dataset, substitute a dataset of conversations, and continue training with exactly the same algorithm. The model adjusts very quickly and learns the statistics of how an assistant responds to human queries.

📝 回顾上一章的成本对比:预训练在实践中可能要在成千上万台机器上跑约三个月;而后训练通常短得多——比如只要几个小时。原因正是这里手工构造的对话数据集,比整个互联网的文本小得多。算法、其他一切都不变,我们只是把数据集换成了对话。 Recall the cost comparison from the last chapter: pretraining can in practice run for about three months across many thousands of computers; post-training is typically much shorter — a few hours, say. The reason is exactly that this hand-constructed conversation dataset is far smaller than the entire internet's text. The algorithm and everything else stay the same; we just swap the dataset for conversations.

下一个问题:模型里的一切都必须变成 token,因为它处理的只有 token 序列。那么如何把「对话」这种结构化对象变成 token 序列?我们需要设计一套编码规则——这有点像互联网上的 TCP/IP 数据包:有精确的协议,规定信息怎么排布、怎么组织,大家都遵守同一套约定。在 LLM 里也一样,我们需要一套规则,规定对话这种数据结构如何被编码、解码成 token,又如何还原。 Next question: everything in the model must become tokens, because all it processes are token sequences. So how do we turn a structured object like a 'conversation' into a token sequence? We need to design an encoding — a bit like the TCP/IP packet on the internet: there are precise protocols for how information is laid out and structured, and everyone agrees on the same convention. It's the same in LLMs: we need rules for how a data structure like a conversation gets encoded into and decoded back from tokens.

以 GPT-4o 用的格式为例。每一轮对话用一个特殊 token <|im_start|> 开头(im 是「imaginary monologue」,想象中的独白之意),接着指明这一轮是谁的——比如 user;然后是一个内部独白分隔符 <|im_sep|>,再接上这句话本身的 token(问题的内容),最后用 <|im_end|> 收尾。于是「用户问 2 加 2 等于几」这一轮,就变成了「特殊起始 token + 角色 + 分隔符 + 内容 token + 特殊结束 token」这样一串。整段两轮对话最终可能就是一条约 49 个 token 的一维序列。不同的 LLM 格式略有差异,目前还有点像「西部荒野」,但思路都一样。 Take the format GPT-4o uses. Each turn begins with a special token <|im_start|> (im is short for 'imaginary monologue'), then specifies whose turn it is — e.g. user; then an internal-monologue separator <|im_sep|>; then the tokens of the utterance itself (the content of the question); and finally <|im_end|> to close it. So the turn 'the user asks what is 2 plus 2' becomes a string like 'special start token + role + separator + content tokens + special end token.' A whole two-turn conversation might end up as a one-dimensional sequence of about 49 tokens. Different LLMs use slightly different formats — it's a bit of a Wild West right now — but the idea is the same.

⚠️ 特别注意:<|im_start|> 这类东西不是普通文本,而是新引入的特殊 token。它们在预训练时从未被训练过,是我们在后训练阶段专门创建、并掺进文本里的。模型通过它们学会:「一轮对话从这里开始,这一轮属于谁(用户还是助手),内容是什么,然后这一轮结束」。正是这些新 token 把无结构的文本切分成了有角色、有边界的对话。 Note especially: things like <|im_start|> are not ordinary text but newly introduced special tokens. They were never trained on during pretraining; we create them specifically in the post-training stage and intersperse them with the text. Through them the model learns: 'a turn starts here, this turn belongs to whom (user or assistant), this is its content, then the turn ends.' It's exactly these new tokens that carve unstructured text into a conversation with roles and boundaries.
对话被"压平"成一维 token 序列,用特殊 token 标记每一轮的边界与角色 <|im_start|> 一轮开始 user 角色:用户 2 + 2 = ? 用户说的话 <|im_end|> 一轮结束 <|im_start|> 一轮开始 assistant 角色:助手 等于 4。 助手回答 <|im_end|> 一轮结束 特殊 token(后训练新引入) 角色 内容
一段「用户↔助手」对话如何被特殊 token 编码成一维 token 序列 How a 'user ↔ assistant' conversation is encoded by special tokens into a one-dimensional token sequence

关键在于:一旦对话变成了一维 token 序列,我们之前学过的一切就都能照搬了——还是「预测序列里的下一个 token」,只不过现在训练的是对话。推理时也一样:你在 ChatGPT 里说「如果换成乘号呢」并回车,服务器会在你的消息后面拼上 <|im_start|>assistant<|im_sep|>,然后从这里开始让模型采样:第一个 token 是什么、第二个是什么……由此生成出助手的回答。回答不必和训练里某条一模一样,但会带有训练数据里那种「味道」。 The key point: once a conversation becomes a one-dimensional token sequence, everything we learned before carries over — it's still 'predict the next token in the sequence,' except now we train on conversations. Inference is the same: when you type 'what if it were times' in ChatGPT and hit enter, the server appends <|im_start|>assistant<|im_sep|> after your message and starts sampling from the model right there: what's a good first token, second token, and so on, producing the assistant's reply. The reply need not be identical to any single training example, but it carries the 'flavor' of the training data.

这套做法最早系统性公开的,是 OpenAI 2022 年的 InstructGPT(确切说是其中的技术)。论文里(3.4 节)提到他们通过 Upwork 或 Scale AI 雇了人类承包商来构造对话:标注员先想出各种 prompt(「给我五个重燃职业热情的点子」「我接下来该读哪十本科幻小说」「把这句话翻成西班牙语」……),再亲手写出理想的助手回答。那他们怎么知道什么才是「理想回答」?靠的是公司(如 OpenAI)写给标注员的「标注规范(labeling instructions)」。 The first systematic public description of this approach was OpenAI's 2022 InstructGPT (or rather, the technique within it). The paper (section 3.4) mentions they hired human contractors via Upwork or Scale AI to construct conversations: labelers first came up with prompts ('give me five ideas to regain enthusiasm for my career,' 'what are the top 10 sci-fi books I should read next,' 'translate this sentence into Spanish' ...), then wrote out the ideal assistant response by hand. How do they know what the ideal response is? Via 'labeling instructions' the company (e.g. OpenAI) writes for the labelers.

这些标注规范在高层次上要求标注员做到三点:有用(helpful)、真实(truthful)、无害(harmless)——尽量帮上忙、尽量说真话、对我们不希望系统处理的问题就不要回答。实际中这份规范往往不是几句话,而是长达数百页、需要专门研读的文档。于是模型当然不可能在训练里覆盖未来所有可能被问到的问题,但只要有足够多(比如十万条)这样的示范对话,模型就会在训练中逐渐「接演」这个有用、真实、无害的助手人格——一切都是以例编程。 At a high level these labeling instructions ask the labeler to be: helpful, truthful, and harmless — try to help, try to tell the truth, and don't answer questions we don't want the system handling. In practice this isn't a few sentences but often a document hundreds of pages long that people study professionally. The model obviously can't cover every possible future question in training, but with enough demonstration conversations (say, a hundred thousand), the model gradually 'takes on' this helpful, truthful, harmless assistant persona during training — all of it programming by example.

过去两三年,这套做法的前沿已经进步了。如今很少再让人类「从零」手写每一条回答了——因为我们已经有了 LLM,可以用它们来帮忙生成这些对话数据。更常见的是:标注员让现成的 LLM 先给出一个答案,再去编辑、修订它。现代的 SFT 数据集(如 UltraChat)很大程度上是合成的(LLM 辅助生成),其中也夹杂一些人类编辑;这类数据集如今动辄包含数百万条对话,覆盖极其广泛的主题——是相当庞大的「SFT 混合数据集(SFT mixtures)」。但本质没变:还是一堆对话,我们还是像以前那样在上面训练。 Over the past two or three years the state of the art has advanced. It's now rare to have humans write every response from scratch — because we have LLMs, and we can use them to help generate this conversation data. More commonly, a labeler has an existing LLM produce an answer first, then edits and revises it. Modern SFT datasets (e.g. UltraChat) are to a large extent synthetic (LLM-assisted), with some human editing mixed in; such datasets now routinely contain millions of conversations spanning an enormous diversity of topics — sizable 'SFT mixtures.' But the essence is unchanged: still a pile of conversations, and we still train on them as before.

💡 破除一点「魔法感」:当你在 ChatGPT 里提问、回车,回来的东西在统计上是与训练集对齐的,而训练集的源头说到底是「人类遵循标注规范写出的回答」。所以你并不是在跟某个神奇 AI 对话,而更像是在问「一个(被良好指导的)人类标注员在这种情形下会怎么回答」——你得到的,是对这样一位标注员的统计模拟。而且这位标注员往往不是随便找的路人:涉及代码等专业问题时,公司会雇受过教育的专家来写,所以你模拟的是相当专业的那种人。 Dispelling some of the 'magic': when you ask a question in ChatGPT and hit enter, what comes back is statistically aligned with the training set, and that training set ultimately traces back to 'humans writing responses by following labeling instructions.' So you're not talking to some magical AI; it's more like asking 'what would a (well-instructed) human labeler say in this situation' — what you get is a statistical simulation of such a labeler. And that labeler usually isn't a random person off the street: for specialized questions like code, companies hire educated experts to write the answers, so you're simulating a fairly skilled person.

举个具体例子:你问「推荐巴黎最值得看的五个地标」。回来的并不是某个 AI 真的去满世界调研、用「无限智能」排了个名,而是对一位 OpenAI 雇来的标注员的统计模拟。如果这个具体问题恰好在后训练数据集里,你看到的答案多半就非常接近那位标注员当初写下的(他可能花 20 分钟上网查了查、列了个单子)。如果这个具体问题不在数据集里,那答案就更「涌现」一些:模型结合预训练里关于巴黎、地标、人们爱看什么的海量知识,再叠加后训练学到的回答风格,模拟出一个合理的清单。 A concrete example: you ask 'recommend the top five landmarks to see in Paris.' What comes back is not some AI that actually researched all the landmarks worldwide and ranked them with 'infinite intelligence' — it's a statistical simulation of a labeler OpenAI hired. If this exact question happens to be in the post-training dataset, your answer is very likely close to what that labeler wrote (they may have spent 20 minutes online researching and made a list). If this exact question is not in the dataset, the answer is more 'emergent': the model combines its vast pretraining knowledge about Paris, landmarks, and what people like to see, layered with the response style learned in post-training, to simulate a reasonable list.

  • 后训练把数据集从「互联网文档」换成「对话」(人类↔助手,可多轮)。
  • 因为是神经网络,助手是被「以例编程」而非用代码显式编程——靠大量对话样例隐式塑造行为。
  • 对话被一套协议编码成一维 token 序列;特殊 token(如 <|im_start|>、角色、内容、<|im_end|>)是后训练才新引入的。
  • InstructGPT(OpenAI 2022)最早公开这套做法;人类标注员按「有用、真实、无害」的标注规范写理想回答。
  • 现代数据集(如 UltraChat)大量是 LLM 合成 + 少量人工编辑,可含数百万条对话。
  • 和 ChatGPT 对话 ≈ 问「一个受良好指导的人类标注员会怎么答」;后训练只需几小时,远短于预训练。
  • Post-training swaps the dataset from 'internet documents' to 'conversations' (human ↔ assistant, possibly multi-turn).
  • Because it's a neural network, the assistant is 'programmed by example,' not explicitly in code — behavior is shaped implicitly by many conversation samples.
  • Conversations are encoded into a one-dimensional token sequence by a protocol; special tokens (e.g. <|im_start|>, role, content, <|im_end|>) are newly introduced in post-training.
  • InstructGPT (OpenAI 2022) first described this publicly; human labelers write ideal responses following 'helpful, truthful, harmless' labeling instructions.
  • Modern datasets (e.g. UltraChat) are largely LLM-synthesized plus a little human editing, and can contain millions of conversations.
  • Talking to ChatGPT ≈ asking 'what would a well-instructed human labeler say'; post-training takes only hours, far shorter than pretraining.

📝 本章测验

从预训练到后训练,训练数据发生了什么变化?From pretraining to post-training, what changes about the training data?

为什么说助手是被「以例编程」的?Why do we say the assistant is 'programmed by example'?

关于 <|im_start|> 这类特殊 token,下列哪项正确?Which is correct about special tokens like <|im_start|>?

按本章的视角,和 ChatGPT 对话最贴切的类比是什么?Per this chapter, what is the most apt analogy for talking to ChatGPT?