后训练数据:对话 Post-Training Data: Conversations
后训练的核心,是把训练数据从「互联网文档」换成「对话」——人类与助手之间的多轮问答。我们没法像写代码那样显式地给助手编程,只能通过对话数据集「以例编程」:用例子隐式地教会它如何回应。这些对话会被一套协议(像 <|im_start|>、角色、内容、<|im_end|> 这样的特殊 token)编码成一维 token 序列,这些特殊 token 是后训练才新引入的。我们追溯到 OpenAI 2022 年的 InstructGPT,看人类标注员如何按「有用、真实、无害」的标注规范写出理想回答;再看现代数据集(如 UltraChat)如何大量借助 LLM 合成。最终的直觉:和 ChatGPT 对话,约等于在问「一个受过良好指导的人类标注员会怎么回答」。 The core of post-training is swapping the training data from 'internet documents' to 'conversations' — multi-turn Q&A between a human and an assistant. We can't program the assistant explicitly like writing code; we can only program it implicitly through a dataset of conversations, teaching it how to respond by example. These conversations are encoded into a one-dimensional token sequence by a protocol (special tokens like <|im_start|>, a role, the content, and <|im_end|>), and those special tokens are newly introduced in post-training. We trace this back to OpenAI's 2022 InstructGPT, seeing how human labelers write ideal responses following 'helpful, truthful, harmless' labeling instructions; then how modern datasets (e.g. UltraChat) are largely LLM-synthesized. The takeaway intuition: talking to ChatGPT ≈ asking 'what would a well-instructed human labeler say.'
进入后训练,我们要开始思考的对象是「对话」。这些对话可以是多轮的(multi-turn),最简单的情形就是一个人类和一个助手之间的来回。比如人类问「2 加 2 等于几」,助手应当回答类似「2 加 2 等于 4」;人类追问「如果把加号换成乘号呢」,助手再给出回应。对话里还可以体现出助手的某种「人格」——比如语气友好;而当人类要求做我们不希望它去做的事时,助手会给出所谓的「拒绝(refusal)」,礼貌地说「这个我没法帮你」。换句话说,我们现在要做的,是把助手在这些对话里应有的行为「编程」进去。 Entering post-training, the object we start thinking about is the conversation. These conversations can be multi-turn, and in the simplest case it's a back-and-forth between a human and an assistant. For example, the human asks 'what is 2 plus 2,' and the assistant should reply something like '2 plus 2 is 4'; the human follows up with 'what if it were times instead of plus,' and the assistant responds again. A conversation can also reveal a certain 'personality' for the assistant — say, a friendly tone; and when the human asks for something we don't want it to do, the assistant gives what's called a 'refusal,' politely saying 'I can't help with that.' In other words, what we're now doing is programming in how the assistant ought to behave across these conversations.
那这些数据从哪来?最初它们来自人类标注员(human labelers)。我们给标注员一段对话情境,请他们写出在这种情境下「理想的助手回答」。模型随后就在这些数据上训练、去模仿这类回答。具体流程是:拿出预训练阶段产出的 base 模型(它原本是在互联网文档上训练的),把那个互联网文档数据集「扔掉」,换成一个对话数据集,然后用完全相同的算法继续训练。模型会非常快地调整过来,学到「助手该如何回应人类问题」的统计规律。 Where does this data come from? Originally it came from human labelers. We give a labeler some conversational context and ask them to write out the 'ideal assistant response' for that situation. The model then trains on this data and learns to imitate those responses. Concretely: take the base model produced in pretraining (originally trained on internet documents), throw away that internet-document dataset, substitute a dataset of conversations, and continue training with exactly the same algorithm. The model adjusts very quickly and learns the statistics of how an assistant responds to human queries.
下一个问题:模型里的一切都必须变成 token,因为它处理的只有 token 序列。那么如何把「对话」这种结构化对象变成 token 序列?我们需要设计一套编码规则——这有点像互联网上的 TCP/IP 数据包:有精确的协议,规定信息怎么排布、怎么组织,大家都遵守同一套约定。在 LLM 里也一样,我们需要一套规则,规定对话这种数据结构如何被编码、解码成 token,又如何还原。 Next question: everything in the model must become tokens, because all it processes are token sequences. So how do we turn a structured object like a 'conversation' into a token sequence? We need to design an encoding — a bit like the TCP/IP packet on the internet: there are precise protocols for how information is laid out and structured, and everyone agrees on the same convention. It's the same in LLMs: we need rules for how a data structure like a conversation gets encoded into and decoded back from tokens.
以 GPT-4o 用的格式为例。每一轮对话用一个特殊 token <|im_start|> 开头(im 是「imaginary monologue」,想象中的独白之意),接着指明这一轮是谁的——比如 user;然后是一个内部独白分隔符 <|im_sep|>,再接上这句话本身的 token(问题的内容),最后用 <|im_end|> 收尾。于是「用户问 2 加 2 等于几」这一轮,就变成了「特殊起始 token + 角色 + 分隔符 + 内容 token + 特殊结束 token」这样一串。整段两轮对话最终可能就是一条约 49 个 token 的一维序列。不同的 LLM 格式略有差异,目前还有点像「西部荒野」,但思路都一样。 Take the format GPT-4o uses. Each turn begins with a special token <|im_start|> (im is short for 'imaginary monologue'), then specifies whose turn it is — e.g. user; then an internal-monologue separator <|im_sep|>; then the tokens of the utterance itself (the content of the question); and finally <|im_end|> to close it. So the turn 'the user asks what is 2 plus 2' becomes a string like 'special start token + role + separator + content tokens + special end token.' A whole two-turn conversation might end up as a one-dimensional sequence of about 49 tokens. Different LLMs use slightly different formats — it's a bit of a Wild West right now — but the idea is the same.
关键在于:一旦对话变成了一维 token 序列,我们之前学过的一切就都能照搬了——还是「预测序列里的下一个 token」,只不过现在训练的是对话。推理时也一样:你在 ChatGPT 里说「如果换成乘号呢」并回车,服务器会在你的消息后面拼上 <|im_start|>assistant<|im_sep|>,然后从这里开始让模型采样:第一个 token 是什么、第二个是什么……由此生成出助手的回答。回答不必和训练里某条一模一样,但会带有训练数据里那种「味道」。 The key point: once a conversation becomes a one-dimensional token sequence, everything we learned before carries over — it's still 'predict the next token in the sequence,' except now we train on conversations. Inference is the same: when you type 'what if it were times' in ChatGPT and hit enter, the server appends <|im_start|>assistant<|im_sep|> after your message and starts sampling from the model right there: what's a good first token, second token, and so on, producing the assistant's reply. The reply need not be identical to any single training example, but it carries the 'flavor' of the training data.
这套做法最早系统性公开的,是 OpenAI 2022 年的 InstructGPT(确切说是其中的技术)。论文里(3.4 节)提到他们通过 Upwork 或 Scale AI 雇了人类承包商来构造对话:标注员先想出各种 prompt(「给我五个重燃职业热情的点子」「我接下来该读哪十本科幻小说」「把这句话翻成西班牙语」……),再亲手写出理想的助手回答。那他们怎么知道什么才是「理想回答」?靠的是公司(如 OpenAI)写给标注员的「标注规范(labeling instructions)」。 The first systematic public description of this approach was OpenAI's 2022 InstructGPT (or rather, the technique within it). The paper (section 3.4) mentions they hired human contractors via Upwork or Scale AI to construct conversations: labelers first came up with prompts ('give me five ideas to regain enthusiasm for my career,' 'what are the top 10 sci-fi books I should read next,' 'translate this sentence into Spanish' ...), then wrote out the ideal assistant response by hand. How do they know what the ideal response is? Via 'labeling instructions' the company (e.g. OpenAI) writes for the labelers.
这些标注规范在高层次上要求标注员做到三点:有用(helpful)、真实(truthful)、无害(harmless)——尽量帮上忙、尽量说真话、对我们不希望系统处理的问题就不要回答。实际中这份规范往往不是几句话,而是长达数百页、需要专门研读的文档。于是模型当然不可能在训练里覆盖未来所有可能被问到的问题,但只要有足够多(比如十万条)这样的示范对话,模型就会在训练中逐渐「接演」这个有用、真实、无害的助手人格——一切都是以例编程。 At a high level these labeling instructions ask the labeler to be: helpful, truthful, and harmless — try to help, try to tell the truth, and don't answer questions we don't want the system handling. In practice this isn't a few sentences but often a document hundreds of pages long that people study professionally. The model obviously can't cover every possible future question in training, but with enough demonstration conversations (say, a hundred thousand), the model gradually 'takes on' this helpful, truthful, harmless assistant persona during training — all of it programming by example.
过去两三年,这套做法的前沿已经进步了。如今很少再让人类「从零」手写每一条回答了——因为我们已经有了 LLM,可以用它们来帮忙生成这些对话数据。更常见的是:标注员让现成的 LLM 先给出一个答案,再去编辑、修订它。现代的 SFT 数据集(如 UltraChat)很大程度上是合成的(LLM 辅助生成),其中也夹杂一些人类编辑;这类数据集如今动辄包含数百万条对话,覆盖极其广泛的主题——是相当庞大的「SFT 混合数据集(SFT mixtures)」。但本质没变:还是一堆对话,我们还是像以前那样在上面训练。 Over the past two or three years the state of the art has advanced. It's now rare to have humans write every response from scratch — because we have LLMs, and we can use them to help generate this conversation data. More commonly, a labeler has an existing LLM produce an answer first, then edits and revises it. Modern SFT datasets (e.g. UltraChat) are to a large extent synthetic (LLM-assisted), with some human editing mixed in; such datasets now routinely contain millions of conversations spanning an enormous diversity of topics — sizable 'SFT mixtures.' But the essence is unchanged: still a pile of conversations, and we still train on them as before.
举个具体例子:你问「推荐巴黎最值得看的五个地标」。回来的并不是某个 AI 真的去满世界调研、用「无限智能」排了个名,而是对一位 OpenAI 雇来的标注员的统计模拟。如果这个具体问题恰好在后训练数据集里,你看到的答案多半就非常接近那位标注员当初写下的(他可能花 20 分钟上网查了查、列了个单子)。如果这个具体问题不在数据集里,那答案就更「涌现」一些:模型结合预训练里关于巴黎、地标、人们爱看什么的海量知识,再叠加后训练学到的回答风格,模拟出一个合理的清单。 A concrete example: you ask 'recommend the top five landmarks to see in Paris.' What comes back is not some AI that actually researched all the landmarks worldwide and ranked them with 'infinite intelligence' — it's a statistical simulation of a labeler OpenAI hired. If this exact question happens to be in the post-training dataset, your answer is very likely close to what that labeler wrote (they may have spent 20 minutes online researching and made a list). If this exact question is not in the dataset, the answer is more 'emergent': the model combines its vast pretraining knowledge about Paris, landmarks, and what people like to see, layered with the response style learned in post-training, to simulate a reasonable list.
- •后训练把数据集从「互联网文档」换成「对话」(人类↔助手,可多轮)。
- •因为是神经网络,助手是被「以例编程」而非用代码显式编程——靠大量对话样例隐式塑造行为。
- •对话被一套协议编码成一维 token 序列;特殊 token(如 <|im_start|>、角色、内容、<|im_end|>)是后训练才新引入的。
- •InstructGPT(OpenAI 2022)最早公开这套做法;人类标注员按「有用、真实、无害」的标注规范写理想回答。
- •现代数据集(如 UltraChat)大量是 LLM 合成 + 少量人工编辑,可含数百万条对话。
- •和 ChatGPT 对话 ≈ 问「一个受良好指导的人类标注员会怎么答」;后训练只需几小时,远短于预训练。
- •Post-training swaps the dataset from 'internet documents' to 'conversations' (human ↔ assistant, possibly multi-turn).
- •Because it's a neural network, the assistant is 'programmed by example,' not explicitly in code — behavior is shaped implicitly by many conversation samples.
- •Conversations are encoded into a one-dimensional token sequence by a protocol; special tokens (e.g. <|im_start|>, role, content, <|im_end|>) are newly introduced in post-training.
- •InstructGPT (OpenAI 2022) first described this publicly; human labelers write ideal responses following 'helpful, truthful, harmless' labeling instructions.
- •Modern datasets (e.g. UltraChat) are largely LLM-synthesized plus a little human editing, and can contain millions of conversations.
- •Talking to ChatGPT ≈ asking 'what would a well-instructed human labeler say'; post-training takes only hours, far shorter than pretraining.
📝 本章测验
从预训练到后训练,训练数据发生了什么变化?From pretraining to post-training, what changes about the training data?
为什么说助手是被「以例编程」的?Why do we say the assistant is 'programmed by example'?
关于 <|im_start|> 这类特殊 token,下列哪项正确?Which is correct about special tokens like <|im_start|>?
按本章的视角,和 ChatGPT 对话最贴切的类比是什么?Per this chapter, what is the most apt analogy for talking to ChatGPT?