第 15 章

从监督微调到强化学习 From Supervised Fine-Tuning to Reinforcement Learning

训练 LLM 和送孩子上学很像。预训练 = 读教材的说明性文字(背景知识);监督微调(SFT) = 看专家给出的「例题详解」,照着模仿;但还有第三阶段——做「练习题」,这就是强化学习(RL)。本章讲清楚 SFT 的根本局限:用「Emily 买苹果」这道题,四个候选解法都能算出每个苹果 3 美元,但作为人类标注员,你并不知道哪条解法对「模型」最好。一个解法有两重目的——(1)算对答案,(2)讲得让人看得舒服;而对人友好的呈现,未必是模型用最少计算量、最可靠地到达答案的 token 序列(回忆:每个 token 计算量有限)。人和模型的「认知」不同:你觉得轻松的一步,模型可能跨不过去;你写下的某个跳跃,模型参数里可能根本没有。结论:与其替模型猜「理想解法」,不如让它自己练——尝试很多解法,留下能得到正确答案的那些。这正是下一章强化学习要做的事。 Training an LLM is a lot like sending a child to school. Pretraining = reading the expository text of a textbook (background knowledge); supervised fine-tuning (SFT) = studying the expert's 'worked solutions' and imitating them; but there is a third stage — doing the 'practice problems,' which is reinforcement learning (RL). This chapter pins down the fundamental limit of SFT: with the 'Emily buys apples' problem, four candidate solutions all reach $3 per apple, yet as a human labeler you don't actually know which solution is best for the MODEL. A solution serves two purposes — (1) reach the right answer, (2) read nicely for a human; but the human-friendly presentation may not be the token sequence by which the model reaches the answer most reliably and with the least compute (recall: finite compute per token). Human and model 'cognition' differ: a step that feels trivial to you might be too big a leap for the model; a jump you write down may not exist in the model's parameters. Conclusion: rather than guessing the 'ideal solution' for the model, let it practice — try many solutions and keep the ones that reach the correct answer. That is exactly what the next chapter's reinforcement learning does.

先回顾整条训练流水线。第一阶段是预训练:在互联网文档上训练,得到一个 base 模型——本质是互联网文档模拟器。它有趣、也是对互联网的有损压缩,但不直接好用,因为我们想要的不是「采样网页」,而是向 AI 提问、让它回答。于是进入后训练的第一步——监督微调(SFT):算法和预训练完全一样,唯一变的是数据集。我们不再喂网页,而是精心策划数百万条「人类与助手」的对话(覆盖各种话题),在上面继续训练,就得到了一个助手。这些对话最终都源于人类的策划:人写 prompt、人写「理想回答」(现在常借助 LLM 工具辅助生成,但底层仍是人类把关)。 First, recap the whole training pipeline. Stage one is pretraining: train on internet documents to get a base model — essentially an internet-document simulator. It's interesting, and a lossy compression of the internet, but not directly useful, because we don't want to 'sample web pages' — we want to ask an AI questions and have it answer. So we enter the first step of post-training — supervised fine-tuning (SFT): algorithmically identical to pretraining, with only the dataset changed. Instead of web pages we now feed millions of carefully curated 'human and assistant' conversations across all kinds of topics, continue training on them, and get an assistant. These conversations ultimately come from human curation: humans write the prompts and the 'ideal responses' (nowadays often with help from LLM tools, but humans still curate at the bottom).

一个很好的类比是上学。教材里大致有三类信息,恰好对应训练的三个阶段。第一类是「说明性正文(exposition)」——大段的背景知识、概念铺陈;你读它的过程,大致等价于预训练:建立知识库、对主题形成大致认识。第二类是「例题及其详解(worked solutions)」——专家(作者)不仅给题,还把完整解法演示给你看;这份详解就相当于助手的「理想回答」,你读它就是在「模仿专家」,这大致对应 SFT 模型。第三类,就是本章要引出的——「练习题」。 A useful analogy is going to school. A textbook has roughly three kinds of information, which map neatly onto the three training stages. The first is 'exposition' — long stretches of background knowledge and concept-building; reading it is roughly equivalent to pretraining: building a knowledge base and a general sense of the topic. The second is 'worked solutions' — the expert (the author) gives you not just a problem but a fully demonstrated solution; that worked solution is like the assistant's 'ideal response,' and reading it is 'imitating the expert,' roughly corresponding to the SFT model. The third — which this chapter sets up — is the 'practice problems.'

练习题对学习至关重要,因为它逼你「自己动手」:你拿到题目描述,但拿不到详解——你只在书末答案页看到「最终答案」。于是你知道要到达的终点,却必须自己摸索过程:试很多种方法,看哪种最能把你引到那个答案。在这个过程里,你会依赖两样东西:一是预训练来的背景知识,二是一点点对专家解法的模仿。这正是强化学习(RL)的精髓——给定 prompt 和最终答案,但不给专家解法,模型必须自己练、自己试。而要理解为什么需要 RL,得先看清 SFT 的一个根本局限。 Practice problems are critical for learning because they force you to 'do it yourself': you get the problem statement but not the worked solution — you only see the 'final answer' in the answer key at the back. So you know the destination but must find the path yourself: try many approaches and see which best gets you to that answer. In doing so you lean on two things: the background knowledge from pretraining, and a little imitation of expert solutions. This is the essence of reinforcement learning (RL) — you're given the prompt and the final answer, but not an expert solution, and the model must practice and try things itself. But to see why we need RL, we first have to see a fundamental limit of SFT.

回到那道熟悉的题:「Emily 买了 3 个苹果和 2 个橙子,每个橙子 2 美元,所有水果共 13 美元,问每个苹果多少钱?」想象有四个候选解法,它们都正确地得到了答案 3。但有的解法先列方程组,有的纯用英文一步步讲,有的几乎一步跳到答案。现在的关键问题是:假如我是那个要把对话写进训练集的人类标注员,我究竟该选哪一个放进去?坦白说,我并不知道。 Back to the familiar problem: 'Emily buys 3 apples and 2 oranges. Each orange costs $2. The total cost of all the fruit is $13. What is the cost of each apple?' Imagine four candidate solutions, all of which correctly reach the answer 3. But some set up a system of equations, some just talk through it in English step by step, and some skip almost straight to the answer. Now the key question: if I'm the human labeler writing a conversation to put into the training set, which one should I actually choose? Honestly, I don't know.

要看清这一点,得区分一个解法的两重目的。第一重目的当然是「算对答案」——拿到最终答案 3,这是最重要的。但还有第二重目的:把过程「呈现得让人看着舒服」——因为我们默认用户想看到中间步骤、想要清晰的讲解。这两件事是分开的:一个是「给人看的呈现」,一个是「真正算对」。如果此刻我们只关心「到达正确答案」这一件事,那哪个解法对 LLM 来说才是最优的? To see this, distinguish the two purposes a solution serves. The first is of course 'reach the right answer' — getting the final answer 3, the most important thing. But there's a second purpose: present the process 'nicely for the human' — because we assume the user wants to see intermediate steps and a clear explanation. These are two separate things: one is 'presentation for the human,' the other is 'actually getting it right.' If for the moment we care only about 'reaching the correct answer,' then which solution is optimal for the LLM?

答案是:作为人类标注员,我根本判断不了。回忆前面讲过的——每个 token 只能享有有限、大致固定的一点计算量,模型在任何单个 token 里都不能做太大的跳跃。所以那个「token 极少、几乎一步到位」的解法,虽然简短,却可能在某个 token 上(比如直接写「9 ÷ 3 = 3」)要求模型一次性完成太多心算,反而逼它跳步、出错。也许把过程铺得更开、或者先列成方程会更好——可我们根本不知道。因为对你我轻松的一步,对模型可能是过大的跳跃;而我在解法里随手注入的某些知识,模型参数里也许根本没有,那对它就成了「莫名其妙的突然飞跃」,反而让它困惑。 The answer: as a human labeler, I simply can't judge. Recall what we covered earlier — each token gets only a finite, roughly fixed amount of compute, and the model can't make too big a leap in any single token. So the 'very few tokens, almost one-shot' solution, though short, might demand on some token (e.g. writing '9 ÷ 3 = 3' directly) that the model finish too much mental arithmetic at once, pushing it to skip steps and err. Maybe spreading it out more, or setting up an equation, would work better — but we just don't know. Because a step trivial to you and me may be too big a leap for the model; and some knowledge I casually inject into a solution may simply not exist in the model's parameters, becoming a 'baffling sudden jump' that confuses it.

💡 核心症结:人和模型的「认知」不一样。LLM 在数学、物理、化学上的知识可能远超我这个标注员,我写的解法可能根本没用上它已有的本事;反过来,我写下的某些「显然」步骤,在它参数里其实是缺失的,对它就是难以跨越的跳跃。说到底,我并不是那个 LLM,所以我无法替它挑出「最经济、最可靠到达答案」的 token 序列。SFT 解法可以用来「初始化」模型、把它带到正确解法的大致邻域,但最适合模型自己的 token 序列,得让模型在试错中自己去发现。 The core issue: human and model 'cognition' differ. The LLM may know far more math, physics, and chemistry than I, the labeler — my solution might not use the ability it already has; conversely, some step I write down as 'obvious' may actually be missing from its parameters, an unbridgeable leap for it. Ultimately, I am not the LLM, so I can't pick out the 'most economical, most reliable' token sequence for it to reach the answer. SFT solutions can 'initialize' the model and bring it into the vicinity of correct solutions, but the token sequence that suits the model best must be discovered by the model itself, through trial and error.

SFT vs RL:SFT 模仿人类写好的「理想解法」;RL 让模型自己尝试很多解法,保留能得到正确答案的那些 SFT vs RL: SFT imitates human-written 'ideal solutions'; RL lets the model try many solutions itself and keep the ones that reach the correct answer

所以总结一下这层张力:我们不擅长替模型创造 token 序列。SFT 阶段写的「理想解法」很有价值——它像练习题之前的例题详解,把模型初始化到「会写解法、会列方程、会一步步讲」的状态,带它进入正确解法的邻域。但要真正「调准」,我们需要让模型自己去练:给它 prompt,给它最终答案(可验证),不给专家解法;让它尝试,看哪条路径可靠地到达答案,再强化那些路径。这就把我们引向了下一章:强化学习。 So to summarize the tension: we're not good at creating token sequences for the model. The 'ideal solutions' written in the SFT stage are valuable — like the worked examples before the practice problems, they initialize the model into a state of 'can write solutions, set up equations, talk through steps,' bringing it into the vicinity of correct solutions. But to truly 'dial it in,' we need to let the model practice itself: give it the prompt, give it the (verifiable) final answer, but not the expert solution; let it try, see which path reliably reaches the answer, and reinforce those paths. This leads us into the next chapter: reinforcement learning.

•三阶段类比上学:预训练=读说明性正文(背景知识);SFT=学专家的例题详解(模仿);RL=做练习题(自己练)。
•SFT = 在人类策划的「人—助手」对话上继续训练,算法同预训练,只换数据集;模型模仿专家写好的理想回答。
•练习题只给题目和最终答案、不给解法;你要自己试错、发现到达答案的路径——这正是 RL 的思路。
•SFT 的根本局限:对同一道题有多个都正确的解法,人类标注员不知道哪个对「模型」最好。
•一个解法有两重目的:①算对答案;②呈现得让人看着舒服。对人友好的呈现≠模型最省算力、最可靠的 token 序列。
•人和模型认知不同:你轻松的一步模型可能跨不过去;你注入的知识模型参数里可能没有。结论:让模型自己去发现适合它的 token 序列。

•Three stages mapped to school: pretraining = reading exposition (background knowledge); SFT = studying the expert's worked solutions (imitation); RL = doing the practice problems (practicing yourself).
•SFT = continue training on human-curated 'human–assistant' conversations; same algorithm as pretraining, just a different dataset; the model imitates expert-written ideal responses.
•Practice problems give you only the question and final answer, not the solution; you must try things and discover the path to the answer yourself — exactly RL's idea.
•The fundamental limit of SFT: for one problem there are many correct solutions, and the human labeler doesn't know which is best for the MODEL.
•A solution serves two purposes: (1) reach the right answer; (2) present nicely for the human. Human-friendly presentation ≠ the model's most compute-efficient, most reliable token sequence.
•Human and model cognition differ: a step trivial to you may be too big a leap for the model; knowledge you inject may not be in its parameters. Conclusion: let the model discover the token sequence that suits it.

📝 本章测验

用上学来类比,强化学习(RL)对应教材里的哪一部分?In the school analogy, which part of a textbook does reinforcement learning (RL) correspond to?

对那道「Emily 买苹果」的题,有四个都正确的候选解法。SFT 的根本难题是什么?For the 'Emily buys apples' problem there are four correct candidate solutions. What is SFT's fundamental difficulty?

本章说一个解法有「两重目的」,指的是哪两个?The chapter says a solution serves 'two purposes.' Which two?

为什么「对人友好的呈现」未必是模型的最优 token 序列?Why isn't 'human-friendly presentation' necessarily the model's optimal token sequence?