第 16 章

强化学习 Reinforcement Learning

强化学习的核心其实很简单:猜了再验(guess and check)。拿一个 prompt,让模型并行生成大量候选解法(实践中可能上千甚至上百万个);检查哪些到达了正确的最终答案;然后训练模型「多做那些成功解法里做过的事、少做失败解法里做过的事」。关键在于:奖励是自动的——最终答案对不对就是信号,不需要人来评判过程。所以它在「可验证」的领域(数学、代码)效果极好。这些训练序列不来自人类专家,而来自模型自己;模型在「游乐场」里反复练习,自己发现哪些 token 序列可靠地通向答案——这些序列不做莫名跳步、充分利用了模型自身的知识。这就是「推理 / 思考」模型的由来。RL 比 SFT 新、不那么标准化,各家实验室对细节守口如瓶;DeepSeek 是少数公开谈论它的(下一章)。 The core of reinforcement learning is actually simple: guess and check. Take a prompt, have the model generate many candidate solutions in parallel (in practice possibly thousands or even millions); check which reach the correct final answer; then train the model to 'do more of what the winning solutions did, and less of what the losing ones did.' The key: the reward is automatic — whether the final answer is correct is the signal, with no human needed to judge the process. So it works extremely well in 'verifiable' domains (math, code). These training sequences don't come from human experts but from the model itself; the model practices in a 'playground,' discovering for itself which token sequences reliably lead to the answer — sequences that make no baffling leaps and fully use the model's own knowledge. This is the origin of 'reasoning / thinking' models. RL is newer and less standardized than SFT, and labs keep the details private; DeepSeek is one of the few to discuss it openly (next chapter).

上一章我们看到,人类标注员没法替模型挑出最优解法。强化学习的做法其实出奇地简单:既然我们不知道哪条解法最好,那就让模型尝试很多种,看哪种行得通。我们仍用那道熟悉的题:「Emily 买 3 个苹果和 2 个橙子……每个苹果多少钱?」我们已知正确答案是 3 美元。 In the last chapter we saw that a human labeler can't pick the optimal solution for the model. The reinforcement learning approach is, in fact, surprisingly simple: since we don't know which solution is best, let the model try many and see which ones work. We use the same familiar problem: 'Emily buys 3 apples and 2 oranges ... what is the cost of each apple?' We already know the correct answer is $3.

流程是这样的:拿这个 prompt,运行模型,让它生成一个解法。检查这个解法,看它有没有到达正确答案 3。这是一次尝试。然后删掉重来,再试一次——因为模型是随机系统(每个 token 都从一个概率分布里采样),所以每次生成的解法都略有不同,会走上略微不同的路径。第三次再试……如此重复。实践中,对单个 prompt 你可能采样成千上万、甚至上百万个相互独立的解法,其中有些对、有些不对。 The procedure is this: take the prompt, run the model, and have it generate a solution. Inspect that solution and see whether it reached the correct answer, 3. That's one attempt. Then delete it and try again — because the model is a stochastic system (each token is sampled from a probability distribution), every generation differs slightly and goes down a slightly different path. Try a third time ... and so on. In practice, for a single prompt you might sample thousands, tens of thousands, or even millions of independent solutions, some correct and some not.

用一张「卡通图」来想象:一个 prompt,下面并行铺开许多条解法。有些走得顺、到达了正确答案(标绿),有些走偏了、没到答案(标红)。比如我们生成了 15 个解法,只有 4 个答对。接下来我们要做的,就是「鼓励那些通向正确答案的解法」:红色解法里某处出了岔子,这不是好路径;绿色解法里事情进展顺利,我们希望模型以后在类似 prompt 上多走这样的路。 Picture a 'cartoon diagram': one prompt, with many solutions fanning out in parallel below it. Some go well and reach the correct answer (green), some go astray and miss it (red). Say we generated 15 solutions and only 4 got it right. What we do next is 'encourage the solutions that lead to correct answers': something went wrong somewhere in the red solutions — not a good path; things went well in the green ones — and we want the model to take such paths more often on similar prompts in the future.

怎么鼓励?就在这些(成功的)序列上训练模型。但注意:这些训练序列不再来自人类专家标注员——没有任何人决定这是「正确解法」,它们是模型自己生成的。模型是在「练习」:它试了几种解法,有 4 个奏效,然后它就在这些奏效的解法上训练,就像一个学生看着自己的作业说「嗯,这种做法很顺,以后这类题我就该这么解」。最简单的一种做法,是从这 4 个里挑出「最佳」的一个(也许最短、也许某些方面最漂亮),只在它上面训练;做一次参数更新后,模型以后在这种情形下就更倾向于走这条路。 How do we encourage them? By training the model on those (successful) sequences. But note: these training sequences no longer come from a human expert labeler — no human decided this was the 'correct solution'; they were generated by the model itself. The model is 'practicing': it tried several solutions, 4 worked, and it then trains on those that worked — like a student looking at their own work and saying 'okay, this approach went smoothly, so this is how I should solve these.' The simplest variant is to pick the single 'best' of the 4 (maybe the shortest, maybe nicest in some way) and train only on it; after one parameter update, the model becomes slightly more likely to take that path in such situations.

加载交互组件…

记住:这一切是在海量、多样的 prompt 上同时发生的——成千上万道数学题、物理题、各种各样的题,每道题又采样上千个解法。随着这个过程一轮轮迭代,模型在为自己发现:哪种 token 序列能可靠地引向正确答案。这不来自人类标注员;模型像在游乐场里玩耍,它知道自己要到达的终点,于是自己摸索出对它管用的序列。这些序列有几个特点:不做莫名其妙的跳步、统计上可靠、并且充分利用了模型自身已有的知识。 Remember: all of this happens simultaneously across a huge, diverse set of prompts — tens of thousands of math problems, physics problems, all kinds of problems, each with thousands of sampled solutions. As this process iterates round after round, the model discovers for itself which token sequences reliably lead to the correct answer. This doesn't come from a human labeler; the model is like a child playing in a playground — it knows the destination it's aiming for and works out the sequences that work for it. These sequences share traits: they make no baffling leaps, they're statistically reliable, and they fully use the knowledge the model already has.

💡 一句话抓住本质:强化学习就是「猜了再验」(guess and check)——猜出许多种解法,验一验,然后以后多做奏效的。关键妙处在于奖励是自动的:只要能判断最终答案对不对,就不需要人来评判中间过程。所以 RL 在「可验证」的领域(数学、代码,答案能被检查)效果极佳。模型不再是模仿人类,而是发现属于它自己的好策略——这正是「推理 / 思考」模型的来源。 In one line: reinforcement learning is 'guess and check' — guess many solutions, check them, and do more of what worked. The key trick is that the reward is automatic: as long as we can judge whether the final answer is correct, no human is needed to judge the intermediate process. So RL works superbly in 'verifiable' domains (math, code, where answers can be checked). The model no longer imitates humans but discovers its own good strategies — exactly the origin of 'reasoning / thinking' models.

放回整条流水线来看:SFT 模型仍然有用,因为它把模型「初始化」到正确解法的大致邻域——它已经会写解法、会列方程组、会一步步把过程讲出来。但真正「调准」是在强化学习阶段:在这里我们发现真正对模型管用、能得到正确答案的解法,鼓励它们,模型就随时间不断变强。所以总结起来,我们训练 LLM 的方式很像教孩子:先读完所有教材的正文(预训练)建知识库;再看遍所有专家例题详解(SFT)学会模仿;最后只做练习题(RL),才得到会「自己解题」的模型。 Put back into the whole pipeline: the SFT model is still useful because it 'initializes' the model into the general vicinity of correct solutions — it already knows how to write solutions, set up systems of equations, and talk through a process step by step. But the real 'dialing in' happens in reinforcement learning: here we discover the solutions that genuinely work for the model and reach correct answers, encourage them, and the model keeps getting better over time. So, summed up, training an LLM is much like teaching a child: first read all the exposition in the textbooks (pretraining) to build a knowledge base; then study every expert worked solution (SFT) to learn to imitate; finally do only the practice problems (RL) to get a model that can 'solve problems itself.'

📝 一个重要的现实:RL 比预训练和 SFT 都新得多,还远没有标准化。核心想法(试错学习)极其简单,但魔鬼在细节里——怎么挑「最好的」解法、在它们上面训练多少、prompt 分布如何构造、怎样设置训练才真正奏效……有大量微妙的数学旋钮要调对。正因如此,各家实验室(如 OpenAI)早就在内部试验 RL 微调,却很少公开谈论。这也是为什么 DeepSeek 那篇公开讲清 RL 细节的论文是件大事——下一章我们就来看它。 An important reality: RL is much newer than pretraining and SFT, and far from standardized. The core idea (trial-and-error learning) is extremely simple, but the devil is in the details — how to pick the 'best' solutions, how much to train on them, how to construct the prompt distribution, how to set up the run so it actually works ... there are many subtle mathematical knobs to get right. For this reason, labs (like OpenAI) had long experimented with RL fine-tuning internally but rarely discussed it publicly. That's why DeepSeek's paper, which openly laid out the RL details, was such a big deal — we look at it in the next chapter.

•RL 的核心:猜了再验。拿一个 prompt,并行生成大量候选解法(实践中上千乃至上百万个)。
•检查哪些解法到达了正确的最终答案;在(成功的)序列上训练模型,让它「多做奏效的、少做失败的」。
•训练序列不来自人类专家,而来自模型自己;模型在「游乐场」里练习,自己发现对它管用的 token 序列。
•奖励是自动的:最终答案对错即信号,无需人评判过程——所以 RL 在数学、代码等「可验证」领域极有效。
•模型不再模仿人类,而是发现自己的好策略(不跳步、统计可靠、充分用上自身知识)——这就是「推理/思考」模型。
•RL 比 SFT 新、不标准化,细节繁多且被各家保密;DeepSeek 是少数公开讲清细节的(下一章)。

•The core of RL: guess and check. Take a prompt and generate many candidate solutions in parallel (in practice thousands to millions).
•Check which solutions reach the correct final answer; train the model on the (successful) sequences to 'do more of what worked, less of what failed.'
•The training sequences come not from human experts but from the model itself; it practices in a 'playground,' discovering the token sequences that work for it.
•The reward is automatic: the final answer's correctness is the signal, with no human judging the process — so RL is highly effective in 'verifiable' domains like math and code.
•The model no longer imitates humans but discovers its own good strategies (no leaps, statistically reliable, fully using its own knowledge) — that's a 'reasoning / thinking' model.
•RL is newer and less standardized than SFT, with many private details; DeepSeek is one of the few to lay them out openly (next chapter).

📝 本章测验

强化学习的基本循环是怎样的?What is the basic loop of reinforcement learning?

在 RL 里,用来训练的「成功解法」序列来自哪里?In RL, where do the 'successful solution' sequences used for training come from?

为什么 RL 特别适合数学和代码这类领域?Why is RL especially suited to domains like math and code?

下列关于 RL 现状的说法,哪个正确?Which statement about the current state of RL is correct?

强化学习 Reinforcement Learning

🔁 强化学习的核心循环

📝 本章测验