LLM 全栈学习
第 17 章

DeepSeek-R1:推理的涌现 DeepSeek-R1: The Emergence of Reasoning

DeepSeek-R1 论文首次公开展示了把强化学习用在推理问题上的全过程。最惊人的发现不是准确率上升(这在意料之中),而是质变:在 RL 训练过程中,模型的回答会「自发地越来越长」。它自己学会了推理——从不同角度重新推导、复核自己的工作、回溯纠错(也就是「思维链」)。这些「认知策略」不是被显式教出来的,而是从「只追求答对」的优化里涌现出来的;它很像人类解难题时打的草稿。这就是「思考模型」(相对于早期只会模仿的 SFT 模型)。你能在 DeepSeek-R1、OpenAI 的 o 系列这样的模型里直接看到这种思考,数学/推理基准上的准确率随之大幅跃升。这些推理模型现已可用(如 chat.deepseek.com、together.ai,以及 OpenAI 的 o 系列),但「思考」会在推理时消耗更多算力、也更慢。 The DeepSeek-R1 paper was the first to publicly show the full process of applying reinforcement learning to reasoning problems. The most striking finding isn't that accuracy rises (that's expected) but a qualitative change: during RL training, the model's responses 'spontaneously grow longer.' It learns to reason on its own — re-deriving things from different perspectives, double-checking its own work, backtracking to fix mistakes (i.e. 'chains of thought'). These 'cognitive strategies' aren't taught explicitly; they emerge from optimization that only rewards getting the answer right, much like the scratch work a human does on a hard problem. This is a 'thinking model' (versus the earlier SFT models that merely imitated). You can see this thinking directly in models like DeepSeek-R1 and OpenAI's o-series, and accuracy on math/reasoning benchmarks jumps sharply. These reasoning models are now available (e.g. chat.deepseek.com, together.ai, and OpenAI's o-series), but 'thinking' costs more compute at inference and is slower.

前两个阶段——预训练和监督微调——已经存在多年、非常标准,所有 LLM 提供商都在做。真正还很「早期」、尚未在业界标准化的,是最后这个 RL 阶段。原因正如上一章所说:核心想法很简单(试错学习),但有大量微妙的细节决定它到底能不能跑通。许多公司(如 OpenAI)早就在内部试验 RL 微调,却一直没有公开谈论,都关在公司内部进行。 The first two stages — pretraining and supervised fine-tuning — have been around for years, are very standard, and every LLM provider does them. What's still genuinely 'early' and not yet standardized in the field is this last RL stage. The reason, as the last chapter said: the core idea is simple (trial-and-error learning), but a great many subtle details decide whether it actually works. Many companies (like OpenAI) had long experimented with RL fine-tuning internally but never discussed it publicly — it all stayed inside the company.

正因如此,中国公司 DeepSeek 不久前发布的那篇论文才是件大事:它非常公开地讲了把 RL 用于大语言模型、以及这件事对模型「推理能力」有多重要,还给出了复现所需的很多细节。这篇论文重新点燃了公众对「用 RL 训练 LLM」的兴趣。下面我们就借它来看看:当你「正确地」把 RL 应用到语言模型上,会发生什么、看起来是什么样。 For exactly this reason, the paper recently released by the Chinese company DeepSeek was a big deal: it talked very openly about applying RL to large language models, about how important this is for the models' 'reasoning ability,' and it gave many of the details needed to reproduce the results. This paper reinvigorated public interest in 'training LLMs with RL.' Let's use it to see what happens — and what it looks like — when you apply RL to language models 'correctly.'

先看定量结果。论文里有一张图,横轴是 RL 训练的步数(成千上万步),纵轴是模型在数学题上的解题准确率。一开始模型做得不太好,但随着在大量这类题目上不断试错、更新参数,准确率持续攀升——模型在「发现」如何解数学题。这本身已经很好了。但比「准确率上升」更惊人的,是模型「达到这个结果的方式」发生了质的变化。 First the quantitative result. The paper has a figure: the x-axis is RL training steps (tens of thousands), the y-axis is the model's accuracy at solving math problems. At first the model isn't very good, but as it keeps trying and erring across a large set of such problems and updating its parameters, accuracy keeps climbing — the model is 'discovering' how to solve math problems. That alone is good. But more striking than 'accuracy rising' is a qualitative change in 'the way the model achieves that result.'

论文里另一张图很有意思:随着优化推进,模型「每个回答的平均长度」在不断变长。也就是说,模型为了拿到更高的准确率,自发地用上了越来越多的 token——它学会了写非常非常长的解法。注意:没有人要求它这么做,这是优化过程中的一种涌现性质(emergent property)。那它为什么要把解法写这么长? Another figure in the paper is interesting: as optimization proceeds, the model's 'average length per response' keeps growing. That is, to get higher accuracy the model spontaneously uses more and more tokens — it learns to write very, very long solutions. Note: no one asked it to; this is an emergent property of the optimization process. So why does it write such long solutions?

随着 RL 训练推进,模型「自发」学会写更长的推理、检查自己 训练早期 直接给短答案,常出错 训练中期 开始写出中间步骤 训练后期 长篇推理 + 自我检查、回溯 ← 回答长度(token 数)随训练自然增长——没人教它这么做,是 RL「优化答对率」的副产品
随着 RL 训练推进,模型回答自发变长:它涌现出复核、换角度重推、回溯纠错等推理行为,准确率随之上升 As RL training proceeds, responses spontaneously lengthen: the model emerges behaviors like re-checking, re-deriving from new angles, and backtracking — and accuracy rises with it

把这些长解法拿来定性地看,你会发现模型开始做这样的事:写到一半突然「等等,等等——这是个值得标记的时刻,让我重新一步步评估一下,确认这个和对不对」。模型在做什么?它在重新评估步骤。它学到了:为了准确,尝试很多想法、从不同角度切入、回溯、换个框架重述,效果更好——这正是你我在解数学题时脑子里做的事。但它复现的是「你脑子里发生的过程」,而不是你最终誊在答卷上的东西。没有任何人能把这些东西硬编码进「理想助手回答」里——因为你根本不知道该写什么;它就是恰好对模型管用、并提升了解题准确率。 Look at these long solutions qualitatively and you'll find the model starting to do things like: midway it suddenly writes 'wait, wait — that's an aha moment I can flag here, let me re-evaluate this step by step to identify the correct sum.' What is the model doing? It's re-evaluating steps. It has learned that, for accuracy, trying many ideas, approaching from different perspectives, backtracking, and reframing works better — exactly what you and I do in our heads solving a math problem. But it reproduces 'what happens in your head,' not what you finally write down on the answer sheet. No human could hardcode this into an 'ideal assistant response' — because you wouldn't know what to put; it just happens to work for the model and improves its accuracy.

💡 核心洞见:模型在「发现思考的方式」。它学到了我喜欢称为「认知策略」的东西——如何摆弄一个问题、如何从不同角度切入、如何引入类比、如何反复尝试、如何从多个视角复核结果。这些就是我们说的「思维链(chains of thought)」,它是优化的涌现性质,而非被显式教授的。我们唯一给它的只有「正确答案」;这些思考行为,纯粹是从「努力把题答对」里自己长出来的。这太不可思议了。 Core insight: the model is 'discovering ways to think.' It learns what I like to call 'cognitive strategies' — how to manipulate a problem, approach it from different perspectives, pull in analogies, try many things, and re-check a result from multiple angles. These are what we mean by 'chains of thought,' an emergent property of the optimization rather than something explicitly taught. The only thing we gave it was the correct answers; these thinking behaviors grew, on their own, purely out of 'trying to get the problems right.' That is remarkable.

把这道熟悉的苹果题丢给一个「推理 / 思考」模型(也就是用 RL 训练出来的模型),你会看到两段截然不同的东西。先是一段「思考过程」:它会写「让我把这道题理清楚……所以每个苹果是 3 美元。等一下,我再核对一下算式,从另一个角度试试……嗯,都对得上,我没看到错误。让我看看有没有别的解法,比如设个方程……同样的答案,所以每个苹果确实是 3 美元,我有信心是对的」。这段思考结束后,它才另起一段,为人类「漂亮地」写出最终解法,并把答案框起来。前者关乎正确性、后者关乎呈现——而那段「思考」,正是 RL 带来的、也是让回答变长、让准确率上升、出现「啊哈时刻」的地方。 Give this familiar apple problem to a 'reasoning / thinking' model (one trained with RL) and you'll see two very different parts. First a 'thinking process': it writes 'let me figure this out ... so each apple is $3. Wait a second, let me check my math again from a different perspective ... yep, that all checks out, I don't see any mistakes. Let me see if there's another way, maybe setting up an equation ... same answer, so each apple is definitely $3, I'm confident that's correct.' Only after that thinking does it start a fresh section and write up the final solution 'nicely' for the human, boxing the answer. The former is about correctness, the latter about presentation — and that 'thinking' is exactly what RL produces: it's what lengthens the response, raises accuracy, and is where the 'aha moments' appear.

这些「思考模型」现在到哪儿用?DeepSeek-R1 是开源(开放权重)模型,任何人都能下载,只是它很大,你没法在 MacBook 上跑满血版。你可以在 chat.deepseek.com 上用(记得打开「Deep think」按钮才是 R1),或在 together.ai 这类托管最先进模型的推理平台上选 DeepSeek-R1。OpenAI 那边的「o」系列(如 o1、o3-mini 等)按其员工公开说法,是用与 DeepSeek-R1 非常相似的 RL 技术训练的,所以也是思考模型——不过 OpenAI 在网页界面里只展示思维链的「摘要」,不展示完整原文,部分是出于「蒸馏风险」的顾虑(怕别人模仿其推理轨迹)。要注意:GPT-4o、4o-mini 这类你应当主要当作 SFT 模型,它们并不真正「思考」。 Where do you use these 'thinking models'? DeepSeek-R1 is open-source (open-weights), anyone can download it, though it's large and you can't run the full model on a MacBook. You can use it at chat.deepseek.com (remember to turn on the 'Deep think' button to get R1), or pick DeepSeek-R1 on an inference platform like together.ai that hosts state-of-the-art models. OpenAI's 'o' series (o1, o3-mini, etc.), per public statements from its employees, was trained with RL techniques very similar to DeepSeek-R1's, so they too are thinking models — though OpenAI shows only 'summaries' of the chains of thought in its web UI, not the full text, partly out of 'distillation risk' concerns (fear that others imitate its reasoning traces). Note: GPT-4o, 4o-mini, and the like you should mostly treat as SFT models; they don't really 'think.'

📝 实用建议:思考模型很强,但「思考」会在推理时消耗更多算力、也更慢(它真的要花时间生成那一长串思维链)。所以遇到困难的数学、代码、需要深度推理的问题,值得用思考模型;但若只是简单的知识性问题,让模型「想 30 秒」就有点杀鸡用牛刀了——这时用普通的 SFT 模型(如 GPT-4o)往往更划算、更快。 Practical advice: thinking models are powerful, but 'thinking' costs more compute at inference and is slower (it really does spend time generating that long chain of thought). So for hard math, code, or problems needing deep reasoning, a thinking model is worth it; but for a simple knowledge question, having the model 'think for 30 seconds' is overkill — an ordinary SFT model (like GPT-4o) is often cheaper and faster there.
  • DeepSeek-R1 论文首次公开展示把 RL 用于推理问题的全过程,并给出复现细节,重燃公众兴趣。
  • 定量:在数学题上,随 RL 训练步数增加,解题准确率持续攀升。
  • 更惊人的定性发现:训练中模型回答「自发变长」——它用更多 token 来换取更高准确率。
  • 模型自发学会推理:复核工作、从不同角度重推、回溯纠错——这就是「思维链」,是优化的涌现性质,无人显式教授。
  • 我们只给了「正确答案」;这些「认知策略」纯粹从「努力答对」里长出来,很像人解难题时打的草稿。
  • 这些「思考模型」(DeepSeek-R1、OpenAI o 系列)现已可用,但「思考」在推理时更耗算力、更慢;简单问题用 SFT 模型更划算。
  • The DeepSeek-R1 paper was the first to publicly show the full process of applying RL to reasoning problems, with reproduction details, reinvigorating public interest.
  • Quantitative: on math problems, accuracy keeps climbing as RL training steps increase.
  • The more striking qualitative finding: during training the responses 'spontaneously lengthen' — the model trades more tokens for higher accuracy.
  • The model spontaneously learns to reason: re-checking its work, re-deriving from different angles, backtracking — that's 'chains of thought,' an emergent property of optimization, taught by no one.
  • We gave it only the 'correct answers'; these 'cognitive strategies' grew purely out of 'trying to get it right,' much like a human's scratch work on a hard problem.
  • These 'thinking models' (DeepSeek-R1, OpenAI's o-series) are now available, but 'thinking' costs more compute at inference and is slower; an SFT model is cheaper for simple questions.

📝 本章测验

DeepSeek-R1 论文最「惊人」的发现是什么?What was the most 'striking' finding in the DeepSeek-R1 paper?

模型涌现出的「复核、换角度重推、回溯纠错」这些行为是怎么来的?Where do the emergent behaviors 'recheck, re-derive from new angles, backtrack' come from?

下列哪一组对「思考模型」和普通模型的归类是对的?Which grouping correctly classifies 'thinking models' versus ordinary models?

关于使用思考模型,本章给的实用建议是什么?What practical advice does the chapter give about using thinking models?