第 18 章

AlphaGo 的启示 Lessons from AlphaGo

「强化学习极其强大」并不是 LLM 时代才有的新发现。一个经典例证是围棋:DeepMind 的 AlphaGo。它有两种训练方式——监督学习(模仿人类高手的棋局)和强化学习(自我对弈,奖励=赢棋)。结果是 RL 版本超越了监督版本,并击败了世界冠军。监督学习因为只是模仿人类,会「封顶」在人类水平、永远无法真正超越顶尖棋手;而 RL 不受人类表现的约束,可以发现人类未知的新策略。最著名的就是「第 37 手」——一步人类几乎不会下(被估计为万分之一概率)、事后却被证明绝妙的棋。它对应到 LLM 上的启示是:SFT 模仿人类(以人为上限),RL 则可能发现超越模仿的新推理策略。于是有一个开放问题:LLM 版本的「第 37 手」会是什么?也许是人类想不到的类比、全新的思考策略,甚至模型自创的、不再是英语的「思考语言」。前提是要有海量、多样的可练习题目。一个关键告诫:围棋有清晰的胜负奖励,而开放式语言任务很难定义奖励——这为下一章的 RLHF 埋下伏笔。 'Reinforcement learning is extremely powerful' is not a discovery unique to the LLM era. A classic example is the game of Go: DeepMind's AlphaGo. It was trained two ways — supervised learning (imitating human expert games) and reinforcement learning (self-play, reward = winning). The RL version surpassed the supervised one and beat the world champion. Supervised learning, being mere imitation of humans, 'caps out' at human level and can never truly exceed the top players; RL is not constrained by human performance and can discover new strategies unknown to humans. The most famous is 'Move 37' — a move no human would play (estimated at about 1 in 10,000 probability) that proved brilliant in hindsight. The lesson for LLMs: SFT imitates humans (capped at the human ceiling), while RL can discover new reasoning strategies beyond imitation. So there's an open question: what is the LLM equivalent of 'Move 37'? Perhaps analogies humans couldn't make, an entirely new thinking strategy, even a model-invented 'thinking language' that isn't English. The prerequisite: a huge, diverse set of problems to practice on. A key caveat: Go has a crisp win/lose reward, while open-ended language tasks are hard to reward — foreshadowing the next chapter's RLHF.

最后再做一个连接:「强化学习是一种极其强大的学习方式」,这件事对 AI 领域并不新鲜。一个早已展示过它威力的地方,就是围棋——DeepMind 著名的 AlphaGo 系统。它学着和顶尖人类棋手对弈(关于它还有一部纪录片可看)。翻开 AlphaGo 背后的论文,会看到一张很有意思、也对我们很熟悉的图——只不过我们如今是在「任意问题求解」这个更开放的领域里重新发现它,而它当年是在围棋这个封闭、具体的领域里展示的。 One last connection: 'reinforcement learning is an extremely powerful way of learning' is not new to the field of AI. One place its power was demonstrated long ago is the game of Go — DeepMind's famous AlphaGo system. It learned to play against top human players (there's even a documentary about it). Open the paper behind AlphaGo and you'll find a figure that is interesting and quite familiar to us — except we're now rediscovering it in the more open domain of arbitrary problem-solving, whereas back then it was demonstrated in the closed, specific domain of Go.

那张图的纵轴是围棋的 ELO 等级分,图里画着一条参照线——李世石(Lee Sedol),一位极强的人类棋手。图比较的是两种模型的棋力:一种用监督学习训练,一种用强化学习训练。监督学习的模型是在「模仿人类高手」:你拿来海量的专家对局,试着去模仿,棋力确实会上升——但会「封顶」,你永远到不了顶尖棋手(如李世石)那个高度。原因很根本:你只是在模仿人类,而单纯模仿人类,本质上无法超越人类。 That figure's y-axis is the ELO rating in Go, with a reference line — Lee Sedol, an extremely strong human player. It compares the strength of two models: one trained by supervised learning, one by reinforcement learning. The supervised model is 'imitating human experts': take a huge number of expert games and try to imitate them, and strength does rise — but it 'caps out,' never reaching the top players (like Lee Sedol). The reason is fundamental: you're only imitating humans, and pure imitation of humans fundamentally cannot surpass humans.

💡 核心对照:监督学习 = 模仿人类专家,因此以「人类水平」为天花板,无法真正超越顶尖人类。强化学习不受人类表现的约束:在围棋里,它意味着系统去下那些「经验上、统计上能赢棋」的着法。这恰好对应到 LLM:SFT 模仿人类(封顶于人),RL 则有可能发现超越模仿的新策略。 The core contrast: supervised learning = imitating human experts, so it's ceilinged at 'human level' and can't truly exceed the top humans. Reinforcement learning is not constrained by human performance: in Go it means the system plays moves that 'empirically and statistically lead to winning.' This maps directly onto LLMs: SFT imitates humans (capped at the human ceiling), while RL can potentially discover new strategies beyond imitation.

AlphaGo 的强化学习版本是怎么练的?它和自己对弈,用 RL 做「rollout(推演)」。这其实就是我们上一章那张图——只不过这里没有 prompt:围棋是固定的棋局,系统尝试大量不同的下法,那些最终「赢棋」(而不是「答对某个答案」)的对局就被强化、被加强。于是系统逐渐学到:哪些动作序列在经验上、统计上能通向胜利。因为不受人类表现的约束,RL 可以做得显著更好,甚至超越李世石这样的顶尖棋手——AlphaGo 正是这样击败了世界冠军。(那条曲线大概还能继续往上,只是跑 RL 很花钱,他们在某处就停了。) How was AlphaGo's reinforcement learning version trained? It plays against itself, using RL to create 'rollouts.' This is exactly the diagram from the last chapter — except here there's no prompt: Go is a fixed game, the system tries many different plays, and the games that ultimately 'win' (rather than 'reach a specific answer') get reinforced and strengthened. So the system gradually learns which action sequences empirically and statistically lead to victory. Because it's not constrained by human performance, RL can do significantly better, even surpassing top players like Lee Sedol — which is exactly how AlphaGo beat the world champion. (That curve could probably have kept climbing; running RL just costs money, so they cropped it at some point.)

最能说明问题的是著名的「第 37 手(Move 37)」。在与李世石的对局中,AlphaGo 下出了一步几乎没有人类棋手会下的棋:据估算,人类下出这一手的概率约为万分之一。可事后看,这是一步绝妙的棋。也就是说,在强化学习的过程中,AlphaGo 发现了一种人类未知的下法——它当时看着像「失误」,解说员都以为下错了,而它其实出自训练中「这一手看起来能赢」的判断。这恰恰是 RL 的威力:它没有任何东西阻止你「偏离人类的下法分布」。 The most telling example is the famous 'Move 37.' In a game against Lee Sedol, AlphaGo played a move almost no human would: it was estimated that a human had about a 1-in-10,000 chance of playing it. Yet in hindsight it was a brilliant move. That is, in the process of reinforcement learning, AlphaGo discovered a way of playing unknown to humans — it looked like a 'mistake' at the time, commentators thought it was an error, but it actually came from the training judgment that 'this move seems to win.' This is exactly RL's power: nothing prevents you from veering off the distribution of how humans play.

📝 把这一切搬到语言模型上,就有了一个引人入胜的开放问题:LLM 版本的「第 37 手」会是什么?既然 RL 不必固守人类的分布,它在语言/推理上可能发现哪些人类没有的新策略、新「啊哈」式洞见?也许是人类构造不出的类比,也许是一种全新的思考策略,甚至——既然模型并不被强制使用英语——它可能漂移出英语、发明一种更利于「思考」的自创语言。这一切的前提,是要有足够大、足够多样的题目分布(各类「练习题」搭起来的「游戏环境」)供模型反复打磨这些策略。这正是当前前沿 LLM 研究在做的事。 Carry all this over to language models and you get a captivating open question: what is the LLM equivalent of 'Move 37'? Since RL needn't stay within the human distribution, what new strategies or new 'aha'-style insights might it discover in language/reasoning that humans don't have? Perhaps analogies humans couldn't construct, perhaps an entirely new thinking strategy, even — since the model isn't forced to use English — it might drift off English and invent its own language better suited to 'thinking.' The prerequisite for all of this is a large, diverse enough distribution of problems (a 'game environment' built from all kinds of 'practice problems') for the model to refine these strategies on. This is exactly what frontier LLM research is doing right now.

⚠️ 一个关键告诫:围棋之所以是 RL 的「完美」舞台,是因为它有清晰的奖励——赢或输,黑白分明、可自动判定。同样地,数学和代码也有可验证的答案。但开放式的语言任务呢?「这首诗好不好」「这个笑话好不好笑」「这个总结写得怎么样」——这些没有干净利落的胜负信号。当奖励难以定义时,前面那套自动化的 RL 就没法直接用。如何为这类开放式任务设定奖励,正是下一章「基于人类反馈的强化学习(RLHF)」要解决的问题。 A key caveat: Go is a 'perfect' arena for RL because it has a crisp reward — win or lose, black and white, judged automatically. Likewise, math and code have verifiable answers. But what about open-ended language tasks? 'Is this poem good?' 'Is this joke funny?' 'How good is this summary?' — these have no clean win/lose signal. When the reward is hard to define, the automated RL above can't be applied directly. How to set a reward for such open-ended tasks is exactly what the next chapter, 'reinforcement learning from human feedback (RLHF),' addresses.

•AlphaGo(DeepMind)是 RL 的经典成功:在围棋里证明了强化学习极其强大。
•两种训练方式:监督学习(模仿人类专家对局)vs 强化学习(自我对弈,奖励=赢棋)。
•结果:RL 版本超越了监督版本,并击败世界冠军;监督学习只是模仿人类,会封顶于人类水平。
•「第 37 手」:一步人类几乎不会下(约万分之一概率)、事后却绝妙的棋——RL 发现了超越人类知识的新策略。
•对 LLM 的启示:SFT 模仿人类(以人为上限),RL 可发现超越模仿的新推理策略;LLM 版「第 37 手」是什么仍是开放问题。
•关键告诫:围棋有清晰的胜负奖励(数学/代码也可验证),但开放式语言任务难以定义奖励——引出下一章 RLHF。

•AlphaGo (DeepMind) is the canonical RL success: it demonstrated in Go that reinforcement learning is extremely powerful.
•Two ways of training: supervised learning (imitating human expert games) vs reinforcement learning (self-play, reward = winning).
•Result: the RL version surpassed the supervised one and beat the world champion; supervised learning only imitates humans and caps out at human level.
•'Move 37': a move no human would play (about 1 in 10,000 probability) that was brilliant in hindsight — RL discovered a strategy beyond human knowledge.
•Lesson for LLMs: SFT imitates humans (capped at the human ceiling), RL can discover new reasoning strategies beyond imitation; what the LLM 'Move 37' is remains an open question.
•Key caveat: Go has a crisp win/lose reward (math/code are verifiable too), but open-ended language tasks are hard to reward — leading into the next chapter's RLHF.

📝 本章测验

AlphaGo 的两种训练方式中,为什么「监督学习」版本会封顶?Of AlphaGo's two training approaches, why does the 'supervised learning' version cap out?

「第 37 手(Move 37)」说明了 RL 的什么特性?What property of RL does 'Move 37' illustrate?

把 AlphaGo 的启示对应到 LLM,下面哪个说法最准确?Mapping AlphaGo's lesson onto LLMs, which statement is most accurate?

为什么围棋是 RL 的「完美」舞台,而开放式语言任务更难?Why is Go a 'perfect' arena for RL while open-ended language tasks are harder?