模型需要 token 来思考 Models Need Tokens to Think
每个 token 只能享有有限的计算量(有限的网络层数),所以模型没法在单个 token 里做无限推理。结论:把推理「摊开」到许多 token 上(把过程一步步写出来),比直接蹦出答案要好得多。我们用「Emily 买苹果」的数学题做对比:一上来就报答案,等于逼模型在一个 token 里塞下全部计算——很可能算错;而先列中间步骤、慢慢逼近答案,是好得多的训练标签。推论:遇到困难算术,与其让模型「心算」,不如让它调用工具(代码解释器),因为它本就不擅长在脑内可靠计算。token 序列就是模型用来做计算的「草稿纸 / 工作记忆」。 Each token gets only a finite amount of computation (a finite number of network layers), so the model can't do unbounded reasoning within a single token. Conclusion: spreading reasoning across many tokens (writing the work out step by step) is far better than blurting the answer. We compare with the 'Emily buys apples' math problem: stating the answer up front forces the model to cram all the computation into one token — likely getting it wrong; laying out intermediate steps and approaching the answer slowly is a far better training label. Corollary: for hard arithmetic, rather than having the model do 'mental math,' have it call a tool (a code interpreter), since it isn't good at reliable in-head computation. The token sequence is the model's 'scratchpad / working memory for computation.'
这一节讲这些模型在解决问题时的「原生计算能力」。在构造对话样例去训练模型时,我们要格外小心,这里有不少「锋利的边角」,恰好很能说明模型是如何「思考」的。看这样一个 prompt(假设我们要把它放进训练集去教模型解简单数学题):「Emily 买了 3 个苹果和 2 个橙子。每个橙子 2 美元。总共花了 13 美元。问苹果多少钱一个?」 This section is about these models' 'native computational capabilities' in problem-solving. When constructing conversation examples to train the model, we have to be very careful — there are quite a few 'sharp edges' here that are illuminating about how the models 'think.' Consider this prompt (suppose we want to put it in the training set to teach the model to solve simple math problems): 'Emily buys 3 apples and 2 oranges. Each orange costs $2. The total is $13. What is the cost of each apple?'
想象有两个候选答案,都正确(都得出每个苹果 3 美元):左边那个一上来就说「答案是 3 美元」,然后再补一句解释;右边那个先写中间过程——「橙子总共 4 美元,所以 13 减 4 等于 9,9 除以 3 等于 3」——再给出答案。两个答案都对,但其中一个作为助手的训练标签要好得多,另一个则相当糟糕。你能想到为什么吗?如果用了糟糕的那个,你的模型在数学上可能会变得很差。 Imagine two candidate answers, both correct (both conclude each apple costs $3): the one on the left blurts 'the answer is $3' right away, then adds a sentence of explanation; the one on the right writes intermediate steps first — 'the oranges cost $4 total, so 13 minus 4 is 9, and 9 divided by 3 is 3' — then gives the answer. Both are correct, but one is a far better training label for the assistant and the other is quite bad. Can you see why? If you use the bad one, your model could end up really bad at math.
关键要记住:模型在训练和推理时,都是在一条从左到右的一维 token 序列上工作。每生成下一个 token,都要把前面所有 token 喂进网络,网络再给出下一个 token 的概率。而这里最重要的一点是:这个网络的计算层数是有限的——比如现代先进网络可能也就 100 层左右。也就是说,从「前面的 token 序列」到「下一个 token 的概率」,中间只经过有限、且对每个 token 大致固定的计算量。你应该把它想成「每个 token 只能花一小笔、几乎固定的计算预算」。 The key thing to remember: in both training and inference, the model works on a one-dimensional token sequence from left to right. To generate each next token, all preceding tokens are fed into the network, which then gives probabilities for the next token. And the most important point: this network has a finite number of layers of computation — a modern state-of-the-art network might have only about 100 layers. That is, from 'the previous token sequence' to 'the probabilities for the next token,' there's only a finite, and roughly fixed-per-token, amount of computation. You should think of it as 'each token gets only a small, nearly fixed compute budget.'
回到那道苹果题。左边那个「先报答案」的回答之所以糟糕,是因为:想象模型从左到右逐个吐出「答案 / 是 / $」这几个 token,紧接着它就被要求在「3」这一个 token 里,把整道题的全部计算都塞进去、一次性算对。而一旦「3」吐出来了,后面那些解释 token 不过是事后的「马后炮」——答案早已在上下文里,后面并不是在真正计算。所以这等于在训练模型「用一个 token 猜出答案」,而这因为每 token 计算量有限,根本行不通。 Back to the apple problem. The 'answer-first' response on the left is bad because: imagine the model emitting the tokens 'answer / is / $' left to right, and then it's required, in the single token '3,' to cram the entire problem's computation and get it right in one shot. And once '3' is out, the explanation tokens that follow are just post-hoc justification — the answer is already in the context, and the later tokens aren't really computing anything. So this trains the model to 'guess the answer in one token,' which simply can't work given the finite per-token compute.
右边的回答好得多,因为它把计算分摊开了:让模型从左到右慢慢逼近答案,先产出中间结果——「橙子共 4 美元」「13 减 4 等于 9」「9 除以 3」——每一步本身都不太费力,叠加起来就解出了整道题。等到接近末尾,前面所有中间结果都已在它的工作记忆(上下文)里,要确定最终答案就容易多了。我们其实是在「教模型把推理摊开」,让每个 token 里只有很简单的一点计算。下面的演示让你直观感受:把同一道题「一步到位」和「分步展开」,看难度差别。 The response on the right is far better because it distributes the computation: it has the model approach the answer slowly from left to right, producing intermediate results first — 'oranges cost $4 total,' '13 minus 4 is 9,' '9 divided by 3' — each step not very expensive by itself, adding up to the full solution. By the time it nears the end, all the intermediate results are in its working memory (the context), making the final answer much easier to determine. We're really 'teaching the model to spread out its reasoning,' so each token contains only a little simple computation. The demo below lets you feel this: solve the same problem 'in one shot' vs. 'step by step' and see the difference.
🧮 模型需要 token 来「思考」
题目:Emily 买 3 个苹果和 2 个橘子,每个橘子 $2,总价 $13,问苹果多少钱?(答案:每个 $3)
更好:把推理摊开到许多 token 上,每个中间步骤(算橘子总价、做减法、做除法)各占一些 token。 模型在每一步只需做一点点计算,等写到最后的「3」时,答案几乎已经被前面的 token「算」出来了。
核心直觉:token 序列就是模型的「草稿纸」。逼它跳步会出错;让它把过程写出来,正确率显著提升。
实践中你通常不用显式操心这件事,因为 OpenAI 这样的公司有标注员专门负责确保答案是「摊开」的。所以你问 ChatGPT 这道题,它会慢悠悠地「先定义变量、列方程」、产出一堆中间结果——这些中间结果不是给你看的,是给模型自己用的:如果模型不为自己生成这些中间结果,它就到不了正确答案。反过来,你也可以「为难」它:让它「只用一个 token、立刻给我答案」。题目简单时它或许侥幸答对;但把数字调大(比如「Emily 买 23 个苹果、177 个橙子……」),它在一次前向传播里就算不出来,答案就会出错。让它「正常解、别管 token 限制」,它列出中间步骤后又能答对了。 In practice you usually don't have to worry about this explicitly, because companies like OpenAI have labelers dedicated to ensuring answers are 'spread out.' So ask ChatGPT this problem and it'll go slowly — 'let's define variables, set up the equation' — producing a pile of intermediate results. Those intermediate results aren't for you; they're for the model itself: if the model doesn't generate them for itself, it won't reach the correct answer. Conversely, you can 'be mean' to it: tell it to 'answer in a single token, immediately.' For an easy problem it might luck into the right answer; but scale up the numbers (e.g. 'Emily buys 23 apples and 177 oranges ...') and it can't compute that in one forward pass, so the answer comes out wrong. Tell it to 'solve normally, ignore the token limit' and, after laying out intermediate steps, it gets it right again.
- •每个 token 只享有有限、大致固定的计算量(层数有限),模型无法在单个 token 里做任意复杂的计算。
- •所以要把推理「摊开」到许多 token:产出中间结果,慢慢逼近答案。
- •「先报答案」的标签很糟:逼模型把全部计算塞进一个 token,后面的解释只是事后马后炮。
- •「分步展开」的标签好得多:每步只做一点简单计算,叠加成解,末尾时中间结果都在工作记忆里。
- •困难算术 / 计数:别让模型心算,让它用工具(代码解释器),正确性远更可靠。
- •token 序列就是模型做计算的草稿纸 / 工作记忆——这就是「模型需要 token 来思考」。
- •Each token gets only finite, roughly fixed computation (finite layers); the model can't do arbitrarily complex computation in a single token.
- •So spread reasoning across many tokens: produce intermediate results and approach the answer slowly.
- •An 'answer-first' label is bad: it forces all computation into one token, and the following explanation is just post-hoc justification.
- •A 'step-by-step' label is far better: each step does a little simple computation, adding up to a solution, with intermediates in working memory by the end.
- •Hard arithmetic / counting: don't make the model do mental math — have it use a tool (code interpreter), which is far more reliable.
- •The token sequence is the model's scratchpad / working memory for computation — that's why 'models need tokens to think.'
📝 本章测验
为什么模型不能在单个 token 里完成任意复杂的推理?Why can't a model do arbitrarily complex reasoning within a single token?
对那道苹果数学题,为什么「先列中间步骤再给答案」是更好的训练标签?For the apple math problem, why is 'lay out intermediate steps, then give the answer' a better training label?
ChatGPT 解题时产出的一堆中间步骤,主要是给谁用的?Who are the pile of intermediate steps ChatGPT produces mainly for?
遇到困难算术或计数,本章推荐怎么做?For hard arithmetic or counting, what does this chapter recommend?