LLM 全栈学习
第 6 章

推理:从模型生成文本 Inference: Generating Text from the Model

训练之外的另一个关键阶段是推理(inference)——从模型里生成新数据。做法很简单:给一段前缀 token,网络吐出下一个 token 的概率分布,我们「掷一枚有偏的硬币」采样出一个 token,接到末尾,再重复。因为是采样,同样的前缀会得到不同的输出;生成的文本是对训练数据的「remix」,统计上相似但并非逐字照搬。你在 ChatGPT 上对话时,发生的全部就是推理。 Beyond training, the other key stage is inference — generating new data from the model. It's simple: give it a prefix of tokens, the network outputs a probability distribution over the next token, we 'flip a biased coin' to sample one token, append it, and repeat. Because it's sampling, the same prefix yields different outputs; the generated text is a 'remix' of the training data — statistically similar but not verbatim. When you chat with ChatGPT, all that's happening is inference.

我们已经看过网络的内部结构,也大致谈了训练。现在来看处理这些网络的另一个主要阶段:推理(inference)。在推理中,我们要做的是从模型生成新数据,从而看看它在参数里到底内化了哪些模式。从模型生成其实相当直接:先准备一些起始 token,也就是你想要的「前缀」。 We've seen the network's internals and discussed training a bit. Now for the other major stage of working with these networks: inference. In inference, what we do is generate new data from the model, so we can see what patterns it has internalized in its parameters. Generating from the model is fairly straightforward: we start with some tokens that form your 'prefix' — whatever you want to begin with.

假设我们想以 token 91 开头。把它喂进网络,网络会像之前说的那样,给出一个概率向量。接下来我们就「掷一枚有偏的硬币」——根据这个概率分布采样出一个 token。模型给了高概率的 token,被采到的可能性就更大;你可以这样理解这枚有偏的硬币。比如这次采样得到 token 860,它跟在 91 后面是个相对可能的选择。然后我们把 860 接到序列末尾,再问「第三个 token 是什么」,采样、追加;如此循环。下面这张图就展示了这个生成循环。 Say we want to start with token 91. We feed it into the network, and the network gives us a probability vector, just as before. Then we 'flip a biased coin' — we sample a token according to this probability distribution. Tokens the model assigned high probability are more likely to be sampled; that's the biased coin. Say this time we sample token 860, a relatively likely continuation of 91. We append 860 to the sequence, then ask 'what's the third token?', sample, append, and loop on. The diagram below shows this generation loop.

推理 = 一次预测一个 token,再把它接回上文,循环往复 当前上文 已有的 token 序列 神经网络 输出概率分布 采样一个 token 「掷有偏的硬币」 把新 token 追加到上文,再来一轮 因为每步都在「掷硬币」,同样的开头每次也可能生成不同的文本(随机性)
推理循环:前缀 token → 网络给出概率 → 采样一个 token → 追加 → 再预测下一个 The inference loop: prefix tokens → network gives probabilities → sample a token → append → predict the next
💡 关键直觉:每一步采样都是在「掷一枚有偏的硬币」。网络只给出概率,真正吐出哪个 token 取决于这次抽到了谁。正因如此,整个系统是随机的(stochastic):同样的前缀,每次生成都可能走向不同的方向。 Key intuition: each sampling step is 'flipping a biased coin.' The network only gives probabilities; which token actually comes out depends on what you draw this time. Precisely because of this, the whole system is stochastic: the same prefix can head off in different directions on each generation.

在上面这个例子里,我们其实没有完全复现训练数据里的那段序列。比如某一步我们采到的不是训练里原本的 3962,而是 13659(对应词「article」)——于是后面的走向就和原文不同了。记住:这些系统是随机的,我们在采样、在掷硬币。有时运气好,会复现训练集里的一小段文本;有时采到的 token 压根没在任何训练文档里逐字出现过。 In the example above, we didn't exactly reproduce the sequence from the training data. At one step we sampled 13659 (the word 'article') instead of the original 3962 — and from there the continuation diverged from the source. Remember: these systems are stochastic; we're sampling, we're flipping coins. Sometimes we get lucky and reproduce a small chunk of training text; sometimes we sample a token that never appeared verbatim in any training document.

所以我们得到的是训练数据的「remix」。因为每一步都可能掷出略微不同的 token,而那个 token 一旦进入序列,又会影响下一步的采样,于是序列很快就会偏离任何训练文档。统计上,生成文本与训练数据有相似的性质,但并不与之完全相同——它更像是「受训练数据启发」而来。至于为什么会采到「article」?你可以想象,在「viewing a single …」这样的上下文里,「article」本就是个相对可能的词,训练文档里某处也确实这样接续过,我们只是恰好在这一步把它抽了出来。 So what we get is a 'remix' of the training data. Because each step can flip out a slightly different token, and once that token enters the sequence it influences the next sample, the sequence quickly drifts away from any training document. Statistically, the generated text has similar properties to the training data but is not identical to it — it's more like text 'inspired by' the training data. As for why we'd sample 'article'? You can imagine that in a context like 'viewing a single ...', 'article' is a relatively likely word, and somewhere in the training documents it did follow such a context — we just happened to draw it at that step.

下面这个演示让你亲手体验:给定上下文,网络给出下一个 token 的概率分布,你来「掷硬币」采样。多采几次,留意同样的前缀如何导向不同的 token——这正是推理随机性的来源。 The demo below lets you experience this firsthand: given a context, the network gives a probability distribution over the next token, and you 'flip the coin' to sample. Sample a few times and notice how the same prefix can lead to different tokens — that's exactly where inference's randomness comes from.

🎲 预测下一个 Token(掷有偏的硬币)

Viewing a single
the
32%
a
21%
article
14%
single
11%
post
9%
viewing
7%
direction
6%

模型按概率「掷有偏硬币」抽下一个 token——高概率更可能被选中,但每次都可能不同。

📝 重要区分:训练和推理是两件事。在常见流程里,下载并分词整个互联网只是一次性的预处理;之后你会训练许多不同设置、不同规模的网络;最后挑一组满意的参数固定下来,用它做推理生成。你在 ChatGPT 上和模型对话时,模型早在几个月前就训练好了,参数被冻结,不再更新——你看到的全部输出,都只是推理:你给它一些 token,它在补全后续的 token 序列。 An important distinction: training and inference are two different things. In a common workflow, downloading and tokenizing the whole internet is a one-time preprocessing step; you then train many networks of different settings and sizes; finally you pick a parameter set you're happy with and freeze it for inference. When you chat with a model on ChatGPT, it was trained months ago, its parameters are frozen and no longer updated — everything you see is just inference: you give it some tokens, and it completes the token sequence.
  • 推理 = 从模型生成新数据,看看它在参数里内化了哪些模式。
  • 循环:给前缀 token → 网络输出下一个 token 的概率 → 采样一个(掷有偏硬币) → 追加 → 再预测。
  • 系统是随机的:同样的前缀,因为每步采样不同,会得到不同的输出。
  • 生成文本是训练数据的「remix」:统计相似但非逐字照搬,更像受其启发。
  • 训练与推理是两个阶段;你用 ChatGPT 时全是推理,参数已冻结、不再更新。
  • Inference = generating new data from the model to see what patterns it internalized in its parameters.
  • The loop: give prefix tokens → network outputs next-token probabilities → sample one (flip a biased coin) → append → predict again.
  • The system is stochastic: the same prefix yields different outputs because each step samples differently.
  • Generated text is a 'remix' of the training data: statistically similar but not verbatim — more like inspired by it.
  • Training and inference are separate stages; using ChatGPT is all inference, with parameters frozen and no longer updated.

📝 本章测验

推理时,我们如何从网络给出的概率分布里得到一个具体的 token?During inference, how do we get a concrete token from the network's probability distribution?

为什么用同一个前缀,模型每次生成的结果可能不同?Why can the model produce different outputs each time from the same prefix?

生成出来的文本和训练数据是什么关系?What is the relationship between generated text and the training data?

当你在 ChatGPT 上和模型对话时,背后主要发生的是什么?When you chat with a model on ChatGPT, what is mainly happening under the hood?