神经网络的输入与输出 Neural Network Inputs and Outputs
把语料分词后,我们就得到约 15 万亿个 token。训练神经网络的核心,是建模「token 如何一个接一个地出现」。我们先不打开网络这个黑盒,只看清它的输入(变长的 token 窗口)和输出(覆盖整个词表的概率分布),以及训练时如何一步步调整它。 Once the corpus is tokenized we have about 15 trillion tokens. Training the network is about modeling how tokens follow one another. Without opening the black box yet, we look at its input (a variable-length window of tokens) and its output (a probability distribution over the whole vocabulary), and how training nudges it step by step.
用分词器把语料重新表示成 token 序列后,FineWeb 这样的数据集不仅占约 44TB,还对应着约 15 万亿个 token 的序列。每个 token 都只是一小块文本的「原子」,其数字 ID 本身没有大小意义,只是唯一标识。接下来就是计算量最大的一步:训练神经网络,让它建模这些 token 在序列中如何彼此衔接的统计规律。 After re-representing the corpus as token sequences with the tokenizer, a dataset like FineWeb not only occupies about 44TB but also corresponds to a sequence of about 15 trillion tokens. Each token is just an 'atom' of a small chunk of text; its numeric ID carries no notion of magnitude — it's simply a unique identifier. Next comes the most compute-heavy step: training the neural network to model the statistical regularities of how these tokens follow one another in the sequence.
做法是:从数据里相当随机地取出一个 token「窗口」。把窗口里的这些 token 作为「上下文(context)」喂进网络,这就是神经网络的输入;我们的目标是预测紧接着出现的那个 token。先不打开网络这个黑盒,关键是看清它的输入和输出:输入是一段 token 序列,输出是对「下一个 token 是什么」的预测。 Here's how: we take a fairly random 'window' of tokens from the data. We feed the tokens in that window as 'context' into the network — this is the input — and our goal is to predict the very next token. Without opening the black box yet, the key is to see its input and output clearly: the input is a sequence of tokens, and the output is a prediction of what the next token is.
网络的输出是什么?因为词表有约 100,277 种可能的 token,网络会输出正好这么多个数字,每个数字代表对应 token「作为下一个出现」的概率。换句话说,输出是一个覆盖整个词表的概率分布——网络在对「下一个会是什么」做猜测。注意这一点很重要:输出不是「一个词」,而是对全部约 10 万个候选 token 同时给出的概率;后续的「推理」步骤才会从这个分布里采样出真正吐出来的那个 token。 What's the output? Because the vocabulary has about 100,277 possible tokens, the network outputs exactly that many numbers, each representing the probability of the corresponding token coming next. In other words, the output is a probability distribution over the entire vocabulary — the network is making guesses about what comes next. This matters: the output is not 'one word' but a probability assigned simultaneously across all ~100,000 candidate tokens; it's the later 'inference' step that samples an actual token from this distribution.
🎲 预测下一个 Token(掷有偏的硬币)
模型按概率「掷有偏硬币」抽下一个 token——高概率更可能被选中,但每次都可能不同。
再说说输入的长度。窗口长度可以从 0 一直到我们设定的某个最大值——实践中比如 8000 个 token,这个上限就叫上下文长度(context length)。原则上窗口可以任意长,但处理很长的序列在计算上非常昂贵,所以我们干脆定一个像 8000、4000 或 16000 这样的数值并截断。早期的 GPT-2 上下文长度只有 1024,而现代模型已能达到几十万。窗口越长越贵,是因为网络要同时考虑窗口里每一对 token 之间的关系,成本随长度迅速增长。下面的演示让你亲手感受变长窗口与上下文上限。 Now about the input length. The window length can range from 0 up to some maximum we choose — in practice, say, 8,000 tokens; that upper bound is called the context length. In principle the window could be arbitrarily long, but processing very long sequences is computationally expensive, so we simply pick a value like 8,000, 4,000, or 16,000 and crop there. Early GPT-2 had a context length of just 1,024, whereas modern models reach into the hundreds of thousands. Longer windows are costlier because the network must weigh the relationships among every pair of tokens in the window, and that cost grows rapidly with length. The demo below lets you feel the variable-length window and the context cap firsthand.
🪟 上下文窗口(Context Length)
模型预测下一个 token 时,只能"看到"最近的若干个 token——这就是上下文长度。拖动滑块改变窗口大小。
模型用 高亮的 4 个 token 作为上下文,预测末尾的下一个 token(?)。 窗口越大,能参考的历史越多,预测往往越准——但计算也越贵。GPT-2 上限是 1024,现代模型可达几十万甚至上百万。
训练是怎么进行的?一开始,网络是随机初始化的——它就是个随机的变换,所以训练初期它输出的那些概率也基本是随机的。但因为这个窗口是从真实数据里采样出来的,我们知道真正的下一个 token 是什么,这就是「标准答案(label)」。于是我们用一套数学方法去更新网络:让正确 token 的概率往上调一点,让其他所有 token 的概率往下调一点。更新之后,下次再喂同样的上下文,网络就会给正确答案略高一点的概率。这个微调过程同时、并行地发生在整个数据集的所有 token 上——我们一批一批地采样窗口,对每个位置都做这种「轻推」。训练就是不断重复这个更新,直到网络的预测与训练数据里 token 衔接的真实统计规律相吻合。 How does training proceed? At first the network is randomly initialized — it's just a random transformation, so early on its output probabilities are essentially random too. But because the window was sampled from real data, we know what the true next token is — that's the label. So we use a mathematical procedure to update the network: nudge the probability of the correct token up a little and the probabilities of all other tokens down a little. After the update, feeding the same context again yields a slightly higher probability for the correct answer. This nudging happens simultaneously and in parallel across all tokens in the entire dataset — we sample batches of windows and, at every position, apply this small push. Training is just repeating this update until the network's predictions match the real statistics of how tokens follow one another in the training data.
别忘了:这种「轻推」并不是只针对某一个位置发生的。它同时、并行地作用在数据集里成千上万个 token 位置上。实践中,我们一批批地采样许多个窗口,对每个窗口的每个位置都计算「正确答案该往哪推」,然后一次性把这些信号汇总起来更新网络。正是因为可以如此大规模并行,才有可能在约 15 万亿个 token 上把网络训练出来——这也是这一步计算量极其巨大的原因。 Don't forget: this nudging doesn't happen at just one position. It happens simultaneously and in parallel across thousands upon thousands of token positions in the dataset. In practice we sample many windows in batches, compute at every position of every window which way the correct answer should push, and then aggregate all those signals into one update of the network. It's precisely because this can be parallelized at such scale that we can train the network over roughly 15 trillion tokens — and it's also why this step is so enormously compute-heavy.
- •输入:长度 0 到最大值(如 8000)的变长 token 窗口,这个上限叫上下文长度。
- •上下文越长越昂贵:GPT-2 只有 1024,现代模型可达几十万;成本随长度迅速增长。
- •输出:覆盖整个词表(约 100,277 个)的概率分布,每个可能的下一个 token 一个概率——不是单个词。
- •参数最初随机初始化,故初期预测近乎随机,再逐步更新。
- •训练 = 调整参数,让「正确的下一个 token」概率上升、其余下降;在海量 token 上以一批批窗口并行进行。
- •Input: a variable-length window of 0 to a maximum (e.g. 8,000) tokens; that cap is the context length.
- •Longer context is costlier: GPT-2 had only 1,024, modern models reach hundreds of thousands; cost grows rapidly with length.
- •Output: a probability distribution over the whole vocabulary (~100,277), one probability per possible next token — not a single word.
- •Parameters start randomly initialized, so early predictions are near-random, then update gradually.
- •Training = adjust parameters so the correct next token's probability rises and the rest fall; done in parallel over massive token counts, in batches of windows.
📝 本章测验
神经网络在「预测下一个 token」时,输出的是什么?When the network predicts the next token, what does it output?
一次训练更新具体做了什么?What exactly does one training update do?
关于「上下文长度(context length)」,下面哪种说法正确?Which statement about 'context length' is correct?
为什么训练初期网络给出的概率几乎是随机的?Why are the network's output probabilities almost random early in training?