第 5 章

神经网络内部:Transformer Inside the Neural Network: The Transformer

上一章我们只看了网络的输入和输出,把它当成黑盒。现在打开盒子看看里面:输入的 token 与数十亿个参数(权重)被搅进一个巨大的数学表达式;参数最初是随机的,训练就是慢慢调整它们。我们会认识 Transformer 这种架构,理解里面的「神经元」其实极其简单、没有记忆,以及为什么真正重要的只是「一个从输入到输出的固定函数」。 Last chapter we treated the network as a black box, looking only at its inputs and outputs. Now we open the box: the input tokens get mixed with billions of parameters (weights) inside one giant mathematical expression; the parameters start random, and training slowly adjusts them. We meet the Transformer architecture, see that its 'neurons' are extremely simple and memoryless, and understand why what really matters is just 'a fixed function from input to output.'

回顾一下:输入是一段 token 序列(可能是 4 个 token,也可能多达约 8000 个),输出是对下一个 token 的概率分布。这些输入 token 进入网络后,会和网络的参数(也叫权重)一起被搅进一个巨大的数学表达式里。现代网络通常有数十亿个这样的参数。一开始,这些参数是完全随机设置的——所以网络的预测在初期也是完全随机的。 Recall the setup: the input is a sequence of tokens (maybe 4 tokens, maybe up to around 8,000), and the output is a probability distribution over the next token. Once these input tokens enter the network, they get mixed together with the network's parameters (also called weights) inside one giant mathematical expression. Modern networks typically have billions of these parameters. At the very start, the parameters are set completely at random — so the network's predictions are completely random early on too.

💡 把参数想象成 DJ 调音台上的一排旋钮。每一种 token 序列输入,旋钮的当前设置都会给出一组预测。训练神经网络,本质上就是不停地拧这些旋钮,直到找到一组设置,使网络的预测和训练集里的统计规律相吻合。 Picture the parameters as a row of knobs on a DJ mixing board. For any possible token-sequence input, the current setting of the knobs produces some set of predictions. Training the network is essentially twiddling these knobs until you find a setting whose predictions are consistent with the statistics of the training set.

那这个「巨大的数学表达式」长什么样?现代网络的表达式可能有上万亿个项,但别被吓到——它本质上很简单。它把输入 x(比如 x1、x2)和权重(w0、w1、w2……)混在一起,而这种「混合」无非是乘法、加法、指数、除法这类基本运算。设计这些表达式是「神经网络架构」这门研究方向的工作:目标是让表达式有表达力、易于优化、可并行。但说到底,它并不复杂——只是把输入和参数搅在一起,然后做出预测。 So what does this 'giant mathematical expression' look like? A modern network's expression might have trillions of terms, but don't be intimidated — it is fundamentally simple. It mixes the inputs x (say x1, x2) with the weights (w0, w1, w2, ...), and that 'mixing' is nothing more than basic operations: multiplication, addition, exponentiation, division. Designing these expressions is the job of the research field called neural-network architecture: the goal is to make them expressive, optimizable, and parallelizable. But at the end of the day they are not complex — they just blend the inputs with the parameters to produce predictions.

生产环境中真正使用的网络有一种特定的结构,叫做 Transformer。举一个小例子:某个可视化里的 Transformer 大约有 85,000 个参数(现代生产模型则是几百亿到上万亿)。数据从顶部的 token 序列输入,一路流经网络,最后到达输出——也就是 logits 经过 softmax 后得到的、对「下一个 token 是什么」的预测。中间经过一连串变换,产生大量中间值。下面这张图展示了这条数据流。 The networks actually used in production have a specific structure called the Transformer. As a small example: one visualized Transformer has roughly 85,000 parameters (modern production models have hundreds of billions to trillions). Data enters as a token sequence at the top, flows all the way through the network, and reaches the output — the predictions for what token comes next, obtained from logits passed through a softmax. Along the way it goes through a sequence of transformations that produce many intermediate values. The diagram below shows this data flow.

Transformer 的数据流:token 嵌入 → 注意力 → 多层感知机 → softmax 预测下一个 token Data flow in a Transformer: token embedding → attention → MLP → softmax prediction of the next token

顺着这条流水线看:首先,每个 token 被「嵌入(embed)」成一个向量——也就是说,词表里每一种可能的 token 在网络内部都对应一个向量,这叫做分布式表示。这些向量随后流经一系列变换:层归一化(layer norm)、矩阵乘法、softmax 等等。先经过 Transformer 的注意力(attention)块,信息再流入多层感知机(MLP)块,如此层层推进。图中那些数字就是表达式的中间值,你几乎可以把它们想成这些「合成神经元」的放电率。 Following the pipeline: first, each token is 'embedded' into a vector — meaning every possible token in the vocabulary corresponds to a vector inside the network, a so-called distributed representation. Those vectors then flow through a series of transformations: layer norms, matrix multiplications, softmaxes, and so on. They pass through the Transformer's attention block first, then the information flows into the multi-layer perceptron (MLP) block, and so on, layer by layer. The numbers in the diagram are the intermediate values of the expression — you can almost think of them as the firing rates of these synthetic neurons.

⚠️ 别把这些「神经元」想得太像大脑里的神经元。生物神经元是复杂的动态过程,有记忆、会随时间演化;而这里的神经元极其简单。整个表达式是一个从输入到输出的固定数学函数,没有记忆,完全无状态(stateless)。你顶多把它当成一块「合成脑组织」来类比,但别认真当真。 Don't picture these 'neurons' as too much like the ones in your brain. Biological neurons are complex dynamical processes with memory that evolve over time; the neurons here are extremely simple. The whole expression is a fixed mathematical function from input to output, with no memory — completely stateless. At most, treat 'synthetic brain tissue' as a loose analogy, not something to take literally.

这些变换的精确数学细节,其实没那么重要,我们不必深究。真正要理解的只有一句话:这是一个数学函数,由一组固定数量的参数(比如 85,000 个)参数化,把输入变成输出。当我们拧动这些参数时,就会得到不同的预测;而训练的目标,就是找到一组好的参数设置,让预测和训练集里看到的规律对上。这,就是 Transformer。 The precise mathematical details of these transformations honestly don't matter much, and we don't need to dig into them. What truly matters is just this: it is a mathematical function, parameterized by some fixed number of parameters (say 85,000), that transforms inputs into outputs. As we twiddle these parameters we get different predictions; and the goal of training is to find a good setting of the parameters so the predictions line up with the patterns seen in the training set. That, in essence, is the Transformer.

•输入 token 与网络的参数(权重)一起被搅进一个巨大的数学表达式;现代网络有数十亿到上万亿个参数。
•参数最初随机设置,所以初期预测是随机的;训练 = 拧旋钮,找到与训练集统计规律一致的参数设置。
•底层数学并不可怕,只是乘、加、指数、除等基本运算的大量组合。
•生产网络的结构叫 Transformer:token 嵌入 → 注意力 → MLP → softmax 预测下一个 token。
•里面的「神经元」极简单、无记忆、完全无状态——一个从输入到输出的固定函数,别按生物神经元理解。
•真正重要的是:它是一个由固定参数参数化的函数,我们要做的就是找到好的参数。

•The input tokens get mixed with the network's parameters (weights) in one giant math expression; modern networks have billions to trillions of parameters.
•Parameters start random, so early predictions are random; training = twiddling knobs to find a setting consistent with the training-set statistics.
•The underlying math isn't scary — just many combinations of basic ops: multiply, add, exponentiate, divide.
•The production network's structure is the Transformer: token embedding → attention → MLP → softmax over the next token.
•Its 'neurons' are extremely simple, memoryless, and fully stateless — a fixed input-to-output function, not biological neurons.
•What really matters: it's a function parameterized by a fixed set of parameters, and our job is to find good parameters.

📝 本章测验

「把参数想成 DJ 调音台上的旋钮」这个比喻,主要想说明什么?What does the analogy 'parameters are like knobs on a DJ board' mainly illustrate?

Transformer 内部「数据流」的大致顺序是怎样的?What is the rough order of the 'data flow' inside a Transformer?

为什么说不该把网络里的「神经元」当成大脑里的生物神经元?Why shouldn't the network's 'neurons' be thought of as biological brain neurons?

关于网络底层的数学,下面哪种说法最准确?Which statement about the network's underlying math is most accurate?