第 7 章

实战:训练 GPT-2 Hands-On: Training GPT-2

用一个具体例子把训练和推理串起来:OpenAI 在 2019 年发布的 GPT-2。它是第一次「可辨认的现代技术栈」凑齐——16 亿参数、上下文长度 1024、约 1000 亿 token,按今天的标准都很小。我们会看训练时研究者盯着的那个数字——损失(loss),理解为什么训练成本从 2019 年的约 4 万美元降到今天的约一两百到六百美元,以及为什么 GPU 成了这场「淘金热」。 We tie training and inference together with a concrete example: OpenAI's GPT-2, released in 2019. It's the first time a recognizably modern stack came together — 1.6B parameters, a context length of 1,024, and about 100B tokens, all small by today's standards. We look at the number researchers stare at during training — the loss — understand why training cost fell from about $40k in 2019 to roughly $100–600 today, and why GPUs became the gold rush.

我们用一个具体的模型把前面的概念落地:OpenAI 的 GPT-2。GPT 是「Generatively Pre-trained Transformer(生成式预训练 Transformer)」的缩写,GPT-2 是这个系列的第二代,发布于 2019 年。今天你用 ChatGPT 时,背后是 GPT-4,也就是这个系列的第四代。我喜欢拿 GPT-2 举例,是因为它是「可辨认的现代技术栈」第一次凑齐——GPT-2 的所有组成部分今天都还认得出来,只不过后来一切都变大了。 Let's ground the earlier concepts with a concrete model: OpenAI's GPT-2. GPT stands for 'Generatively Pre-trained Transformer,' and GPT-2 is the second iteration of the series, released in 2019. When you use ChatGPT today, GPT-4 — the fourth iteration — is behind the magic. I like to use GPT-2 as an example because it's the first time a recognizably modern stack came together: every piece of GPT-2 is still recognizable by today's standards; everything has just gotten bigger.

几个值得记住的数字:GPT-2 是一个 Transformer 神经网络,和今天的网络是同一类;它有 16 亿个参数(今天的现代 Transformer 更接近几千亿到一万亿)。它的最大上下文长度是 1024 个 token——也就是采样窗口、预测下一个 token 时,历史里最多只能看 1024 个 token(今天则接近几十万甚至上百万,上下文越多,预测越准)。它在大约 1000 亿个 token 上训练——而我们之前看的 FineWeb 数据集有 15 万亿个 token,所以 1000 亿其实相当小。 A few numbers worth remembering: GPT-2 is a Transformer neural network, the same kind we use today; it has 1.6 billion parameters (modern Transformers are closer to hundreds of billions or a trillion). Its maximum context length is 1,024 tokens — so when sampling windows and predicting the next token, the history is capped at 1,024 tokens (today it's closer to hundreds of thousands or even a million; more context means better predictions). It was trained on about 100 billion tokens — whereas the FineWeb dataset we saw earlier has 15 trillion, so 100 billion is actually quite small.

训练时,研究者究竟在看什么?屏幕上每滚过一行,就是对模型的一次更新:把网络的参数稍微调一点,让它更擅长预测序列里的下一个 token。每一行更新,改善的是对约 100 万个 token 的预测——我们一次性从数据集里取出约 100 万个 token,同时改进对它们「作为下一个出现」的预测。每一步都对网络做一次更新。 What does a researcher actually watch during training? Every line that scrolls by on screen is one update to the model: the network's parameters get nudged a little so it gets better at predicting the next token in a sequence. Each line improves the prediction on about 1 million tokens at once — we pull roughly 1 million tokens from the dataset and simultaneously improve the prediction of each of them coming next. Every step is one update to the network.

要紧盯的那个数字,叫损失(loss)。损失是一个单一的数字,告诉你网络当前表现得有多好;它被设计成「越低越好」。随着更新越来越多,你会看到损失在下降——这对应着对下一个 token 的预测越来越准。作为研究者,你就盯着这个数字:喝着咖啡、转着拇指,确保每一次更新损失都在改善、网络在变好。下面这个演示让你直观感受损失曲线如何随训练步数下降。 The number to watch closely is the loss. The loss is a single number telling you how well the network is currently performing; it's designed so that lower is better. As more updates accumulate, you watch the loss go down — corresponding to better and better next-token predictions. As a researcher, you stare at this number: sipping coffee, twiddling your thumbs, making sure that with every update the loss is improving and the network is getting better. The demo below lets you feel how the loss curve falls as training steps accumulate.

加载交互组件…

💡 在训练早期(比如只跑了 20 步),模型每隔一段时间做一次推理,输出看起来完全是随机的乱码——因为参数才更新了 20 次,网络几乎还是随机的。但等跑到约 1% 时,生成的文本已经开始有一点局部连贯性;若把全部约 32,000 步跑完,模型就能生成相当通顺的英语。损失下降,文本变好,是同一件事的两面。 Early in training (say only 20 steps in), the model periodically runs inference and the output looks like completely random gibberish — because the parameters have been updated only 20 times and the network is still nearly random. But by about 1% of the way through, the generated text starts to show a little local coherence; run all ~32,000 steps and the model produces fairly fluent English. The loss going down and the text getting better are two sides of the same thing.

再说成本。2019 年训练 GPT-2 的成本估计约为 4 万美元;但今天可以做得好得多——大约一天、约 600 美元就能复现,而且这还没怎么使劲,认真优化甚至能压到约 100 美元。为什么成本降这么多?其一,数据集质量大幅提升:过滤、抽取、准备数据的方法精细多了。但更大的差别在于硬件——计算机变快了太多;同时,把硬件性能榨干的软件也成熟了很多,因为大家都在专注于让这些模型跑得飞快。 Now about cost. Training GPT-2 in 2019 was estimated at about $40,000; but today you can do far better — roughly one day and about $600 to reproduce it, and that's without trying too hard; with real effort you could push it down to around $100. Why has the cost dropped so much? First, the datasets have gotten much better: the way we filter, extract, and prepare data is far more refined. But the bigger difference is hardware — computers have gotten dramatically faster; and the software that squeezes every bit of speed out of that hardware has matured a lot too, as everyone focuses on running these models very, very fast.

这种训练不可能在笔记本上跑——网络太大、数据太多。它运行在云端的计算机上,典型的是一个「8×H100 节点」:单台机器里塞了 8 块 H100 GPU。你向云厂商租用这样的机器(讲者喜欢用 Lambda,但提供这类服务的公司很多),按需价格大约是每块 GPU 每小时 3 美元。GPU 之所以是训练神经网络的完美选择,是因为训练计算量极大、却高度可并行:许多独立的计算单元能同时开工,一起完成底层那些矩阵乘法。 This kind of training can't run on a laptop — the network is too large and there's too much data. It runs on computers out in the cloud, typically an '8×H100 node': a single machine packed with 8 H100 GPUs. You rent such a machine from a cloud provider (the speaker likes Lambda, but many companies offer this), at an on-demand price of roughly $3 per GPU per hour. GPUs are a perfect fit for training networks because the computation is enormously expensive yet highly parallel: many independent workers can run at the same time to solve the matrix multiplications under the hood.

📝 一块 GPU 接八块组成一个节点,多个节点再堆成整个数据中心。所有大科技公司都极度渴求 GPU,好训练这些语言模型——这正是把 Nvidia 股价推到数万亿美元、让它「爆发」的根本原因。这就是这场「淘金热」:抢够多的 GPU,让它们一起协作,在像 FineWeb 这样的数据集上反复预测下一个 token。GPU 越多,能尝试预测和改进的 token 就越多,迭代越快,网络也能训得越大。 One GPU becomes eight in a node, and many nodes stack into an entire data center. All the big tech companies desperately want GPUs to train these language models — which is fundamentally why Nvidia's stock has been pushed into the trillions and why it 'exploded.' This is the gold rush: get enough GPUs and have them all collaborate to repeatedly predict the next token on a dataset like FineWeb. The more GPUs you have, the more tokens you can try to predict and improve on, the faster you iterate, and the bigger a network you can train.

•GPT-2(OpenAI,2019)是第一次「可辨认的现代技术栈」凑齐;GPT 即生成式预训练 Transformer。
•规模:16 亿参数、上下文长度 1024、约 1000 亿 token——按今天标准都很小。
•每一步训练更新同时改善对约 100 万个 token 的「下一个 token」预测。
•损失(loss)是单一数字、越低越好;研究者盯着它下降,这对应预测变准、文本变通顺。
•成本:2019 年约 4 万美元,今天约 100–600 美元——靠更好的数据、更快的硬件/GPU、更好的软件。
•训练跑在云端租来的 8×H100 节点上(约 3 美元/GPU/小时);GPU 是「淘金热」,也是 Nvidia 爆发的原因。

•GPT-2 (OpenAI, 2019) is the first time a recognizably modern stack came together; GPT = Generatively Pre-trained Transformer.
•Scale: 1.6B parameters, context length 1,024, ~100B tokens — all small by today's standards.
•Each training step simultaneously improves the next-token prediction on about 1 million tokens.
•Loss is a single number, lower is better; researchers watch it fall, which corresponds to better predictions and more fluent text.
•Cost: ~$40k in 2019, ~$100–600 today — thanks to better data, faster hardware/GPUs, and better software.
•Training runs on cloud-rented 8×H100 nodes (~$3/GPU/hour); GPUs are the gold rush and the reason Nvidia exploded.

📝 本章测验

为什么作者特别喜欢用 GPT-2 来讲解?Why does the author especially like using GPT-2 as the teaching example?

训练中的「损失(loss)」是什么?该怎么看它?What is the 'loss' in training, and how should you read it?

GPT-2 的训练成本为什么从 2019 年的约 4 万美元降到今天的一两百到六百美元?Why did GPT-2's training cost fall from about $40k in 2019 to roughly $100–600 today?

为什么 GPU 特别适合训练神经网络,以及它和 Nvidia 的「爆发」有什么关系?Why are GPUs especially suited to training networks, and how does that relate to Nvidia's 'explosion'?

实战:训练 GPT-2 Hands-On: Training GPT-2

📉 训练过程:loss 下降,样本由乱码变通顺

📝 本章测验