LLM 全栈学习
第 9 章

从预训练到后训练 From Pretraining to Post-Training

一个承上启下的短章。我们先回顾:预训练把互联网文档拆成 token、用神经网络预测序列,产出一个 base 模型——一个 token 级的互联网文档模拟器。但我们要的是助手:能问、能答。于是把 base 模型交给第二阶段——后训练(post-training)。后训练在计算上比预训练便宜、短得多(几小时 vs 约三个月),用的训练算法相同,只是把数据集从互联网文档换成对话。后续章节会展开对话数据、SFT 和 RLHF。 A short bridging chapter. First we recap: pretraining breaks internet documents into tokens and uses a neural network to predict sequences, yielding a base model — a token-level internet-document simulator. But what we want is an assistant: one we can ask and that answers. So we hand the base model to a second stage — post-training. Post-training is computationally far cheaper and shorter than pretraining (hours vs about three months), uses the same training algorithm, and just swaps the dataset from internet documents to conversations. Later chapters unpack conversation data, SFT, and RLHF.

先把目前为止的内容缩放成一张全景图。我们的目标,是训练出像 ChatGPT 那样的 LLM 助手。到现在为止,我们只讲完了第一个阶段——预训练(pretraining):拿来互联网文档,把它们拆成 token(文本的小原子块),再用神经网络去预测 token 序列。这个阶段的产物,就是 base 模型——也就是那组网络参数。 Let's zoom out into one panoramic picture of what we've covered so far. Our goal is to train an LLM assistant like ChatGPT. Up to now we've only finished the first stage — pretraining: take internet documents, break them into tokens (little atomic chunks of text), and use a neural network to predict token sequences. The product of this stage is the base model — that set of network parameters.

预训练 Pretraining 15 万亿 token 互联网文本 数千台 GPU · 约 3 个月 → base model(token 模拟器) 后训练 Post-training 对话数据 · SFT / RLHF 算力小得多 · 约数小时 → assistant(会对话的助手) 构建大模型的两个顺序阶段
两个阶段:预训练产出 base 模型(互联网模拟器),后训练把它变成助手 Two stages: pretraining yields a base model (internet simulator); post-training turns it into an assistant

这个 base 模型,本质上是一个 token 级的互联网文档模拟器:它能生成在统计性质上与互联网文档相似的 token 序列。我们也看到,它在某些应用里确实能用——通过巧妙的提示,甚至能勉强扮成助手。但这还不够好。我们真正想要的是一个助手:能向它提问,它就给出答案,而不是续写一篇网页文档。 This base model is essentially a token-level internet-document simulator: it generates token sequences with statistical properties similar to internet documents. We also saw it can be used in some applications — and with clever prompting can even loosely play an assistant. But that's not good enough. What we really want is an assistant: ask it a question and it gives you an answer, rather than continuing a web document.

所以我们进入第二个阶段:后训练(post-training)。做法是把 base 模型(那个互联网文档模拟器)交给后训练流程,在此基础上继续打磨,把它变成一个真正会回答问题的助手。 So we enter the second stage: post-training. The approach is to take the base model (that internet-document simulator) and hand it off to a post-training process that refines it further into an assistant that genuinely answers questions.

💡 关键直觉:后训练在计算上比预训练便宜、也短得多。那些动辄数百万美元、占满整个数据中心的「重活」,几乎都发生在预训练阶段(可能跑约三个月);后训练只需要小得多的数据集和短得多的时间(往往只是几个小时)。它便宜,但依然极其重要——正是它把一个 LLM「模拟器」变成了真正可用的助手。 Key intuition: post-training is computationally far cheaper and shorter than pretraining. The 'heavy lifting' — millions of dollars, whole data centers — happens almost entirely in pretraining (which may run for about three months); post-training needs a much smaller dataset and much less time (often just a few hours). It's cheap, yet still extremely important — it's what turns an LLM 'simulator' into a genuinely usable assistant.

也许出乎意料的是:后训练用的训练算法,和预训练基本是同一套。真正变的不是「怎么训」,而是「用什么数据训」——我们把数据集从「互联网文档」换成「对话」。也就是说,我们不再让模型去采样互联网文档,而是教它在面对问题时给出回答。下一章就从这些对话数据讲起。 Perhaps surprisingly, post-training uses essentially the same training algorithm as pretraining. What changes isn't 'how we train' but 'what data we train on' — we swap the dataset from 'internet documents' to 'conversations.' That is, instead of having the model sample internet documents, we teach it to produce answers when faced with questions. The next chapter starts from this conversation data.

📝 接下来几章的路线图:先看后训练的「对话数据」长什么样、如何把问答组织成多轮对话;再讲监督微调(SFT),用人类编写的优质对话教模型扮演助手;最后是基于人类反馈的强化学习(RLHF),进一步对齐模型的行为与人类偏好。 Roadmap for the next few chapters: first, what post-training's 'conversation data' looks like and how Q&A is organized into multi-turn conversations; then supervised fine-tuning (SFT), teaching the model to play the assistant using high-quality human-written conversations; and finally reinforcement learning from human feedback (RLHF), further aligning the model's behavior with human preferences.
  • 回顾:预训练 = 把互联网文档拆成 token、用网络预测序列,产出 base 模型(参数)。
  • base 模型是 token 级的互联网文档模拟器,但我们要的是能问答的助手。
  • 于是把 base 模型交给第二阶段——后训练,把它打磨成助手。
  • 后训练在计算上比预训练便宜、短得多:几小时 vs 约三个月,数据集也小得多。
  • 训练算法基本不变,变的是数据集:从互联网文档换成对话。
  • 后续章节:对话数据 → 监督微调(SFT) → 基于人类反馈的强化学习(RLHF)。
  • Recap: pretraining = break internet documents into tokens and predict sequences with a network, producing a base model (parameters).
  • The base model is a token-level internet-document simulator, but we want a question-answering assistant.
  • So we hand the base model to a second stage — post-training — to refine it into an assistant.
  • Post-training is computationally far cheaper and shorter than pretraining: hours vs about three months, with a much smaller dataset.
  • The training algorithm is essentially unchanged; what changes is the dataset: from internet documents to conversations.
  • Coming chapters: conversation data → supervised fine-tuning (SFT) → reinforcement learning from human feedback (RLHF).

📝 本章测验

为什么在拿到 base 模型之后,我们还需要后训练?Why do we still need post-training after obtaining a base model?

后训练和预训练相比,计算开销如何?How does post-training's compute cost compare to pretraining's?

从预训练到后训练,主要改变的是什么?Going from pretraining to post-training, what mainly changes?