第 1 章

引言:把 ChatGPT 拆开看 Introduction: Taking ChatGPT Apart

建立思考大模型的心智模型——它擅长什么、不擅长什么、有哪些需要警惕的边角。 Build a mental model for what LLMs are: where they shine, where they fail, and the sharp edges to watch for.

这一系列内容是面向大众的、对像 ChatGPT 这类大语言模型的全面介绍。目标是给你一套心智模型,去理解文本框背后到底发生了什么:你在那里输入文字、按下回车,返回的这些词是怎么生成的?你究竟在和什么对话? This is a comprehensive, general-audience introduction to large language models like ChatGPT. The goal is to give you mental models for what is happening behind that text box: you type something, press enter — how are the returned words generated, and what exactly are you talking to?

我们会走完构建这类系统的完整流程,但始终保持对普通读者友好。这个工具在某些方面神奇而强大,在另一些方面却并不擅长,还有不少需要警惕的「锋利边缘」。 We'll walk through the entire pipeline of how this stuff is built, kept accessible throughout. The tool is magical and powerful in some respects, not good at others, and has many sharp edges to be aware of.

所谓「心智模型」,不是让你记住一堆参数或公式,而是在脑海里建立一幅可操作的图景:数据从哪里来,文本如何被表示成符号,网络在预测什么,训练在调整什么。有了这幅图景,当模型给出一个答案时,你就能大致推断它「为什么」这么说——是因为它在海量文本里见过类似模式,还是因为它在硬编一个它其实不知道的事实。 A 'mental model' isn't about memorizing parameters or formulas. It's an operational picture in your head: where the data comes from, how text is turned into symbols, what the network is predicting, and what training is adjusting. With that picture, when the model gives you an answer you can roughly infer why it says what it says — whether it's echoing patterns it saw in vast amounts of text, or confabulating a fact it doesn't actually know.

•大模型的构建分多个顺序阶段:预训练 → 后训练。
•整个过程的核心是「预测下一个 token」。
•理解它的能力与局限,比惊叹于它的神奇更重要。

•Building an LLM has sequential stages: pretraining → post-training.
•The core of the whole process is 'predict the next token'.
•Understanding its capabilities and limits matters more than being amazed.

构建大模型的两个顺序阶段 The two sequential stages of building an LLM

一个有用的直觉是:模型是对互联网的一次「有损压缩」。我们把数十 TB 的文本塞进一个固定大小的网络里,它装不下原文的每一个字,只能记住其中的统计规律、常见说法和反复出现的知识。所以它更像一份「对互联网的模糊记忆」,而不是一个能逐字检索的数据库。常见的东西它记得很牢,罕见的细节则可能被压缩掉、被它「脑补」出一个看似合理的版本。 A useful intuition: the model is a lossy compression of the internet. We squeeze tens of terabytes of text into a fixed-size network; it can't store every word, only the statistical regularities, common phrasings, and frequently repeated knowledge. So it behaves more like a hazy recollection of the internet than a database you can query verbatim. Common things it remembers well; rare details may get compressed away, and it may reconstruct a plausible-sounding version instead.

💡 把这门课当成一次「逆向工程」:我们不是学怎么用 ChatGPT,而是搞清楚它内部如何被造出来,从而知道何时该信任它、何时该怀疑它。 Treat this as reverse-engineering: we're not learning to use ChatGPT, but understanding how it's built — so we know when to trust it and when to doubt it.

「锋利边缘」指的是:在大多数情况下表现得很聪明的模型,会在一些看似简单的任务上突然「翻车」。一个经典例子是数字母:让它数一个单词里有几个某字母,它常常数错。原因不在于它「笨」,而在于它根本看不到单个字母——正如后面分词那一章会讲到的,它看到的是被切成块的 token,而不是一个个字符。知道边缘在哪里,你就能避开它,或者换一种问法。 'Sharp edges' means: a model that seems clever in most situations can suddenly fail on tasks that look trivial. A classic example is counting letters — ask it how many times a certain letter appears in a word and it often gets it wrong. The reason isn't that it's dumb; it's that it never sees individual letters. As the tokenization chapter will show, it sees chunked tokens, not single characters. Knowing where the edges are lets you avoid them, or rephrase the question.

📝 本系列路线图:我们会顺着模型被造出来的真实顺序走——预训练数据(下载并处理互联网)→ 分词(文本变 token)→ 神经网络的输入输出 → 推理(让模型生成文本)→ 后训练(变成会对话的助手)→ 强化学习(让它学会推理与对齐)。每一章都对应流程里的一个真实阶段。 Roadmap for this series: we follow the real order in which a model is built — pretraining data (download and process the internet) → tokenization (text becomes tokens) → neural network inputs and outputs → inference (making the model generate text) → post-training (becoming a conversational assistant) → reinforcement learning (learning to reason and stay aligned). Each chapter maps to one real stage of the pipeline.

📝 本章测验

构建大模型的两个主要阶段顺序是?What are the two main stages of building an LLM, in order?

把模型理解成对互联网的「有损压缩」,主要想说明什么?Calling the model a 'lossy compression' of the internet mainly conveys what?

模型在数单词里某个字母出现几次时常常出错,根本原因是?Why does the model often miscount how many times a letter appears in a word?