Base 模型:互联网文本模拟器 Base Models: Internet Text Simulators
预训练跑完后得到的东西叫 base 模型——一个 token 级的互联网文档模拟器,而不是助手。我们看看「发布一个模型」到底意味着什么(代码 + 参数),以 Meta 的 Llama 3.1 405B base 为例,玩一玩它的几种典型行为:逐字背诵(regurgitation)、自信地编造(hallucination)、从提示里的例子现学规律(in-context / few-shot learning),甚至用一段巧妙的对话提示把 base 模型「调教」成助手。最后理解:参数是对互联网的一种有损压缩。 What you get after pretraining is a base model — a token-level internet-document simulator, not an assistant. We look at what 'releasing a model' really means (code + parameters), use Meta's Llama 3.1 405B base as the example, and play with its characteristic behaviors: verbatim regurgitation, confident hallucination, in-context / few-shot learning from examples in the prompt, and even coaxing a base model into acting like an assistant with a clever conversation prompt. Finally we see the parameters as a lossy compression of the internet.
预训练这一整个阶段的产物,就是 base 模型——也就是那组训练好的参数。但 base 模型本身还不太「有用」,因为它只是一个 token 级的互联网文本模拟器:它会做出互联网网页的「remix」,像在做梦一样生成网页文本。我们真正想要的是一个助手(assistant):能向它提问、它给出回答。base 模型不会这么做,它只是按训练文档(基本上是网页)的统计规律续写。 The product of the entire pretraining stage is the base model — that trained set of parameters. But a base model by itself isn't very 'useful' yet, because it's just a token-level internet text simulator: it produces 'remixes' of internet web pages, dreaming up page text. What we actually want is an assistant: something we can ask questions and get answers from. A base model won't do that; it just continues text according to the statistics of its training documents, which are basically web pages.
那「发布一个模型」具体是什么意思?你基本需要两样东西。其一是代码:通常是一段 Python,详细描述模型里那一连串操作的顺序——也就是上一章那个 Transformer 的「前向传播(forward pass)」。这只是计算机代码,通常也就几百行,相当好懂、相当标准。其二才是真正值钱的东西——参数:这个网络的那些数值。因为参数有比如 15 亿个,你需要把它们设到一组正确(或非常好)的取值上,所以发布时会附上这组参数,本质就是一长串数字(例如 GPT-2 的约 15 亿个数)。 So what does 'releasing a model' mean concretely? You basically need two things. First, the code: usually some Python that describes in detail the sequence of operations the model performs — the 'forward pass' of that Transformer from the last chapter. This is just computer code, typically only a few hundred lines, fairly understandable and fairly standard. Second, the truly valuable part — the parameters: the actual numbers of this network. Because there are, say, 1.5 billion of them, you need them set to a correct (or very good) set of values, so the release ships these parameters, essentially one long list of numbers (e.g. GPT-2's ~1.5 billion values).
GPT-2 毕竟比较老,所以我们换一个更大、更现代的例子:Meta 训练并发布的 Llama 3。回顾一下,GPT-2 是 16 亿参数、训练于 1000 亿 token;Llama 3 则大得多——其中最大的 base 模型是 Llama 3.1 405B(4050 亿参数),训练于 15 万亿 token,做法完全一样,只是规模大得多。要先确认一件事:这个 base 模型还不是助手。你问它「2 + 2 等于几」,它不会回「等于 4,还需要我帮什么吗?」——因为这句话只是被分词成一串 token 当作前缀,模型接下来做的无非是计算下一个 token 的概率,本质是一个非常昂贵的「自动补全」,补的是它在训练文档(网页)里看到的统计规律。而且它是随机的:同一个前缀每次都重新采样,所以每次续写都不同。 GPT-2 is fairly old, so let's switch to a bigger, more modern example: Llama 3, trained and released by Meta. To recap, GPT-2 was 1.6B parameters trained on 100B tokens; Llama 3 is much larger — its biggest base model is Llama 3.1 405B (405 billion parameters), trained on 15 trillion tokens, in exactly the same way, just much bigger. First confirm one thing: this base model is not yet an assistant. Ask it 'what is 2 + 2' and it won't reply 'it's 4 — anything else I can help with?' — because that line is just tokenized into a prefix, and all the model does next is compute the probability of the next token. It's essentially a very expensive autocomplete, completing the statistical patterns it saw in its training documents (web pages). And it's stochastic: for the same prefix it re-samples each time, so every continuation differs.
尽管 base 模型本身对很多应用还不够好用,它其实非常有价值:在「预测下一个 token」这个任务里,它学到了关于世界的大量知识,并把这些知识压缩进了网络的参数。换句话说,那 4050 亿个参数是对互联网的一种压缩,有点像一个 zip 文件——但它不是无损压缩,而是有损压缩:我们得到的是互联网的一个「模糊缩影」,可以从中生成内容。我们可以通过恰当的提示,把藏在参数里的知识「勾」出来。比如用「以下是我心目中巴黎十大必看地标」这样的开头去 prime 它,它就会顺势把这个清单续下去。 Even though a base model by itself isn't good enough for many applications, it's actually very valuable: in the task of predicting the next token, it has learned a great deal about the world and compressed that knowledge into the network's parameters. In other words, those 405 billion parameters are a compression of the internet, a bit like a zip file — but it's not lossless, it's lossy: what we get is a blurry gestalt of the internet that we can generate from. We can elicit the knowledge hiding in the parameters with the right prompt. Prime it with an opener like 'Here's my top-10 list of the top landmarks to see in Paris,' and it will continue that list.
再看几种典型行为。其一是逐字背诵(regurgitation):把维基百科「斑马」词条的第一句粘进去,模型会几乎一字不差地把整段词条背出来,纯靠记忆。这是因为像维基百科这种高质量来源,训练时往往被优先、反复采样(可能见过约 10 次),就像你把一段文字读上很多遍后能背诵一样;只不过模型记得更高效。它最终会偏离原文,因为没法精确记住全部。这种现象通常并不是我们想要的。 Now a few characteristic behaviors. First, regurgitation: paste in the first sentence of Wikipedia's 'zebra' article, and the model recites the whole entry almost word-for-word, purely from memory. This happens because high-quality sources like Wikipedia are preferentially and repeatedly sampled during training (perhaps seen ~10 times), like reading a passage many times until you can recite it — except the model memorizes more efficiently. It eventually drifts from the original because it can't remember everything exactly. This behavior is usually not what we want.
其二是幻觉(hallucination)。Llama 3.1 的训练数据知识截止于 2023 年底,它没见过 2024 年大选的结果。但如果你用「来自未来」的 token 去 prime 它(比如「共和党人特朗普……2017 年起任美国总统……」),它会硬着头皮按参数里的知识接着往下「猜」:某次说竞选搭档是 Mike Pence、对手是希拉里;再采样一次,搭档又变成 Ron DeSantis、对手是拜登和哈里斯——一个个「平行宇宙」。这就是幻觉:模型只是在概率性地给出它的最佳猜测。 Second, hallucination. Llama 3.1's training data has a knowledge cutoff at the end of 2023, so it never saw the 2024 election results. But if you prime it with tokens 'from the future' (e.g. 'The Republican, Trump ... president of the United States from 2017 ...'), it gamely continues guessing from the knowledge in its parameters: one sample says the running mate was Mike Pence against Hillary Clinton; resample and the running mate becomes Ron DeSantis against Biden and Harris — one 'parallel universe' after another. That's hallucination: the model is just giving its best guess, probabilistically.
其三是上下文内学习(in-context / few-shot learning)。给模型一个「few-shot 提示」:10 对「英文单词:韩语翻译」,最后留下「teacher:」让它补全几个 token。模型会在阅读上下文的过程中「就地」学到这里有一种算法式的规律,并接着把规律延续下去——它扮演起翻译器的角色,给出正确的翻译。这种「读着读着就学会了」的能力,就叫上下文内学习;构造这样的例子序列,就叫 few-shot 提示。下面的演示让你亲手感受 base 模型作为「token 自动补全」的行为。 Third, in-context / few-shot learning. Give the model a 'few-shot prompt': 10 pairs of 'English word: Korean translation,' then leave 'teacher:' for it to complete in a few tokens. As it reads the context, the model learns 'in place' that there's an algorithmic pattern here and continues it — taking on the role of a translator and producing the correct translation. This ability to 'learn as it reads' is called in-context learning; constructing such an example sequence is a few-shot prompt. The demo below lets you feel the base model behaving as a 'token autocomplete.'
🔤 分词器 Tokenizer(真实 GPT-4 cl100k_base)
Token 数:2 字符数:11
悬停每个 token 看它的 ID。试试大小写、加空格,观察切分变化。
- •Base 模型 = 预训练的产物 = token 级的互联网文档模拟器,还不是助手。
- •发布模型 = 代码(几百行的前向传播)+ 参数(具体数值,如 GPT-2 约 15 亿个数)。
- •例子:Meta 的 Llama 3.1 405B base,4050 亿参数,训练于 15 万亿 token。
- •参数是对互联网的有损压缩;勾出的知识模糊且概率性,常见事物更可信,不可全信。
- •典型行为:① 逐字背诵(高质量来源被反复采样);② 幻觉(对没见过的事自信乱猜)。
- •③ 上下文内 / few-shot 学习:从提示里的例子现学规律;④ 用对话式 few-shot 提示可把 base 模型变成助手。
- •Base model = output of pretraining = a token-level internet-document simulator, not yet an assistant.
- •Releasing a model = code (a few hundred lines of forward pass) + parameters (the actual numbers, e.g. GPT-2's ~1.5B values).
- •Example: Meta's Llama 3.1 405B base, 405 billion parameters, trained on 15 trillion tokens.
- •Parameters are a lossy compression of the internet; elicited knowledge is vague and probabilistic — frequent things are more reliable, don't trust it fully.
- •Behaviors: (1) regurgitation (high-quality sources sampled repeatedly); (2) hallucination (confidently guessing things it never saw).
- •(3) in-context / few-shot learning: learning a pattern from prompt examples; (4) a conversational few-shot prompt can turn a base model into an assistant.
📝 本章测验
「base 模型」最准确的描述是什么?What is the most accurate description of a 'base model'?
「发布一个模型」通常包含哪两样东西?What two things does 'releasing a model' typically include?
下面哪一组分别对应「逐字背诵」和「幻觉」?Which pair correctly matches 'regurgitation' and 'hallucination'?
「上下文内学习(few-shot)」指的是什么?What does 'in-context (few-shot) learning' refer to?