LLM 全栈学习
第 3 章

分词:文本如何变成 token Tokenization: How Text Becomes Tokens

神经网络只接受「一维的、有限种类的符号序列」。我们看看为什么不能直接喂比特,如何从字节出发,再用字节对编码(BPE)在序列长度和词表大小之间做权衡,最终得到 GPT-4 那约 10 万个 token 的词表。 Neural networks only accept a one-dimensional sequence drawn from a finite set of symbols. We look at why we can't feed raw bits, how we start from bytes, and how byte pair encoding (BPE) trades off sequence length against vocabulary size — ending up at GPT-4's vocabulary of roughly 100,000 tokens.

在把文本送进神经网络之前,要先决定如何表示它。这类网络期望的是一个一维的符号序列,而且符号的种类是有限的。所以问题变成:我们的「符号」到底是什么?文本本身已经是一维序列(从左到右、从上到下);在计算机里,用 UTF-8 编码后它就变成了一长串比特,只有 0 和 1 两种符号。 Before feeding text into a neural network, we must decide how to represent it. These networks expect a one-dimensional sequence drawn from a finite set of symbols. So the question becomes: what are our symbols? Text is already a one-dimensional sequence (left to right, top to bottom); inside a computer, UTF-8 encoding turns it into a long string of bits, with only two possible symbols, 0 and 1.

问题是:序列长度在神经网络里是一种「珍贵且有限」的资源。我们并不想要一个只有两种符号、却极其漫长的序列。于是我们在「符号种类(词表大小)」和「序列长度」之间做权衡:宁愿多一些符号、换来短一些的序列。最朴素的压缩办法是把连续的 8 个比特打包成一个「字节」——8 个比特共有 256 种组合,所以序列瞬间缩短为原来的八分之一,代价是符号种类从 2 变成 256(每个值 0–255)。 The catch: sequence length is a precious, finite resource inside a neural network. We don't want a sequence with only two symbols that is extremely long. So we trade off the number of symbols (the vocabulary size) against sequence length: we'd rather have more symbols in exchange for a shorter sequence. The most naive compression is to pack 8 consecutive bits into one 'byte' — 8 bits have 256 possible combinations, so the sequence instantly becomes 8x shorter, at the cost of growing from 2 symbols to 256 (each value 0–255).

💡 关键直觉:把这些值想成 256 种独一无二的「表情符号」,而不是数字。token 的 ID(比如 15339)只是个唯一标签,它的数值大小没有任何意义——15339 并不比 15338「大」或「多」。把整段文本想象成一串表情,会比想象成一串数字更接近真相。 Key intuition: think of these values as 256 unique emoji, not as numbers. A token's ID (say 15339) is just a unique label; its numeric magnitude means nothing — 15339 is not 'bigger' or 'more' than 15338. Picturing the text as a string of emoji is closer to the truth than picturing it as a string of numbers.

对最先进的模型来说,我们还想继续缩短序列、换取更多符号,做法是运行「字节对编码(Byte Pair Encoding,BPE)」。它的思路是:找出序列里非常常见的相邻符号对,比如某两个字节经常挨在一起出现(例如 116 后面跟着 32),就把这一对合并成一个全新的符号(铸造一个新 ID,例如 256),并把序列中所有这一对都改写成新符号。这个过程可以反复迭代,每铸造一个新符号,序列就更短一点、词表就更大一点。 For state-of-the-art models we want to shrink the sequence further in exchange for more symbols, and the way to do this is to run Byte Pair Encoding (BPE). The idea: find very common adjacent symbol pairs — say two bytes that frequently occur together (for example 116 followed by 32) — and merge that pair into a brand-new symbol (mint a new ID, e.g. 256), rewriting every occurrence of that pair as the new symbol. This can be iterated as many times as we like; each new symbol makes the sequence a bit shorter and the vocabulary a bit larger.

加载交互组件…

实践中一个不错的词表大小大约是 10 万个符号;具体来说,GPT-4 使用 100,277 个符号(其分词器叫 cl100k_base)。把原始文本转换成这些符号(我们称之为 token)的整个过程,就叫分词(tokenization)。下面这张图可以帮你直观感受这条「比特 → 字节 → BPE」的路径:每往右走一步,序列变短、词表变大。 In practice a good vocabulary size is around 100,000 symbols; specifically, GPT-4 uses 100,277 symbols (its tokenizer is called cl100k_base). The whole process of converting raw text into these symbols — which we call tokens — is tokenization. The diagram below helps you feel this 'bits → bytes → BPE' path: each step to the right makes the sequence shorter and the vocabulary larger.

符号种类越多(词表越大) → 序列越短 比特 Bits 2 种符号 序列极长 字节 Bytes 256 种符号 短 8 倍 BPE Token ≈100,277 种 最短 ← 同一段文本对应的序列长度(示意)
比特 → 字节 → BPE token:序列长度与词表大小的此消彼长 Bits → bytes → BPE tokens: the trade-off between sequence length and vocabulary size

现在用真实的 GPT-4 分词器亲手试一试。在下面的演示里输入文本,你会看到它被切成一个个带 ID 的 token。试着对比:「hello world」是两个 token;但「helloworld」、首字母大写的「Hello world」、或在中间多打一个空格,切分结果都会不同——token 数量甚至每个 token 本身都会变。 Now try it hands-on with the real GPT-4 tokenizer. Type text into the demo below and watch it get split into tokens, each with an ID. Compare for yourself: 'hello world' is two tokens; but 'helloworld', a capitalized 'Hello world', or an extra space in the middle all tokenize differently — the token count, and even the individual tokens, change.

加载交互组件…
⚠️ 分词对大小写和空格很敏感,这正是模型有时在拼写、字符计数等任务上表现奇怪的根源之一。因为模型看到的不是一个个字符,而是被打成块的 token——它很难「看清」一个 token 内部有几个字母。所以下次它数错单词里的字母数,别太惊讶:这不是它笨,而是分词的副作用。 Tokenization is sensitive to case and spaces, and this is one root cause of why models sometimes behave strangely on spelling or character-counting tasks. The model doesn't see individual characters; it sees chunked tokens — and it struggles to 'see inside' a token to count its letters. So next time it miscounts the letters in a word, don't be surprised: it's not stupidity, it's a side effect of tokenization.
  • 神经网络要的是一维、有限符号集的序列。
  • 纯比特(2 种符号)序列太长;打包成字节得到 256 种符号、短八倍。
  • BPE 反复合并高频符号对,进一步缩短序列、扩大词表。
  • 核心权衡:词表越大,序列越短,但符号越多;反之亦然。
  • 把 token ID 当成唯一标签(像表情符号),而非有大小的数字。
  • GPT-4 词表约 100,277 个 token(cl100k_base);大小写和空格都会改变切分。
  • Neural networks want a one-dimensional sequence over a finite symbol set.
  • A raw-bit sequence (2 symbols) is too long; packing into bytes gives 256 symbols and is 8x shorter.
  • BPE repeatedly merges frequent symbol pairs to shorten the sequence further and grow the vocabulary.
  • The core trade-off: a larger vocabulary means a shorter sequence but more symbols, and vice versa.
  • Treat token IDs as unique labels (like emoji), not numbers with magnitude.
  • GPT-4's vocabulary is about 100,277 tokens (cl100k_base); case and spaces both change the split.

📝 本章测验

为什么不直接把文本的原始比特(0/1)序列喂给神经网络?Why don't we feed the raw bit (0/1) sequence of text directly into the neural network?

字节对编码(BPE)的核心操作是什么?What is the core operation of Byte Pair Encoding (BPE)?

为什么应该把 token ID 想成「唯一标签/表情」而不是数字?Why should a token ID be thought of as a 'unique label / emoji' rather than a number?

模型在拼写或数字符这类任务上偶尔出错,和分词有什么关系?How does tokenization relate to the model occasionally failing at spelling or character-counting tasks?