第 14 章

参差不齐的智能 Jagged Intelligence

「参差不齐的智能(jagged intelligence)」指的是:模型可以在某些任务上超越人类,却在小孩都觉得简单的事情上翻车。经典例子:「9.11 比 9.9 大吗?」(经常答错)、「strawberry 里有几个 r」(数不清字母)。根源有几条:分词(模型看到的是 token 块,不是单个字符)、每个 token 的计算量有限、训练数据里的种种怪癖(比如 9.11 让模型联想到圣经章节号)。实践启示:别假设模型在所有任务上能力均匀;在这些「锋利边缘」附近多加验证,该用工具时就用工具。 'Jagged intelligence' refers to the fact that models can be superhuman at some tasks yet fail at things a child finds trivial. Classic examples: 'is 9.11 bigger than 9.9?' (often wrong) and 'how many r's in strawberry' (can't count the letters). The roots are several: tokenization (the model sees chunks of tokens, not individual characters), finite per-token computation, and quirks in the training data (e.g. 9.11 reminds the model of Bible verse numbers). Practical takeaway: don't assume uniform competence; verify extra carefully near these 'sharp edges,' and use tools where appropriate.

在收尾这部分「LLM 心理学」之前,还想点出一类现象:这些模型有一些「锋利的边缘(jagged edges)」。有些边缘前面已经解释过、也说得通(比如拼写、计数);但有些就算你深入理解了模型的工作原理,仍会让你摸不着头脑。人们把这种现象叫做「参差不齐的智能(jagged intelligence)」:模型能解复杂数学题、能回答博士级别的物理化学生物问题(常常比我答得好),却会在一些超级简单的问题上栽跟头。 Before wrapping up this 'LLM psychology' section, one more class of phenomena to point out: these models have some 'jagged edges.' Some of these we've already explained and they make sense (spelling, counting); but others will leave you scratching your head even if you deeply understand how the models work. People call this 'jagged intelligence': a model can solve complex math problems and answer PhD-level physics, chemistry, and biology questions (often better than I can), yet trip over some super-simple questions.

先看两个跟分词、计数直接相关的例子。其一是拼写类任务:模型看到的不是字符,而是 token——它的整个世界是由这些「文本小块」构成的,它并不像我们的眼睛那样看见一个个字母。所以「把 ubiquitous 每隔三个字符打印一个」这种字符级任务常常失败:在分词器里,ubiquitous 其实被切成了 3 个 token,模型只看到这几个 token 的 ID,无法轻松索引到「第三个字母」。我们能轻松做到,是因为字母就在我们的视觉工作记忆里。 First, two examples tied directly to tokenization and counting. One is spelling tasks: the model doesn't see characters but tokens — its entire world is made of these 'little text chunks,' and it doesn't see individual letters the way our eyes do. So a character-level task like 'print every third character of ubiquitous' often fails: in the tokenizer, ubiquitous is actually split into 3 tokens, and the model only sees those token IDs, unable to easily index into 'the third letter.' We can do it easily because the letters are right there in our visual working memory.

其二是那个著名的「strawberry 里有几个 r」。很长一段时间里,所有先进模型都坚称只有 2 个 r,引发一阵热议:为什么模型能解数学奥赛题,却数不清草莓里的 r?答案正是前面铺垫的两点叠加:第一,模型看到的是 token、不是字符;第二,模型本就不擅长计数(在单个 token 里数数,等于又一次「要求单 token 承担太多计算」)。看字符的困难和计数的困难叠在一起,就成了这个经典翻车。如今模型大多能答对(也可能是被特意「硬编码」了答案),但成因仍是这套机制。 The other is the famous 'how many r's in strawberry.' For a long time all state-of-the-art models insisted there were only 2 r's, sparking a stir: how can a model solve math-Olympiad problems yet miscount the r's in strawberry? The answer is exactly the two points set up earlier, combined: first, the model sees tokens, not characters; second, the model is bad at counting (counting within a single token is, again, 'asking one token to bear too much computation'). The difficulty of seeing characters plus the difficulty of counting together produce this classic failure. Models now mostly get it right (possibly because the answer was deliberately 'hardcoded'), but the mechanism is the same.

再看一个就算懂原理也让人挠头的例子:「9.11 比 9.9 大吗?」模型常常答错,还会煞有介事地论证一番。而且不太可复现——有时答对、有时答错、有时中途自我纠正又翻回去。怪就怪在:它能搞定奥赛级难题,却在这种小学问题上失手。 Now an example that's a head-scratcher even if you understand the mechanics: 'is 9.11 bigger than 9.9?' The model often gets it wrong, and will solemnly argue its case. It's also not very reproducible — sometimes right, sometimes wrong, sometimes self-correcting midway then flipping back. The strange part: it can handle Olympiad-grade problems yet fails at this grade-school question.

有人深入研究过这个现象。据说当你审视神经网络内部的激活、看哪些特征/神经元被点亮时,会发现一批通常和「圣经经文」相关的神经元亮了起来——在圣经章节号的语境里,9.11 是排在 9.9 之后(更靠后/更大)的。所以模型仿佛被「提醒」这看起来像经文标记,在那种语境下 9.11 「更大」,从而被强烈干扰;哪怕它同时试图用数学来论证,最终还是得出了错误答案。这件事至今没有被完全理解。 Some people have studied this in depth. Reportedly, when you scrutinize the activations inside the network and see which features/neurons light up, a batch of neurons usually associated with 'Bible verses' turns on — in the context of Bible chapter-and-verse numbers, 9.11 comes after 9.9 (later/'greater'). So the model is as if 'reminded' this looks like verse markers, where 9.11 is 'greater,' and gets strongly distracted; even as it simultaneously tries to argue with math, it still arrives at the wrong answer. This one isn't fully understood to this day.

💡 「参差不齐的智能」的根源可以归纳为几条:① 分词——模型看到的是 token 块,不是单个字符,所以拼写/字符级任务先天吃亏;② 每个 token 的计算量有限——计数等任务被压进单个 token 就容易出错;③ 训练数据的种种怪癖——比如 9.11 让模型联想到圣经章节号,从而被干扰。能力的「峰」和「谷」并不均匀分布。 The roots of 'jagged intelligence' boil down to a few: (1) tokenization — the model sees chunks of tokens, not individual characters, so spelling/character-level tasks are inherently disadvantaged; (2) finite per-token computation — tasks like counting crammed into one token are error-prone; (3) quirks in the training data — e.g. 9.11 reminding the model of Bible chapter-and-verse numbers and distracting it. The 'peaks' and 'valleys' of competence are not evenly distributed.

⚠️ 实践启示:不要假设模型在所有任务上能力均匀。把它当成它本来的样子——一个非常神奇、但又不能完全信任的随机系统。在那些「锋利边缘」附近(拼写、计数、数字大小比较、精确算术)多做验证;该用工具时就用工具(比如「用代码」)。把它当作工具来使用,而不是把问题甩给它「随便跑一下」然后照单全收地复制粘贴结果。 Practical takeaway: don't assume uniform competence across tasks. Treat it for what it is — a remarkable but not fully trustworthy stochastic system. Verify extra carefully near the 'sharp edges' (spelling, counting, comparing number magnitudes, exact arithmetic); use tools where appropriate (e.g. 'use code'). Use it as a tool, rather than handing it a problem to 'just let rip on' and copy-pasting the results uncritically.

•参差不齐的智能:模型在某些任务上超越人类,却在小孩都觉得简单的事情上翻车。
•经典例子:「9.11 比 9.9 大吗」(常答错)、「strawberry 里有几个 r」(数不清字母)。
•根源一:分词——模型看到的是 token 块,不是单个字符,拼写/字符级任务先天吃亏。
•根源二:每个 token 计算量有限——计数等被压进单 token 的任务容易出错。
•根源三:训练数据怪癖——如 9.11 让模型联想到圣经章节号而被干扰。
•实践:别假设能力均匀;在锋利边缘多验证,该用工具(如「用代码」)就用工具。

•Jagged intelligence: a model can be superhuman at some tasks yet fail at things a child finds trivial.
•Classic examples: 'is 9.11 bigger than 9.9' (often wrong) and 'how many r's in strawberry' (can't count the letters).
•Root one: tokenization — the model sees chunks of tokens, not individual characters, so spelling/character tasks are disadvantaged.
•Root two: finite per-token computation — tasks like counting crammed into one token are error-prone.
•Root three: training-data quirks — e.g. 9.11 reminding the model of Bible chapter-and-verse numbers and distracting it.
•Practical: don't assume uniform competence; verify near the sharp edges, and use tools (e.g. 'use code') where appropriate.

📝 本章测验

「参差不齐的智能(jagged intelligence)」指的是什么?What does 'jagged intelligence' refer to?

为什么模型经常数不清「strawberry 里有几个 r」?Why does the model often miscount 'how many r's in strawberry'?

面对模型的这些「锋利边缘」,本章给的实践建议是什么?Given these 'sharp edges,' what is the chapter's practical advice?