LLM 全栈学习
第 2 章

预训练数据:把互联网喂给模型 Pretraining Data: Feeding the Internet to the Model

预训练的第一步是「下载并处理整个互联网」。我们看看从 Common Crawl 的原始网页,经过层层过滤,最终变成像 FineWeb 那样约 44TB、约 15 万亿 token 的高质量语料的全过程。 The first step of pretraining is to download and process the internet. We trace the journey from Common Crawl's raw web pages, through layers of filtering, into a high-quality corpus like FineWeb — roughly 44TB and about 15 trillion tokens.

预训练分成多个顺序排列的阶段,而第一个阶段的第一步,就是「下载并处理整个互联网」。我们的目标是从公开来源拿到海量文本:既要数量巨大,又要质量很高,还要足够多样——因为我们希望模型内部装下尽可能多的知识。要同时做到「量大、质高、多样」相当不容易,所以这一步要经过好几个处理阶段。 Pretraining has several sequential stages, and the very first step of the first stage is to download and process the internet. The goal is to gather a huge quantity of text from publicly available sources: we want it to be large in volume, high in quality, and very diverse — because we want as much knowledge as possible inside the model. Achieving 'large, high-quality, and diverse' all at once is hard, so this step takes multiple processing stages.

为什么「多样」和「质量」缺一不可?多样性决定了知识的广度:如果语料里只有体育新闻,模型就不会懂医学或编程。质量决定了模型学到的是好榜样还是坏榜样:如果喂进去的是垃圾营销文案和机器生成的乱码,模型学会的也只能是这些。我们想要的是一片既广阔、又干净的文本——这正是后面所有过滤步骤要努力达成的目标。 Why do diversity and quality both matter? Diversity sets the breadth of knowledge: if the corpus were only sports news, the model would know nothing about medicine or programming. Quality sets whether the model learns from good examples or bad ones: feed it spammy marketing copy and machine-generated gibberish, and that's what it learns to imitate. We want text that is both broad and clean — which is exactly what all the filtering steps below are trying to achieve.

Hugging Face 整理并公开了一个有代表性的数据集叫 FineWeb,并详细记录了它的构建方法。所有主流厂商(OpenAI、Anthropic、Google 等)内部都有类似 FineWeb 的东西。值得注意的是:虽然互联网极其庞大,但 FineWeb 这种生产级语料最终只占大约 44TB 磁盘空间——一块普通硬盘几乎就能装下。原因在于我们处理的是纯文本,而且过滤得非常激进。 Hugging Face curated and published a representative dataset called FineWeb, documenting in detail how it was built. Every major provider (OpenAI, Anthropic, Google, and so on) has some internal equivalent. Notably, even though the internet is enormous, a production-grade corpus like FineWeb ends up being only about 44TB on disk — an ordinary hard drive could almost hold it. That's because we work with plain text and filter it very aggressively.

💡 「整个互联网」最后只有约 44TB,这件事很反直觉:你能在普通商店买到一块 1TB 的 U 盘,而几十块这样的盘就装下了模型见过的「全部世界」。秘诀在于——我们只留纯文本(不要图片、视频),而且把绝大部分网页都过滤掉了。 It's counterintuitive that 'the whole internet' ends up around 44TB: you can buy a 1TB USB stick at any store, and a few dozen of them hold the entire 'world' the model ever sees. The trick is that we keep only plain text (no images, no video) and filter out the vast majority of web pages.

大部分数据的起点是 Common Crawl。这是一个自 2007 年起就持续爬取互联网的组织;截至 2024 年,它已经索引了约 27 亿个网页。它的做法是:从少量种子网页出发,顺着链接不断往外爬,持续索引信息,随时间积累出海量数据。不过 Common Crawl 的原始数据非常「脏」,需要经过一整套处理流水线层层过滤。 Most of the data starts from Common Crawl, an organization that has been scouring the internet since 2007; as of 2024 it had indexed around 2.7 billion web pages. It works by starting from a few seed pages and following links outward, continuously indexing information and accumulating a massive amount of data over time. But Common Crawl's raw data is quite messy and needs a full processing pipeline of layered filtering.

Common Crawl 数十亿网页 URL 过滤 去恶意/垃圾站 正文抽取 去 HTML/CSS 语言过滤 >65% 英文 去重 删重复文档 PII 移除 去隐私信息 FineWeb 数据集 ≈ 44TB · 约 15 万亿 token
从 Common Crawl 原始网页到 FineWeb 语料的处理流水线 The processing pipeline from raw Common Crawl pages to the FineWeb corpus
  • URL 过滤:用域名黑名单剔除不想要的站点——恶意软件站、垃圾站、营销站、成人站、仇恨站等。这些内容质量低或有害,留着只会污染模型。
  • 正文抽取:爬下来的是原始 HTML(含 CSS、导航栏、列表等标记),需要用启发式规则抽出真正的正文文本,把网页代码和导航统统丢掉。
  • 语言过滤:用语言分类器猜每页是什么语言,例如 FineWeb 只保留英文占比超过 65% 的网页——这是个设计取舍,直接决定模型的多语言能力。
  • 去重:去掉重复或近似重复的内容,避免同一篇文章被反复学习而过度偏向。
  • PII 移除:检测并过滤地址、社保号(SSN)等个人可识别信息,以保护隐私。
  • URL filtering: domain blocklists drop unwanted sites — malware, spam, marketing, adult, hate sites, and so on. This content is low-quality or harmful; keeping it only pollutes the model.
  • Text extraction: crawled pages are raw HTML (CSS, navigation bars, lists, and other markup); heuristics pull out the actual body text and discard the page code and navigation.
  • Language filtering: a language classifier guesses each page's language — FineWeb, for example, keeps only pages that are more than 65% English. This is a design choice that directly determines the model's multilingual ability.
  • Deduplication: remove duplicate and near-duplicate content so the same article isn't learned over and over and over-weighted.
  • PII removal: detect and filter personally identifiable information such as addresses and Social Security numbers (SSNs) to protect privacy.
💡 语言过滤其实是一个产品决策。如果你把所有西班牙语都过滤掉,模型以后就几乎没见过西班牙语数据,自然也就不擅长西班牙语。FineWeb 专注英文(>65% 阈值),所以基于它训练的模型英文很强,但其他语言未必。 Language filtering is really a product decision. If you filter out all the Spanish, the model later will have seen almost no Spanish and won't be good at it. FineWeb focuses on English (the >65% threshold), so a model trained on it will be strong in English but not necessarily in other languages.
📝 不同公司会对「语言配比」做出不同选择。有的厂商刻意保留大量多语言数据,以训练在全球市场都好用的模型;有的则像 FineWeb 一样集中火力做好英文。没有唯一正确答案——这取决于产品要服务谁。 Different companies make different decisions about the language mix. Some deliberately keep large amounts of multilingual data to train models that work across global markets; others, like FineWeb, concentrate their fire on excellent English. There's no single right answer — it depends on who the product is meant to serve.

经过这一整套处理,我们最终得到约 44TB 的纯文本。点开 FineWeb,你会看到里面就是一篇篇被过滤干净的网页文本——比如一篇 2012 年关于龙卷风的报道,或一篇讲肾上腺的医学小科普。把其中的网页文本拼接起来,就是一大片「文本织锦」:里面充满了各种语言模式。这片巨大的文本,正是下一步训练神经网络的起点——我们要让神经网络去内化、去建模这些文本是如何流动、如何衔接的。 After this whole pipeline we end up with roughly 44TB of plain text. Open FineWeb and you'll see page after page of cleaned-up web text — say a 2012 article about tornadoes, or a little medical explainer about your adrenal glands. Concatenating the web text gives a vast 'tapestry' of text, full of linguistic patterns. This enormous body of text is the starting point for the next step — training neural networks. We want the networks to internalize and model how this text flows and how one piece follows another.

📝 本章测验

下面哪个是大多数预训练数据集的原始数据起点?Which of the following is the raw starting point for most pretraining datasets?

在 FineWeb 的处理流水线中,「正文抽取」这一步主要解决什么问题?In FineWeb's pipeline, what problem does the 'text extraction' step mainly solve?

为什么说语言过滤是一个「产品决策」而非纯技术步骤?Why is language filtering described as a 'product decision' rather than a purely technical step?

FineWeb 数据集最终的规模大约是多少?About how large is the final FineWeb dataset?