深入理解大语言模型

基于 Andrej Karpathy《Deep Dive into LLMs》—— 从预训练到 RLHF 的交互式中文课程。

已读 0 / 4 章

课程章节

01 引言:把 ChatGPT 拆开看Introduction: Taking ChatGPT Apart 建立思考大模型的心智模型——它擅长什么、不擅长什么、有哪些需要警惕的边角。Build a mental model for what LLMs are: where they shine, where they fail, and the sharp edges to watch for.
02 预训练数据:把互联网喂给模型Pretraining Data: Feeding the Internet to the Model 预训练的第一步是「下载并处理整个互联网」。我们看看从 Common Crawl 的原始网页,经过层层过滤,最终变成像 FineWeb 那样约 44TB、约 15 万亿 token 的高质量语料的全过程。The first step of pretraining is to download and process the internet. We trace the journey from Common Crawl's raw web pages, through layers of filtering, into a high-quality corpus like FineWeb — roughly 44TB and about 15 trillion tokens.
03 分词:文本如何变成 tokenTokenization: How Text Becomes Tokens 神经网络只接受「一维的、有限种类的符号序列」。我们看看为什么不能直接喂比特,如何从字节出发,再用字节对编码(BPE)在序列长度和词表大小之间做权衡,最终得到 GPT-4 那约 10 万个 token 的词表。Neural networks only accept a one-dimensional sequence drawn from a finite set of symbols. We look at why we can't feed raw bits, how we start from bytes, and how byte pair encoding (BPE) trades off sequence length against vocabulary size — ending up at GPT-4's vocabulary of roughly 100,000 tokens.
04 神经网络的输入与输出Neural Network Inputs and Outputs 把语料分词后,我们就得到约 15 万亿个 token。训练神经网络的核心,是建模「token 如何一个接一个地出现」。我们先不打开网络这个黑盒,只看清它的输入(变长的 token 窗口)和输出(覆盖整个词表的概率分布),以及训练时如何一步步调整它。Once the corpus is tokenized we have about 15 trillion tokens. Training the network is about modeling how tokens follow one another. Without opening the black box yet, we look at its input (a variable-length window of tokens) and its output (a probability distribution over the whole vocabulary), and how training nudges it step by step.