Understand how web-scale pre-training data is extracted, filtered, deduplicated, mixed, tokenized, and packed into training-ready shards, including decontamination, late-stage annealing, and synthetic-data tradeoffs.
Scaling laws gave us a target: if you want a capable dense model, you may need hundreds of billions or trillions of useful training tokens. Pre-training data pipelines decide what those tokens are before any fine-tuning or alignment happens. This chapter covers sourcing, cleaning, deduplication, filtering, and mixing at the scale where data quality becomes model behavior.
Imagine you are building a customer-support chatbot for an online marketplace. It needs to answer questions about return policies, shipping delays, and payment disputes. You can't write answers for every question in advance, so you decide to train an AI system on a large collection of support transcripts, product descriptions, and shipping documentation.
But where does that collection come from? You can't dump raw web pages into the model and expect useful behavior. Web-scale text includes ads, broken HTML, duplicate product listings, forum spam, private data, and benchmark questions. Training on unfiltered data would be like stocking a fulfillment center with unsorted returns, duplicate labels, and damaged boxes. Useful records are in there, but the line spends most of its time sorting waste unless the intake process is disciplined.
This article is about the engineering pipeline that turns raw data into clean training inventory. You'll learn how teams ingest large raw corpora, filter low-signal text, remove duplicates, scrub private information, and convert what's left into the integer tokens a model trains on.
Before we build the pipeline, recall three ideas from earlier in the curriculum:
If tokens or next-token prediction feel fuzzy, review Language Modeling & Next Tokens before continuing. Everything here assumes you understand why the model needs large amounts of varied text.
Let's start with a concrete number. Suppose you want to train a 70 billion parameter model. How many tokens should you feed it?
Researchers at DeepMind answered this with the Chinchilla scaling laws. Under a fixed compute budget, they found that model size and training-token count should grow together. A useful rule of thumb for dense decoder-only transformers is about 20 training tokens per parameter[1].
For a 70B model:
That's roughly the entire text content of millions of books, web pages, and code repositories combined. It gives you a sense of why the pipeline needs industrial scale.
A companion formula for back-of-the-envelope planning is training FLOPs of about , where is a consistently counted dense model size and is the number of training tokens. It's an approximation, and details such as whether output-layer parameters are counted can move fitted optima, but it's useful for quick capacity estimates.[1][2]
Raw token counts are misleading. A smaller corpus of textbook-quality tokens can beat a much larger pile of noisy crawl text. Scaling data only helps when the extra tokens still add real information.
Meta reports pre-training Llama 3's 405B model on 15.6T tokens, about 38.5 tokens per parameter, and says that flagship configuration is approximately compute-optimal under its own fitted laws and training budget. Separately, Meta reports training its smaller models longer than compute-optimal because those models performed better at the same inference budget.[3] Treat the 20x ratio as a planning baseline, not a hard law.
1def dense_training_flops(parameters: float, tokens: float) -> float:
2 return 6 * parameters * tokens
3
4parameters = 70e9
5target_tokens = 20 * parameters
6llama_405b_ratio = 15.6e12 / 405e9
7
8print(f"Chinchilla-style token target: {target_tokens / 1e12:.1f}T")
9print(f"Dense training FLOP estimate: {dense_training_flops(parameters, target_tokens):.2e}")
10print(f"Llama 3 405B reported ratio: {llama_405b_ratio:.1f} tokens/parameter")1Chinchilla-style token target: 1.4T
2Dense training FLOP estimate: 5.88e+23
3Llama 3 405B reported ratio: 38.5 tokens/parameterThink of the raw internet as an overloaded receiving dock. It contains valuable inventory, but it's mixed with duplicate labels, broken packaging, spam, and unsafe material. A pre-training data pipeline is the automated sorting line that extracts useful records and turns them into a training-ready mix.
Your job as the AI engineer isn't to write the text. It's to design a sorting system that processes large partitions repeatably, records each decision, and avoids throwing out high-value records.
The modern pipeline has three major phases:
Different data sources serve different roles in the training mixture. Web data gives breadth and freshness. Code is dense in syntax, structure, and exact correctness. Books and papers provide long-form coherent explanations. Carefully curated synthetic or textbook-style data can raise signal density further.
Phi-1 is a good example: its authors report strong coding benchmark performance from a 1.3B model trained on textbook-quality and synthetic data.[4]
| Source | What it contributes | Main curation question |
|---|---|---|
| Common Crawl snapshots | Broad web coverage at large scale | Which extraction, language, quality, and deduplication rules raise signal without collapsing yield? |
| Books and papers | Long-form explanations and sustained argument | Which licenses, provenance rules, and subject areas fit the intended model? |
| Wikipedia | Structured reference text across many topics and languages | Which languages and snapshots belong in the mixture? |
| Code (for example, public repositories) | Syntax, APIs, and examples where exact structure matters | Which licenses, repository-level rules, and file filters remove generated or low-value code? |
| Curated web (for example, technical forums) | Focused discussions and worked answers | How do you preserve useful niche material without replaying copied pages? |
| Synthetic data | Targeted examples for gaps that raw sources cover poorly | Does the generated slice preserve factual fidelity, diversity, and downstream quality? |
Note: Some papers disclose approximate mixtures while many don't. Llama 3 reports a final pre-training mix of roughly 50% general knowledge, 25% mathematical and reasoning data, 17% code, and 8% multilingual data.[3] The broader pattern is that mixture weights are decisions, not raw-byte proportions.
Source quality is only half the decision. You also have to decide how often each source appears and whether that weighting stays fixed through the whole run.
Three knobs matter:
The wrong default is "sample in proportion to raw bytes." Raw crawl volume isn't the same thing as training value. The Pile is still useful as reference because it made mixture design explicit instead of treating web text as one giant bucket.[5] Llama 3 shows a later-stage version of the same idea: even after assembling a huge corpus, training shifted toward higher-value slices during annealing.[3]
1reported_mix = {
2 "general knowledge": 0.50,
3 "math and reasoning": 0.25,
4 "code": 0.17,
5 "multilingual": 0.08,
6}
7budget = 1e12
8
9assert abs(sum(reported_mix.values()) - 1.0) < 1e-9
10for source, fraction in reported_mix.items():
11 print(f"{source:<19}: {budget * fraction / 1e9:>3.0f}B tokens")1general knowledge : 500B tokens
2math and reasoning : 250B tokens
3code : 170B tokens
4multilingual : 80B tokens| Knob | What changes | Typical reason |
|---|---|---|
| Upsampling / downsampling | how often each source appears | code, math, books, or multilingual text may deserve more weight than raw size suggests |
| Curriculum schedule | what model sees earlier vs later | keep early coverage broad, then spend late tokens on harder or cleaner data |
| Per-example weights | which records inside one source repeat more | keep rare but important domains from being drowned by generic text |
This is practical curriculum learning in pretraining. It usually isn't "easy lessons first, hard lessons later." More often it means changing source weights over time so the model sees a different mix at different stages.
Two common patterns:
Don't confuse curriculum with annealing alone. Annealing usually means late-stage learning-rate decay plus a higher-quality mix. Curriculum is broader: any deliberate schedule that changes what the model sees as training progresses.
It also helps to know the public datasets that anchored a lot of later recipes:
| Dataset | What it emphasized | Why people still cite it |
|---|---|---|
| C4[6] | Aggressive heuristic cleanup of Common Crawl | Canonical example of turning noisy crawl text into a cleaner English web corpus |
| The Pile[5] | Deliberate source diversity across code, papers, books, forums, and web text | Useful contrast to pure web-crawl pipelines because mixture design is the core idea |
| FineWeb[7] | Web-scale extraction, filtering, and deduplication ablations over 96 Common Crawl snapshots | Good modern reference for how pipeline details measurably change downstream quality |
Before using expensive neural models to classify text, basic heuristic filters drop the lowest-signal material from web crawls. As seen in foundational datasets like C4 (Colossal Clean Crawled Corpus)[6], these simple rules act as a highly efficient first line of defense.
The snippet below demonstrates how these basic rules are implemented in practice. It takes a single document as input and returns True only if the document passes all checks.
1def load_bad_words() -> list[str]:
2 # In production, this would come from a reviewed policy list.
3 return ["toxic_word_1", "toxic_word_2"]
4
5def quality_filter(doc: str) -> bool:
6 """Illustrative web-text filter; thresholds require corpus-level validation."""
7 words = doc.split()
8
9 # Length filter: drop documents that are too short or likely concatenated.
10 if len(words) < 50 or len(words) > 100_000:
11 return False
12
13 # Repetition filter: repeated n-grams are common in spam and boilerplate.
14 for n in [2, 3, 4]:
15 ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
16 if len(ngrams) > 0:
17 frac_duplicates = 1 - len(set(ngrams)) / len(ngrams)
18 if frac_duplicates > 0.3: # >30% repeated n-grams
19 return False
20
21 # Policy-list filter: check the ratio of reviewed blocklisted terms.
22 bad_words = set(load_bad_words())
23 bad_count = sum(1 for w in words if w.lower() in bad_words)
24 if bad_count / len(words) > 0.01:
25 return False
26
27 # Alphabetic character ratio: drop documents that are mostly symbols/code.
28 alpha_chars = sum(c.isalpha() for c in doc)
29 if alpha_chars / len(doc) < 0.5:
30 return False
31
32 # Sentence-ending punctuation check.
33 sentences = doc.split('.')
34 if len(sentences) < 3:
35 return False
36
37 return True
38
39procedure_doc = (
40 "The marketplace return policy explains how damaged packages are reviewed. "
41 "Customers upload photos, support agents verify the carrier scan history, "
42 "and the system creates a prepaid label when the claim is eligible. "
43 "This document contains specific procedural details, normal punctuation, "
44 "and enough natural language context to be useful for model training. "
45 "It avoids repeated boilerplate and gives support teams a clear workflow."
46)
47
48cases = {
49 "procedure": (procedure_doc, True),
50 "short": ("Return labels are available.", False),
51 "repeated": ("sale sale sale sale sale sale sale sale sale sale " * 8, False),
52 "policy-list": (procedure_doc + " toxic_word_1 toxic_word_2", False),
53}
54
55for name, (text, expected) in cases.items():
56 kept = quality_filter(text)
57 assert kept is expected
58 print(f"{name:<11} -> {'keep' if kept else 'drop'}")1procedure -> keep
2short -> drop
3repeated -> drop
4policy-list -> dropThe code intentionally uses simple thresholds so the gates are visible. C4 and FineWeb use related heuristic filtering ideas, but a real recipe must measure retention and downstream model quality on its own crawl slices before adopting thresholds.[6][7]
These heuristics usually run on general web text before the final data mix is assembled. Code corpora often use a parallel recipe, because symbol-heavy files that look low-quality to a web-text filter can still be very valuable for pre-training.
After heuristic filtering, modern pipelines use classifiers trained to distinguish high-quality educational text from low-quality web chatter:
fastText classifier screens whether the document matches the intended training languages.fastText classifier on instruction-formatted and ELI5 text and keeps only the top ~10% by score, and Meta reports separate Llama 3 classifiers for general quality, code, and reasoning signals.[7][8][3]| Filter | Effect | Example |
|---|---|---|
| URL blocklist | Remove known low-quality or unsafe domains | Common in production pipelines |
| Language ID (FastText) | Keep target language(s) | CCNet[9], FineWeb[7] |
| Heuristic rules | Length, repetition, char ratios | C4[6], RedPajama[10], FineWeb[7] |
| Classifier | Quality score threshold | FineWeb-Edu[7], DCLM[8], Llama 3[3] |
| Perplexity filter | Remove gibberish | CCNet[9] |
Filtering decisions trade yield for the type of data retained. DCLM selected a fastText filter that retained its top 10% of documents after evaluating trained models, rather than assuming a score threshold was automatically better.[8]
1documents = [
2 {"source": "proof", "tokens": 120, "score": 0.98},
3 {"source": "api-doc", "tokens": 160, "score": 0.92},
4 {"source": "forum", "tokens": 200, "score": 0.79},
5 {"source": "news", "tokens": 240, "score": 0.63},
6 {"source": "scrape", "tokens": 280, "score": 0.31},
7]
8
9threshold = 0.90
10kept = [doc for doc in documents if doc["score"] >= threshold]
11kept_tokens = sum(doc["tokens"] for doc in kept)
12total_tokens = sum(doc["tokens"] for doc in documents)
13
14print("kept sources:", ", ".join(doc["source"] for doc in kept))
15print(f"token yield: {kept_tokens}/{total_tokens} ({kept_tokens / total_tokens:.0%})")1kept sources: proof, api-doc
2token yield: 280/1000 (28%)Deduplication is a core quality step. The internet contains large amounts of duplicated content: boilerplate headers, repeated SEO text, mirrored sites, copied documentation, and reposted articles. Lee et al. found that, on the corpora and model sizes they tested, deduplication reduced emitted memorized text by about 10x and reached the same or better validation accuracy in fewer training steps.[11]
This involves taking a cryptographic hash, such as SHA-256, of each document and removing exact matches. It's fast but misses near-duplicates, such as a news article where only the timestamp or a single word was changed.
MinHash with Locality-Sensitive Hashing (LSH) is one widely used approach for fuzzy deduplication at scale.[12][11] FineWeb, for example, applied MinHash independently per crawl using word 5-grams and parameters intended to target documents with at least 75% similarity; its experiments favored per-crawl rather than global deduplication.[7]
The idea, by hand: Imagine you have three short documents:
Doc A and Doc B share almost every word. If you break them into overlapping 2-word chunks (shingles), they look like this:
["The return", "return policy", "policy says", "says damaged", ...]["The return", "return policy", "policy says", "says damaged", ...]The overlap is high. Doc C shares zero shingles with A or B. MinHash turns each document into a small numeric "sketch" that estimates this overlap relationship. LSH then routes similar sketches into the same candidate set so we inspect likely duplicates rather than comparing every pair. Because LSH is approximate, a candidate still needs a final similarity decision.
Here is the full pipeline:
The code below shows how to build a basic MinHash and LSH pipeline using the datasketch library. For a tiny demo corpus, it builds signatures from bigram shingles, uses LSH only for candidate lookup, and verifies candidates with exact Jaccard overlap before dropping a document.
1import re
2
3from datasketch import MinHash, MinHashLSH
4
5def normalize(text: str) -> list[str]:
6 return re.findall(r"[a-z0-9]+", text.lower())
7
8def shingles(text: str, shingle_size: int = 2) -> set[tuple[str, ...]]:
9 tokens = normalize(text)
10 if len(tokens) < shingle_size:
11 raise ValueError("Document is too short for chosen shingle size.")
12 return {
13 tuple(tokens[i:i + shingle_size])
14 for i in range(len(tokens) - shingle_size + 1)
15 }
16
17def create_minhash(items: set[tuple[str, ...]], num_perm: int = 128) -> MinHash:
18 """Create MinHash signature from normalized word shingles."""
19 m = MinHash(num_perm=num_perm)
20 for item in items:
21 shingle = " ".join(item)
22 m.update(shingle.encode("utf8"))
23 return m
24
25def jaccard(left: set[tuple[str, ...]], right: set[tuple[str, ...]]) -> float:
26 return len(left & right) / len(left | right)
27
28threshold = 0.5
29lsh = MinHashLSH(threshold=threshold, num_perm=128)
30documents = [
31 ("doc1", "The return policy says damaged items need a prepaid return label."),
32 ("doc2", "The return policy says damaged items require a prepaid return label."),
33 ("doc3", "Carrier scans update delivery promises."),
34]
35
36kept_shingles: dict[str, set[tuple[str, ...]]] = {}
37
38for doc_id, text in documents:
39 doc_shingles = shingles(text)
40 mh = create_minhash(doc_shingles)
41 candidate_scores = [
42 (candidate, jaccard(doc_shingles, kept_shingles[candidate]))
43 for candidate in sorted(lsh.query(mh))
44 ]
45 duplicate = next(
46 ((candidate, score) for candidate, score in candidate_scores if score >= threshold),
47 None,
48 )
49 if duplicate:
50 print(f"{doc_id}: drop near duplicate of {duplicate[0]} ({duplicate[1]:.2f})")
51 else:
52 kept_shingles[doc_id] = doc_shingles
53 lsh.insert(doc_id, mh)
54 print(f"{doc_id}: keep")
55
56print(f"Kept {len(kept_shingles)} out of {len(documents)} documents.")1doc1: keep
2doc2: drop near duplicate of doc1 (0.67)
3doc3: keep
4Kept 2 out of 3 documents.Why this works: doc1 is inserted first. When doc2 arrives, LSH returns doc1 as a candidate; an exact Jaccard check then confirms sufficient overlap before the drop. doc3 has no confirmed neighbor, so it is kept. Production thresholds, shingle sizes, and whether deduplication operates per crawl or globally must be evaluated for the corpus; FineWeb's published recipe used 5-grams and a 75% target.[7]
| Method | Speed | Precision | Recall | Best For |
|---|---|---|---|---|
| Exact hash (SHA-256) | Very fast | Perfect | Low (exact only) | Removing byte-identical documents |
| MinHash + LSH | Fast candidate search | Tunable | Tunable | Large-corpus near-deduplication recipes[12][7] |
| Suffix array / suffix tree | Medium | Very High | Very High | Exact repeated spans and long substring deduplication |
| SimHash | Fast | Medium | Medium | Cheap approximate similarity when memory is extremely tight |
An important and often overlooked step is decontamination: ensuring that evaluation benchmarks don't leak into the pre-training data. If a model sees MMLU, HumanEval, or GSM8K examples during pre-training, its reported performance becomes misleading because it may be memorizing instead of generalizing.
The risk is substantial. Web crawls contain GitHub repositories with coding interview questions, educational websites with standardized test questions, and forums where users paste benchmark prompts. Because training corpora and evaluation sets often draw from the same public web, contamination is a recurring risk rather than a hypothetical edge case.
One common approach is n-gram overlap detection. For each document in the training corpus, check whether it contains spans from an evaluation dataset. The n-gram size, normalization, removal threshold, and code-specific rules must be recorded with the released corpus. GPT-3, for example, used a conservative 13-gram analysis; Llama 3 reports excluding benchmark training sets from its annealing data.[13][3]
| Check | What it catches | What it misses |
|---|---|---|
| Exact string or n-gram overlap | Copied benchmark prompts, answers, or code spans | Paraphrases and renamed identifiers |
| Fuzzy or task-specific similarity | Lightly edited versions worth inspection | Requires calibrated thresholds |
| Repository or source exclusion | Artifacts from a known evaluation source | Copies hosted elsewhere |
Exact-match filtering isn't enough. A coding problem with renamed variables or a question with shuffled answer choices can still leak. Pipelines may add fuzzy or task-specific checks, then document what they removed so benchmark-clean claims are auditable.
1import re
2
3def ngrams(text: str, n: int) -> set[tuple[str, ...]]:
4 tokens = re.findall(r"[a-z0-9]+", text.lower())
5 return {tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)}
6
7benchmark = "A refund is approved when the carrier confirms a damaged shipment."
8candidates = {
9 "direct-copy": "Internal guide: a refund is approved when the carrier confirms a damaged shipment.",
10 "paraphrase": "Authorize reimbursement after shipping damage has been verified.",
11}
12protected = ngrams(benchmark, n=4)
13
14for name, document in candidates.items():
15 flagged = bool(protected & ngrams(document, n=4))
16 print(f"{name:<11} -> {'flag' if flagged else 'not caught by exact n-grams'}")1direct-copy -> flag
2paraphrase -> not caught by exact n-grams
After deduplication and decontamination, a pipeline may scrub personally identifiable information (PII) and unsafe content according to its data policy and legal obligations. A model trained on scraped identifiers can reproduce sensitive strings at inference time, creating privacy and compliance risk. FineWeb reports anonymizing email and public-IP addresses; Llama 3 reports removing domains likely to contain high volumes of PII or adult content.[7][3]
PII removal uses a layered approach:
Unsafe-content filtering is policy-specific and imperfect. FineWeb applies URL-level filtering for adult content but explicitly notes that harmful documents can remain. Llama 3 combines domain rules with dirty-word counting for adult websites that domain blocklists miss.[7][3] Whatever controls you choose, audit retained and removed samples across languages and source types instead of treating one score threshold as a universal safety boundary.
Tokenization converts raw text into integer IDs that the model can process. At the scale of trillions of tokens, tokenization itself requires significant parallel computation.
Selecting the right tokenizer involves trading off between vocabulary size, sequence length, and language support. A smaller vocabulary can force text to split into more subwords, increasing sequence length and the computational cost of attention. A larger vocabulary can improve compression when it matches the corpus, but it also requires a larger embedding matrix, which consumes memory.
Common subword-tokenizer algorithms use data-driven vocabulary construction:
| Tokenizer | Algorithm | Vocabulary | Used By |
|---|---|---|---|
| Byte-level BPE[14][15][3] | Bottom-up merging over bytes/subwords | 32K-128K+ | GPT-style models, Llama 3 |
| WordPiece[16] | Greedy subword vocabulary | ~30K | BERT |
| SentencePiece[17] | Framework for BPE or Unigram on raw text | 32K-256K | Llama 2, T5, multilingual models |
When you train a model family from scratch, the tokenizer should be trained on a representative sample of the pre-training data so it learns the most common subword patterns. Reusing a mature tokenizer can be a good engineering choice, but a mismatched tokenizer can waste context length, especially for multilingual or domain-heavy corpora. The code below uses the Hugging Face tokenizers library to initialize a byte-level Byte-Pair Encoding (BPE) tokenizer. For the demo, it trains from an in-memory sample; production pipelines train from a carefully mixed corpus sample and may target vocabularies such as 128,000 tokens.
1from tokenizers import Tokenizer, decoders, models, trainers, pre_tokenizers
2
3# Initialize a Byte-Pair Encoding (BPE) tokenizer
4tokenizer = Tokenizer(models.BPE())
5tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
6tokenizer.decoder = decoders.ByteLevel()
7
8# Train the tokenizer on a representative sample of the corpus
9trainer = trainers.BpeTrainer(
10 vocab_size=256,
11 special_tokens=["<s>", "</s>", "<pad>"],
12 min_frequency=2,
13 initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
14)
15
16training_sample = [
17 "Customers can return damaged items with a prepaid return label.",
18 "Carrier scans update delivery promises and support workflows.",
19 "Los clientes pueden consultar retrasos de envio y politicas de pago.",
20 "def refund_order(order_id): return create_support_ticket(order_id)",
21]
22
23tokenizer.train_from_iterator(training_sample, trainer=trainer)
24encoded = tokenizer.encode("Return policy creates a support ticket.")
25print(f"vocab size: {len(tokenizer.get_vocab())}, sample tokens: {len(encoded.ids)}")1vocab size: 259, sample tokens: 40In a byte-level setup, you often don't need an <unk> token because any UTF-8 byte sequence can be represented.
Vocabulary size is a real design decision. A larger vocabulary can improve compression when its extra tokens match the target text, but it also increases embedding-table size and memory footprint. A smaller vocabulary is lighter, but it can produce longer sequences. Meta reports that Llama 3 moved to a 128K-token vocabulary, starting from a tiktoken base plus 28K extra tokens, and improved English compression from 3.17 to 3.94 characters per token relative to Llama 2[3].
Large corpus pipelines require distributed execution beyond demonstration scripts. DCLM reports starting with a 240T-token Common Crawl pool containing about 200B documents and 370TB of compressed text, and processing crawl data on hundreds of AWS CPU nodes.[8]
At that scale, data is partitioned for extraction and filtering, intermediate artifacts are stored durably, and every transformation needs enough metadata to reproduce which records reached training.
Deduplication changes the system shape: signatures can be computed independently, but candidate grouping needs records with related signatures to meet. Whether that step uses a global operation or a restricted partition, such as FineWeb's per-crawl recipe, changes both resource cost and retained data quality.[7]
Documents vary widely in length: an article might be thousands of tokens while a forum post is short. Padding each document to a fixed sequence length spends compute on padding tokens. Best-fit packing groups documents into fixed-length token blocks to reduce that waste.
1lengths = [7, 6, 5, 5, 3, 2]
2capacity = 10
3
4def best_fit_pack(items: list[int], block_size: int) -> list[list[int]]:
5 bins: list[list[int]] = []
6 for length in sorted(items, reverse=True):
7 options = [bucket for bucket in bins if sum(bucket) + length <= block_size]
8 if options:
9 min(options, key=lambda bucket: block_size - sum(bucket) - length).append(length)
10 else:
11 bins.append([length])
12 return bins
13
14naive_slots = len(lengths) * capacity
15packed = best_fit_pack(lengths, capacity)
16packed_slots = len(packed) * capacity
17tokens = sum(lengths)
18
19print("packed blocks:", packed)
20print(f"naive utilization: {tokens / naive_slots:.0%}")
21print(f"packed utilization: {tokens / packed_slots:.0%}")1packed blocks: [[7, 3], [6, 2], [5, 5]]
2naive utilization: 47%
3packed utilization: 93%Packing creates another design decision: if multiple documents share one sequence, should tokens in one document attend to tokens from another? Some causal-language-model recipes concatenate documents with end-of-document separators and keep the ordinary causal mask. Others isolate documents with a block-diagonal mask so one document can't read tokens from an earlier document in the same packed block. The isolated version below prevents artificial cross-document context while retaining packing efficiency.
1document_ids = [0, 0, 0, 1, 1]
2
3mask = [
4 [
5 int(previous <= current and document_ids[previous] == document_ids[current])
6 for previous in range(len(document_ids))
7 ]
8 for current in range(len(document_ids))
9]
10
11for row in mask:
12 print(" ".join(map(str, row)))
13print("doc 1 token attends to doc 0:", bool(mask[3][2]))11 0 0 0 0
21 1 0 0 0
31 1 1 0 0
40 0 0 1 0
50 0 0 1 1
6doc 1 token attends to doc 0: FalseData annealing is a late-stage recipe adjustment where the learning rate decays and the data mix shifts toward higher-value slices. It isn't a universal "final 10%" rule. Llama 3, for example, reports annealing over the final 40M tokens while upsampling very high-quality sources and excluding benchmark training sets from the annealing pool[3].
Recent model and dataset papers show that data quality and token yield must be measured together once a corpus is already large.
Phi-1 reported strong coding benchmark performance for a 1.3B model trained with textbook-quality data, worked examples, and synthetic exercises.[4] Its result motivates controlled curation experiments; it doesn't make one data recipe universal.
FineWeb scales the same idea to web-scale curation. The paper builds a 15T-token dataset from 96 Common Crawl snapshots and then extracts FineWeb-Edu, a 1.3T educational subset filtered with a classifier trained on LLM-judged educational scores. The main lesson is that pipeline design matters: individual per-crawl MinHash deduplication and additional filtering each improved downstream results over weaker baselines[7].
Three published open-corpus reports illustrate different tradeoffs after FineWeb. Each changes a different part of the yield-versus-quality problem.
| Corpus | Scale and scope | Key idea |
|---|---|---|
| DCLM-Baseline[8] | 3.8T-token set filtered from a 240T-token Common Crawl pool | A fastText filter retained its top 10% of documents; a 7B model trained on 2.6T tokens reached 64% 5-shot MMLU |
| Nemotron-CC[18] | 6.3T tokens: 4.4T globally deduplicated original tokens plus 1.9T synthetic tokens | Classifier ensembling and source-conditioned rephrasing target larger unique-token yield; its 1.1T high-quality subset improved MMLU by 5.6 points over DCLM in the reported 8B, 1T-token setup |
| FineWeb2[19] | 20TB and 5B documents across over 1,000 languages from almost 100 snapshots | Language-adapted filtering and dedup-informed rebalancing were evaluated across canary languages before scale-up |
These reports don't establish one universally best filter. DCLM optimized benchmark quality under its evaluation setup with aggressive retention. Nemotron-CC explicitly targeted a larger long-horizon token supply and reported both quality and quantity comparisons. FineWeb2 addresses the separate multilingual problem: thresholds, word segmentation, language identification, and deduplication behavior don't transfer cleanly from English to every language.[8][18][19]
This token-yield pressure is often called the data wall. Villalobos et al. estimate an effective stock of about 320T public human-text tokens after quality and repetition adjustments. Under their assumptions about continued dataset growth, they project full utilization between 2026 and 2032, with a median of 2028; an assumed 5x overtraining scenario shifts that projection one to two years earlier.[20] This is a scenario projection, not evidence that usable public text is already exhausted.
1effective_stock = 320e12
2training_budgets = {
3 "1T-token study": 1e12,
4 "Llama 3 405B report": 15.6e12,
5 "100T-token scenario": 100e12,
6}
7
8for label, tokens in training_budgets.items():
9 print(f"{label:<22}: {tokens / effective_stock:>5.1%} of 320T reference stock")
10print("Comparison only: it isn't a claim that stock has been consumed.")11T-token study : 0.3% of 320T reference stock
2Llama 3 405B report : 4.9% of 320T reference stock
3100T-token scenario : 31.2% of 320T reference stock
4Comparison only: it isn't a claim that stock has been consumed.| Approach | Description | Expected Result |
|---|---|---|
| Raw Common Crawl | Minimal processing | Weakest baseline; high duplication and noise |
| Heuristic-filtered (C4-style) | Language ID + fixed quality rules | Better than raw crawl; still leaves many near-duplicates |
| Dedup + heuristic-filtered (FineWeb) | Per-crawl MinHash + additional filters | Strong public baseline; measurable downstream gains[7] |
| Classifier-filtered (DCLM-Baseline) | Model-based filtering keeps top ~10% of documents | Reported 64% 5-shot MMLU for its 7B model trained on 2.6T tokens[8] |
| Classifier ensemble + synthetic (Nemotron-CC) | Multiple classifiers plus synthetic rephrasing | Reported high-quality-subset improvement of +5.6 MMLU over DCLM in its 8B, 1T-token comparison[18] |
| Educationally curated (FineWeb-Edu / Phi-style) | Classifier or textbook-style curation plus synthetic data | Phi-1 reported strong coding benchmarks for a 1.3B model trained on textbook-quality data[4] |
Additional crawl volume and stronger curation must be compared empirically. FineWeb is a good example: in its ablations, extraction, per-crawl deduplication, and filtering choices measurably changed downstream results.[7]
Synthetic data can raise signal density, but it works best when it fills a specific gap rather than replacing the whole corpus.
Phi-1 is one clear example. Instead of trying to mimic the raw web distribution, the dataset deliberately emphasized textbook-style explanations, exercises, and synthetic examples[4]. That makes synthetic data useful when you want more reasoning traces, worked solutions, or domain-specific exemplars than the open web naturally provides.
One proposed response to token-yield pressure is synthetic expansion. Nemotron-CC generates variants from source documents, including question-answer pairs and distilled passages, and reports improvements in specific training comparisons.[18] Source conditioning gives the generated text provenance, but the authors state that they didn't verify factual accuracy or fidelity for rephrased data. Grounding therefore needs validation rather than assumption.
Synthetic text inherits teacher errors, style biases, and coverage gaps. Shumailov et al. study degradation when successive model generations train recursively on generated data, a failure mode often called model collapse.[21] This doesn't prohibit synthetic supplementation; it means mixture weight, provenance, factual fidelity, and diversity need evaluation.
Don't conflate verified synthetic-data generation with post-training reinforcement learning. Pre-training pipelines more often use offline generation plus filtering or programmatic checks before tokens are admitted.
Here are the most frequent errors teams make when building or reasoning about pre-training pipelines, with symptoms, causes, and fixes.
| Mistake | Symptom | Cause | Fix |
|---|---|---|---|
| Ignoring deduplication | Model repeats memorized boilerplate; token budget spends on repeats | Duplicate text overweights copied passages | Evaluate dedup granularity; FineWeb selected per-crawl MinHash over global dedup |
| Filtering non-English too aggressively | Model loses code or math reasoning strength | Code and math contain many universal symbols that language classifiers mislabel | Use separate recipes for code corpora and multilingual text |
| Overlooking benchmark contamination | Inflated evaluation scores | Evaluation examples leaked into training data via public sources | Specify exact-overlap and task-specific fuzzy rules; audit removals |
| Forgetting PII policy | Model outputs personal identifiers | Raw crawls contain user-generated content with personal data | Apply documented identifier/domain controls and audit a sample before training |
| Using an English-centric tokenizer for multilingual models | Terrible compression and slow inference for non-English languages | Vocabulary was trained on mostly English text | Train the tokenizer on a representative multilingual sample; expect 32K-128K+ vocabularies |
| Treating synthetic data as automatically faithful | Rewrites introduce errors or narrow styles | Source-conditioned generation was admitted without validation | Measure fidelity, diversity, and downstream effects before changing mixture weight |
| Assuming one annealing schedule fits every model | Late-stage training gives inconsistent gains | Copied another team's annealing recipe without matching data mix | Treat annealing as a hyperparameter; test on a small model first |
| Tier | You should be able to defend |
|---|---|
| Foundational | Map the full pipeline from raw web crawl to tokenized training batches. |
| Foundational | Explain why raw token count isn't the same thing as useful training signal. |
| Intermediate | Distinguish exact hashing from MinHash/LSH near-deduplication. |
| Intermediate | Explain why source weighting and multilingual weighting should follow target capabilities instead of raw byte counts. |
| Advanced | Design multi-stage filtering with heuristics, classifiers, perplexity, PII scrub, and benchmark decontamination. |
| Advanced | Apply n-gram overlap and fuzzy matching to protect evaluation benchmarks. |
| Advanced | Explain late-stage data annealing, sequence packing, and why their recipes are model-specific. |
| Advanced | Identify when synthetic data helps and when it narrows the corpus too much. |
| Advanced | Contrast DCLM, Nemotron-CC, and FineWeb2 and explain the tradeoff each report evaluates. |
Question 1: You are building a pipeline for a 7B parameter model. Using the 20 tokens-per-parameter rule, how many tokens should you target? Answer: 7B x 20 = 140 billion tokens.
Question 2: A document in your crawl shares 85% of its 5-grams with a document already in the training set. Should you keep it? Why or why not? Answer: Treat it as a near-duplicate under a documented threshold, then retain one representative according to provenance and quality policy. Length alone isn't enough to decide which record is better.
Question 3: Your team is debating whether to spend $1 million on another 5T tokens of raw Common Crawl or on improving the classifier filter. What argument would you make? Answer: Demand a controlled comparison: measure token retention, domain coverage, and downstream performance for an improved filter against the added crawl. FineWeb and DCLM motivate the experiment, but neither proves the answer for this corpus.
Question 4: A junior engineer suggests training the tokenizer on English-only text for a model that must also handle Spanish customer support. What's the risk? Answer: Spanish text will be split into many more subword tokens, increasing sequence length, memory use, and inference cost for every Spanish query. The tokenizer should see a representative multilingual sample.
Question 5: Why is MinHash useful when exact hashing already exists? Answer: Exact hashing catches only byte-identical documents. MinHash sketches overlapping shingles, so it can find pages that are mostly the same after boilerplate, timestamps, whitespace, or a few words change.
Question 6: How would you protect a HumanEval-style benchmark from leakage? Answer: Start with exact string and n-gram checks against benchmark prompts and solutions, then add code-aware checks such as repository filtering and fuzzy matching for renamed variables or lightly edited solutions.
Question 7: What is data annealing? Answer: It's a late-stage recipe change where the learning rate falls and the data mix shifts toward higher-value slices. It isn't a universal final percentage; the right schedule depends on the model, data mix, and target capabilities.
Question 8: When does synthetic data help pre-training? Answer: It helps when it fills a concrete gap, such as worked code examples, reasoning traces, domain procedures, or rare languages. It becomes risky when one teacher model's style overwhelms broad human-written data, which can drift toward model collapse.
Question 9: What is the "data wall," and name two pipeline techniques teams use to push past it. Answer: The data wall is a projection that training demand may approach available public human text. Villalobos et al. estimate an effective stock near 320T tokens and a median full-utilization scenario around 2028. Teams study stronger filtering, deduplicated replay, and validated source-conditioned synthetic data to change the tradeoff.
Training Compute-Optimal Large Language Models.
Hoffmann, J., et al. · 2022 · NeurIPS 2022
Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., & Carmon, Y. · 2024
The Llama 3 Herd of Models.
Dubey, A., et al. · 2024 · arXiv preprint
Textbooks Are All You Need
Gunasekar, S., et al. · 2023 · NeurIPS 2023
The Pile: An 800GB Dataset of Diverse Text for Language Modeling.
Gao et al. · 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
Raffel, C., et al. · 2020 · JMLR
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Penedo, G., et al. · 2024 · NeurIPS 2024
DataComp-LM: In Search of the Next Generation of Training Sets for Language Models
Li, J., et al. · 2024
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
Wenzek, G., et al. · 2019 · LREC 2020
RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset
Computer, Together · 2023 · GitHub
Deduplicating Training Data Makes Language Models Better.
Lee, K., Ippolito, D., et al. · 2022 · NeurIPS 2022
On the Resemblance and Containment of Documents.
Broder, A. Z. · 1997
Language Models are Few-Shot Learners.
Brown, T., et al. · 2020 · NeurIPS 2020
Neural Machine Translation of Rare Words with Subword Units.
Sennrich, R., Haddow, B., & Birch, A. · 2016 · ACL 2016
Language Models are Unsupervised Multitask Learners.
Radford, A., et al. · 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.
Kudo, T. & Richardson, J. · 2018 · EMNLP 2018
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
Su, D., et al. · 2024
FineWeb2: One Pipeline to Scale Them All - Adapting Pre-Training Data Processing to Every Language
Penedo, G., et al. · 2025
Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data
Villalobos, P., et al. (Epoch AI) · 2022 · arXiv preprint
AI models collapse when trained on recursively generated data
Shumailov, I., et al. · 2024 · Nature