LearnAdvanced Training & AdaptationPre-training Data at Scale

⚡HardFine-Tuning & Training

Pre-training Data at Scale

Understand how web-scale pre-training data is extracted, filtered, deduplicated, mixed, tokenized, and packed into training-ready shards, including decontamination, late-stage annealing, and synthetic-data tradeoffs.

37 min read

Learning path

Step 97 of 158 in the full curriculum

Scaling Laws & Compute-Optimal Training Build GPT from Scratch Lab

Scaling laws gave us a target: if you want a capable dense model, you may need hundreds of billions or trillions of useful training tokens. Pre-training data pipelines decide what those tokens are before any fine-tuning or alignment happens. Cover sourcing, cleaning, deduplication, filtering, and mixing at the scale where data quality becomes model behavior.

A coding assistant for infrastructure teams needs public code, API docs, runbooks, incident writeups, and design notes, but you can't dump raw repositories into training. Web-scale text includes broken HTML, duplicate files, spam, private data, and benchmark questions. A useful pipeline turns that messy source lake into documented training data: extract text, filter low-signal records, remove duplicates, scrub sensitive material, and convert the retained corpus into packed token sequences.

From token budget to training inventory

Start the pipeline with three ideas from earlier in the curriculum:

Tokens: A language model doesn't read words. It reads tokens: reusable pieces such as ship, ping, punctuation marks, or byte sequences. The tokenizer converts text into integers, and the model learns to predict the next integer in a sequence.
Next-token prediction: During pre-training, the model sees a partial sentence and learns to guess what comes next. The more diverse and high-quality the sequences it sees, the broader its knowledge becomes.
Scaling laws: The previous chapter turned model size into a token target. Data pipelines turn that target into an engineering system.

If tokens or next-token prediction feel fuzzy, review Language Modeling & Next Tokens before continuing. Everything here assumes you understand why the model needs large amounts of varied text.

How much data is enough?

Start with a concrete number. Suppose you want to train a 70 billion parameter model. How many tokens should you feed it?

Hoffmann et al. studied this tradeoff in the Chinchilla scaling laws. Under their fixed-compute dense-transformer setup, model size and training-token count grew together. A useful planning rule from that study is about 20 training tokens per parameter^{[1]Reference 1Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}.

For a 70B model:

70 billion parameters x 20 tokens per parameter = 1.4 trillion tokens

That's roughly the entire text content of millions of books, web pages, and code repositories combined. That scale explains why the pipeline needs industrial machinery.

A companion formula for back-of-the-envelope planning is training FLOPs of about $6ND$ , where $N$ is a consistently counted dense model size and $D$ is the number of training tokens. It's an approximation, and details such as whether output-layer parameters are counted can move fitted optima, but it's useful for quick capacity estimates.^{[1]Reference 1Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}^{[2]Reference 2Resolving Discrepancies in Compute-Optimal Scaling of Language Modelshttps://arxiv.org/abs/2406.19146}

Raw token counts are misleading. A smaller corpus of textbook-quality tokens can beat a much larger pile of noisy crawl text. Scaling data only helps when the extra tokens still add real information.

Meta reports pre-training Llama 3's 405B model on 15.6T tokens, about 38.5 tokens per parameter, and says that flagship configuration is approximately compute-optimal under its own fitted laws and training budget. Separately, Meta reports training its smaller models longer than compute-optimal because those models performed better at the same inference budget.^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783} Treat the 20x ratio as a planning baseline, not a hard law.

token-target-and-flop-budget.py

def dense_training_flops(parameters: float, tokens: float) -> float:
    return 6 * parameters * tokens

parameters = 70e9
target_tokens = 20 * parameters
llama_405b_ratio = 15.6e12 / 405e9

print(f"Chinchilla-style token target: {target_tokens / 1e12:.1f}T")
print(f"Dense training FLOP estimate: {dense_training_flops(parameters, target_tokens):.2e}")
print(f"Llama 3 405B reported ratio: {llama_405b_ratio:.1f} tokens/parameter")

Output

Chinchilla-style token target: 1.4T
Dense training FLOP estimate: 5.88e+23
Llama 3 405B reported ratio: 38.5 tokens/parameter

The raw corpus and the clean training mix

The raw internet is a noisy source lake. It contains useful records, but they're mixed with duplicate pages, broken markup, spam, and unsafe material. A pre-training data pipeline is the repeatable processing system that extracts useful records and turns them into a training-ready mix.

Your job as the AI engineer isn't to write the text. It's to design a processing system that handles large partitions repeatably, records each decision, and avoids throwing out high-value records.

The modern pipeline has three major phases:

Ingestion: Pull raw data from web crawls, code hosts, and curated archives.
Cleaning: Filter, deduplicate (remove duplicate documents), decontaminate (strip out leaked evaluation-benchmark examples), and scrub for safety.
Preparation: Tokenize, shuffle, pack, and split into training shards.

Illustrative pre-training funnel for 100 raw records. Extraction keeps 82, quality filtering keeps 42, near-deduplication keeps 31, and policy plus evaluation checks keep 29 records for tokenization and packing. A stacked accounting bar shows all 100 outcomes. — In this illustrative 100-record batch, 18 records fail extraction, 40 fail quality checks, 11 are near-duplicates, and 2 fail policy or evaluation checks, leaving 29 for tokenization and packing. Real retention rates must be measured for each source.

Where the text comes from

Different data sources serve different roles in the training mixture. Web data gives breadth and freshness. Code is dense in syntax, structure, and exact correctness. Books and papers provide long-form coherent explanations. Carefully curated synthetic or textbook-style data can raise signal density further.

Phi-1 is a good example: its authors report strong coding benchmark performance from a 1.3B model trained on textbook-quality and synthetic data.^{[4]Reference 4Textbooks Are All You Needhttps://arxiv.org/abs/2306.11644}

Source	What it contributes	Main curation question
Common Crawl snapshots	Broad web coverage at large scale	Which extraction, language, quality, and deduplication rules raise signal without collapsing yield?
Books and papers	Long-form explanations and sustained argument	Which licenses, provenance rules, and subject areas fit the intended model?
Wikipedia	Structured reference text across many topics and languages	Which languages and snapshots belong in the mixture?
Code (for example, public repositories)	Syntax, APIs, and examples where exact structure matters	Which licenses, repository-level rules, and file filters remove generated or low-value code?
Curated web (for example, technical forums)	Focused discussions and worked answers	How do you preserve useful niche material without replaying copied pages?
Synthetic data	Targeted examples for gaps that raw sources cover poorly	Does the generated slice preserve factual fidelity, diversity, and downstream quality?

Note: Some papers disclose approximate mixtures while many don't. Llama 3 reports a final pre-training mix of roughly 50% general knowledge, 25% mathematical and reasoning data, 17% code, and 8% multilingual data.^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783} The broader pattern is that mixture weights are decisions, not raw-byte proportions.

Mixture weights, curriculum, and data scheduling

Source quality is only half the decision. You also have to decide how often each source appears and whether that weighting stays fixed through the whole run.

Three knobs matter:

Source weighting: web, code, books, papers, synthetic data
Schedule over time: fixed mix vs later-stage reweighting
Within-source weighting: whether some slices replay more often than others

The wrong default is "sample in proportion to raw bytes." Raw crawl volume isn't the same thing as training value. The Pile is still useful as reference because it made mixture design explicit instead of treating web text as one giant bucket.^{[5]Reference 5The Pile: An 800GB Dataset of Diverse Text for Language Modeling.https://arxiv.org/abs/2101.00027} Llama 3 shows a later-stage version of the same idea: even after assembling a huge corpus, training shifted toward higher-value slices during annealing.^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}

reported-mixture-token-budget.py

reported_mix = {
    "general knowledge": 0.50,
    "math and reasoning": 0.25,
    "code": 0.17,
    "multilingual": 0.08,
}
budget = 1e12

assert abs(sum(reported_mix.values()) - 1.0) < 1e-9
for source, fraction in reported_mix.items():
    print(f"{source:<19}: {budget * fraction / 1e9:>3.0f}B tokens")

Output

general knowledge  : 500B tokens
math and reasoning : 250B tokens
code               : 170B tokens
multilingual       :  80B tokens

Knob	What changes	Typical reason
Upsampling / downsampling	how often each source appears	code, math, books, or multilingual text may deserve more weight than raw size suggests
Curriculum schedule	what model sees earlier vs later	keep early coverage broad, then spend late tokens on harder or cleaner data
Per-example weights	which records inside one source repeat more	keep rare but important domains from being drowned by generic text

This is practical curriculum learning in pretraining. It usually isn't "easy lessons first, hard lessons later." More often it means changing source weights over time so the model sees a different mix at different stages.

Two common patterns:

Static high-quality mix: choose one weighted mixture and keep it stable through most of training
Late-stage curriculum: keep broad coverage early, then upweight cleaner or more target-relevant slices later^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}

Don't confuse curriculum with annealing alone. Annealing usually means late-stage learning-rate decay plus a higher-quality mix. Curriculum is broader: any deliberate schedule that changes what the model sees as training progresses.

It also helps to know the public datasets that anchored a lot of later recipes:

Dataset	What it emphasized	Why people still cite it
C4^{[6]Reference 6Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.https://arxiv.org/abs/1910.10683}	Aggressive heuristic cleanup of Common Crawl	Canonical example of turning noisy crawl text into a cleaner English web corpus
The Pile^{[5]Reference 5The Pile: An 800GB Dataset of Diverse Text for Language Modeling.https://arxiv.org/abs/2101.00027}	Deliberate source diversity across code, papers, books, forums, and web text	Useful contrast to pure web-crawl pipelines because mixture design is the core idea
FineWeb^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}	Web-scale extraction, filtering, and deduplication ablations over 96 Common Crawl snapshots	Good modern reference for how pipeline details measurably change downstream quality

First quality check: heuristic filtering

Before using expensive neural models to classify text, basic heuristic filters drop the lowest-signal material from web crawls. As seen in foundational datasets like C4 (Colossal Clean Crawled Corpus)^{[6]Reference 6Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.https://arxiv.org/abs/1910.10683}, these simple rules act as a highly efficient first line of defense.

The snippet below implements those basic rules. It takes a single document as input and returns True only if the document passes all checks.

first-quality-gate-heuristic-filtering.py

def load_bad_words() -> list[str]:
    # In production, this would come from a reviewed policy list.
    return ["toxic_word_1", "toxic_word_2"]

def quality_filter(doc: str) -> bool:
    """Illustrative web-text filter; thresholds require corpus-level validation."""
    words = doc.split()

    # Length filter: drop documents that are too short or likely concatenated.
    if len(words) < 50 or len(words) > 100_000:
        return False

    # Repetition filter: repeated n-grams are common in spam and boilerplate.
    for n in [2, 3, 4]:
        ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
        if len(ngrams) > 0:
            frac_duplicates = 1 - len(set(ngrams)) / len(ngrams)
            if frac_duplicates > 0.3:  # >30% repeated n-grams
                return False

    # Policy-list filter: check the ratio of reviewed blocklisted terms.
    bad_words = set(load_bad_words())
    bad_count = sum(1 for w in words if w.lower() in bad_words)
    if bad_count / len(words) > 0.01:
        return False

    # Alphabetic character ratio: drop documents that are mostly symbols/code.
    alpha_chars = sum(c.isalpha() for c in doc)
    if alpha_chars / len(doc) < 0.5:
        return False

    # Sentence-ending punctuation check.
    sentences = doc.split('.')
    if len(sentences) < 3:
        return False

    return True

procedure_doc = (
    "The key-rotation runbook explains how stale service-account keys are reviewed. "
    "Operators verify the audit signal, check the owning service, "
    "and create a rotation task when the account is eligible. "
    "This document contains specific procedural details, normal punctuation, "
    "and enough natural language context to be useful for model training. "
    "It avoids repeated boilerplate and gives support teams a clear workflow."
)

cases = {
    "procedure": (procedure_doc, True),
    "short": ("Rotation tasks are available.", False),
    "repeated": ("sale sale sale sale sale sale sale sale sale sale " * 8, False),
    "policy-list": (procedure_doc + " toxic_word_1 toxic_word_2", False),
}

for name, (text, expected) in cases.items():
    kept = quality_filter(text)
    assert kept is expected
    print(f"{name:<11} -> {'keep' if kept else 'drop'}")

Output

procedure   -> keep
short       -> drop
repeated    -> drop
policy-list -> drop

What this code does, step by step

Length check: Very short documents often carry little context; extremely long extractions may be concatenated pages or indexes.
Repetition check: High repeated n-gram fractions are a useful spam or boilerplate signal, but legitimate templates and structured documents can also repeat.
Policy-list check: A reviewed list can identify content a specific corpus policy excludes; its threshold is a policy choice, not a universal quality label.
Character check: A general-text recipe can discard symbol-heavy extractions, while a code corpus needs a different recipe.
Sentence check: Few sentence terminators can signal a malformed text extraction, but are unsuitable as a universal language or format rule.

The code intentionally uses simple thresholds so the gates are visible. C4 and FineWeb use related heuristic filtering ideas, but a real recipe must measure retention and downstream model quality on its own crawl slices before adopting thresholds.^{[6]Reference 6Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.https://arxiv.org/abs/1910.10683}^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}

These heuristics usually run on general web text before the final data mix is assembled. Code corpora often use a parallel recipe, because symbol-heavy files that look low-quality to a web-text filter can still be high-signal pre-training data.

Second quality check: classifier-based filtering

After heuristic filtering, modern pipelines use classifiers trained to distinguish high-quality educational text from low-quality web chatter:

Language ID: A fastText classifier screens whether the document matches the intended training languages.
Quality Classifier: A classifier scores whether a document looks like reference-quality material rather than random crawl text. FineWeb-Edu uses an educational-quality classifier trained on LLM-judged scores, DCLM trains a fastText classifier on instruction-formatted and ELI5 text and keeps only the top ~10% by score, and Meta reports separate Llama 3 classifiers for general quality, code, and reasoning signals.^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}^{[8]Reference 8DataComp-LM: In Search of the Next Generation of Training Sets for Language Modelshttps://arxiv.org/abs/2406.11794}^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}
Perplexity filtering: A language model such as an n-gram model can score a document against a reference distribution. Unusually high or low scores can flag text for corpus-specific retention rules; CCNet is a published example of perplexity-based filtering.^{[9]Reference 9CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Datahttps://arxiv.org/abs/1911.00359}

Filter	Effect	Example
URL blocklist	Remove known low-quality or unsafe domains	Common in production pipelines
Language ID (FastText)	Keep target language(s)	CCNet^{[9]Reference 9CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Datahttps://arxiv.org/abs/1911.00359}, FineWeb^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}
Heuristic rules	Length, repetition, char ratios	C4^{[6]Reference 6Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.https://arxiv.org/abs/1910.10683}, RedPajama^{[10]Reference 10RedPajama: An Open Source Recipe to Reproduce LLaMA training datasethttps://github.com/togethercomputer/RedPajama-Data}, FineWeb^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}
Classifier	Quality score threshold	FineWeb-Edu^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}, DCLM^{[8]Reference 8DataComp-LM: In Search of the Next Generation of Training Sets for Language Modelshttps://arxiv.org/abs/2406.11794}, Llama 3^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}
Perplexity filter	Rank documents by fit to a reference distribution	CCNet^{[9]Reference 9CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Datahttps://arxiv.org/abs/1911.00359}

Filtering decisions trade yield for the type of data retained. DCLM selected a fastText filter that retained its top 10% of documents after evaluating trained models, rather than assuming a score threshold was automatically better.^{[8]Reference 8DataComp-LM: In Search of the Next Generation of Training Sets for Language Modelshttps://arxiv.org/abs/2406.11794}

quality-score-threshold-yield.py

documents = [
    {"source": "proof", "tokens": 120, "score": 0.98},
    {"source": "api-doc", "tokens": 160, "score": 0.92},
    {"source": "forum", "tokens": 200, "score": 0.79},
    {"source": "news", "tokens": 240, "score": 0.63},
    {"source": "scrape", "tokens": 280, "score": 0.31},
]

threshold = 0.90
kept = [doc for doc in documents if doc["score"] >= threshold]
kept_tokens = sum(doc["tokens"] for doc in kept)
total_tokens = sum(doc["tokens"] for doc in documents)

print("kept sources:", ", ".join(doc["source"] for doc in kept))
print(f"token yield: {kept_tokens}/{total_tokens} ({kept_tokens / total_tokens:.0%})")

Output

kept sources: proof, api-doc
token yield: 280/1000 (28%)

Removing duplicates

Deduplication is a core quality step. The internet contains large amounts of duplicated content: boilerplate headers, repeated SEO text, mirrored sites, copied documentation, and reposted articles. Lee et al. found that, on the corpora and model sizes they tested, deduplication reduced emitted memorized text by about 10x and reached the same or better validation accuracy in fewer training steps.^{[11]Reference 11Deduplicating Training Data Makes Language Models Better.https://arxiv.org/abs/2107.06499}

Exact deduplication (hash-based)

This involves taking a cryptographic hash, such as SHA-256, of each document and removing exact matches. It's fast but misses near-duplicates, such as a news article where only the timestamp or a single word was changed.

MinHash + LSH (near-deduplication)

MinHash with Locality-Sensitive Hashing (LSH) is one widely used approach for fuzzy deduplication at scale.^{[12]Reference 12On the Resemblance and Containment of Documents.https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf}^{[11]Reference 11Deduplicating Training Data Makes Language Models Better.https://arxiv.org/abs/2107.06499} FineWeb, for example, applied MinHash independently per crawl using word 5-grams and parameters intended to target documents with at least 75% similarity; its experiments favored per-crawl rather than global deduplication.^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}

The idea, by hand: Imagine you have three short documents:

Doc A: "The key policy says stale accounts need security review."
Doc B: "The key policy says stale accounts require security review."
Doc C: "Audit logs record deployment rollback events."

Doc A and Doc B share almost every word. If you break them into overlapping 2-word chunks (shingles), they look like this:

Doc A shingles: ["The key", "key policy", "policy says", "says stale", ...]
Doc B shingles: ["The key", "key policy", "policy says", "says stale", ...]

The overlap is high. Doc C shares zero shingles with A or B. MinHash turns each document into a small numeric "sketch" that estimates this overlap relationship. LSH then routes similar sketches into the same candidate set so we inspect likely duplicates rather than comparing every pair. Because LSH is approximate, a candidate still needs a final similarity decision.

One candidate-then-verify pipeline looks like this:

Worked near-deduplication example for three documents. Documents A and B each have eight unique bigrams, with six in their intersection and ten in their union, giving exact Jaccard similarity 0.60. Their six-value illustrative MinHash sketches agree in four positions and collide in one LSH band, so exact verification compares them and drops B at a 0.50 threshold. Document C has five bigrams, no overlap with A or B, and is kept. — For the lesson's toy documents, A and B share 6 of 10 unique bigrams, so exact Jaccard similarity is 0.60. An illustrative six-value MinHash sketch estimates 4/6, or about 0.67, and sends them to one shared LSH band; the exact threshold makes the final drop decision.

This example shows how to build a basic MinHash and LSH pipeline using the datasketch library. For a tiny demo corpus, it builds signatures from bigram shingles, uses LSH only for candidate lookup, and verifies candidates with exact Jaccard overlap before dropping a document.

minhash-lsh-near-deduplication.py

import re

from datasketch import MinHash, MinHashLSH

def normalize(text: str) -> list[str]:
    return re.findall(r"[a-z0-9]+", text.lower())

def shingles(text: str, shingle_size: int = 2) -> set[tuple[str, ...]]:
    tokens = normalize(text)
    if len(tokens) < shingle_size:
        raise ValueError("Document is too short for chosen shingle size.")
    return {
        tuple(tokens[i:i + shingle_size])
        for i in range(len(tokens) - shingle_size + 1)
    }

def create_minhash(items: set[tuple[str, ...]], num_perm: int = 128) -> MinHash:
    """Create MinHash signature from normalized word shingles."""
    m = MinHash(num_perm=num_perm)
    for item in items:
        shingle = " ".join(item)
        m.update(shingle.encode("utf8"))
    return m

def jaccard(left: set[tuple[str, ...]], right: set[tuple[str, ...]]) -> float:
    return len(left & right) / len(left | right)

threshold = 0.5
lsh = MinHashLSH(threshold=threshold, num_perm=128)
documents = [
    ("doc1", "The key policy says stale accounts need security review."),
    ("doc2", "The key policy says stale accounts require security review."),
    ("doc3", "Audit logs record deployment rollback events."),
]

kept_shingles: dict[str, set[tuple[str, ...]]] = {}

for doc_id, text in documents:
    doc_shingles = shingles(text)
    mh = create_minhash(doc_shingles)
    candidate_scores = [
        (candidate, jaccard(doc_shingles, kept_shingles[candidate]))
        for candidate in sorted(lsh.query(mh))
    ]
    duplicate = next(
        ((candidate, score) for candidate, score in candidate_scores if score >= threshold),
        None,
    )
    if duplicate:
        print(f"{doc_id}: drop near duplicate of {duplicate[0]} ({duplicate[1]:.2f})")
    else:
        kept_shingles[doc_id] = doc_shingles
        lsh.insert(doc_id, mh)
        print(f"{doc_id}: keep")

print(f"Kept {len(kept_shingles)} out of {len(documents)} documents.")

Output

doc1: keep
doc2: drop near duplicate of doc1 (0.60)
doc3: keep
Kept 2 out of 3 documents.

Why this works: doc1 is inserted first. When doc2 arrives, LSH returns doc1 as a candidate; an exact Jaccard check then confirms sufficient overlap before the drop. doc3 has no confirmed neighbor, so it's kept. Published recipes make different final decisions after MinHash collisions. Lee et al. verified candidate pairs with actual Jaccard and edit similarity, while FineWeb used matching MinHash buckets and transitive clustering. Production thresholds, shingle sizes, verification rules, and whether deduplication operates per crawl or globally must be evaluated for the corpus.^{[11]Reference 11Deduplicating Training Data Makes Language Models Better.https://arxiv.org/abs/2107.06499}^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}

Method	Speed	Precision	Recall	Best For
Exact hash (SHA-256)	Very fast	Perfect	Low (exact only)	Removing byte-identical documents
MinHash + LSH	Fast candidate search	Tunable	Tunable	Large-corpus near-deduplication recipes^{[12]Reference 12On the Resemblance and Containment of Documents.https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf}^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}
Suffix array / suffix tree	Medium	Very High	Very High	Exact repeated spans and long substring deduplication
SimHash	Fast	Medium	Medium	Cheap approximate similarity when memory is extremely tight

Keeping benchmarks clean: decontamination

An important and often overlooked step is decontamination: keeping evaluation benchmarks out of the pre-training data. If a model sees MMLU, HumanEval, or GSM8K examples during pre-training, its reported performance becomes misleading because it may be memorizing instead of generalizing.

The risk is substantial. Web crawls contain GitHub repositories with coding interview questions, educational websites with standardized test questions, and forums where users paste benchmark prompts. Because training corpora and evaluation sets often draw from the same public web, contamination is a recurring risk rather than a hypothetical edge case.

N-gram overlap filtering

One common approach is n-gram overlap detection. For each document in the training corpus, check whether it contains spans from an evaluation dataset. The n-gram size, normalization, removal threshold, and code-specific rules must be recorded with the released corpus. GPT-3 attempted pre-training filtering with 13-gram overlaps, then reported that a filtering bug left some overlaps in place. Its post-hoc benchmark analysis used variable n-gram lengths capped at 13. Llama 3 reports excluding benchmark training sets from its annealing data.^{[13]Reference 13Language Models are Few-Shot Learners.https://arxiv.org/abs/2005.14165}^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}

Check	What it catches	What it misses
Exact string or n-gram overlap	Copied benchmark prompts, answers, or code spans	Paraphrases and renamed identifiers
Fuzzy or task-specific similarity	Lightly edited versions worth inspection	Requires calibrated thresholds
Repository or source exclusion	Artifacts from a known evaluation source	Copies hosted elsewhere

Exact-match filtering isn't enough. A coding problem with renamed variables or a question with shuffled answer choices can still leak. Pipelines may add fuzzy or task-specific checks, then document what they removed so benchmark-clean claims are auditable.

ngram-decontamination-screen.py

import re

def ngrams(text: str, n: int) -> set[tuple[str, ...]]:
    tokens = re.findall(r"[a-z0-9]+", text.lower())
    return {tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)}

benchmark = "A rotation is approved when security verifies a stale service account."
candidates = {
    "direct-copy": "Internal guide: a rotation is approved when security verifies a stale service account.",
    "paraphrase": "Allow key refresh after the risk signal has been checked.",
}
protected = ngrams(benchmark, n=4)

for name, document in candidates.items():
    flagged = bool(protected & ngrams(document, n=4))
    print(f"{name:<11} -> {'flag' if flagged else 'not caught by exact n-grams'}")

Output

direct-copy -> flag
paraphrase  -> not caught by exact n-grams

Benchmark decontamination flow: copied prompts hit exact matching and drop, renamed solutions hit fuzzy or code-aware checks and drop, benchmark repositories hit source rules and drop, while unrelated support policy text clears all detectors and stays in corpus. — No single detector catches every leak shape. Exact overlap catches copies, fuzzy or code-aware checks catch edited variants, and source rules block known benchmark artifacts; only clean candidates reach the corpus.

Safety and privacy scrub

After deduplication and decontamination, a pipeline may scrub personally identifiable information (PII) and unsafe content according to its data policy and legal obligations. A model trained on scraped identifiers can reproduce sensitive strings at inference time, creating privacy and compliance risk. FineWeb reports anonymizing email and public-IP addresses; Llama 3 reports removing domains likely to contain high volumes of PII or adult content.^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}

PII removal uses a layered approach:

Regex patterns: Rule-based detection can catch structured patterns such as email addresses or public IP addresses cheaply.
Entity or policy classifiers: Context-aware models can identify classes of content that rigid patterns miss, but require false-positive evaluation.
Source and domain rules: Known high-risk sources can be excluded before model training, as reported for Llama 3.^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}

Unsafe-content filtering is policy-specific and imperfect. FineWeb applies URL-level filtering for adult content but explicitly notes that harmful documents can remain. Llama 3 combines domain rules with dirty-word counting for adult websites that domain blocklists miss.^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783} Whatever controls you choose, audit retained and removed samples across languages and source types instead of treating one score threshold as a universal safety boundary.

Turning text into numbers: tokenization

Tokenization converts raw text into integer IDs that the model can process. At the scale of trillions of tokens, tokenization itself requires parallel computation across many workers.

Choosing a tokenizer

Selecting the right tokenizer involves trading off between vocabulary size, sequence length, and language support. A smaller vocabulary can force text to split into more subwords, increasing sequence length and the computational cost of attention. A larger vocabulary can improve compression when it matches the corpus, but it also requires a larger embedding matrix, which consumes memory.

Common subword-tokenizer algorithms use data-driven vocabulary construction:

Tokenizer	Algorithm	Vocabulary	Used By
Byte-level BPE^{[14]Reference 14Neural Machine Translation of Rare Words with Subword Units.https://arxiv.org/abs/1508.07909}^{[15]Reference 15Language Models are Unsupervised Multitask Learners.https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf}^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}	Start from byte-representable symbols, then add frequent merges	32K-128K+	GPT-style models, Llama 3
WordPiece^{[16]Reference 16BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.https://arxiv.org/abs/1810.04805}	Subword vocabulary with greedy longest-match encoding	~30K	BERT
SentencePiece^{[17]Reference 17SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.https://arxiv.org/abs/1808.06226}	Framework for BPE or Unigram on raw text	32K-256K	Llama 2, T5, multilingual models

Training a BPE tokenizer

When you train a model family from scratch, the tokenizer should be trained on a representative sample of the pre-training data so it learns the most common subword patterns. Reusing a mature tokenizer can be a good engineering choice, but a mismatched tokenizer can waste context length, especially for multilingual or domain-heavy corpora. The tokenizer demo uses the Hugging Face tokenizers library to initialize a byte-level Byte-Pair Encoding (BPE) tokenizer. For the demo, it trains from an in-memory sample; production pipelines train from a carefully mixed corpus sample and may target vocabularies such as 128,000 tokens.

training-a-bpe-tokenizer.py

from tokenizers import Tokenizer, decoders, models, trainers, pre_tokenizers

# Initialize a Byte-Pair Encoding (BPE) tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()

# Train the tokenizer on a representative sample of the corpus
trainer = trainers.BpeTrainer(
    vocab_size=300,
    special_tokens=["<s>", "</s>", "<pad>"],
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
)

training_sample = [
    "Operators can rotate stale service-account keys after review.",
    "Audit logs record deployment rollback events and access workflows.",
    "Los equipos pueden consultar guías de despliegue y políticas de acceso.",
    "def rotate_key(account_id): return create_security_ticket(account_id)",
]

tokenizer.train_from_iterator(training_sample, trainer=trainer)
encoded = tokenizer.encode("Rotation policy creates a security ticket.")
decoded = tokenizer.decode(encoded.ids)
assert decoded == "Rotation policy creates a security ticket."
print(f"learned vocab size: {len(tokenizer.get_vocab())}")
print(f"sample tokens: {len(encoded.ids)}, round-trip: {decoded!r}")

Output

learned vocab size: 300
sample tokens: 30, round-trip: 'Rotation policy creates a security ticket.'

In a byte-level setup, you often don't need an <unk> token because any UTF-8 byte sequence can be represented. The requested vocab_size is a ceiling, not a guarantee: this tiny sample learns 293 entries because it runs out of repeated pairs worth merging before reaching 300.

Vocabulary size is a real design decision. A larger vocabulary can improve compression when its extra tokens match the target text, but it also increases embedding-table size and memory footprint. A smaller vocabulary is lighter, but it can produce longer sequences. Meta reports that Llama 3 moved to a 128K-token vocabulary, starting from a tiktoken base plus 28K extra tokens, and improved English compression from 3.17 to 3.94 characters per token relative to Llama 2^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}.

Running the pipeline at scale

Large corpus pipelines require distributed execution beyond demonstration scripts. DCLM reports starting with a 240T-token Common Crawl pool containing about 200B documents and 370TB of compressed text, and processing crawl data on hundreds of AWS CPU nodes.^{[8]Reference 8DataComp-LM: In Search of the Next Generation of Training Sets for Language Modelshttps://arxiv.org/abs/2406.11794}

At that scale, data is partitioned for extraction and filtering, intermediate artifacts are stored durably, and every transformation needs enough metadata to reproduce which records reached training.

Deduplication changes the system shape: signatures can be computed independently, but candidate grouping needs records with related signatures to meet. Whether that step uses a global operation or a restricted partition, such as FineWeb's per-crawl recipe, changes both resource cost and retained data quality.^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}

Trillion-token scale challenges

Ingestion and filtering: Extraction, language ID, rules, and classifier scoring must stream through many documents without losing provenance.
Deduplication: Signatures are parallel-friendly; candidate grouping and representative selection need an explicit partitioning strategy.
Training order: Tokenized shards need randomized, documented sampling so one source doesn't unintentionally dominate a run segment.
Reproducibility: Engineers need to identify the source document, filter version, mixture weight, and shard for suspicious examples.

Data packing and sequence efficiency

Documents vary widely in length: an article might be thousands of tokens while a forum post is short. Padding each document to a fixed sequence length spends compute on padding tokens. Best-fit packing groups documents into fixed-length token blocks to reduce that waste.

best-fit-packing-utilization.py

lengths = [7, 6, 5, 5, 3, 2]
capacity = 10

def best_fit_pack(items: list[int], block_size: int) -> list[list[int]]:
    bins: list[list[int]] = []
    for length in sorted(items, reverse=True):
        options = [bucket for bucket in bins if sum(bucket) + length <= block_size]
        if options:
            min(options, key=lambda bucket: block_size - sum(bucket) - length).append(length)
        else:
            bins.append([length])
    return bins

naive_slots = len(lengths) * capacity
packed = best_fit_pack(lengths, capacity)
packed_slots = len(packed) * capacity
tokens = sum(lengths)

print("packed blocks:", packed)
print(f"naive utilization: {tokens / naive_slots:.0%}")
print(f"packed utilization: {tokens / packed_slots:.0%}")

Output

packed blocks: [[7, 3], [6, 2], [5, 5]]
naive utilization: 47%
packed utilization: 93%

Packing creates another design decision: if multiple documents share one sequence, should tokens in one document attend to tokens from another? Some causal-language-model recipes concatenate documents with end-of-document separators and keep the ordinary causal mask. Others isolate documents with a block-diagonal mask so one document can't read tokens from an earlier document in the same packed block. The isolated version below prevents artificial cross-document context while retaining packing efficiency.

packed-document-attention-mask.py

document_ids = [0, 0, 0, 1, 1]

mask = [
    [
        int(previous <= current and document_ids[previous] == document_ids[current])
        for previous in range(len(document_ids))
    ]
    for current in range(len(document_ids))
]

for row in mask:
    print(" ".join(map(str, row)))
print("doc 1 token attends to doc 0:", bool(mask[3][2]))

Output

0 0 0 0
1 0 0 0
1 1 0 0
0 0 1 0
0 0 1 1
doc 1 token attends to doc 0: False

Data annealing is a late-stage recipe adjustment where the learning rate decays and the data mix shifts toward higher-value slices. It isn't a universal "final 10%" rule. Llama 3, for example, reports annealing over the final 40M tokens while upsampling very high-quality sources and excluding benchmark training sets from the annealing pool^{[3]Reference 3The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783}.

Quality and quantity must be measured together

Recent model and dataset papers show that data quality and token yield must be measured together once a corpus is already large.

The Phi lesson

Phi-1 reported strong coding benchmark performance for a 1.3B model trained with textbook-quality data, worked examples, and synthetic exercises.^{[4]Reference 4Textbooks Are All You Needhttps://arxiv.org/abs/2306.11644} Its result motivates controlled curation experiments; it doesn't make one data recipe universal.

FineWeb and FineWeb-Edu

FineWeb scales the same idea to web-scale curation. The paper builds a 15T-token dataset from 96 Common Crawl snapshots and then extracts FineWeb-Edu, a 1.3T educational subset filtered with a classifier trained on LLM-judged educational scores. The main lesson is that pipeline design matters: individual per-crawl MinHash deduplication and additional filtering each improved downstream results over weaker baselines^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}.

Modern open corpora

Three published open-corpus reports illustrate different tradeoffs after FineWeb. Each changes a different part of the yield-versus-quality problem.

Corpus	Scale and scope	Role
DCLM-Baseline^{[8]Reference 8DataComp-LM: In Search of the Next Generation of Training Sets for Language Modelshttps://arxiv.org/abs/2406.11794}	3.8T-token set filtered from a 240T-token Common Crawl pool	A `fastText` filter retained its top 10% of documents; a 7B model trained on 2.6T tokens reached 64% 5-shot MMLU
Nemotron-CC^{[18]Reference 18Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Datasethttps://arxiv.org/abs/2412.02595}	6.3T tokens: 4.4T globally deduplicated original tokens plus 1.9T synthetic tokens	Classifier ensembling and source-conditioned rephrasing target larger unique-token yield; its 1.1T high-quality subset improved MMLU by 5.6 points over DCLM in the reported 8B, 1T-token setup
FineWeb2^{[19]Reference 19FineWeb2: One Pipeline to Scale Them All - Adapting Pre-Training Data Processing to Every Languagehttps://arxiv.org/abs/2506.20920}	20TB and 5B documents across over 1,000 languages from almost 100 snapshots	Language-adapted filtering and dedup-informed rebalancing were evaluated across canary languages before scale-up

These reports don't establish one universally best filter. DCLM optimized benchmark quality under its evaluation setup with aggressive retention. Nemotron-CC explicitly targeted a larger long-horizon token supply and reported both quality and quantity comparisons. FineWeb2 addresses the separate multilingual problem: thresholds, word segmentation, language identification, and deduplication behavior don't transfer cleanly from English to every language.^{[8]Reference 8DataComp-LM: In Search of the Next Generation of Training Sets for Language Modelshttps://arxiv.org/abs/2406.11794}^{[18]Reference 18Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Datasethttps://arxiv.org/abs/2412.02595}^{[19]Reference 19FineWeb2: One Pipeline to Scale Them All - Adapting Pre-Training Data Processing to Every Languagehttps://arxiv.org/abs/2506.20920}

The data wall

This token-yield pressure is often called the data wall. Villalobos et al. estimate an effective stock of about 320T public human-text tokens after quality and repetition adjustments. Under their assumptions about continued dataset growth, they project full utilization between 2026 and 2032, with a median of 2028; an assumed 5x overtraining scenario shifts that median one year earlier.^{[20]Reference 20Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Datahttps://arxiv.org/abs/2211.04325} This is a scenario projection, not evidence that usable public text is already exhausted.

projected-public-text-stock-context.py

effective_stock = 320e12
training_budgets = {
    "1T-token study": 1e12,
    "Llama 3 405B report": 15.6e12,
    "100T-token scenario": 100e12,
}

for label, tokens in training_budgets.items():
    print(f"{label:<22}: {tokens / effective_stock:>5.1%} of 320T reference stock")
print("Comparison only: it isn't a claim that stock has been consumed.")

Output

1T-token study        :  0.3% of 320T reference stock
Llama 3 405B report   :  4.9% of 320T reference stock
100T-token scenario   : 31.2% of 320T reference stock
Comparison only: it isn't a claim that stock has been consumed.

Exact threshold chart for the lesson five-document example. At score thresholds 0.00, 0.50, 0.70, 0.90, and 0.95, token yield is 100, 72, 48, 28, and 12 percent while token-weighted mean retained classifier score is 0.661, 0.797, 0.881, 0.946, and 0.980. The selected 0.90 threshold keeps 280 of 1000 tokens, but downstream quality still requires proxy-model ablations. — In the lesson's five-document example, raising the threshold from 0.00 to 0.90 cuts token yield from 100% to 28% while the token-weighted mean retained classifier score rises from 0.661 to 0.946. That score is still a filter proxy; proxy-model training and downstream evaluations decide whether the threshold improves the corpus.

Additional crawl volume and stronger curation must be compared empirically. FineWeb is a good example: in its ablations, extraction, per-crawl deduplication, and filtering choices measurably changed downstream results.^{[7]Reference 7The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scalehttps://arxiv.org/abs/2406.17557}

Synthetic data: supplement, not substitute

Synthetic data can raise signal density, but it works best when it fills a specific gap rather than replacing the whole corpus.

Where synthetic data helps

Phi-1 is one clear example. Instead of trying to mimic the raw web distribution, the dataset deliberately emphasized textbook-style explanations, exercises, and synthetic examples^{[4]Reference 4Textbooks Are All You Needhttps://arxiv.org/abs/2306.11644}. That makes synthetic data useful when you want more reasoning traces, worked solutions, or domain-specific exemplars than the open web naturally provides.

One proposed response to token-yield pressure is synthetic expansion. Nemotron-CC generates variants from source documents, including question-answer pairs and distilled passages, and reports improvements in specific training comparisons.^{[18]Reference 18Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Datasethttps://arxiv.org/abs/2412.02595} Source conditioning gives the generated text provenance, but the authors state that they didn't verify factual accuracy or fidelity for rephrased data. Grounding therefore needs validation rather than assumption.

Why provenance and diversity still matter

Synthetic text inherits teacher errors, style biases, and coverage gaps. Shumailov et al. study degradation when successive model generations train recursively on generated data, a failure mode often called model collapse.^{[21]Reference 21AI models collapse when trained on recursively generated datahttps://www.nature.com/articles/s41586-024-07566-y} This doesn't prohibit synthetic supplementation; it means mixture weight, provenance, factual fidelity, and diversity need evaluation.

Don't conflate verified synthetic-data generation with post-training reinforcement learning. Pre-training pipelines more often use offline generation plus filtering or programmatic checks before tokens are admitted.

Common mistakes and how to catch them

These are the most frequent errors teams make when building or reasoning about pre-training pipelines, with symptoms, causes, and fixes.

Mistake	Symptom	Cause	Fix
Ignoring deduplication	Model repeats memorized boilerplate; token budget spends on repeats	Duplicate text overweights copied passages	Evaluate dedup granularity; FineWeb selected per-crawl MinHash over global dedup
Filtering non-English too aggressively	Model loses code or math reasoning strength	Code and math contain many universal symbols that language classifiers mislabel	Use separate recipes for code corpora and multilingual text
Overlooking benchmark contamination	Inflated evaluation scores	Evaluation examples leaked into training data via public sources	Specify exact-overlap and task-specific fuzzy rules; audit removals
Forgetting PII policy	Model outputs personal identifiers	Raw crawls contain user-generated content with personal data	Apply documented identifier/domain controls and audit a sample before training
Using an English-centric tokenizer for multilingual models	Terrible compression and slow inference for non-English languages	Vocabulary was trained on mostly English text	Train the tokenizer on a representative multilingual sample; expect 32K-128K+ vocabularies
Treating synthetic data as automatically faithful	Rewrites introduce errors or narrow styles	Source-conditioned generation was admitted without validation	Measure fidelity, diversity, and downstream effects before changing mixture weight
Assuming one annealing schedule fits every model	Late-stage training gives inconsistent gains	Copied another team's annealing recipe without matching data mix	Treat annealing as a hyperparameter; test on a small model first

What to check before moving on

Tier	Defense target
Foundational	Map the full pipeline from raw web crawl to tokenized training batches.
Foundational	Explain why raw token count isn't the same thing as useful training signal.
Intermediate	Distinguish exact hashing from MinHash/LSH near-deduplication.
Intermediate	Explain why source weighting and multilingual weighting should follow target capabilities instead of raw byte counts.
Advanced	Design multi-stage filtering with heuristics, classifiers, perplexity, PII scrub, and benchmark decontamination.
Advanced	Apply n-gram overlap and fuzzy matching to protect evaluation benchmarks.
Advanced	Explain late-stage data annealing, sequence packing, and why their recipes are model-specific.
Advanced	Identify when synthetic data helps and when it narrows the corpus too much.
Advanced	Contrast DCLM, Nemotron-CC, and FineWeb2 and explain the tradeoff each report evaluates.

What to remember

Data sources: Filtered web text usually dominates volume, but code and curated corpora are often upsampled relative to their raw size.
Quality filtering: Published pipelines combine heuristics, classifiers, and sometimes perplexity filters in different orders; retention and quality must be evaluated.
Deduplication: Near-deduplication with MinHash/LSH is a common large-corpus approach. LSH produces candidates; a documented decision rule determines removals.
Tokenization: The choice of tokenizer (BPE, SentencePiece) and vocabulary size affects compression rates and multilingual capabilities.
Scale: DCLM reports a 240T-token source pool processed with hundreds of CPU nodes; production pipeline choices must preserve provenance and reproducibility at that scale.
Data packing: Best-fit packing (bin packing) improves training efficiency by minimizing padding waste.
Data annealing: Late-stage annealing or high-quality upsampling can help, but the schedule is model-specific rather than a universal last-step recipe.
Synthetic data: Synthetic data is best used as a targeted supplement. Provenance, diversity, and validation matter as much as volume.
Quality vs. quantity: Filtering, deduplication, and new token supply form an empirical tradeoff, not a universal winner.
The data wall: Villalobos et al. estimate an effective public-text stock around 320T tokens and project utilization under stated scenarios; that isn't a claim of present exhaustion.

Practice drill

Build a corpus-readiness checklist for a trillion-token run:

Define source mix, sampling weights, language coverage, deduplication threshold, benchmark-decontamination rules, and PII scrub policy.
Add retention metrics by source and language so quality filters can be audited for false positives.
Record tokenizer compression rates for target languages and domains before freezing the vocabulary.
Specify shard metadata: source, license or policy class, filter version, dedupe cluster, tokenizer version, and split assignment.

The checklist should make the training corpus reproducible, auditable, and measurable before any expensive run starts.

Next Step

Continue to Build GPT from Scratch Lab

You now understand how large corpora are collected, filtered, deduplicated, tokenized, packed, and documented. Next step turns that data work into an actual language-model training run so you can see token blocks, next-token labels, validation loss, and <span data-glossary="model-checkpoint">checkpoints</span> behave as one concrete pipeline.

PreviousScaling Laws & Compute-Optimal Training

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., & Carmon, Y. · 2024

The Llama 3 Herd of Models.

Dubey, A., et al. · 2024 · arXiv preprint

Textbooks Are All You Need

Gunasekar, S., et al. · 2023 · NeurIPS 2023

The Pile: An 800GB Dataset of Diverse Text for Language Modeling.

Gao et al. · 2020

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

Raffel, C., et al. · 2020 · JMLR

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Penedo, G., et al. · 2024 · NeurIPS 2024

DataComp-LM: In Search of the Next Generation of Training Sets for Language Models

Li, J., et al. · 2024

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Wenzek, G., et al. · 2019 · LREC 2020

RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset

Computer, Together · 2023 · GitHub

Deduplicating Training Data Makes Language Models Better.

Lee, K., Ippolito, D., et al. · 2022 · NeurIPS 2022

On the Resemblance and Containment of Documents.

Broder, A. Z. · 1997

Language Models are Few-Shot Learners.

Brown, T., et al. · 2020 · NeurIPS 2020

Neural Machine Translation of Rare Words with Subword Units.

Sennrich, R., Haddow, B., & Birch, A. · 2016 · ACL 2016

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.

Kudo, T. & Richardson, J. · 2018 · EMNLP 2018

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Su, D., et al. · 2024

FineWeb2: One Pipeline to Scale Them All - Adapting Pre-Training Data Processing to Every Language

Penedo, G., et al. · 2025

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

Villalobos, P., et al. (Epoch AI) · 2022 · arXiv preprint

AI models collapse when trained on recursively generated data

Shumailov, I., et al. · 2024 · Nature

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Pre-training Data at Scale

From token budget to training inventory

How much data is enough?

Why is a token target not enough to define a pre-training corpus?

The raw corpus and the clean training mix

Where the text comes from

Mixture weights, curriculum, and data scheduling

First quality check: heuristic filtering

What this code does, step by step

Second quality check: classifier-based filtering

Removing duplicates

Exact deduplication (hash-based)

MinHash + LSH (near-deduplication)

Why does near-deduplication matter even after exact hashing?

Keeping benchmarks clean: decontamination

N-gram overlap filtering

What makes benchmark contamination different from ordinary duplicate text?

Safety and privacy scrub

Turning text into numbers: tokenization

Choosing a tokenizer

Training a BPE tokenizer

Running the pipeline at scale

Trillion-token scale challenges

Data packing and sequence efficiency

Quality and quantity must be measured together

The Phi lesson

FineWeb and FineWeb-Edu

Modern open corpora

The data wall

DCLM's selected filter keeps its top 10% of documents by classifier score. Why might a team designing a longer-horizon run test a Nemotron-CC-style approach as well?

Synthetic data: supplement, not substitute

Where synthetic data helps

Why provenance and diversity still matter

Common mistakes and how to catch them

What to check before moving on

What to remember

Practice drill

Mastery Check

Discussion