LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationPre-training Data at Scale
⚡HardFine-Tuning & Training

Pre-training Data at Scale

Understand how web-scale pre-training data is extracted, filtered, deduplicated, mixed, tokenized, and packed into training-ready shards, including decontamination, late-stage annealing, and synthetic-data tradeoffs.

38 min read
Learning path
Step 94 of 155 in the full curriculum
Scaling Laws & Compute-Optimal TrainingBuild GPT from Scratch Lab

Pre-training Data at Scale

Scaling laws gave us a target: if you want a capable dense model, you may need hundreds of billions or trillions of useful training tokens. Pre-training data pipelines decide what those tokens are before any fine-tuning or alignment happens. This chapter covers sourcing, cleaning, deduplication, filtering, and mixing at the scale where data quality becomes model behavior.

Imagine you are building a customer-support chatbot for an online marketplace. It needs to answer questions about return policies, shipping delays, and payment disputes. You can't write answers for every question in advance, so you decide to train an AI system on a large collection of support transcripts, product descriptions, and shipping documentation.

But where does that collection come from? You can't dump raw web pages into the model and expect useful behavior. Web-scale text includes ads, broken HTML, duplicate product listings, forum spam, private data, and benchmark questions. Training on unfiltered data would be like stocking a fulfillment center with unsorted returns, duplicate labels, and damaged boxes. Useful records are in there, but the line spends most of its time sorting waste unless the intake process is disciplined.

This article is about the engineering pipeline that turns raw data into clean training inventory. You'll learn how teams ingest large raw corpora, filter low-signal text, remove duplicates, scrub private information, and convert what's left into the integer tokens a model trains on.

From token budget to training inventory

Before we build the pipeline, recall three ideas from earlier in the curriculum:

  • Tokens: A language model doesn't read words. It reads tokens (short subword fragments like "ship", "ping", or " ##ing"). The tokenizer converts text into integers, and the model learns to predict the next integer in a sequence.
  • Next-token prediction: During pre-training, the model sees a partial sentence and learns to guess what comes next. The more diverse and high-quality the sequences it sees, the broader its knowledge becomes.
  • Scaling laws: The previous chapter turned model size into a token target. This chapter turns that target into an engineering system.

If tokens or next-token prediction feel fuzzy, review Language Modeling & Next Tokens before continuing. Everything here assumes you understand why the model needs large amounts of varied text.

How much data is enough?

Let's start with a concrete number. Suppose you want to train a 70 billion parameter model. How many tokens should you feed it?

Researchers at DeepMind answered this with the Chinchilla scaling laws. Under a fixed compute budget, they found that model size and training-token count should grow together. A useful rule of thumb for dense decoder-only transformers is about 20 training tokens per parameter[1].

For a 70B model:

  • 70 billion parameters x 20 tokens per parameter = 1.4 trillion tokens

That's roughly the entire text content of millions of books, web pages, and code repositories combined. It gives you a sense of why the pipeline needs industrial scale.

A companion formula for back-of-the-envelope planning is training FLOPs of about 6ND6ND6ND, where NNN is a consistently counted dense model size and DDD is the number of training tokens. It's an approximation, and details such as whether output-layer parameters are counted can move fitted optima, but it's useful for quick capacity estimates.[1][2]

Raw token counts are misleading. A smaller corpus of textbook-quality tokens can beat a much larger pile of noisy crawl text. Scaling data only helps when the extra tokens still add real information.

Meta reports pre-training Llama 3's 405B model on 15.6T tokens, about 38.5 tokens per parameter, and says that flagship configuration is approximately compute-optimal under its own fitted laws and training budget. Separately, Meta reports training its smaller models longer than compute-optimal because those models performed better at the same inference budget.[3] Treat the 20x ratio as a planning baseline, not a hard law.

token-target-and-flop-budget.py
1def dense_training_flops(parameters: float, tokens: float) -> float: 2 return 6 * parameters * tokens 3 4parameters = 70e9 5target_tokens = 20 * parameters 6llama_405b_ratio = 15.6e12 / 405e9 7 8print(f"Chinchilla-style token target: {target_tokens / 1e12:.1f}T") 9print(f"Dense training FLOP estimate: {dense_training_flops(parameters, target_tokens):.2e}") 10print(f"Llama 3 405B reported ratio: {llama_405b_ratio:.1f} tokens/parameter")
Output
1Chinchilla-style token target: 1.4T 2Dense training FLOP estimate: 5.88e+23 3Llama 3 405B reported ratio: 38.5 tokens/parameter

The raw dock and the clean training mix

Think of the raw internet as an overloaded receiving dock. It contains valuable inventory, but it's mixed with duplicate labels, broken packaging, spam, and unsafe material. A pre-training data pipeline is the automated sorting line that extracts useful records and turns them into a training-ready mix.

Your job as the AI engineer isn't to write the text. It's to design a sorting system that processes large partitions repeatably, records each decision, and avoids throwing out high-value records.

The modern pipeline has three major phases:

  1. Ingestion: Pull raw data from web crawls, code hosts, and curated archives.
  2. Cleaning: Filter, deduplicate (remove duplicate documents), decontaminate (strip out leaked evaluation-benchmark examples), and scrub for safety.
  3. Preparation: Tokenize, shuffle, pack, and split into training shards.
Pre-training data pipeline: raw web and curated sources flow through text extraction, filtering, deduplication, safety scrubbing, and tokenization before becoming packed training shards. Pre-training data pipeline: raw web and curated sources flow through text extraction, filtering, deduplication, safety scrubbing, and tokenization before becoming packed training shards.
Trace the five colored stages from left to right. Notice that raw sources shrink while signal density rises.
Data curation pipeline: crawl data is filtered, deduplicated, decontaminated, and mixed into a higher-signal training corpus. Data curation pipeline: crawl data is filtered, deduplicated, decontaminated, and mixed into a higher-signal training corpus.
Compare the three quality gates at the bottom. Each gate removes a different kind of problem: low quality, duplicates, and unsafe content.

Where the text comes from

Different data sources serve different roles in the training mixture. Web data gives breadth and freshness. Code is dense in syntax, structure, and exact correctness. Books and papers provide long-form coherent explanations. Carefully curated synthetic or textbook-style data can raise signal density further.

Phi-1 is a good example: its authors report strong coding benchmark performance from a 1.3B model trained on textbook-quality and synthetic data.[4]

SourceWhat it contributesMain curation question
Common Crawl snapshotsBroad web coverage at large scaleWhich extraction, language, quality, and deduplication rules raise signal without collapsing yield?
Books and papersLong-form explanations and sustained argumentWhich licenses, provenance rules, and subject areas fit the intended model?
WikipediaStructured reference text across many topics and languagesWhich languages and snapshots belong in the mixture?
Code (for example, public repositories)Syntax, APIs, and examples where exact structure mattersWhich licenses, repository-level rules, and file filters remove generated or low-value code?
Curated web (for example, technical forums)Focused discussions and worked answersHow do you preserve useful niche material without replaying copied pages?
Synthetic dataTargeted examples for gaps that raw sources cover poorlyDoes the generated slice preserve factual fidelity, diversity, and downstream quality?

Note: Some papers disclose approximate mixtures while many don't. Llama 3 reports a final pre-training mix of roughly 50% general knowledge, 25% mathematical and reasoning data, 17% code, and 8% multilingual data.[3] The broader pattern is that mixture weights are decisions, not raw-byte proportions.

Mixture weights, curriculum, and data scheduling

Source quality is only half the decision. You also have to decide how often each source appears and whether that weighting stays fixed through the whole run.

Three knobs matter:

  1. Source weighting: web, code, books, papers, synthetic data
  2. Schedule over time: fixed mix vs later-stage reweighting
  3. Within-source weighting: whether some slices replay more often than others

The wrong default is "sample in proportion to raw bytes." Raw crawl volume isn't the same thing as training value. The Pile is still useful as reference because it made mixture design explicit instead of treating web text as one giant bucket.[5] Llama 3 shows a later-stage version of the same idea: even after assembling a huge corpus, training shifted toward higher-value slices during annealing.[3]

reported-mixture-token-budget.py
1reported_mix = { 2 "general knowledge": 0.50, 3 "math and reasoning": 0.25, 4 "code": 0.17, 5 "multilingual": 0.08, 6} 7budget = 1e12 8 9assert abs(sum(reported_mix.values()) - 1.0) < 1e-9 10for source, fraction in reported_mix.items(): 11 print(f"{source:<19}: {budget * fraction / 1e9:>3.0f}B tokens")
Output
1general knowledge : 500B tokens 2math and reasoning : 250B tokens 3code : 170B tokens 4multilingual : 80B tokens
KnobWhat changesTypical reason
Upsampling / downsamplinghow often each source appearscode, math, books, or multilingual text may deserve more weight than raw size suggests
Curriculum schedulewhat model sees earlier vs laterkeep early coverage broad, then spend late tokens on harder or cleaner data
Per-example weightswhich records inside one source repeat morekeep rare but important domains from being drowned by generic text

This is practical curriculum learning in pretraining. It usually isn't "easy lessons first, hard lessons later." More often it means changing source weights over time so the model sees a different mix at different stages.

Two common patterns:

  • Static high-quality mix: choose one weighted mixture and keep it stable through most of training
  • Late-stage curriculum: keep broad coverage early, then upweight cleaner or more target-relevant slices later[3]

Don't confuse curriculum with annealing alone. Annealing usually means late-stage learning-rate decay plus a higher-quality mix. Curriculum is broader: any deliberate schedule that changes what the model sees as training progresses.

It also helps to know the public datasets that anchored a lot of later recipes:

DatasetWhat it emphasizedWhy people still cite it
C4[6]Aggressive heuristic cleanup of Common CrawlCanonical example of turning noisy crawl text into a cleaner English web corpus
The Pile[5]Deliberate source diversity across code, papers, books, forums, and web textUseful contrast to pure web-crawl pipelines because mixture design is the core idea
FineWeb[7]Web-scale extraction, filtering, and deduplication ablations over 96 Common Crawl snapshotsGood modern reference for how pipeline details measurably change downstream quality

First quality gate: heuristic filtering

Before using expensive neural models to classify text, basic heuristic filters drop the lowest-signal material from web crawls. As seen in foundational datasets like C4 (Colossal Clean Crawled Corpus)[6], these simple rules act as a highly efficient first line of defense.

The snippet below demonstrates how these basic rules are implemented in practice. It takes a single document as input and returns True only if the document passes all checks.

first-quality-gate-heuristic-filtering.py
1def load_bad_words() -> list[str]: 2 # In production, this would come from a reviewed policy list. 3 return ["toxic_word_1", "toxic_word_2"] 4 5def quality_filter(doc: str) -> bool: 6 """Illustrative web-text filter; thresholds require corpus-level validation.""" 7 words = doc.split() 8 9 # Length filter: drop documents that are too short or likely concatenated. 10 if len(words) < 50 or len(words) > 100_000: 11 return False 12 13 # Repetition filter: repeated n-grams are common in spam and boilerplate. 14 for n in [2, 3, 4]: 15 ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)] 16 if len(ngrams) > 0: 17 frac_duplicates = 1 - len(set(ngrams)) / len(ngrams) 18 if frac_duplicates > 0.3: # >30% repeated n-grams 19 return False 20 21 # Policy-list filter: check the ratio of reviewed blocklisted terms. 22 bad_words = set(load_bad_words()) 23 bad_count = sum(1 for w in words if w.lower() in bad_words) 24 if bad_count / len(words) > 0.01: 25 return False 26 27 # Alphabetic character ratio: drop documents that are mostly symbols/code. 28 alpha_chars = sum(c.isalpha() for c in doc) 29 if alpha_chars / len(doc) < 0.5: 30 return False 31 32 # Sentence-ending punctuation check. 33 sentences = doc.split('.') 34 if len(sentences) < 3: 35 return False 36 37 return True 38 39procedure_doc = ( 40 "The marketplace return policy explains how damaged packages are reviewed. " 41 "Customers upload photos, support agents verify the carrier scan history, " 42 "and the system creates a prepaid label when the claim is eligible. " 43 "This document contains specific procedural details, normal punctuation, " 44 "and enough natural language context to be useful for model training. " 45 "It avoids repeated boilerplate and gives support teams a clear workflow." 46) 47 48cases = { 49 "procedure": (procedure_doc, True), 50 "short": ("Return labels are available.", False), 51 "repeated": ("sale sale sale sale sale sale sale sale sale sale " * 8, False), 52 "policy-list": (procedure_doc + " toxic_word_1 toxic_word_2", False), 53} 54 55for name, (text, expected) in cases.items(): 56 kept = quality_filter(text) 57 assert kept is expected 58 print(f"{name:<11} -> {'keep' if kept else 'drop'}")
Output
1procedure -> keep 2short -> drop 3repeated -> drop 4policy-list -> drop

What this code does, step by step

  • Length check: Very short documents often carry little context; extremely long extractions may be concatenated pages or indexes.
  • Repetition check: High repeated n-gram fractions are a useful spam or boilerplate signal, but legitimate templates and structured documents can also repeat.
  • Policy-list check: A reviewed list can identify content a specific corpus policy excludes; its threshold is a policy choice, not a universal quality label.
  • Character check: A general-text recipe can discard symbol-heavy extractions, while a code corpus needs a different recipe.
  • Sentence check: Few sentence terminators can signal a malformed prose extraction, but are unsuitable as a universal language or format rule.

The code intentionally uses simple thresholds so the gates are visible. C4 and FineWeb use related heuristic filtering ideas, but a real recipe must measure retention and downstream model quality on its own crawl slices before adopting thresholds.[6][7]

These heuristics usually run on general web text before the final data mix is assembled. Code corpora often use a parallel recipe, because symbol-heavy files that look low-quality to a web-text filter can still be very valuable for pre-training.

Second quality gate: classifier-based filtering

After heuristic filtering, modern pipelines use classifiers trained to distinguish high-quality educational text from low-quality web chatter:

  1. Language ID: A fastText classifier screens whether the document matches the intended training languages.
  2. Quality Classifier: A classifier scores whether a document looks like reference-quality material rather than random crawl text. FineWeb-Edu uses an educational-quality classifier trained on LLM-judged scores, DCLM trains a fastText classifier on instruction-formatted and ELI5 text and keeps only the top ~10% by score, and Meta reports separate Llama 3 classifiers for general quality, code, and reasoning signals.[7][8][3]
  3. Perplexity filtering: A language model such as an n-gram model can score a document against a reference distribution. Unusually high or low scores can flag text for corpus-specific retention rules; CCNet is a published example of perplexity-based filtering.[9]
FilterEffectExample
URL blocklistRemove known low-quality or unsafe domainsCommon in production pipelines
Language ID (FastText)Keep target language(s)CCNet[9], FineWeb[7]
Heuristic rulesLength, repetition, char ratiosC4[6], RedPajama[10], FineWeb[7]
ClassifierQuality score thresholdFineWeb-Edu[7], DCLM[8], Llama 3[3]
Perplexity filterRemove gibberishCCNet[9]

Filtering decisions trade yield for the type of data retained. DCLM selected a fastText filter that retained its top 10% of documents after evaluating trained models, rather than assuming a score threshold was automatically better.[8]

quality-score-threshold-yield.py
1documents = [ 2 {"source": "proof", "tokens": 120, "score": 0.98}, 3 {"source": "api-doc", "tokens": 160, "score": 0.92}, 4 {"source": "forum", "tokens": 200, "score": 0.79}, 5 {"source": "news", "tokens": 240, "score": 0.63}, 6 {"source": "scrape", "tokens": 280, "score": 0.31}, 7] 8 9threshold = 0.90 10kept = [doc for doc in documents if doc["score"] >= threshold] 11kept_tokens = sum(doc["tokens"] for doc in kept) 12total_tokens = sum(doc["tokens"] for doc in documents) 13 14print("kept sources:", ", ".join(doc["source"] for doc in kept)) 15print(f"token yield: {kept_tokens}/{total_tokens} ({kept_tokens / total_tokens:.0%})")
Output
1kept sources: proof, api-doc 2token yield: 280/1000 (28%)

Removing duplicates

Deduplication is a core quality step. The internet contains large amounts of duplicated content: boilerplate headers, repeated SEO text, mirrored sites, copied documentation, and reposted articles. Lee et al. found that, on the corpora and model sizes they tested, deduplication reduced emitted memorized text by about 10x and reached the same or better validation accuracy in fewer training steps.[11]

Exact deduplication (hash-based)

This involves taking a cryptographic hash, such as SHA-256, of each document and removing exact matches. It's fast but misses near-duplicates, such as a news article where only the timestamp or a single word was changed.

MinHash + LSH (near-deduplication)

MinHash with Locality-Sensitive Hashing (LSH) is one widely used approach for fuzzy deduplication at scale.[12][11] FineWeb, for example, applied MinHash independently per crawl using word 5-grams and parameters intended to target documents with at least 75% similarity; its experiments favored per-crawl rather than global deduplication.[7]

The idea, by hand: Imagine you have three short documents:

  • Doc A: "The return policy says damaged items need a prepaid return label."
  • Doc B: "The return policy says damaged items require a prepaid return label."
  • Doc C: "Carrier scans update delivery promises."

Doc A and Doc B share almost every word. If you break them into overlapping 2-word chunks (shingles), they look like this:

  • Doc A shingles: ["The return", "return policy", "policy says", "says damaged", ...]
  • Doc B shingles: ["The return", "return policy", "policy says", "says damaged", ...]

The overlap is high. Doc C shares zero shingles with A or B. MinHash turns each document into a small numeric "sketch" that estimates this overlap relationship. LSH then routes similar sketches into the same candidate set so we inspect likely duplicates rather than comparing every pair. Because LSH is approximate, a candidate still needs a final similarity decision.

Here is the full pipeline:

Near-deduplication pipeline: raw document becomes shingles, then a MinHash signature, then an LSH candidate lookup, followed by exact similarity check and keep-or-drop decision. Near-deduplication pipeline: raw document becomes shingles, then a MinHash signature, then an LSH candidate lookup, followed by exact similarity check and keep-or-drop decision.
MinHash compresses each document into a similarity-preserving sketch. LSH narrows the search to likely matches, then an exact similarity check decides whether to keep the document or drop it as a near-duplicate.

The code below shows how to build a basic MinHash and LSH pipeline using the datasketch library. For a tiny demo corpus, it builds signatures from bigram shingles, uses LSH only for candidate lookup, and verifies candidates with exact Jaccard overlap before dropping a document.

minhash-lsh-near-deduplication.py
1import re 2 3from datasketch import MinHash, MinHashLSH 4 5def normalize(text: str) -> list[str]: 6 return re.findall(r"[a-z0-9]+", text.lower()) 7 8def shingles(text: str, shingle_size: int = 2) -> set[tuple[str, ...]]: 9 tokens = normalize(text) 10 if len(tokens) < shingle_size: 11 raise ValueError("Document is too short for chosen shingle size.") 12 return { 13 tuple(tokens[i:i + shingle_size]) 14 for i in range(len(tokens) - shingle_size + 1) 15 } 16 17def create_minhash(items: set[tuple[str, ...]], num_perm: int = 128) -> MinHash: 18 """Create MinHash signature from normalized word shingles.""" 19 m = MinHash(num_perm=num_perm) 20 for item in items: 21 shingle = " ".join(item) 22 m.update(shingle.encode("utf8")) 23 return m 24 25def jaccard(left: set[tuple[str, ...]], right: set[tuple[str, ...]]) -> float: 26 return len(left & right) / len(left | right) 27 28threshold = 0.5 29lsh = MinHashLSH(threshold=threshold, num_perm=128) 30documents = [ 31 ("doc1", "The return policy says damaged items need a prepaid return label."), 32 ("doc2", "The return policy says damaged items require a prepaid return label."), 33 ("doc3", "Carrier scans update delivery promises."), 34] 35 36kept_shingles: dict[str, set[tuple[str, ...]]] = {} 37 38for doc_id, text in documents: 39 doc_shingles = shingles(text) 40 mh = create_minhash(doc_shingles) 41 candidate_scores = [ 42 (candidate, jaccard(doc_shingles, kept_shingles[candidate])) 43 for candidate in sorted(lsh.query(mh)) 44 ] 45 duplicate = next( 46 ((candidate, score) for candidate, score in candidate_scores if score >= threshold), 47 None, 48 ) 49 if duplicate: 50 print(f"{doc_id}: drop near duplicate of {duplicate[0]} ({duplicate[1]:.2f})") 51 else: 52 kept_shingles[doc_id] = doc_shingles 53 lsh.insert(doc_id, mh) 54 print(f"{doc_id}: keep") 55 56print(f"Kept {len(kept_shingles)} out of {len(documents)} documents.")
Output
1doc1: keep 2doc2: drop near duplicate of doc1 (0.67) 3doc3: keep 4Kept 2 out of 3 documents.

Why this works: doc1 is inserted first. When doc2 arrives, LSH returns doc1 as a candidate; an exact Jaccard check then confirms sufficient overlap before the drop. doc3 has no confirmed neighbor, so it is kept. Production thresholds, shingle sizes, and whether deduplication operates per crawl or globally must be evaluated for the corpus; FineWeb's published recipe used 5-grams and a 75% target.[7]

MethodSpeedPrecisionRecallBest For
Exact hash (SHA-256)Very fastPerfectLow (exact only)Removing byte-identical documents
MinHash + LSHFast candidate searchTunableTunableLarge-corpus near-deduplication recipes[12][7]
Suffix array / suffix treeMediumVery HighVery HighExact repeated spans and long substring deduplication
SimHashFastMediumMediumCheap approximate similarity when memory is extremely tight

Keeping benchmarks clean: decontamination

An important and often overlooked step is decontamination: ensuring that evaluation benchmarks don't leak into the pre-training data. If a model sees MMLU, HumanEval, or GSM8K examples during pre-training, its reported performance becomes misleading because it may be memorizing instead of generalizing.

The risk is substantial. Web crawls contain GitHub repositories with coding interview questions, educational websites with standardized test questions, and forums where users paste benchmark prompts. Because training corpora and evaluation sets often draw from the same public web, contamination is a recurring risk rather than a hypothetical edge case.

N-gram overlap filtering

One common approach is n-gram overlap detection. For each document in the training corpus, check whether it contains spans from an evaluation dataset. The n-gram size, normalization, removal threshold, and code-specific rules must be recorded with the released corpus. GPT-3, for example, used a conservative 13-gram analysis; Llama 3 reports excluding benchmark training sets from its annealing data.[13][3]

CheckWhat it catchesWhat it misses
Exact string or n-gram overlapCopied benchmark prompts, answers, or code spansParaphrases and renamed identifiers
Fuzzy or task-specific similarityLightly edited versions worth inspectionRequires calibrated thresholds
Repository or source exclusionArtifacts from a known evaluation sourceCopies hosted elsewhere

Exact-match filtering isn't enough. A coding problem with renamed variables or a question with shuffled answer choices can still leak. Pipelines may add fuzzy or task-specific checks, then document what they removed so benchmark-clean claims are auditable.

ngram-decontamination-screen.py
1import re 2 3def ngrams(text: str, n: int) -> set[tuple[str, ...]]: 4 tokens = re.findall(r"[a-z0-9]+", text.lower()) 5 return {tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)} 6 7benchmark = "A refund is approved when the carrier confirms a damaged shipment." 8candidates = { 9 "direct-copy": "Internal guide: a refund is approved when the carrier confirms a damaged shipment.", 10 "paraphrase": "Authorize reimbursement after shipping damage has been verified.", 11} 12protected = ngrams(benchmark, n=4) 13 14for name, document in candidates.items(): 15 flagged = bool(protected & ngrams(document, n=4)) 16 print(f"{name:<11} -> {'flag' if flagged else 'not caught by exact n-grams'}")
Output
1direct-copy -> flag 2paraphrase -> not caught by exact n-grams

Benchmark decontamination ladder: benchmark exemplars become exact n-gram filters, fuzzy matching, and repository rules, then suspicious training docs are dropped before final corpus release. Benchmark decontamination ladder: benchmark exemplars become exact n-gram filters, fuzzy matching, and repository rules, then suspicious training docs are dropped before final corpus release.
Decontamination is separate from ordinary deduplication. The point isn't only removing repeated text, but protecting evaluation from leaked benchmark content and near-duplicate variants.

Safety and privacy scrub

After deduplication and decontamination, a pipeline may scrub personally identifiable information (PII) and unsafe content according to its data policy and legal obligations. A model trained on scraped identifiers can reproduce sensitive strings at inference time, creating privacy and compliance risk. FineWeb reports anonymizing email and public-IP addresses; Llama 3 reports removing domains likely to contain high volumes of PII or adult content.[7][3]

PII removal uses a layered approach:

  • Regex patterns: Rule-based detection can catch structured patterns such as email addresses or public IP addresses cheaply.
  • Entity or policy classifiers: Context-aware models can identify classes of content that rigid patterns miss, but require false-positive evaluation.
  • Source and domain rules: Known high-risk sources can be excluded before model training, as reported for Llama 3.[3]

Unsafe-content filtering is policy-specific and imperfect. FineWeb applies URL-level filtering for adult content but explicitly notes that harmful documents can remain. Llama 3 combines domain rules with dirty-word counting for adult websites that domain blocklists miss.[7][3] Whatever controls you choose, audit retained and removed samples across languages and source types instead of treating one score threshold as a universal safety boundary.

Turning text into numbers: tokenization

Tokenization converts raw text into integer IDs that the model can process. At the scale of trillions of tokens, tokenization itself requires significant parallel computation.

Choosing a tokenizer

Selecting the right tokenizer involves trading off between vocabulary size, sequence length, and language support. A smaller vocabulary can force text to split into more subwords, increasing sequence length and the computational cost of attention. A larger vocabulary can improve compression when it matches the corpus, but it also requires a larger embedding matrix, which consumes memory.

Common subword-tokenizer algorithms use data-driven vocabulary construction:

TokenizerAlgorithmVocabularyUsed By
Byte-level BPE[14][15][3]Bottom-up merging over bytes/subwords32K-128K+GPT-style models, Llama 3
WordPiece[16]Greedy subword vocabulary~30KBERT
SentencePiece[17]Framework for BPE or Unigram on raw text32K-256KLlama 2, T5, multilingual models

Training a BPE tokenizer

When you train a model family from scratch, the tokenizer should be trained on a representative sample of the pre-training data so it learns the most common subword patterns. Reusing a mature tokenizer can be a good engineering choice, but a mismatched tokenizer can waste context length, especially for multilingual or domain-heavy corpora. The code below uses the Hugging Face tokenizers library to initialize a byte-level Byte-Pair Encoding (BPE) tokenizer. For the demo, it trains from an in-memory sample; production pipelines train from a carefully mixed corpus sample and may target vocabularies such as 128,000 tokens.

training-a-bpe-tokenizer.py
1from tokenizers import Tokenizer, decoders, models, trainers, pre_tokenizers 2 3# Initialize a Byte-Pair Encoding (BPE) tokenizer 4tokenizer = Tokenizer(models.BPE()) 5tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True) 6tokenizer.decoder = decoders.ByteLevel() 7 8# Train the tokenizer on a representative sample of the corpus 9trainer = trainers.BpeTrainer( 10 vocab_size=256, 11 special_tokens=["<s>", "</s>", "<pad>"], 12 min_frequency=2, 13 initial_alphabet=pre_tokenizers.ByteLevel.alphabet(), 14) 15 16training_sample = [ 17 "Customers can return damaged items with a prepaid return label.", 18 "Carrier scans update delivery promises and support workflows.", 19 "Los clientes pueden consultar retrasos de envio y politicas de pago.", 20 "def refund_order(order_id): return create_support_ticket(order_id)", 21] 22 23tokenizer.train_from_iterator(training_sample, trainer=trainer) 24encoded = tokenizer.encode("Return policy creates a support ticket.") 25print(f"vocab size: {len(tokenizer.get_vocab())}, sample tokens: {len(encoded.ids)}")
Output
1vocab size: 259, sample tokens: 40

In a byte-level setup, you often don't need an <unk> token because any UTF-8 byte sequence can be represented.

Vocabulary size is a real design decision. A larger vocabulary can improve compression when its extra tokens match the target text, but it also increases embedding-table size and memory footprint. A smaller vocabulary is lighter, but it can produce longer sequences. Meta reports that Llama 3 moved to a 128K-token vocabulary, starting from a tiktoken base plus 28K extra tokens, and improved English compression from 3.17 to 3.94 characters per token relative to Llama 2[3].

Running the pipeline at scale

Large corpus pipelines require distributed execution beyond demonstration scripts. DCLM reports starting with a 240T-token Common Crawl pool containing about 200B documents and 370TB of compressed text, and processing crawl data on hundreds of AWS CPU nodes.[8]

At that scale, data is partitioned for extraction and filtering, intermediate artifacts are stored durably, and every transformation needs enough metadata to reproduce which records reached training.

Deduplication changes the system shape: signatures can be computed independently, but candidate grouping needs records with related signatures to meet. Whether that step uses a global operation or a restricted partition, such as FineWeb's per-crawl recipe, changes both resource cost and retained data quality.[7]

Trillion-token scale challenges

  • Ingestion and filtering: Extraction, language ID, rules, and classifier scoring must stream through many documents without losing provenance.
  • Deduplication: Signatures are parallel-friendly; candidate grouping and representative selection need an explicit partitioning strategy.
  • Training order: Tokenized shards need randomized, documented sampling so one source doesn't unintentionally dominate a run segment.
  • Reproducibility: Engineers need to identify the source document, filter version, mixture weight, and shard for suspicious examples.

Data packing and sequence efficiency

Documents vary widely in length: an article might be thousands of tokens while a forum post is short. Padding each document to a fixed sequence length spends compute on padding tokens. Best-fit packing groups documents into fixed-length token blocks to reduce that waste.

best-fit-packing-utilization.py
1lengths = [7, 6, 5, 5, 3, 2] 2capacity = 10 3 4def best_fit_pack(items: list[int], block_size: int) -> list[list[int]]: 5 bins: list[list[int]] = [] 6 for length in sorted(items, reverse=True): 7 options = [bucket for bucket in bins if sum(bucket) + length <= block_size] 8 if options: 9 min(options, key=lambda bucket: block_size - sum(bucket) - length).append(length) 10 else: 11 bins.append([length]) 12 return bins 13 14naive_slots = len(lengths) * capacity 15packed = best_fit_pack(lengths, capacity) 16packed_slots = len(packed) * capacity 17tokens = sum(lengths) 18 19print("packed blocks:", packed) 20print(f"naive utilization: {tokens / naive_slots:.0%}") 21print(f"packed utilization: {tokens / packed_slots:.0%}")
Output
1packed blocks: [[7, 3], [6, 2], [5, 5]] 2naive utilization: 47% 3packed utilization: 93%

Packing creates another design decision: if multiple documents share one sequence, should tokens in one document attend to tokens from another? Some causal-language-model recipes concatenate documents with end-of-document separators and keep the ordinary causal mask. Others isolate documents with a block-diagonal mask so one document can't read tokens from an earlier document in the same packed block. The isolated version below prevents artificial cross-document context while retaining packing efficiency.

packed-document-attention-mask.py
1document_ids = [0, 0, 0, 1, 1] 2 3mask = [ 4 [ 5 int(previous <= current and document_ids[previous] == document_ids[current]) 6 for previous in range(len(document_ids)) 7 ] 8 for current in range(len(document_ids)) 9] 10 11for row in mask: 12 print(" ".join(map(str, row))) 13print("doc 1 token attends to doc 0:", bool(mask[3][2]))
Output
11 0 0 0 0 21 1 0 0 0 31 1 1 0 0 40 0 0 1 0 50 0 0 1 1 6doc 1 token attends to doc 0: False

Data annealing is a late-stage recipe adjustment where the learning rate decays and the data mix shifts toward higher-value slices. It isn't a universal "final 10%" rule. Llama 3, for example, reports annealing over the final 40M tokens while upsampling very high-quality sources and excluding benchmark training sets from the annealing pool[3].

Quality beats quantity

Recent model and dataset papers show that data quality and token yield must be measured together once a corpus is already large.

The Phi lesson

Phi-1 reported strong coding benchmark performance for a 1.3B model trained with textbook-quality data, worked examples, and synthetic exercises.[4] Its result motivates controlled curation experiments; it doesn't make one data recipe universal.

FineWeb and FineWeb-Edu

FineWeb scales the same idea to web-scale curation. The paper builds a 15T-token dataset from 96 Common Crawl snapshots and then extracts FineWeb-Edu, a 1.3T educational subset filtered with a classifier trained on LLM-judged educational scores. The main lesson is that pipeline design matters: individual per-crawl MinHash deduplication and additional filtering each improved downstream results over weaker baselines[7].

Modern open corpora

Three published open-corpus reports illustrate different tradeoffs after FineWeb. Each changes a different part of the yield-versus-quality problem.

CorpusScale and scopeKey idea
DCLM-Baseline[8]3.8T-token set filtered from a 240T-token Common Crawl poolA fastText filter retained its top 10% of documents; a 7B model trained on 2.6T tokens reached 64% 5-shot MMLU
Nemotron-CC[18]6.3T tokens: 4.4T globally deduplicated original tokens plus 1.9T synthetic tokensClassifier ensembling and source-conditioned rephrasing target larger unique-token yield; its 1.1T high-quality subset improved MMLU by 5.6 points over DCLM in the reported 8B, 1T-token setup
FineWeb2[19]20TB and 5B documents across over 1,000 languages from almost 100 snapshotsLanguage-adapted filtering and dedup-informed rebalancing were evaluated across canary languages before scale-up

These reports don't establish one universally best filter. DCLM optimized benchmark quality under its evaluation setup with aggressive retention. Nemotron-CC explicitly targeted a larger long-horizon token supply and reported both quality and quantity comparisons. FineWeb2 addresses the separate multilingual problem: thresholds, word segmentation, language identification, and deduplication behavior don't transfer cleanly from English to every language.[8][18][19]

The data wall

This token-yield pressure is often called the data wall. Villalobos et al. estimate an effective stock of about 320T public human-text tokens after quality and repetition adjustments. Under their assumptions about continued dataset growth, they project full utilization between 2026 and 2032, with a median of 2028; an assumed 5x overtraining scenario shifts that projection one to two years earlier.[20] This is a scenario projection, not evidence that usable public text is already exhausted.

projected-public-text-stock-context.py
1effective_stock = 320e12 2training_budgets = { 3 "1T-token study": 1e12, 4 "Llama 3 405B report": 15.6e12, 5 "100T-token scenario": 100e12, 6} 7 8for label, tokens in training_budgets.items(): 9 print(f"{label:<22}: {tokens / effective_stock:>5.1%} of 320T reference stock") 10print("Comparison only: it isn't a claim that stock has been consumed.")
Output
11T-token study : 0.3% of 320T reference stock 2Llama 3 405B report : 4.9% of 320T reference stock 3100T-token scenario : 31.2% of 320T reference stock 4Comparison only: it isn't a claim that stock has been consumed.

ApproachDescriptionExpected Result
Raw Common CrawlMinimal processingWeakest baseline; high duplication and noise
Heuristic-filtered (C4-style)Language ID + fixed quality rulesBetter than raw crawl; still leaves many near-duplicates
Dedup + heuristic-filtered (FineWeb)Per-crawl MinHash + additional filtersStrong public baseline; measurable downstream gains[7]
Classifier-filtered (DCLM-Baseline)Model-based filtering keeps top ~10% of documentsReported 64% 5-shot MMLU for its 7B model trained on 2.6T tokens[8]
Classifier ensemble + synthetic (Nemotron-CC)Multiple classifiers plus synthetic rephrasingReported high-quality-subset improvement of +5.6 MMLU over DCLM in its 8B, 1T-token comparison[18]
Educationally curated (FineWeb-Edu / Phi-style)Classifier or textbook-style curation plus synthetic dataPhi-1 reported strong coding benchmarks for a 1.3B model trained on textbook-quality data[4]
Data curation experiment: raw web data flows through filters into a curated mix, with token yield and measured model quality evaluated together. Data curation experiment: raw web data flows through filters into a curated mix, with token yield and measured model quality evaluated together.
Filtering reduces token yield. Ablation runs decide whether the retained mix improves model quality for the intended workload.

Additional crawl volume and stronger curation must be compared empirically. FineWeb is a good example: in its ablations, extraction, per-crawl deduplication, and filtering choices measurably changed downstream results.[7]

Synthetic data: supplement, not substitute

Synthetic data can raise signal density, but it works best when it fills a specific gap rather than replacing the whole corpus.

Where synthetic data helps

Phi-1 is one clear example. Instead of trying to mimic the raw web distribution, the dataset deliberately emphasized textbook-style explanations, exercises, and synthetic examples[4]. That makes synthetic data useful when you want more reasoning traces, worked solutions, or domain-specific exemplars than the open web naturally provides.

One proposed response to token-yield pressure is synthetic expansion. Nemotron-CC generates variants from source documents, including question-answer pairs and distilled passages, and reports improvements in specific training comparisons.[18] Source conditioning gives the generated text provenance, but the authors state that they didn't verify factual accuracy or fidelity for rephrased data. Grounding therefore needs validation rather than assumption.

Why provenance and diversity still matter

Synthetic text inherits teacher errors, style biases, and coverage gaps. Shumailov et al. study degradation when successive model generations train recursively on generated data, a failure mode often called model collapse.[21] This doesn't prohibit synthetic supplementation; it means mixture weight, provenance, factual fidelity, and diversity need evaluation.

Don't conflate verified synthetic-data generation with post-training reinforcement learning. Pre-training pipelines more often use offline generation plus filtering or programmatic checks before tokens are admitted.

Common mistakes and how to catch them

Here are the most frequent errors teams make when building or reasoning about pre-training pipelines, with symptoms, causes, and fixes.

MistakeSymptomCauseFix
Ignoring deduplicationModel repeats memorized boilerplate; token budget spends on repeatsDuplicate text overweights copied passagesEvaluate dedup granularity; FineWeb selected per-crawl MinHash over global dedup
Filtering non-English too aggressivelyModel loses code or math reasoning strengthCode and math contain many universal symbols that language classifiers mislabelUse separate recipes for code corpora and multilingual text
Overlooking benchmark contaminationInflated evaluation scoresEvaluation examples leaked into training data via public sourcesSpecify exact-overlap and task-specific fuzzy rules; audit removals
Forgetting PII policyModel outputs personal identifiersRaw crawls contain user-generated content with personal dataApply documented identifier/domain controls and audit a sample before training
Using an English-centric tokenizer for multilingual modelsTerrible compression and slow inference for non-English languagesVocabulary was trained on mostly English textTrain the tokenizer on a representative multilingual sample; expect 32K-128K+ vocabularies
Treating synthetic data as automatically faithfulRewrites introduce errors or narrow stylesSource-conditioned generation was admitted without validationMeasure fidelity, diversity, and downstream effects before changing mixture weight
Assuming one annealing schedule fits every modelLate-stage training gives inconsistent gainsCopied another team's annealing recipe without matching data mixTreat annealing as a hyperparameter; test on a small model first

What you should be able to defend

TierYou should be able to defend
FoundationalMap the full pipeline from raw web crawl to tokenized training batches.
FoundationalExplain why raw token count isn't the same thing as useful training signal.
IntermediateDistinguish exact hashing from MinHash/LSH near-deduplication.
IntermediateExplain why source weighting and multilingual weighting should follow target capabilities instead of raw byte counts.
AdvancedDesign multi-stage filtering with heuristics, classifiers, perplexity, PII scrub, and benchmark decontamination.
AdvancedApply n-gram overlap and fuzzy matching to protect evaluation benchmarks.
AdvancedExplain late-stage data annealing, sequence packing, and why their recipes are model-specific.
AdvancedIdentify when synthetic data helps and when it narrows the corpus too much.
AdvancedContrast DCLM, Nemotron-CC, and FineWeb2 and explain the tradeoff each report evaluates.

What to remember

  • Data sources: Filtered web text usually dominates volume, but code and curated corpora are often upsampled relative to their raw size.
  • Quality filtering: Published pipelines combine heuristics, classifiers, and sometimes perplexity filters in different orders; retention and quality must be evaluated.
  • Deduplication: Near-deduplication with MinHash/LSH is a common large-corpus approach. LSH produces candidates; a documented decision rule determines removals.
  • Tokenization: The choice of tokenizer (BPE, SentencePiece) and vocabulary size affects compression rates and multilingual capabilities.
  • Scale: DCLM reports a 240T-token source pool processed with hundreds of CPU nodes; production pipeline choices must preserve provenance and reproducibility at that scale.
  • Data packing: Best-fit packing (bin packing) improves training efficiency by minimizing padding waste.
  • Data annealing: Late-stage annealing or high-quality upsampling can help, but the schedule is model-specific rather than a universal last-step recipe.
  • Synthetic data: Synthetic data is best used as a targeted supplement. Provenance, diversity, and validation matter as much as volume.
  • Quality vs. quantity: Filtering, deduplication, and new token supply form an empirical tradeoff, not a universal winner.
  • The data wall: Villalobos et al. estimate an effective public-text stock around 320T tokens and project utilization under stated scenarios; that isn't a claim of present exhaustion.

Practice questions with answer sketches

Question 1: You are building a pipeline for a 7B parameter model. Using the 20 tokens-per-parameter rule, how many tokens should you target? Answer: 7B x 20 = 140 billion tokens.

Question 2: A document in your crawl shares 85% of its 5-grams with a document already in the training set. Should you keep it? Why or why not? Answer: Treat it as a near-duplicate under a documented threshold, then retain one representative according to provenance and quality policy. Length alone isn't enough to decide which record is better.

Question 3: Your team is debating whether to spend $1 million on another 5T tokens of raw Common Crawl or on improving the classifier filter. What argument would you make? Answer: Demand a controlled comparison: measure token retention, domain coverage, and downstream performance for an improved filter against the added crawl. FineWeb and DCLM motivate the experiment, but neither proves the answer for this corpus.

Question 4: A junior engineer suggests training the tokenizer on English-only text for a model that must also handle Spanish customer support. What's the risk? Answer: Spanish text will be split into many more subword tokens, increasing sequence length, memory use, and inference cost for every Spanish query. The tokenizer should see a representative multilingual sample.

Question 5: Why is MinHash useful when exact hashing already exists? Answer: Exact hashing catches only byte-identical documents. MinHash sketches overlapping shingles, so it can find pages that are mostly the same after boilerplate, timestamps, whitespace, or a few words change.

Question 6: How would you protect a HumanEval-style benchmark from leakage? Answer: Start with exact string and n-gram checks against benchmark prompts and solutions, then add code-aware checks such as repository filtering and fuzzy matching for renamed variables or lightly edited solutions.

Question 7: What is data annealing? Answer: It's a late-stage recipe change where the learning rate falls and the data mix shifts toward higher-value slices. It isn't a universal final percentage; the right schedule depends on the model, data mix, and target capabilities.

Question 8: When does synthetic data help pre-training? Answer: It helps when it fills a concrete gap, such as worked code examples, reasoning traces, domain procedures, or rare languages. It becomes risky when one teacher model's style overwhelms broad human-written data, which can drift toward model collapse.

Question 9: What is the "data wall," and name two pipeline techniques teams use to push past it. Answer: The data wall is a projection that training demand may approach available public human text. Villalobos et al. estimate an effective stock near 320T tokens and a median full-utilization scenario around 2028. Teams study stronger filtering, deduplicated replay, and validated source-conditioned synthetic data to change the tradeoff.

Next Step
Continue to Build GPT from Scratch Lab

You now understand how large corpora are collected, filtered, deduplicated, tokenized, packed, and documented. Next step turns that data work into an actual language-model training run so you can see token blocks, next-token labels, validation loss, and checkpoints behave as one concrete pipeline.

PreviousScaling Laws & Compute-Optimal Training
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., & Carmon, Y. · 2024

The Llama 3 Herd of Models.

Dubey, A., et al. · 2024 · arXiv preprint

Textbooks Are All You Need

Gunasekar, S., et al. · 2023 · NeurIPS 2023

The Pile: An 800GB Dataset of Diverse Text for Language Modeling.

Gao et al. · 2020

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

Raffel, C., et al. · 2020 · JMLR

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Penedo, G., et al. · 2024 · NeurIPS 2024

DataComp-LM: In Search of the Next Generation of Training Sets for Language Models

Li, J., et al. · 2024

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Wenzek, G., et al. · 2019 · LREC 2020

RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset

Computer, Together · 2023 · GitHub

Deduplicating Training Data Makes Language Models Better.

Lee, K., Ippolito, D., et al. · 2022 · NeurIPS 2022

On the Resemblance and Containment of Documents.

Broder, A. Z. · 1997

Language Models are Few-Shot Learners.

Brown, T., et al. · 2020 · NeurIPS 2020

Neural Machine Translation of Rare Words with Subword Units.

Sennrich, R., Haddow, B., & Birch, A. · 2016 · ACL 2016

Language Models are Unsupervised Multitask Learners.

Radford, A., et al. · 2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Devlin, J., et al. · 2019 · NAACL 2019

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.

Kudo, T. & Richardson, J. · 2018 · EMNLP 2018

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Su, D., et al. · 2024

FineWeb2: One Pipeline to Scale Them All - Adapting Pre-Training Data Processing to Every Language

Penedo, G., et al. · 2025

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

Villalobos, P., et al. (Epoch AI) · 2022 · arXiv preprint

AI models collapse when trained on recursively generated data

Shumailov, I., et al. · 2024 · Nature