LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationContinued Pretraining for Domain Shift
⚡HardFine-Tuning & Training

Continued Pretraining for Domain Shift

Learn when to keep the causal language-modeling objective and continue pretraining on domain text instead of jumping straight to SFT, and how to evaluate the trade-off against forgetting, cost, and downstream gain.

22 min read
Learning path
Step 99 of 158 in the full curriculum
Build GPT from Scratch LabSynthetic Data Pipelines

The scratch GPT lab trained a tiny model from raw text to checkpoint. Real teams usually start from a base model instead. Continued pretraining keeps the same next-token objective, but shifts the text distribution so the model spends more compute on your domain's language.[1]Reference 1Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.https://aclanthology.org/2020.acl-main.740/

Teams often confuse three different tools:

  • retrieval-augmented generation (RAG) leaves the weights frozen and injects knowledge at inference time by retrieving documents into the prompt. Use it when facts change often or must be cited.
  • Supervised Fine-Tuning (SFT) changes behavior, format, and tone using curated examples such as prompt-response pairs. Use it when the model already knows the domain but answers in the wrong shape.
  • Continued pretraining (CPT) changes the weights with the same next-token objective so the model better fits domain terminology, document structure, and statistical patterns. Use it when raw domain text still confuses the base model.

Don't choose from labels alone. In Ovadia et al.'s knowledge-injection experiments, RAG outperformed unsupervised fine-tuning on MMLU and current-events questions, while repeated paraphrases helped fine-tuning on the new-fact task.[2]Reference 2Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMshttps://aclanthology.org/2024.emnlp-main.15/ Use that result for factual-update failures, not as a universal ranking. CPT earns an experiment when the model poorly fits the domain text distribution, not when you only need fresher facts or a different answer format.

Training ladder from base pretraining to continued pretraining, supervised fine-tuning, and preference tuning. Training ladder from base pretraining to continued pretraining, supervised fine-tuning, and preference tuning.
These stages solve different problems. Base pretraining teaches general language, continued pretraining shifts the model toward a domain's text distribution, SFT teaches interface behavior, and preference tuning decides which acceptable answers the model should prefer.

What changes and what stays the same

In continued pretraining:

  • the objective stays the same: predict the next token
  • the model architecture stays the same
  • the data distribution changes
  • the goal changes from general competence to domain adaptation

That differs from later SFT, where the model learns from curated prompt-response examples instead of unlabeled text.

Diagnose the shift before spending training compute

A useful direct signal is held-out raw-text loss and its exponentiated form, perplexity: establish a base-model value on domain documents, then test whether CPT lowers it while a general-text control remains inside budget. A base model scoring worse on domain than general text is only a screening clue because corpora can have different inherent predictability; it doesn't by itself prove CPT will improve product tasks.

Fragmentation during tokenization is a weaker diagnostic. A fixed tokenizer may use more tokens for unfamiliar terminology, increasing context cost, but CPT doesn't change that tokenizer unless you deliberately redesign embeddings and retrain compatible weights. Use fertility as a corpus inspection signal, not a promise that continued pretraining will shorten tokenized documents.

measure_tokenizer_fertility.py
1import re 2 3import tiktoken 4 5encoder = tiktoken.get_encoding("gpt2") 6samples = { 7 "general": "A developer changed the feature flag before deploy.", 8 "catalog": "The release orchestration workflow reconciles failed checks.", 9 "incident": "The sidecar restarted after the readiness probe failed.", 10} 11 12print("slice words tokens tokens_per_word") 13for name, text in samples.items(): 14 words = re.findall(r"\b[\w'-]+\b", text) 15 tokens = encoder.encode(text) 16 fertility = len(tokens) / len(words) 17 print(f"{name:<9}{len(words):>5}{len(tokens):>8}{fertility:>17.2f}")
Tokenizer fertility diagnostic
1slice words tokens tokens_per_word 2general 8 9 1.12 3catalog 7 10 1.43 4incident 8 11 1.38

Don't split a validation corpus by shuffled token chunks. Near-duplicates, revisions of the same manual, or pages from the same source can land in both training and validation and make CPT look stronger than it really is. Assign a provenance or deduplication group to one split before tokenization.

group_domain_holdout.py
1import hashlib 2 3documents = [ 4 {"group": "manual-v1", "text": "scanner fault E17 means belt obstruction"}, 5 {"group": "manual-v1", "text": "scanner fault E18 means label obstruction"}, 6 {"group": "incident-runbooks", "text": "canary rollbacks require owner acknowledgement"}, 7 {"group": "incident-runbooks", "text": "destructive migrations require DBA approval"}, 8 {"group": "events-east", "text": "hub=EWR lane=42 retry=1"}, 9 {"group": "events-west", "text": "hub=OAK lane=11 retry=0"}, 10] 11 12def split_for_group(group: str) -> str: 13 bucket = int(hashlib.sha256(group.encode()).hexdigest(), 16) % 4 14 return "validation" if bucket == 0 else "train" 15 16splits = {"train": [], "validation": []} 17for doc in documents: 18 splits[split_for_group(doc["group"])].append(doc) 19 20train_groups = {doc["group"] for doc in splits["train"]} 21validation_groups = {doc["group"] for doc in splits["validation"]} 22assert train_groups.isdisjoint(validation_groups) 23 24print(f"train_groups={sorted(train_groups)}") 25print(f"validation_groups={sorted(validation_groups)}") 26print("group leakage: none")
Grouped holdout split
1train_groups=['events-west', 'manual-v1'] 2validation_groups=['events-east', 'incident-runbooks'] 3group leakage: none

The two failure dynamics: forgetting and underfitting

Resuming training on a new distribution pulls the weights in two directions, and a good run balances them.

Catastrophic forgetting is loss of previously learned ability as parameters shift to absorb new data. Push too hard on domain text and broad validation quality can regress.

Underfitting is the opposite failure: train too gently and the domain leaves no real impression.

Two major controls for this balance are the learning-rate schedule and the data mixture; run length and corpus quality matter too.

Learning rate re-warming and re-decaying

A base checkpoint often finished its original cosine schedule at a very small learning rate. If you resume at that floor, adaptation may be inefficient. If you resume too aggressively, general-text loss may regress.

Ibrahim et al. (2024) study a related decoder-only continual-pretraining setting: updating a model with large new datasets after its original cosine schedule ended.[3]Reference 3Simple and Scalable Strategies to Continually Pre-train Large Language Modelshttps://openreview.net/forum?id=DimPeeCxKO For 405M models under English-to-English and English-to-German shifts, and a 10B-parameter model under the English-to-English shift, learning-rate re-warming, re-decaying, and replay matched retraining baselines on their reported losses and evaluation averages while spending less compute. Their experiment is evidence for testing this recipe, not permission to copy one peak learning rate into every domain run.

One subtlety from that work: re-warming can itself increase loss on old data. Sweep the peak and measure both lanes instead of assuming adaptation is free. The paper also explores schedules that aren't tied to one fixed token budget.

rewarm_redecay_schedule.py
1import math 2 3def rewarm_redecay(step: int, total_steps: int, warmup_steps: int, peak: float, floor: float) -> float: 4 if step < warmup_steps: 5 return floor + (peak - floor) * (step + 1) / warmup_steps 6 progress = (step - warmup_steps) / max(1, total_steps - warmup_steps - 1) 7 cosine = 0.5 * (1.0 + math.cos(math.pi * progress)) 8 return floor + (peak - floor) * cosine 9 10total_steps = 1000 11warmup_steps = 50 12peak = 3e-5 # sweep this value; do not inherit it blindly 13floor = 3e-6 14 15for step in [0, 49, 50, 250, 999]: 16 print(f"step={step:>3} lr={rewarm_redecay(step, total_steps, warmup_steps, peak, floor):.2e}")
Re-warm then re-decay schedule
1step= 0 lr=3.54e-06 2step= 49 lr=3.00e-05 3step= 50 lr=3.00e-05 4step=250 lr=2.71e-05 5step=999 lr=3.00e-06

Replay: keep prior-data signal in the mix

The second knob is replay: mix a fraction of previous or representative general-purpose data back into the incoming domain corpus. It provides training signal on broad text while the domain stream shifts the model, so it's a practical candidate for limiting regression.

How much replay? Treat it as a sweep, not a standard percentage. In Ibrahim et al.'s headline comparison, the chosen mixes use 5% replay for the SlimPajama update and 25% replay for the larger English-to-German shift.[3]Reference 3Simple and Scalable Strategies to Continually Pre-train Large Language Modelshttps://openreview.net/forum?id=DimPeeCxKO Those values belong to those datasets and compute budgets. With a fixed token budget, replay also replaces some new-domain tokens, so it can reduce adaptation opportunity while controlling general regression.

compute_equivalent_replay.py
1total_tokens = 2_000_000 2 3print("replay_ratio domain_tokens replay_tokens total_tokens") 4for replay_ratio in [0.00, 0.05, 0.25]: 5 replay_tokens = int(total_tokens * replay_ratio) 6 domain_tokens = total_tokens - replay_tokens 7 assert domain_tokens + replay_tokens == total_tokens 8 print(f"{replay_ratio:>11.0%}{domain_tokens:>15,}{replay_tokens:>15,}{total_tokens:>14,}")
Compute-equivalent replay accounting
1replay_ratio domain_tokens replay_tokens total_tokens 2 0% 2,000,000 0 2,000,000 3 5% 1,900,000 100,000 2,000,000 4 25% 1,500,000 500,000 2,000,000
Checkpoint tradeoff chart for continued pretraining showing domain gain rising early while general regression stays low at first and then worsens, with a balanced checkpoint selected before forgetting dominates. Checkpoint tradeoff chart for continued pretraining showing domain gain rising early while general regression stays low at first and then worsens, with a balanced checkpoint selected before forgetting dominates.
A CPT sweep should vary coupled controls. Re-warm peak decides how aggressively weights move, and replay ratio decides how much broad-text training signal remains. Pick the best downstream/domain trade-off that stays within your general-text regression budget.

When continued pretraining is the right tool

Reach for continued pretraining when the domain has its own language that the base model under-serves:

  • internal incident policies with recurring entity patterns
  • incident event streams and service jargon
  • long runbook or compliance documents
  • dense technical manuals
  • codebases with domain-specific APIs and naming conventions

The trigger isn't "the business wants custom behavior." The model needs more exposure to the domain's text distribution before post-training behavior shaping makes sense.

Good signals

SignalWhy it points to continued pretraining
Model misreads domain terminologyIt lacks token-distribution familiarity, not response style alone
Long domain documents feel unnatural to the modelThe base corpus underrepresented this text type
Raw completions are weak even before instruction formattingThe issue appears before chat behavior enters the picture
You have lots of domain text but few high-quality prompt-response labelsContinued pretraining can exploit unlabeled corpora

Bad signals

SignalBetter tool
Model knows the facts but answers in the wrong formatSFT
You need fresh, frequently changing, or citable factsRAG
Model needs one task-specific classifier headsupervised fine-tuning with a classifier head
Model is mostly correct but chooses the wrong safe vs unsafe answerpreference optimization

Domain-adaptive vs task-adaptive pretraining

The 2020 "Don't Stop Pretraining" paper made this distinction explicit in masked-language-model experiments with RoBERTa:[1]Reference 1Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.https://aclanthology.org/2020.acl-main.740/

  • DAPT (domain-adaptive pretraining): keep training on large unlabeled domain text such as incident runbooks or service notes
  • TAPT (task-adaptive pretraining): continue on the task's own unlabeled inputs, even when the corpus is smaller

The decision remains useful for decoder-only LLM projects, but don't silently transfer RoBERTa's quantitative gains to a causal base model. You still have to measure whether more exposure to the target text distribution improves your model and downstream task.

A practical decision test

Suppose you're building an incident model for service exception handling. Test raw domain-text continuation and prompt-response behavior separately, then compare failure modes. If the model can't continue the underlying service log or incident note coherently, that points to continued pretraining. If raw continuation is competent but assistant behavior is weak, that points more directly to SFT.

Data for continued pretraining

The same discipline from large-scale pretraining still applies:

  • filter low-quality text
  • deduplicate aggressively
  • remove benchmarks and eval leakage
  • scrub PII and sensitive content
  • keep provenance and usage rights for every corpus slice

The corpus can be narrower and more targeted. Domain data can also be more sensitive than public pretraining text, so provenance, access control, and removal procedures are product requirements, not cleanup tasks.

Gate the corpus before tokenization

Keep a manifest that records whether a source may be trained on, whether it contains unresolved sensitive content, and whether it's reserved for evaluation. A high-quality domain document that fails one of these gates doesn't belong in the training stream.

gate_domain_manifest.py
1sources = [ 2 {"name": "public-manuals", "tokens": 800_000, "licensed": True, "pii_scrubbed": True, "eval_only": False}, 3 {"name": "support-notes", "tokens": 120_000, "licensed": True, "pii_scrubbed": False, "eval_only": False}, 4 {"name": "heldout-probes", "tokens": 25_000, "licensed": True, "pii_scrubbed": True, "eval_only": True}, 5 {"name": "vendor-export", "tokens": 300_000, "licensed": False, "pii_scrubbed": True, "eval_only": False}, 6] 7 8accepted = [ 9 row for row in sources 10 if row["licensed"] and row["pii_scrubbed"] and not row["eval_only"] 11] 12rejected = [row["name"] for row in sources if row not in accepted] 13 14print(f"accepted={[row['name'] for row in accepted]}") 15print(f"training_tokens={sum(row['tokens'] for row in accepted):,}") 16print(f"rejected={rejected}")
Corpus manifest gate
1accepted=['public-manuals'] 2training_tokens=800,000 3rejected=['support-notes', 'heldout-probes', 'vendor-export']

Keep evaluation text out of training

For a small exact-overlap gate, normalize text and hash it before building token blocks. Production pipelines also need near-duplicate detection, because formatting changes and partial copies will evade exact hashes.

remove_exact_eval_overlap.py
1import hashlib 2 3def fingerprint(text: str) -> str: 4 normalized = " ".join(text.lower().split()) 5 return hashlib.sha256(normalized.encode()).hexdigest() 6 7heldout = [ 8 "Fault E17: belt obstruction. Clear belt and retry.", 9 "Returns above $500 require supervisor approval.", 10] 11candidate_training = [ 12 "Scanner firmware notes for version 4.2.", 13 " fault E17: BELT obstruction. clear belt and retry. ", 14 "Lane timeout codes and remediation steps.", 15] 16 17heldout_hashes = {fingerprint(text) for text in heldout} 18clean_training = [ 19 text for text in candidate_training 20 if fingerprint(text) not in heldout_hashes 21] 22 23print(f"removed={len(candidate_training) - len(clean_training)}") 24print(f"kept={len(clean_training)}") 25assert all(fingerprint(text) not in heldout_hashes for text in clean_training)
Exact evaluation decontamination
1removed=1 2kept=2

Mixing strategy

Don't assume 100% domain text is always optimal. In practice, teams mix:

  • a high-quality domain slice
  • a smaller replay slice of general text

That replay is one guardrail against forgetting. The exact ratio is empirical: define candidate ratios, hold total training tokens fixed, and select with domain-gain and broad-regression metrics. If the model forgets too much general language while specializing, the run overshot.

BloombergGPT is a useful contrast, not replay evidence: it was trained from scratch on 51.27% financial and 48.73% public tokens, and reports strong financial performance while remaining competitive on general-purpose benchmarks.[4]Reference 4BloombergGPT: A Large Language Model for Financehttps://arxiv.org/abs/2303.17564 It shows that corpus composition should be explicit. It doesn't identify the right CPT replay ratio for your checkpoint.

Pack blocks and preserve the mixture

CPT uses the same causal objective as base pretraining. A common loader recipe joins document token sequences with end-of-document markers and emits full blocks. The separator marks a boundary, but it doesn't prevent cross-document attention by itself. As the data-pipeline chapter explained, choose explicitly between an ordinary causal mask and a document-isolated block-diagonal mask. Small integer token sequences make separator placement inspectable.

pack_domain_token_blocks.py
1EOS = 0 2block_size = 6 3documents = [[11, 12, 13], [21, 22], [31, 32, 33, 34]] 4 5stream = [] 6for document in documents: 7 stream.extend(document + [EOS]) 8 9blocks = [ 10 stream[start:start + block_size] 11 for start in range(0, len(stream) - block_size + 1, block_size) 12] 13 14print(f"stream={stream}") 15print(f"blocks={blocks}") 16assert all(len(block) == block_size for block in blocks) 17assert EOS in blocks[0]
Packed CPT token blocks
1stream=[11, 12, 13, 0, 21, 22, 0, 31, 32, 33, 34, 0] 2blocks=[[11, 12, 13, 0, 21, 22], [0, 31, 32, 33, 34, 0]]

This small example drops an incomplete final block instead of padding it. Production loaders need an explicit remainder policy.

Once domain and replay streams are packed, make mixture selection explicit and auditable. Here each twenty-block training window uses a seeded shuffle with the requested replay count.

build_replay_mixture.py
1import random 2 3def make_window(domain_blocks: list[str], replay_blocks: list[str], replay_ratio: float, size: int) -> list[str]: 4 if not 0.0 <= replay_ratio <= 1.0: 5 raise ValueError("replay_ratio must be between 0 and 1") 6 replay_count = round(size * replay_ratio) 7 domain_count = size - replay_count 8 if len(domain_blocks) < domain_count or len(replay_blocks) < replay_count: 9 raise ValueError("not enough packed blocks for requested window") 10 chosen = domain_blocks[:domain_count] + replay_blocks[:replay_count] 11 random.Random(7).shuffle(chosen) 12 return chosen 13 14domain_blocks = [f"domain-{index}" for index in range(20)] 15replay_blocks = [f"general-{index}" for index in range(20)] 16window = make_window(domain_blocks, replay_blocks, replay_ratio=0.25, size=20) 17 18domain_count = sum(item.startswith("domain") for item in window) 19replay_count = sum(item.startswith("general") for item in window) 20print(f"domain_blocks={domain_count} replay_blocks={replay_count}") 21print(f"first_five={window[:5]}") 22assert (domain_count, replay_count) == (15, 5)
Deterministic replay mixture
1domain_blocks=15 replay_blocks=5 2first_five=['general-2', 'general-0', 'domain-11', 'general-3', 'domain-7']

Evaluation: domain gain without lying to yourself

Continued pretraining needs two evaluation lanes at the same time.

Lane 1: domain gain

Measure:

  • domain validation perplexity
  • retrieval or classification tasks in the domain
  • generation quality on held-out domain documents
  • downstream task lift after later SFT

Lane 2: general regression

Measure:

  • a small broad-language validation slice
  • a lightweight general benchmark set
  • free-form generations outside the target domain

If you only watch domain gain, you can accidentally produce a model that sounds like one incident runbook and forgot how to write broadly coherent language.

Evaluate loss in comparable token units. Perplexity is exp(mean negative log-likelihood), so aggregate token-level loss before exponentiating; don't average document perplexities and call the result a corpus metric.

report_domain_and_general_ppl.py
1import math 2 3base = { 4 "domain": {"negative_log_likelihood": 840.0, "tokens": 240}, 5 "general": {"negative_log_likelihood": 540.0, "tokens": 200}, 6} 7adapted = { 8 "domain": {"negative_log_likelihood": 720.0, "tokens": 240}, 9 "general": {"negative_log_likelihood": 548.0, "tokens": 200}, 10} 11 12def perplexity(metrics: dict[str, float]) -> float: 13 return math.exp(metrics["negative_log_likelihood"] / metrics["tokens"]) 14 15print("lane base_ppl adapted_ppl delta") 16for lane in ["domain", "general"]: 17 base_ppl = perplexity(base[lane]) 18 adapted_ppl = perplexity(adapted[lane]) 19 print(f"{lane:<8}{base_ppl:>9.2f}{adapted_ppl:>13.2f}{adapted_ppl - base_ppl:>7.2f}")
Two-lane perplexity report
1lane base_ppl adapted_ppl delta 2domain 33.12 20.09 -13.03 3general 14.88 15.49 0.61

Runnable checkpoint ledger

The simplest useful artifact is a checkpoint ledger. It doesn't train a model; it shows how to choose between checkpoints after a continued-pretraining sweep. Domain perplexity can improve while general text gets worse, so the chosen checkpoint needs to pass both lanes. Use general regression as a hard gate. Among survivors, rank downstream probe accuracy first and use domain perplexity as a tie-breaker. That keeps the policy visible instead of hiding trade-offs inside an arbitrary weighted score.

continued_pretraining_checkpoint_picker.py
1checkpoints = [ 2 {"name": "base", "domain_ppl": 42.0, "general_ppl": 19.2, "probe_acc": 0.62}, 3 {"name": "cpt-1k", "domain_ppl": 31.5, "general_ppl": 19.5, "probe_acc": 0.68}, 4 {"name": "cpt-4k", "domain_ppl": 27.9, "general_ppl": 20.1, "probe_acc": 0.72}, 5 {"name": "cpt-12k", "domain_ppl": 25.8, "general_ppl": 23.9, "probe_acc": 0.71}, 6] 7 8base = checkpoints[0] 9max_general_regression = 1.5 10 11print("checkpoint domain_gain general_regression probe_acc keep") 12best = None 13best_rank = None 14 15for row in checkpoints: 16 domain_gain = base["domain_ppl"] - row["domain_ppl"] 17 general_regression = row["general_ppl"] - base["general_ppl"] 18 keep = general_regression <= max_general_regression 19 rank = (row["probe_acc"], -row["domain_ppl"]) 20 21 if keep and (best_rank is None or rank > best_rank): 22 best = row 23 best_rank = rank 24 25 print( 26 f"{row['name']:<10}" 27 f"{domain_gain:>11.1f}" 28 f"{general_regression:>20.1f}" 29 f"{row['probe_acc']:>11.2f}" 30 f" {'yes' if keep else 'no'}" 31 ) 32 33print(f"chosen={best['name']}") 34print("reason=best downstream probe, then domain perplexity, inside general-regression budget")
Checkpoint trade-off ledger
1checkpoint domain_gain general_regression probe_acc keep 2base 0.0 0.0 0.62 yes 3cpt-1k 10.5 0.3 0.68 yes 4cpt-4k 14.1 0.9 0.72 yes 5cpt-12k 16.2 4.7 0.71 no 6chosen=cpt-4k 7reason=best downstream probe, then domain perplexity, inside general-regression budget

Stopping rules

Because continued pretraining keeps the same objective, it can feel deceptively safe. It isn't safe by default.

Good stopping cues:

  • domain validation loss flattens
  • downstream gains after a probe SFT stop improving
  • general regressions start to outweigh domain benefits

Bad stopping cues:

  • "we still have more domain text"
  • "loss is still going down a little"

More steps aren't a free lunch once the domain shift is already absorbed.

Where it fits relative to LoRA and SFT

Compare the choices by asking what you want to change.

GoalBest first tool
Inject fresh or citable facts without retrainingRAG
Teach new domain language patternscontinued pretraining
Teach chat or task formatSFT
Run a behavior update without full-weight trainingSFT with LoRA / QLoRA adapters
Choose between multiple acceptable responsesDPO or RLHF

LoRA and QLoRA are parameter-efficient implementation choices; QLoRA also stores the frozen base model in quantized form.[5]Reference 5QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314 They don't determine what supervision teaches. An adapter can be trained with a next-token domain-text objective or with prompt-response SFT. First choose objective from the failure mode, then choose full-weight or parameter-efficient training from budget and deployment constraints.

A strong training stack often looks like:

  1. base model
  2. continued pretraining on domain corpus
  3. SFT on curated prompt-response data
  4. preference optimization if needed

Not every product needs every stage. Choose the stage that matches the failure you observe.

Common pitfalls

Using continued pretraining to fix assistant tone

  • Symptom: the model still formats answers badly after a long domain-text run.

  • Cause: the issue was interface behavior, not domain language exposure.

  • Fix: move to SFT sooner.

Over-specializing on one corpus

  • Symptom: domain completions improve, but the model becomes narrow or brittle elsewhere.

  • Cause: no replay mixture, or too many adaptation steps.

  • Fix: keep a general-text regression lane and stop earlier.

Skipping the downstream check

  • Symptom: domain perplexity improves, but the final task model barely benefits.

  • Cause: the adaptation run optimized text fit that did not transfer to the product task.

  • Fix: probe the adapted checkpoint with a small downstream SFT instead of judging only by perplexity.

Mastery check

Defend these points:

  • Continued pretraining keeps causal language-modeling loss while changing corpus distribution; SFT changes supervision format to prompt-response examples.
  • Use continued pretraining when raw domain text is weak, and use SFT when domain understanding is fine but interface behavior is wrong.
  • For checkpoints that ended at a low learning rate, test re-warming and re-decaying plus replay ratios; Ibrahim et al.'s 5% and 25% mixes are study-specific reference points, not defaults.[3]Reference 3Simple and Scalable Strategies to Continually Pre-train Large Language Modelshttps://openreview.net/forum?id=DimPeeCxKO
  • Reach for RAG when facts change or must be cited; reach for CPT only when raw domain text itself confuses the base model.
  • Track domain gain, downstream probe lift, and broad-language regression together before choosing a checkpoint.

Evaluation rubric

  • Strong: separates RAG, CPT, and SFT by learning objective, then explains that LoRA changes parameterization rather than choosing the objective.
  • Strong: explains forgetting and underfitting as opposite failures controlled mainly by re-warm peak and replay ratio.
  • Strong: uses two evaluation lanes at once: domain gain and general regression.
  • Weak: chooses continued pretraining for changing facts or assistant tone problems that should start with RAG or SFT.
  • Weak: picks the final checkpoint only because domain perplexity kept falling.

Follow-up questions

PromptAnswer sketch
What is the core difference between continued pretraining and SFT?Continued pretraining feeds unlabeled or weakly structured domain text through the same next-token objective. SFT trains on prompt-response examples to teach answer format, task behavior, and interface style.
When is continued pretraining a better first move than SFT? Can LoRA decide that?Choose CPT when the base model is weak on domain language itself; choose SFT when it understands the text but answers poorly. LoRA can't decide between them because it can parameterize either objective.
A team wants the model to answer questions about this week's pricing rules. CPT, SFT, or RAG?RAG. The facts change often and should be citable, so retrieving them at inference beats baking them into weights. CPT is for absorbing the domain's language, not for chasing fast-moving facts.
How do you limit general regression during continued pretraining?Sweep re-warm/re-decay schedules and replay ratios, then watch a general-text regression lane beside domain gain. Don't assume a paper's replay percentage transfers to your data.
How do you know a continued-pretraining run went too far?Domain metrics improve, but broad validation regresses, generations become narrow, or downstream probe SFT stops improving. Choose an earlier checkpoint with better overall trade-off.

What to remember

  • Continued pretraining keeps the same causal LM objective and changes the corpus.
  • It's best when the problem is domain language weakness, not a missing fact (RAG) or a wrong format (SFT).
  • Balance forgetting against underfitting by evaluating learning-rate re-warming/re-decaying and replay mixtures.
  • Replay ratios are experimental choices; measure them under a fixed token budget and a general-regression gate.
  • You still need filtering, deduplication, provenance, and leakage control.
  • Domain gain must be tracked beside general regression, not instead of it.
  • The right next step after continued pretraining is usually SFT, not direct deployment.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A CPT run resumes from a base checkpoint at its final tiny learning rate and uses 100% domain text. Domain perplexity barely changes. After you raise the re-warm peak, domain perplexity improves but general-text loss regresses. What should you sweep next?
2.A base model writes polite assistant answers, but it misreads service exception codes and produces incoherent raw continuations for service logs. Which first training stage targets the observed failure?
3.During a CPT sweep, domain perplexity keeps improving, but a probe SFT barely improves the downstream task and the general validation slice regresses. What does this pattern suggest?
4.A team wants to adapt a base model, but deployment constraints require QLoRA adapters. Which decision has QLoRA already made?
5.A checkpoint policy uses general regression as a hard gate: general perplexity may increase by at most 1.5 from the base value of 19.2. Among checkpoints that pass, choose the highest downstream probe accuracy, using lower domain perplexity only as a tie-breaker. Which checkpoint is selected?
6.A team builds a CPT validation set by shuffling token blocks after tokenization. Different revisions of the same manual and pages from the same source can land in both train and validation. What is the main problem and fix?
7.A support system must answer questions about this week's pricing rules. The rules change frequently, answers must cite the current document, and the base model already reads the policy language well. What should be the first tool?
8.A team has unlabeled technical manuals and wants a base model to better learn their terminology and document structure. Which setup is continued pretraining?
9.A CPT manifest contains four sources: public-manuals has 800,000 tokens and is licensed, PII-scrubbed, and not eval-only; support-notes has 120,000 tokens but unresolved PII; heldout-probes has 25,000 tokens and is eval-only; vendor-export has 300,000 tokens but is unlicensed. Which data passes the training gate?
10.Two held-out documents have total negative log-likelihood and token counts of 60 over 20 tokens and 160 over 40 tokens. What is the corpus perplexity?

10 questions remaining.

Next Step
Continue to Synthetic Data Generation Pipelines for LLMs

Continued pretraining taught you how to move a base model toward the language of your domain before you ever label prompt-response pairs. The next chapter switches to the data engine that often feeds later SFT and preference runs: generating, filtering, decontaminating, and versioning synthetic training examples.

PreviousBuild GPT from Scratch Lab
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.

Gururangan, S., Marasovic, A., Swayamdipta, S., et al. · 2020 · ACL 2020

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

Ovadia, O., Brief, M., Mishaeli, M., & Elisha, O. · 2024 · EMNLP 2024

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Ibrahim, A., Therien, B., Gupta, K., et al. · 2024 · Transactions on Machine Learning Research

BloombergGPT: A Large Language Model for Finance

Wu, S., Irsoy, O., Lu, S., et al. · 2023

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS