Learn when to keep the causal language-modeling objective and continue pretraining on domain text instead of jumping straight to SFT, and how to evaluate the trade-off against forgetting, cost, and downstream gain.
The scratch GPT lab trained a tiny model from raw text to checkpoint. Real teams usually start from a base model instead. Continued pretraining keeps the same next-token objective, but shifts the text distribution so the model spends more compute on your domain's language.[1]
Teams often confuse three different tools:
Don't choose from labels alone. In Ovadia et al.'s knowledge-injection experiments, RAG outperformed unsupervised fine-tuning on MMLU and current-events questions, while repeated paraphrases helped fine-tuning on the new-fact task.[2] Use that result for factual-update failures, not as a universal ranking. CPT earns an experiment when the model poorly fits the domain text distribution, not when you only need fresher facts or a different answer format.
In continued pretraining:
That differs from later SFT, where the model learns from curated prompt-response examples instead of unlabeled text.
A useful direct signal is held-out raw-text loss and its exponentiated form, perplexity: establish a base-model value on domain documents, then test whether CPT lowers it while a general-text control remains inside budget. A base model scoring worse on domain than general text is only a screening clue because corpora can have different inherent predictability; it doesn't by itself prove CPT will improve product tasks.
Fragmentation during tokenization is a weaker diagnostic. A fixed tokenizer may use more tokens for unfamiliar terminology, increasing context cost, but CPT doesn't change that tokenizer unless you deliberately redesign embeddings and retrain compatible weights. Use fertility as a corpus inspection signal, not a promise that continued pretraining will shorten tokenized documents.
1import re
2
3import tiktoken
4
5encoder = tiktoken.get_encoding("gpt2")
6samples = {
7 "general": "A developer changed the feature flag before deploy.",
8 "catalog": "The release orchestration workflow reconciles failed checks.",
9 "incident": "The sidecar restarted after the readiness probe failed.",
10}
11
12print("slice words tokens tokens_per_word")
13for name, text in samples.items():
14 words = re.findall(r"\b[\w'-]+\b", text)
15 tokens = encoder.encode(text)
16 fertility = len(tokens) / len(words)
17 print(f"{name:<9}{len(words):>5}{len(tokens):>8}{fertility:>17.2f}")1slice words tokens tokens_per_word
2general 8 9 1.12
3catalog 7 10 1.43
4incident 8 11 1.38Don't split a validation corpus by shuffled token chunks. Near-duplicates, revisions of the same manual, or pages from the same source can land in both training and validation and make CPT look stronger than it really is. Assign a provenance or deduplication group to one split before tokenization.
1import hashlib
2
3documents = [
4 {"group": "manual-v1", "text": "scanner fault E17 means belt obstruction"},
5 {"group": "manual-v1", "text": "scanner fault E18 means label obstruction"},
6 {"group": "incident-runbooks", "text": "canary rollbacks require owner acknowledgement"},
7 {"group": "incident-runbooks", "text": "destructive migrations require DBA approval"},
8 {"group": "events-east", "text": "hub=EWR lane=42 retry=1"},
9 {"group": "events-west", "text": "hub=OAK lane=11 retry=0"},
10]
11
12def split_for_group(group: str) -> str:
13 bucket = int(hashlib.sha256(group.encode()).hexdigest(), 16) % 4
14 return "validation" if bucket == 0 else "train"
15
16splits = {"train": [], "validation": []}
17for doc in documents:
18 splits[split_for_group(doc["group"])].append(doc)
19
20train_groups = {doc["group"] for doc in splits["train"]}
21validation_groups = {doc["group"] for doc in splits["validation"]}
22assert train_groups.isdisjoint(validation_groups)
23
24print(f"train_groups={sorted(train_groups)}")
25print(f"validation_groups={sorted(validation_groups)}")
26print("group leakage: none")1train_groups=['events-west', 'manual-v1']
2validation_groups=['events-east', 'incident-runbooks']
3group leakage: noneResuming training on a new distribution pulls the weights in two directions, and a good run balances them.
Catastrophic forgetting is loss of previously learned ability as parameters shift to absorb new data. Push too hard on domain text and broad validation quality can regress.
Underfitting is the opposite failure: train too gently and the domain leaves no real impression.
Two major controls for this balance are the learning-rate schedule and the data mixture; run length and corpus quality matter too.
A base checkpoint often finished its original cosine schedule at a very small learning rate. If you resume at that floor, adaptation may be inefficient. If you resume too aggressively, general-text loss may regress.
Ibrahim et al. (2024) study a related decoder-only continual-pretraining setting: updating a model with large new datasets after its original cosine schedule ended.[3] For 405M models under English-to-English and English-to-German shifts, and a 10B-parameter model under the English-to-English shift, learning-rate re-warming, re-decaying, and replay matched retraining baselines on their reported losses and evaluation averages while spending less compute. Their experiment is evidence for testing this recipe, not permission to copy one peak learning rate into every domain run.
One subtlety from that work: re-warming can itself increase loss on old data. Sweep the peak and measure both lanes instead of assuming adaptation is free. The paper also explores schedules that aren't tied to one fixed token budget.
1import math
2
3def rewarm_redecay(step: int, total_steps: int, warmup_steps: int, peak: float, floor: float) -> float:
4 if step < warmup_steps:
5 return floor + (peak - floor) * (step + 1) / warmup_steps
6 progress = (step - warmup_steps) / max(1, total_steps - warmup_steps - 1)
7 cosine = 0.5 * (1.0 + math.cos(math.pi * progress))
8 return floor + (peak - floor) * cosine
9
10total_steps = 1000
11warmup_steps = 50
12peak = 3e-5 # sweep this value; do not inherit it blindly
13floor = 3e-6
14
15for step in [0, 49, 50, 250, 999]:
16 print(f"step={step:>3} lr={rewarm_redecay(step, total_steps, warmup_steps, peak, floor):.2e}")1step= 0 lr=3.54e-06
2step= 49 lr=3.00e-05
3step= 50 lr=3.00e-05
4step=250 lr=2.71e-05
5step=999 lr=3.00e-06The second knob is replay: mix a fraction of previous or representative general-purpose data back into the incoming domain corpus. It provides training signal on broad text while the domain stream shifts the model, so it's a practical candidate for limiting regression.
How much replay? Treat it as a sweep, not a standard percentage. In Ibrahim et al.'s headline comparison, the chosen mixes use 5% replay for the SlimPajama update and 25% replay for the larger English-to-German shift.[3] Those values belong to those datasets and compute budgets. With a fixed token budget, replay also replaces some new-domain tokens, so it can reduce adaptation opportunity while controlling general regression.
1total_tokens = 2_000_000
2
3print("replay_ratio domain_tokens replay_tokens total_tokens")
4for replay_ratio in [0.00, 0.05, 0.25]:
5 replay_tokens = int(total_tokens * replay_ratio)
6 domain_tokens = total_tokens - replay_tokens
7 assert domain_tokens + replay_tokens == total_tokens
8 print(f"{replay_ratio:>11.0%}{domain_tokens:>15,}{replay_tokens:>15,}{total_tokens:>14,}")1replay_ratio domain_tokens replay_tokens total_tokens
2 0% 2,000,000 0 2,000,000
3 5% 1,900,000 100,000 2,000,000
4 25% 1,500,000 500,000 2,000,000
Reach for continued pretraining when the domain has its own language that the base model under-serves:
The trigger isn't "the business wants custom behavior." The model needs more exposure to the domain's text distribution before post-training behavior shaping makes sense.
| Signal | Why it points to continued pretraining |
|---|---|
| Model misreads domain terminology | It lacks token-distribution familiarity, not response style alone |
| Long domain documents feel unnatural to the model | The base corpus underrepresented this text type |
| Raw completions are weak even before instruction formatting | The issue appears before chat behavior enters the picture |
| You have lots of domain text but few high-quality prompt-response labels | Continued pretraining can exploit unlabeled corpora |
| Signal | Better tool |
|---|---|
| Model knows the facts but answers in the wrong format | SFT |
| You need fresh, frequently changing, or citable facts | RAG |
| Model needs one task-specific classifier head | supervised fine-tuning with a classifier head |
| Model is mostly correct but chooses the wrong safe vs unsafe answer | preference optimization |
The 2020 "Don't Stop Pretraining" paper made this distinction explicit in masked-language-model experiments with RoBERTa:[1]
The decision remains useful for decoder-only LLM projects, but don't silently transfer RoBERTa's quantitative gains to a causal base model. You still have to measure whether more exposure to the target text distribution improves your model and downstream task.
Suppose you're building an incident model for service exception handling. Test raw domain-text continuation and prompt-response behavior separately, then compare failure modes. If the model can't continue the underlying service log or incident note coherently, that points to continued pretraining. If raw continuation is competent but assistant behavior is weak, that points more directly to SFT.
The same discipline from large-scale pretraining still applies:
The corpus can be narrower and more targeted. Domain data can also be more sensitive than public pretraining text, so provenance, access control, and removal procedures are product requirements, not cleanup tasks.
Keep a manifest that records whether a source may be trained on, whether it contains unresolved sensitive content, and whether it's reserved for evaluation. A high-quality domain document that fails one of these gates doesn't belong in the training stream.
1sources = [
2 {"name": "public-manuals", "tokens": 800_000, "licensed": True, "pii_scrubbed": True, "eval_only": False},
3 {"name": "support-notes", "tokens": 120_000, "licensed": True, "pii_scrubbed": False, "eval_only": False},
4 {"name": "heldout-probes", "tokens": 25_000, "licensed": True, "pii_scrubbed": True, "eval_only": True},
5 {"name": "vendor-export", "tokens": 300_000, "licensed": False, "pii_scrubbed": True, "eval_only": False},
6]
7
8accepted = [
9 row for row in sources
10 if row["licensed"] and row["pii_scrubbed"] and not row["eval_only"]
11]
12rejected = [row["name"] for row in sources if row not in accepted]
13
14print(f"accepted={[row['name'] for row in accepted]}")
15print(f"training_tokens={sum(row['tokens'] for row in accepted):,}")
16print(f"rejected={rejected}")1accepted=['public-manuals']
2training_tokens=800,000
3rejected=['support-notes', 'heldout-probes', 'vendor-export']For a small exact-overlap gate, normalize text and hash it before building token blocks. Production pipelines also need near-duplicate detection, because formatting changes and partial copies will evade exact hashes.
1import hashlib
2
3def fingerprint(text: str) -> str:
4 normalized = " ".join(text.lower().split())
5 return hashlib.sha256(normalized.encode()).hexdigest()
6
7heldout = [
8 "Fault E17: belt obstruction. Clear belt and retry.",
9 "Returns above $500 require supervisor approval.",
10]
11candidate_training = [
12 "Scanner firmware notes for version 4.2.",
13 " fault E17: BELT obstruction. clear belt and retry. ",
14 "Lane timeout codes and remediation steps.",
15]
16
17heldout_hashes = {fingerprint(text) for text in heldout}
18clean_training = [
19 text for text in candidate_training
20 if fingerprint(text) not in heldout_hashes
21]
22
23print(f"removed={len(candidate_training) - len(clean_training)}")
24print(f"kept={len(clean_training)}")
25assert all(fingerprint(text) not in heldout_hashes for text in clean_training)1removed=1
2kept=2Don't assume 100% domain text is always optimal. In practice, teams mix:
That replay is one guardrail against forgetting. The exact ratio is empirical: define candidate ratios, hold total training tokens fixed, and select with domain-gain and broad-regression metrics. If the model forgets too much general language while specializing, the run overshot.
BloombergGPT is a useful contrast, not replay evidence: it was trained from scratch on 51.27% financial and 48.73% public tokens, and reports strong financial performance while remaining competitive on general-purpose benchmarks.[4] It shows that corpus composition should be explicit. It doesn't identify the right CPT replay ratio for your checkpoint.
CPT uses the same causal objective as base pretraining. A common loader recipe joins document token sequences with end-of-document markers and emits full blocks. The separator marks a boundary, but it doesn't prevent cross-document attention by itself. As the data-pipeline chapter explained, choose explicitly between an ordinary causal mask and a document-isolated block-diagonal mask. Small integer token sequences make separator placement inspectable.
1EOS = 0
2block_size = 6
3documents = [[11, 12, 13], [21, 22], [31, 32, 33, 34]]
4
5stream = []
6for document in documents:
7 stream.extend(document + [EOS])
8
9blocks = [
10 stream[start:start + block_size]
11 for start in range(0, len(stream) - block_size + 1, block_size)
12]
13
14print(f"stream={stream}")
15print(f"blocks={blocks}")
16assert all(len(block) == block_size for block in blocks)
17assert EOS in blocks[0]1stream=[11, 12, 13, 0, 21, 22, 0, 31, 32, 33, 34, 0]
2blocks=[[11, 12, 13, 0, 21, 22], [0, 31, 32, 33, 34, 0]]This small example drops an incomplete final block instead of padding it. Production loaders need an explicit remainder policy.
Once domain and replay streams are packed, make mixture selection explicit and auditable. Here each twenty-block training window uses a seeded shuffle with the requested replay count.
1import random
2
3def make_window(domain_blocks: list[str], replay_blocks: list[str], replay_ratio: float, size: int) -> list[str]:
4 if not 0.0 <= replay_ratio <= 1.0:
5 raise ValueError("replay_ratio must be between 0 and 1")
6 replay_count = round(size * replay_ratio)
7 domain_count = size - replay_count
8 if len(domain_blocks) < domain_count or len(replay_blocks) < replay_count:
9 raise ValueError("not enough packed blocks for requested window")
10 chosen = domain_blocks[:domain_count] + replay_blocks[:replay_count]
11 random.Random(7).shuffle(chosen)
12 return chosen
13
14domain_blocks = [f"domain-{index}" for index in range(20)]
15replay_blocks = [f"general-{index}" for index in range(20)]
16window = make_window(domain_blocks, replay_blocks, replay_ratio=0.25, size=20)
17
18domain_count = sum(item.startswith("domain") for item in window)
19replay_count = sum(item.startswith("general") for item in window)
20print(f"domain_blocks={domain_count} replay_blocks={replay_count}")
21print(f"first_five={window[:5]}")
22assert (domain_count, replay_count) == (15, 5)1domain_blocks=15 replay_blocks=5
2first_five=['general-2', 'general-0', 'domain-11', 'general-3', 'domain-7']Continued pretraining needs two evaluation lanes at the same time.
Measure:
Measure:
If you only watch domain gain, you can accidentally produce a model that sounds like one incident runbook and forgot how to write broadly coherent language.
Evaluate loss in comparable token units. Perplexity is exp(mean negative log-likelihood), so aggregate token-level loss before exponentiating; don't average document perplexities and call the result a corpus metric.
1import math
2
3base = {
4 "domain": {"negative_log_likelihood": 840.0, "tokens": 240},
5 "general": {"negative_log_likelihood": 540.0, "tokens": 200},
6}
7adapted = {
8 "domain": {"negative_log_likelihood": 720.0, "tokens": 240},
9 "general": {"negative_log_likelihood": 548.0, "tokens": 200},
10}
11
12def perplexity(metrics: dict[str, float]) -> float:
13 return math.exp(metrics["negative_log_likelihood"] / metrics["tokens"])
14
15print("lane base_ppl adapted_ppl delta")
16for lane in ["domain", "general"]:
17 base_ppl = perplexity(base[lane])
18 adapted_ppl = perplexity(adapted[lane])
19 print(f"{lane:<8}{base_ppl:>9.2f}{adapted_ppl:>13.2f}{adapted_ppl - base_ppl:>7.2f}")1lane base_ppl adapted_ppl delta
2domain 33.12 20.09 -13.03
3general 14.88 15.49 0.61The simplest useful artifact is a checkpoint ledger. It doesn't train a model; it shows how to choose between checkpoints after a continued-pretraining sweep. Domain perplexity can improve while general text gets worse, so the chosen checkpoint needs to pass both lanes. Use general regression as a hard gate. Among survivors, rank downstream probe accuracy first and use domain perplexity as a tie-breaker. That keeps the policy visible instead of hiding trade-offs inside an arbitrary weighted score.
1checkpoints = [
2 {"name": "base", "domain_ppl": 42.0, "general_ppl": 19.2, "probe_acc": 0.62},
3 {"name": "cpt-1k", "domain_ppl": 31.5, "general_ppl": 19.5, "probe_acc": 0.68},
4 {"name": "cpt-4k", "domain_ppl": 27.9, "general_ppl": 20.1, "probe_acc": 0.72},
5 {"name": "cpt-12k", "domain_ppl": 25.8, "general_ppl": 23.9, "probe_acc": 0.71},
6]
7
8base = checkpoints[0]
9max_general_regression = 1.5
10
11print("checkpoint domain_gain general_regression probe_acc keep")
12best = None
13best_rank = None
14
15for row in checkpoints:
16 domain_gain = base["domain_ppl"] - row["domain_ppl"]
17 general_regression = row["general_ppl"] - base["general_ppl"]
18 keep = general_regression <= max_general_regression
19 rank = (row["probe_acc"], -row["domain_ppl"])
20
21 if keep and (best_rank is None or rank > best_rank):
22 best = row
23 best_rank = rank
24
25 print(
26 f"{row['name']:<10}"
27 f"{domain_gain:>11.1f}"
28 f"{general_regression:>20.1f}"
29 f"{row['probe_acc']:>11.2f}"
30 f" {'yes' if keep else 'no'}"
31 )
32
33print(f"chosen={best['name']}")
34print("reason=best downstream probe, then domain perplexity, inside general-regression budget")1checkpoint domain_gain general_regression probe_acc keep
2base 0.0 0.0 0.62 yes
3cpt-1k 10.5 0.3 0.68 yes
4cpt-4k 14.1 0.9 0.72 yes
5cpt-12k 16.2 4.7 0.71 no
6chosen=cpt-4k
7reason=best downstream probe, then domain perplexity, inside general-regression budgetBecause continued pretraining keeps the same objective, it can feel deceptively safe. It isn't safe by default.
Good stopping cues:
Bad stopping cues:
More steps aren't a free lunch once the domain shift is already absorbed.
Compare the choices by asking what you want to change.
| Goal | Best first tool |
|---|---|
| Inject fresh or citable facts without retraining | RAG |
| Teach new domain language patterns | continued pretraining |
| Teach chat or task format | SFT |
| Run a behavior update without full-weight training | SFT with LoRA / QLoRA adapters |
| Choose between multiple acceptable responses | DPO or RLHF |
LoRA and QLoRA are parameter-efficient implementation choices; QLoRA also stores the frozen base model in quantized form.[5] They don't determine what supervision teaches. An adapter can be trained with a next-token domain-text objective or with prompt-response SFT. First choose objective from the failure mode, then choose full-weight or parameter-efficient training from budget and deployment constraints.
A strong training stack often looks like:
Not every product needs every stage. Choose the stage that matches the failure you observe.
Symptom: the model still formats answers badly after a long domain-text run.
Cause: the issue was interface behavior, not domain language exposure.
Fix: move to SFT sooner.
Symptom: domain completions improve, but the model becomes narrow or brittle elsewhere.
Cause: no replay mixture, or too many adaptation steps.
Fix: keep a general-text regression lane and stop earlier.
Symptom: domain perplexity improves, but the final task model barely benefits.
Cause: the adaptation run optimized text fit that did not transfer to the product task.
Fix: probe the adapted checkpoint with a small downstream SFT instead of judging only by perplexity.
Defend these points:
5% and 25% mixes are study-specific reference points, not defaults.[3]| Prompt | Answer sketch |
|---|---|
| What is the core difference between continued pretraining and SFT? | Continued pretraining feeds unlabeled or weakly structured domain text through the same next-token objective. SFT trains on prompt-response examples to teach answer format, task behavior, and interface style. |
| When is continued pretraining a better first move than SFT? Can LoRA decide that? | Choose CPT when the base model is weak on domain language itself; choose SFT when it understands the text but answers poorly. LoRA can't decide between them because it can parameterize either objective. |
| A team wants the model to answer questions about this week's pricing rules. CPT, SFT, or RAG? | RAG. The facts change often and should be citable, so retrieving them at inference beats baking them into weights. CPT is for absorbing the domain's language, not for chasing fast-moving facts. |
| How do you limit general regression during continued pretraining? | Sweep re-warm/re-decay schedules and replay ratios, then watch a general-text regression lane beside domain gain. Don't assume a paper's replay percentage transfers to your data. |
| How do you know a continued-pretraining run went too far? | Domain metrics improve, but broad validation regresses, generations become narrow, or downstream probe SFT stops improving. Choose an earlier checkpoint with better overall trade-off. |
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.
Gururangan, S., Marasovic, A., Swayamdipta, S., et al. · 2020 · ACL 2020
Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
Ovadia, O., Brief, M., Mishaeli, M., & Elisha, O. · 2024 · EMNLP 2024
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Ibrahim, A., Therien, B., Gupta, K., et al. · 2024 · Transactions on Machine Learning Research
BloombergGPT: A Large Language Model for Finance
Wu, S., Irsoy, O., Lu, S., et al. · 2023
QLoRA: Efficient Finetuning of Quantized Language Models.
Dettmers, T., et al. · 2023 · NeurIPS