LearnAdvanced Training & AdaptationContinued Pretraining for Domain Shift

⚡HardFine-Tuning & Training

Continued Pretraining for Domain Shift

Learn when to keep the causal language-modeling objective and continue pretraining on domain text instead of jumping straight to SFT, and how to evaluate the trade-off against forgetting, cost, and downstream gain.

22 min read

Learning path

Step 99 of 158 in the full curriculum

Build GPT from Scratch Lab Synthetic Data Pipelines

The scratch GPT lab trained a tiny model from raw text to checkpoint. Real teams usually start from a base model instead. Continued pretraining keeps the same next-token objective, but shifts the text distribution so the model spends more compute on your domain's language.^{[1]Reference 1Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.https://aclanthology.org/2020.acl-main.740/}

Teams often confuse three different tools:

retrieval-augmented generation (RAG) leaves the weights frozen and injects knowledge at inference time by retrieving documents into the prompt. Use it when facts change often or must be cited.
Supervised Fine-Tuning (SFT) changes behavior, format, and tone using curated examples such as prompt-response pairs. Use it when the model already knows the domain but answers in the wrong shape.
Continued pretraining (CPT) changes the weights with the same next-token objective so the model better fits domain terminology, document structure, and statistical patterns. Use it when raw domain text still confuses the base model.

Don't choose from labels alone. In Ovadia et al.'s knowledge-injection experiments, RAG outperformed unsupervised fine-tuning on MMLU and current-events questions, while repeated paraphrases helped fine-tuning on the new-fact task.^{[2]Reference 2Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMshttps://aclanthology.org/2024.emnlp-main.15/} Use that result for factual-update failures, not as a universal ranking. CPT earns an experiment when the model poorly fits the domain text distribution, not when you only need fresher facts or a different answer format.

Training ladder from base pretraining to continued pretraining, supervised fine-tuning, and preference tuning. — These stages solve different problems. Base pretraining teaches general language, continued pretraining shifts the model toward a domain's text distribution, SFT teaches interface behavior, and preference tuning decides which acceptable answers the model should prefer.

What changes and what stays the same

In continued pretraining:

the objective stays the same: predict the next token
the model architecture stays the same
the data distribution changes
the goal changes from general competence to domain adaptation

That differs from later SFT, where the model learns from curated prompt-response examples instead of unlabeled text.

Diagnose the shift before spending training compute

A useful direct signal is held-out raw-text loss and its exponentiated form, perplexity: establish a base-model value on domain documents, then test whether CPT lowers it while a general-text control remains inside budget. A base model scoring worse on domain than general text is only a screening clue because corpora can have different inherent predictability; it doesn't by itself prove CPT will improve product tasks.

Fragmentation during tokenization is a weaker diagnostic. A fixed tokenizer may use more tokens for unfamiliar terminology, increasing context cost, but CPT doesn't change that tokenizer unless you deliberately redesign embeddings and retrain compatible weights. Use fertility as a corpus inspection signal, not a promise that continued pretraining will shorten tokenized documents.

measure_tokenizer_fertility.py

import re

import tiktoken

encoder = tiktoken.get_encoding("gpt2")
samples = {
    "general": "A developer changed the feature flag before deploy.",
    "catalog": "The release orchestration workflow reconciles failed checks.",
    "incident": "The sidecar restarted after the readiness probe failed.",
}

print("slice     words  tokens  tokens_per_word")
for name, text in samples.items():
    words = re.findall(r"\b[\w'-]+\b", text)
    tokens = encoder.encode(text)
    fertility = len(tokens) / len(words)
    print(f"{name:<9}{len(words):>5}{len(tokens):>8}{fertility:>17.2f}")

Tokenizer fertility diagnostic

slice     words  tokens  tokens_per_word
general      8       9             1.12
catalog      7      10             1.43
incident     8      11             1.38

Don't split a validation corpus by shuffled token chunks. Near-duplicates, revisions of the same manual, or pages from the same source can land in both training and validation and make CPT look stronger than it really is. Assign a provenance or deduplication group to one split before tokenization.

group_domain_holdout.py

import hashlib

documents = [
    {"group": "manual-v1", "text": "scanner fault E17 means belt obstruction"},
    {"group": "manual-v1", "text": "scanner fault E18 means label obstruction"},
    {"group": "incident-runbooks", "text": "canary rollbacks require owner acknowledgement"},
    {"group": "incident-runbooks", "text": "destructive migrations require DBA approval"},
    {"group": "events-east", "text": "hub=EWR lane=42 retry=1"},
    {"group": "events-west", "text": "hub=OAK lane=11 retry=0"},
]

def split_for_group(group: str) -> str:
    bucket = int(hashlib.sha256(group.encode()).hexdigest(), 16) % 4
    return "validation" if bucket == 0 else "train"

splits = {"train": [], "validation": []}
for doc in documents:
    splits[split_for_group(doc["group"])].append(doc)

train_groups = {doc["group"] for doc in splits["train"]}
validation_groups = {doc["group"] for doc in splits["validation"]}
assert train_groups.isdisjoint(validation_groups)

print(f"train_groups={sorted(train_groups)}")
print(f"validation_groups={sorted(validation_groups)}")
print("group leakage: none")

Grouped holdout split

train_groups=['events-west', 'manual-v1']
validation_groups=['events-east', 'incident-runbooks']
group leakage: none

The two failure dynamics: forgetting and underfitting

Resuming training on a new distribution pulls the weights in two directions, and a good run balances them.

Catastrophic forgetting is loss of previously learned ability as parameters shift to absorb new data. Push too hard on domain text and broad validation quality can regress.

Underfitting is the opposite failure: train too gently and the domain leaves no real impression.

Two major controls for this balance are the learning-rate schedule and the data mixture; run length and corpus quality matter too.

Learning rate re-warming and re-decaying

A base checkpoint often finished its original cosine schedule at a very small learning rate. If you resume at that floor, adaptation may be inefficient. If you resume too aggressively, general-text loss may regress.

Ibrahim et al. (2024) study a related decoder-only continual-pretraining setting: updating a model with large new datasets after its original cosine schedule ended.^{[3]Reference 3Simple and Scalable Strategies to Continually Pre-train Large Language Modelshttps://openreview.net/forum?id=DimPeeCxKO} For 405M models under English-to-English and English-to-German shifts, and a 10B-parameter model under the English-to-English shift, learning-rate re-warming, re-decaying, and replay matched retraining baselines on their reported losses and evaluation averages while spending less compute. Their experiment is evidence for testing this recipe, not permission to copy one peak learning rate into every domain run.

One subtlety from that work: re-warming can itself increase loss on old data. Sweep the peak and measure both lanes instead of assuming adaptation is free. The paper also explores schedules that aren't tied to one fixed token budget.

rewarm_redecay_schedule.py

import math

def rewarm_redecay(step: int, total_steps: int, warmup_steps: int, peak: float, floor: float) -> float:
    if step < warmup_steps:
        return floor + (peak - floor) * (step + 1) / warmup_steps
    progress = (step - warmup_steps) / max(1, total_steps - warmup_steps - 1)
    cosine = 0.5 * (1.0 + math.cos(math.pi * progress))
    return floor + (peak - floor) * cosine

total_steps = 1000
warmup_steps = 50
peak = 3e-5  # sweep this value; do not inherit it blindly
floor = 3e-6

for step in [0, 49, 50, 250, 999]:
    print(f"step={step:>3} lr={rewarm_redecay(step, total_steps, warmup_steps, peak, floor):.2e}")

Re-warm then re-decay schedule

step=  0 lr=3.54e-06
step= 49 lr=3.00e-05
step= 50 lr=3.00e-05
step=250 lr=2.71e-05
step=999 lr=3.00e-06

Replay: keep prior-data signal in the mix

The second knob is replay: mix a fraction of previous or representative general-purpose data back into the incoming domain corpus. It provides training signal on broad text while the domain stream shifts the model, so it's a practical candidate for limiting regression.

How much replay? Treat it as a sweep, not a standard percentage. In Ibrahim et al.'s headline comparison, the chosen mixes use 5% replay for the SlimPajama update and 25% replay for the larger English-to-German shift.^{[3]Reference 3Simple and Scalable Strategies to Continually Pre-train Large Language Modelshttps://openreview.net/forum?id=DimPeeCxKO} Those values belong to those datasets and compute budgets. With a fixed token budget, replay also replaces some new-domain tokens, so it can reduce adaptation opportunity while controlling general regression.

compute_equivalent_replay.py

total_tokens = 2_000_000

print("replay_ratio  domain_tokens  replay_tokens  total_tokens")
for replay_ratio in [0.00, 0.05, 0.25]:
    replay_tokens = int(total_tokens * replay_ratio)
    domain_tokens = total_tokens - replay_tokens
    assert domain_tokens + replay_tokens == total_tokens
    print(f"{replay_ratio:>11.0%}{domain_tokens:>15,}{replay_tokens:>15,}{total_tokens:>14,}")

Compute-equivalent replay accounting

replay_ratio  domain_tokens  replay_tokens  total_tokens
         0%      2,000,000              0     2,000,000
         5%      1,900,000        100,000     2,000,000
        25%      1,500,000        500,000     2,000,000

Checkpoint tradeoff chart for continued pretraining showing domain gain rising early while general regression stays low at first and then worsens, with a balanced checkpoint selected before forgetting dominates. — A CPT sweep should vary coupled controls. Re-warm peak decides how aggressively weights move, and replay ratio decides how much broad-text training signal remains. Pick the best downstream/domain trade-off that stays within your general-text regression budget.

When continued pretraining is the right tool

Reach for continued pretraining when the domain has its own language that the base model under-serves:

internal incident policies with recurring entity patterns
incident event streams and service jargon
long runbook or compliance documents
dense technical manuals
codebases with domain-specific APIs and naming conventions

The trigger isn't "the business wants custom behavior." The model needs more exposure to the domain's text distribution before post-training behavior shaping makes sense.

Good signals

Signal	Why it points to continued pretraining
Model misreads domain terminology	It lacks token-distribution familiarity, not response style alone
Long domain documents feel unnatural to the model	The base corpus underrepresented this text type
Raw completions are weak even before instruction formatting	The issue appears before chat behavior enters the picture
You have lots of domain text but few high-quality prompt-response labels	Continued pretraining can exploit unlabeled corpora

Bad signals

Signal	Better tool
Model knows the facts but answers in the wrong format	SFT
You need fresh, frequently changing, or citable facts	RAG
Model needs one task-specific classifier head	supervised fine-tuning with a classifier head
Model is mostly correct but chooses the wrong safe vs unsafe answer	preference optimization

Domain-adaptive vs task-adaptive pretraining

The 2020 "Don't Stop Pretraining" paper made this distinction explicit in masked-language-model experiments with RoBERTa:^{[1]Reference 1Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.https://aclanthology.org/2020.acl-main.740/}

DAPT (domain-adaptive pretraining): keep training on large unlabeled domain text such as incident runbooks or service notes
TAPT (task-adaptive pretraining): continue on the task's own unlabeled inputs, even when the corpus is smaller

The decision remains useful for decoder-only LLM projects, but don't silently transfer RoBERTa's quantitative gains to a causal base model. You still have to measure whether more exposure to the target text distribution improves your model and downstream task.

A practical decision test

Suppose you're building an incident model for service exception handling. Test raw domain-text continuation and prompt-response behavior separately, then compare failure modes. If the model can't continue the underlying service log or incident note coherently, that points to continued pretraining. If raw continuation is competent but assistant behavior is weak, that points more directly to SFT.

Data for continued pretraining

The same discipline from large-scale pretraining still applies:

filter low-quality text
deduplicate aggressively
remove benchmarks and eval leakage
scrub PII and sensitive content
keep provenance and usage rights for every corpus slice

The corpus can be narrower and more targeted. Domain data can also be more sensitive than public pretraining text, so provenance, access control, and removal procedures are product requirements, not cleanup tasks.

Gate the corpus before tokenization

Keep a manifest that records whether a source may be trained on, whether it contains unresolved sensitive content, and whether it's reserved for evaluation. A high-quality domain document that fails one of these gates doesn't belong in the training stream.

gate_domain_manifest.py

sources = [
    {"name": "public-manuals", "tokens": 800_000, "licensed": True, "pii_scrubbed": True, "eval_only": False},
    {"name": "support-notes", "tokens": 120_000, "licensed": True, "pii_scrubbed": False, "eval_only": False},
    {"name": "heldout-probes", "tokens": 25_000, "licensed": True, "pii_scrubbed": True, "eval_only": True},
    {"name": "vendor-export", "tokens": 300_000, "licensed": False, "pii_scrubbed": True, "eval_only": False},
]

accepted = [
    row for row in sources
    if row["licensed"] and row["pii_scrubbed"] and not row["eval_only"]
]
rejected = [row["name"] for row in sources if row not in accepted]

print(f"accepted={[row['name'] for row in accepted]}")
print(f"training_tokens={sum(row['tokens'] for row in accepted):,}")
print(f"rejected={rejected}")

Corpus manifest gate

accepted=['public-manuals']
training_tokens=800,000
rejected=['support-notes', 'heldout-probes', 'vendor-export']

Keep evaluation text out of training

For a small exact-overlap gate, normalize text and hash it before building token blocks. Production pipelines also need near-duplicate detection, because formatting changes and partial copies will evade exact hashes.

remove_exact_eval_overlap.py

import hashlib

def fingerprint(text: str) -> str:
    normalized = " ".join(text.lower().split())
    return hashlib.sha256(normalized.encode()).hexdigest()

heldout = [
    "Fault E17: belt obstruction. Clear belt and retry.",
    "Returns above $500 require supervisor approval.",
]
candidate_training = [
    "Scanner firmware notes for version 4.2.",
    "  fault E17: BELT obstruction. clear belt and retry. ",
    "Lane timeout codes and remediation steps.",
]

heldout_hashes = {fingerprint(text) for text in heldout}
clean_training = [
    text for text in candidate_training
    if fingerprint(text) not in heldout_hashes
]

print(f"removed={len(candidate_training) - len(clean_training)}")
print(f"kept={len(clean_training)}")
assert all(fingerprint(text) not in heldout_hashes for text in clean_training)

Exact evaluation decontamination

removed=1
kept=2

Mixing strategy

Don't assume 100% domain text is always optimal. In practice, teams mix:

a high-quality domain slice
a smaller replay slice of general text

That replay is one guardrail against forgetting. The exact ratio is empirical: define candidate ratios, hold total training tokens fixed, and select with domain-gain and broad-regression metrics. If the model forgets too much general language while specializing, the run overshot.

BloombergGPT is a useful contrast, not replay evidence: it was trained from scratch on 51.27% financial and 48.73% public tokens, and reports strong financial performance while remaining competitive on general-purpose benchmarks.^{[4]Reference 4BloombergGPT: A Large Language Model for Financehttps://arxiv.org/abs/2303.17564} It shows that corpus composition should be explicit. It doesn't identify the right CPT replay ratio for your checkpoint.

Pack blocks and preserve the mixture

CPT uses the same causal objective as base pretraining. A common loader recipe joins document token sequences with end-of-document markers and emits full blocks. The separator marks a boundary, but it doesn't prevent cross-document attention by itself. As the data-pipeline chapter explained, choose explicitly between an ordinary causal mask and a document-isolated block-diagonal mask. Small integer token sequences make separator placement inspectable.

pack_domain_token_blocks.py

EOS = 0
block_size = 6
documents = [[11, 12, 13], [21, 22], [31, 32, 33, 34]]

stream = []
for document in documents:
    stream.extend(document + [EOS])

blocks = [
    stream[start:start + block_size]
    for start in range(0, len(stream) - block_size + 1, block_size)
]

print(f"stream={stream}")
print(f"blocks={blocks}")
assert all(len(block) == block_size for block in blocks)
assert EOS in blocks[0]

Packed CPT token blocks

stream=[11, 12, 13, 0, 21, 22, 0, 31, 32, 33, 34, 0]
blocks=[[11, 12, 13, 0, 21, 22], [0, 31, 32, 33, 34, 0]]

This small example drops an incomplete final block instead of padding it. Production loaders need an explicit remainder policy.

Once domain and replay streams are packed, make mixture selection explicit and auditable. Here each twenty-block training window uses a seeded shuffle with the requested replay count.

build_replay_mixture.py

import random

def make_window(domain_blocks: list[str], replay_blocks: list[str], replay_ratio: float, size: int) -> list[str]:
    if not 0.0 <= replay_ratio <= 1.0:
        raise ValueError("replay_ratio must be between 0 and 1")
    replay_count = round(size * replay_ratio)
    domain_count = size - replay_count
    if len(domain_blocks) < domain_count or len(replay_blocks) < replay_count:
        raise ValueError("not enough packed blocks for requested window")
    chosen = domain_blocks[:domain_count] + replay_blocks[:replay_count]
    random.Random(7).shuffle(chosen)
    return chosen

domain_blocks = [f"domain-{index}" for index in range(20)]
replay_blocks = [f"general-{index}" for index in range(20)]
window = make_window(domain_blocks, replay_blocks, replay_ratio=0.25, size=20)

domain_count = sum(item.startswith("domain") for item in window)
replay_count = sum(item.startswith("general") for item in window)
print(f"domain_blocks={domain_count} replay_blocks={replay_count}")
print(f"first_five={window[:5]}")
assert (domain_count, replay_count) == (15, 5)

Deterministic replay mixture

domain_blocks=15 replay_blocks=5
first_five=['general-2', 'general-0', 'domain-11', 'general-3', 'domain-7']

Evaluation: domain gain without lying to yourself

Continued pretraining needs two evaluation lanes at the same time.

Lane 1: domain gain

Measure:

domain validation perplexity
retrieval or classification tasks in the domain
generation quality on held-out domain documents
downstream task lift after later SFT

Lane 2: general regression

Measure:

a small broad-language validation slice
a lightweight general benchmark set
free-form generations outside the target domain

If you only watch domain gain, you can accidentally produce a model that sounds like one incident runbook and forgot how to write broadly coherent language.

Evaluate loss in comparable token units. Perplexity is exp(mean negative log-likelihood), so aggregate token-level loss before exponentiating; don't average document perplexities and call the result a corpus metric.

report_domain_and_general_ppl.py

import math

base = {
    "domain": {"negative_log_likelihood": 840.0, "tokens": 240},
    "general": {"negative_log_likelihood": 540.0, "tokens": 200},
}
adapted = {
    "domain": {"negative_log_likelihood": 720.0, "tokens": 240},
    "general": {"negative_log_likelihood": 548.0, "tokens": 200},
}

def perplexity(metrics: dict[str, float]) -> float:
    return math.exp(metrics["negative_log_likelihood"] / metrics["tokens"])

print("lane     base_ppl  adapted_ppl  delta")
for lane in ["domain", "general"]:
    base_ppl = perplexity(base[lane])
    adapted_ppl = perplexity(adapted[lane])
    print(f"{lane:<8}{base_ppl:>9.2f}{adapted_ppl:>13.2f}{adapted_ppl - base_ppl:>7.2f}")

Two-lane perplexity report

lane     base_ppl  adapted_ppl  delta
domain      33.12        20.09 -13.03
general     14.88        15.49   0.61

Runnable checkpoint ledger

The simplest useful artifact is a checkpoint ledger. It doesn't train a model; it shows how to choose between checkpoints after a continued-pretraining sweep. Domain perplexity can improve while general text gets worse, so the chosen checkpoint needs to pass both lanes. Use general regression as a hard gate. Among survivors, rank downstream probe accuracy first and use domain perplexity as a tie-breaker. That keeps the policy visible instead of hiding trade-offs inside an arbitrary weighted score.

continued_pretraining_checkpoint_picker.py

checkpoints = [
    {"name": "base", "domain_ppl": 42.0, "general_ppl": 19.2, "probe_acc": 0.62},
    {"name": "cpt-1k", "domain_ppl": 31.5, "general_ppl": 19.5, "probe_acc": 0.68},
    {"name": "cpt-4k", "domain_ppl": 27.9, "general_ppl": 20.1, "probe_acc": 0.72},
    {"name": "cpt-12k", "domain_ppl": 25.8, "general_ppl": 23.9, "probe_acc": 0.71},
]

base = checkpoints[0]
max_general_regression = 1.5

print("checkpoint  domain_gain  general_regression  probe_acc  keep")
best = None
best_rank = None

for row in checkpoints:
    domain_gain = base["domain_ppl"] - row["domain_ppl"]
    general_regression = row["general_ppl"] - base["general_ppl"]
    keep = general_regression <= max_general_regression
    rank = (row["probe_acc"], -row["domain_ppl"])

    if keep and (best_rank is None or rank > best_rank):
        best = row
        best_rank = rank

    print(
        f"{row['name']:<10}"
        f"{domain_gain:>11.1f}"
        f"{general_regression:>20.1f}"
        f"{row['probe_acc']:>11.2f}"
        f"  {'yes' if keep else 'no'}"
    )

print(f"chosen={best['name']}")
print("reason=best downstream probe, then domain perplexity, inside general-regression budget")

Checkpoint trade-off ledger

checkpoint  domain_gain  general_regression  probe_acc  keep
base              0.0                 0.0       0.62  yes
cpt-1k           10.5                 0.3       0.68  yes
cpt-4k           14.1                 0.9       0.72  yes
cpt-12k          16.2                 4.7       0.71  no
chosen=cpt-4k
reason=best downstream probe, then domain perplexity, inside general-regression budget

Stopping rules

Because continued pretraining keeps the same objective, it can feel deceptively safe. It isn't safe by default.

Good stopping cues:

domain validation loss flattens
downstream gains after a probe SFT stop improving
general regressions start to outweigh domain benefits

Bad stopping cues:

"we still have more domain text"
"loss is still going down a little"

More steps aren't a free lunch once the domain shift is already absorbed.

Where it fits relative to LoRA and SFT

Compare the choices by asking what you want to change.

Goal	Best first tool
Inject fresh or citable facts without retraining	RAG
Teach new domain language patterns	continued pretraining
Teach chat or task format	SFT
Run a behavior update without full-weight training	SFT with LoRA / QLoRA adapters
Choose between multiple acceptable responses	DPO or RLHF

LoRA and QLoRA are parameter-efficient implementation choices; QLoRA also stores the frozen base model in quantized form.^{[5]Reference 5QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314} They don't determine what supervision teaches. An adapter can be trained with a next-token domain-text objective or with prompt-response SFT. First choose objective from the failure mode, then choose full-weight or parameter-efficient training from budget and deployment constraints.

A strong training stack often looks like:

base model
continued pretraining on domain corpus
SFT on curated prompt-response data
preference optimization if needed

Not every product needs every stage. Choose the stage that matches the failure you observe.

Common pitfalls

Using continued pretraining to fix assistant tone

Symptom: the model still formats answers badly after a long domain-text run.
Cause: the issue was interface behavior, not domain language exposure.
Fix: move to SFT sooner.

Over-specializing on one corpus

Symptom: domain completions improve, but the model becomes narrow or brittle elsewhere.
Cause: no replay mixture, or too many adaptation steps.
Fix: keep a general-text regression lane and stop earlier.

Skipping the downstream check

Symptom: domain perplexity improves, but the final task model barely benefits.
Cause: the adaptation run optimized text fit that did not transfer to the product task.
Fix: probe the adapted checkpoint with a small downstream SFT instead of judging only by perplexity.

Mastery check

Defend these points:

Continued pretraining keeps causal language-modeling loss while changing corpus distribution; SFT changes supervision format to prompt-response examples.
Use continued pretraining when raw domain text is weak, and use SFT when domain understanding is fine but interface behavior is wrong.
For checkpoints that ended at a low learning rate, test re-warming and re-decaying plus replay ratios; Ibrahim et al.'s 5% and 25% mixes are study-specific reference points, not defaults.^{[3]Reference 3Simple and Scalable Strategies to Continually Pre-train Large Language Modelshttps://openreview.net/forum?id=DimPeeCxKO}
Reach for RAG when facts change or must be cited; reach for CPT only when raw domain text itself confuses the base model.
Track domain gain, downstream probe lift, and broad-language regression together before choosing a checkpoint.

Evaluation rubric

Strong: separates RAG, CPT, and SFT by learning objective, then explains that LoRA changes parameterization rather than choosing the objective.
Strong: explains forgetting and underfitting as opposite failures controlled mainly by re-warm peak and replay ratio.
Strong: uses two evaluation lanes at once: domain gain and general regression.
Weak: chooses continued pretraining for changing facts or assistant tone problems that should start with RAG or SFT.
Weak: picks the final checkpoint only because domain perplexity kept falling.

Follow-up questions

Prompt	Answer sketch
What is the core difference between continued pretraining and SFT?	Continued pretraining feeds unlabeled or weakly structured domain text through the same next-token objective. SFT trains on prompt-response examples to teach answer format, task behavior, and interface style.
When is continued pretraining a better first move than SFT? Can LoRA decide that?	Choose CPT when the base model is weak on domain language itself; choose SFT when it understands the text but answers poorly. LoRA can't decide between them because it can parameterize either objective.
A team wants the model to answer questions about this week's pricing rules. CPT, SFT, or RAG?	RAG. The facts change often and should be citable, so retrieving them at inference beats baking them into weights. CPT is for absorbing the domain's language, not for chasing fast-moving facts.
How do you limit general regression during continued pretraining?	Sweep re-warm/re-decay schedules and replay ratios, then watch a general-text regression lane beside domain gain. Don't assume a paper's replay percentage transfers to your data.
How do you know a continued-pretraining run went too far?	Domain metrics improve, but broad validation regresses, generations become narrow, or downstream probe SFT stops improving. Choose an earlier checkpoint with better overall trade-off.

What to remember

Continued pretraining keeps the same causal LM objective and changes the corpus.
It's best when the problem is domain language weakness, not a missing fact (RAG) or a wrong format (SFT).
Balance forgetting against underfitting by evaluating learning-rate re-warming/re-decaying and replay mixtures.
Replay ratios are experimental choices; measure them under a fixed token budget and a general-regression gate.
You still need filtering, deduplication, provenance, and leakage control.
Domain gain must be tracked beside general regression, not instead of it.
The right next step after continued pretraining is usually SFT, not direct deployment.

Next Step

Continue to Synthetic Data Generation Pipelines for LLMs

Continued pretraining taught you how to move a base model toward the language of your domain before you ever label prompt-response pairs. The next chapter switches to the data engine that often feeds later SFT and preference runs: generating, filtering, decontaminating, and versioning synthetic training examples.

PreviousBuild GPT from Scratch Lab

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.

Gururangan, S., Marasovic, A., Swayamdipta, S., et al. · 2020 · ACL 2020

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

Ovadia, O., Brief, M., Mishaeli, M., & Elisha, O. · 2024 · EMNLP 2024

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Ibrahim, A., Therien, B., Gupta, K., et al. · 2024 · Transactions on Machine Learning Research

BloombergGPT: A Large Language Model for Finance

Wu, S., Irsoy, O., Lu, S., et al. · 2023

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

Back to Topics

LearnAdvanced Training & AdaptationContinued Pretraining for Domain Shift

⚡HardFine-Tuning & Training

Continued Pretraining for Domain Shift

22 min read

Learning path

Step 99 of 158 in the full curriculum

Build GPT from Scratch Lab Synthetic Data Pipelines

Teams often confuse three different tools:

retrieval-augmented generation (RAG) leaves the weights frozen and injects knowledge at inference time by retrieving documents into the prompt. Use it when facts change often or must be cited.
Supervised Fine-Tuning (SFT) changes behavior, format, and tone using curated examples such as prompt-response pairs. Use it when the model already knows the domain but answers in the wrong shape.
Continued pretraining (CPT) changes the weights with the same next-token objective so the model better fits domain terminology, document structure, and statistical patterns. Use it when raw domain text still confuses the base model.

What changes and what stays the same

In continued pretraining:

the objective stays the same: predict the next token
the model architecture stays the same
the data distribution changes
the goal changes from general competence to domain adaptation

That differs from later SFT, where the model learns from curated prompt-response examples instead of unlabeled text.

Diagnose the shift before spending training compute

measure_tokenizer_fertility.py

import re

import tiktoken

encoder = tiktoken.get_encoding("gpt2")
samples = {
    "general": "A developer changed the feature flag before deploy.",
    "catalog": "The release orchestration workflow reconciles failed checks.",
    "incident": "The sidecar restarted after the readiness probe failed.",
}

print("slice     words  tokens  tokens_per_word")
for name, text in samples.items():
    words = re.findall(r"\b[\w'-]+\b", text)
    tokens = encoder.encode(text)
    fertility = len(tokens) / len(words)
    print(f"{name:<9}{len(words):>5}{len(tokens):>8}{fertility:>17.2f}")

Tokenizer fertility diagnostic

slice     words  tokens  tokens_per_word
general      8       9             1.12
catalog      7      10             1.43
incident     8      11             1.38

group_domain_holdout.py

import hashlib

documents = [
    {"group": "manual-v1", "text": "scanner fault E17 means belt obstruction"},
    {"group": "manual-v1", "text": "scanner fault E18 means label obstruction"},
    {"group": "incident-runbooks", "text": "canary rollbacks require owner acknowledgement"},
    {"group": "incident-runbooks", "text": "destructive migrations require DBA approval"},
    {"group": "events-east", "text": "hub=EWR lane=42 retry=1"},
    {"group": "events-west", "text": "hub=OAK lane=11 retry=0"},
]

def split_for_group(group: str) -> str:
    bucket = int(hashlib.sha256(group.encode()).hexdigest(), 16) % 4
    return "validation" if bucket == 0 else "train"

splits = {"train": [], "validation": []}
for doc in documents:
    splits[split_for_group(doc["group"])].append(doc)

train_groups = {doc["group"] for doc in splits["train"]}
validation_groups = {doc["group"] for doc in splits["validation"]}
assert train_groups.isdisjoint(validation_groups)

print(f"train_groups={sorted(train_groups)}")
print(f"validation_groups={sorted(validation_groups)}")
print("group leakage: none")

Grouped holdout split

train_groups=['events-west', 'manual-v1']
validation_groups=['events-east', 'incident-runbooks']
group leakage: none

The two failure dynamics: forgetting and underfitting

Resuming training on a new distribution pulls the weights in two directions, and a good run balances them.

Catastrophic forgetting is loss of previously learned ability as parameters shift to absorb new data. Push too hard on domain text and broad validation quality can regress.

Underfitting is the opposite failure: train too gently and the domain leaves no real impression.

Two major controls for this balance are the learning-rate schedule and the data mixture; run length and corpus quality matter too.

Learning rate re-warming and re-decaying

rewarm_redecay_schedule.py

import math

def rewarm_redecay(step: int, total_steps: int, warmup_steps: int, peak: float, floor: float) -> float:
    if step < warmup_steps:
        return floor + (peak - floor) * (step + 1) / warmup_steps
    progress = (step - warmup_steps) / max(1, total_steps - warmup_steps - 1)
    cosine = 0.5 * (1.0 + math.cos(math.pi * progress))
    return floor + (peak - floor) * cosine

total_steps = 1000
warmup_steps = 50
peak = 3e-5  # sweep this value; do not inherit it blindly
floor = 3e-6

for step in [0, 49, 50, 250, 999]:
    print(f"step={step:>3} lr={rewarm_redecay(step, total_steps, warmup_steps, peak, floor):.2e}")

Re-warm then re-decay schedule

step=  0 lr=3.54e-06
step= 49 lr=3.00e-05
step= 50 lr=3.00e-05
step=250 lr=2.71e-05
step=999 lr=3.00e-06

Replay: keep prior-data signal in the mix

compute_equivalent_replay.py

total_tokens = 2_000_000

print("replay_ratio  domain_tokens  replay_tokens  total_tokens")
for replay_ratio in [0.00, 0.05, 0.25]:
    replay_tokens = int(total_tokens * replay_ratio)
    domain_tokens = total_tokens - replay_tokens
    assert domain_tokens + replay_tokens == total_tokens
    print(f"{replay_ratio:>11.0%}{domain_tokens:>15,}{replay_tokens:>15,}{total_tokens:>14,}")

Compute-equivalent replay accounting

replay_ratio  domain_tokens  replay_tokens  total_tokens
         0%      2,000,000              0     2,000,000
         5%      1,900,000        100,000     2,000,000
        25%      1,500,000        500,000     2,000,000

When continued pretraining is the right tool

Reach for continued pretraining when the domain has its own language that the base model under-serves:

internal incident policies with recurring entity patterns
incident event streams and service jargon
long runbook or compliance documents
dense technical manuals
codebases with domain-specific APIs and naming conventions

The trigger isn't "the business wants custom behavior." The model needs more exposure to the domain's text distribution before post-training behavior shaping makes sense.

Good signals

Signal	Why it points to continued pretraining
Model misreads domain terminology	It lacks token-distribution familiarity, not response style alone
Long domain documents feel unnatural to the model	The base corpus underrepresented this text type
Raw completions are weak even before instruction formatting	The issue appears before chat behavior enters the picture
You have lots of domain text but few high-quality prompt-response labels	Continued pretraining can exploit unlabeled corpora

Bad signals

Signal	Better tool
Model knows the facts but answers in the wrong format	SFT
You need fresh, frequently changing, or citable facts	RAG
Model needs one task-specific classifier head	supervised fine-tuning with a classifier head
Model is mostly correct but chooses the wrong safe vs unsafe answer	preference optimization

Domain-adaptive vs task-adaptive pretraining

DAPT (domain-adaptive pretraining): keep training on large unlabeled domain text such as incident runbooks or service notes
TAPT (task-adaptive pretraining): continue on the task's own unlabeled inputs, even when the corpus is smaller

A practical decision test

Data for continued pretraining

The same discipline from large-scale pretraining still applies:

filter low-quality text
deduplicate aggressively
remove benchmarks and eval leakage
scrub PII and sensitive content
keep provenance and usage rights for every corpus slice

Gate the corpus before tokenization

gate_domain_manifest.py

sources = [
    {"name": "public-manuals", "tokens": 800_000, "licensed": True, "pii_scrubbed": True, "eval_only": False},
    {"name": "support-notes", "tokens": 120_000, "licensed": True, "pii_scrubbed": False, "eval_only": False},
    {"name": "heldout-probes", "tokens": 25_000, "licensed": True, "pii_scrubbed": True, "eval_only": True},
    {"name": "vendor-export", "tokens": 300_000, "licensed": False, "pii_scrubbed": True, "eval_only": False},
]

accepted = [
    row for row in sources
    if row["licensed"] and row["pii_scrubbed"] and not row["eval_only"]
]
rejected = [row["name"] for row in sources if row not in accepted]

print(f"accepted={[row['name'] for row in accepted]}")
print(f"training_tokens={sum(row['tokens'] for row in accepted):,}")
print(f"rejected={rejected}")

Corpus manifest gate

accepted=['public-manuals']
training_tokens=800,000
rejected=['support-notes', 'heldout-probes', 'vendor-export']

Keep evaluation text out of training

remove_exact_eval_overlap.py

import hashlib

def fingerprint(text: str) -> str:
    normalized = " ".join(text.lower().split())
    return hashlib.sha256(normalized.encode()).hexdigest()

heldout = [
    "Fault E17: belt obstruction. Clear belt and retry.",
    "Returns above $500 require supervisor approval.",
]
candidate_training = [
    "Scanner firmware notes for version 4.2.",
    "  fault E17: BELT obstruction. clear belt and retry. ",
    "Lane timeout codes and remediation steps.",
]

heldout_hashes = {fingerprint(text) for text in heldout}
clean_training = [
    text for text in candidate_training
    if fingerprint(text) not in heldout_hashes
]

print(f"removed={len(candidate_training) - len(clean_training)}")
print(f"kept={len(clean_training)}")
assert all(fingerprint(text) not in heldout_hashes for text in clean_training)

Exact evaluation decontamination

removed=1
kept=2

Mixing strategy

Don't assume 100% domain text is always optimal. In practice, teams mix:

a high-quality domain slice
a smaller replay slice of general text

Pack blocks and preserve the mixture

pack_domain_token_blocks.py

EOS = 0
block_size = 6
documents = [[11, 12, 13], [21, 22], [31, 32, 33, 34]]

stream = []
for document in documents:
    stream.extend(document + [EOS])

blocks = [
    stream[start:start + block_size]
    for start in range(0, len(stream) - block_size + 1, block_size)
]

print(f"stream={stream}")
print(f"blocks={blocks}")
assert all(len(block) == block_size for block in blocks)
assert EOS in blocks[0]

Packed CPT token blocks

stream=[11, 12, 13, 0, 21, 22, 0, 31, 32, 33, 34, 0]
blocks=[[11, 12, 13, 0, 21, 22], [0, 31, 32, 33, 34, 0]]

This small example drops an incomplete final block instead of padding it. Production loaders need an explicit remainder policy.

Once domain and replay streams are packed, make mixture selection explicit and auditable. Here each twenty-block training window uses a seeded shuffle with the requested replay count.

build_replay_mixture.py

import random

def make_window(domain_blocks: list[str], replay_blocks: list[str], replay_ratio: float, size: int) -> list[str]:
    if not 0.0 <= replay_ratio <= 1.0:
        raise ValueError("replay_ratio must be between 0 and 1")
    replay_count = round(size * replay_ratio)
    domain_count = size - replay_count
    if len(domain_blocks) < domain_count or len(replay_blocks) < replay_count:
        raise ValueError("not enough packed blocks for requested window")
    chosen = domain_blocks[:domain_count] + replay_blocks[:replay_count]
    random.Random(7).shuffle(chosen)
    return chosen

domain_blocks = [f"domain-{index}" for index in range(20)]
replay_blocks = [f"general-{index}" for index in range(20)]
window = make_window(domain_blocks, replay_blocks, replay_ratio=0.25, size=20)

domain_count = sum(item.startswith("domain") for item in window)
replay_count = sum(item.startswith("general") for item in window)
print(f"domain_blocks={domain_count} replay_blocks={replay_count}")
print(f"first_five={window[:5]}")
assert (domain_count, replay_count) == (15, 5)

Deterministic replay mixture

domain_blocks=15 replay_blocks=5
first_five=['general-2', 'general-0', 'domain-11', 'general-3', 'domain-7']

Evaluation: domain gain without lying to yourself

Continued pretraining needs two evaluation lanes at the same time.

Lane 1: domain gain

Measure:

domain validation perplexity
retrieval or classification tasks in the domain
generation quality on held-out domain documents
downstream task lift after later SFT

Lane 2: general regression

Measure:

a small broad-language validation slice
a lightweight general benchmark set
free-form generations outside the target domain

If you only watch domain gain, you can accidentally produce a model that sounds like one incident runbook and forgot how to write broadly coherent language.

report_domain_and_general_ppl.py

import math

base = {
    "domain": {"negative_log_likelihood": 840.0, "tokens": 240},
    "general": {"negative_log_likelihood": 540.0, "tokens": 200},
}
adapted = {
    "domain": {"negative_log_likelihood": 720.0, "tokens": 240},
    "general": {"negative_log_likelihood": 548.0, "tokens": 200},
}

def perplexity(metrics: dict[str, float]) -> float:
    return math.exp(metrics["negative_log_likelihood"] / metrics["tokens"])

print("lane     base_ppl  adapted_ppl  delta")
for lane in ["domain", "general"]:
    base_ppl = perplexity(base[lane])
    adapted_ppl = perplexity(adapted[lane])
    print(f"{lane:<8}{base_ppl:>9.2f}{adapted_ppl:>13.2f}{adapted_ppl - base_ppl:>7.2f}")

Two-lane perplexity report

lane     base_ppl  adapted_ppl  delta
domain      33.12        20.09 -13.03
general     14.88        15.49   0.61

Runnable checkpoint ledger

continued_pretraining_checkpoint_picker.py

checkpoints = [
    {"name": "base", "domain_ppl": 42.0, "general_ppl": 19.2, "probe_acc": 0.62},
    {"name": "cpt-1k", "domain_ppl": 31.5, "general_ppl": 19.5, "probe_acc": 0.68},
    {"name": "cpt-4k", "domain_ppl": 27.9, "general_ppl": 20.1, "probe_acc": 0.72},
    {"name": "cpt-12k", "domain_ppl": 25.8, "general_ppl": 23.9, "probe_acc": 0.71},
]

base = checkpoints[0]
max_general_regression = 1.5

print("checkpoint  domain_gain  general_regression  probe_acc  keep")
best = None
best_rank = None

for row in checkpoints:
    domain_gain = base["domain_ppl"] - row["domain_ppl"]
    general_regression = row["general_ppl"] - base["general_ppl"]
    keep = general_regression <= max_general_regression
    rank = (row["probe_acc"], -row["domain_ppl"])

    if keep and (best_rank is None or rank > best_rank):
        best = row
        best_rank = rank

    print(
        f"{row['name']:<10}"
        f"{domain_gain:>11.1f}"
        f"{general_regression:>20.1f}"
        f"{row['probe_acc']:>11.2f}"
        f"  {'yes' if keep else 'no'}"
    )

print(f"chosen={best['name']}")
print("reason=best downstream probe, then domain perplexity, inside general-regression budget")

Checkpoint trade-off ledger

checkpoint  domain_gain  general_regression  probe_acc  keep
base              0.0                 0.0       0.62  yes
cpt-1k           10.5                 0.3       0.68  yes
cpt-4k           14.1                 0.9       0.72  yes
cpt-12k          16.2                 4.7       0.71  no
chosen=cpt-4k
reason=best downstream probe, then domain perplexity, inside general-regression budget

Stopping rules

Because continued pretraining keeps the same objective, it can feel deceptively safe. It isn't safe by default.

Good stopping cues:

domain validation loss flattens
downstream gains after a probe SFT stop improving
general regressions start to outweigh domain benefits

Bad stopping cues:

"we still have more domain text"
"loss is still going down a little"

More steps aren't a free lunch once the domain shift is already absorbed.

Where it fits relative to LoRA and SFT

Compare the choices by asking what you want to change.

Goal	Best first tool
Inject fresh or citable facts without retraining	RAG
Teach new domain language patterns	continued pretraining
Teach chat or task format	SFT
Run a behavior update without full-weight training	SFT with LoRA / QLoRA adapters
Choose between multiple acceptable responses	DPO or RLHF

A strong training stack often looks like:

base model
continued pretraining on domain corpus
SFT on curated prompt-response data
preference optimization if needed

Not every product needs every stage. Choose the stage that matches the failure you observe.

Common pitfalls

Using continued pretraining to fix assistant tone

Symptom: the model still formats answers badly after a long domain-text run.
Cause: the issue was interface behavior, not domain language exposure.
Fix: move to SFT sooner.

Over-specializing on one corpus

Symptom: domain completions improve, but the model becomes narrow or brittle elsewhere.
Cause: no replay mixture, or too many adaptation steps.
Fix: keep a general-text regression lane and stop earlier.

Skipping the downstream check

Symptom: domain perplexity improves, but the final task model barely benefits.
Cause: the adaptation run optimized text fit that did not transfer to the product task.
Fix: probe the adapted checkpoint with a small downstream SFT instead of judging only by perplexity.

Mastery check

Defend these points:

Continued pretraining keeps causal language-modeling loss while changing corpus distribution; SFT changes supervision format to prompt-response examples.
Use continued pretraining when raw domain text is weak, and use SFT when domain understanding is fine but interface behavior is wrong.
For checkpoints that ended at a low learning rate, test re-warming and re-decaying plus replay ratios; Ibrahim et al.'s 5% and 25% mixes are study-specific reference points, not defaults.^{[3]Reference 3Simple and Scalable Strategies to Continually Pre-train Large Language Modelshttps://openreview.net/forum?id=DimPeeCxKO}
Reach for RAG when facts change or must be cited; reach for CPT only when raw domain text itself confuses the base model.
Track domain gain, downstream probe lift, and broad-language regression together before choosing a checkpoint.

Evaluation rubric

Strong: separates RAG, CPT, and SFT by learning objective, then explains that LoRA changes parameterization rather than choosing the objective.
Strong: explains forgetting and underfitting as opposite failures controlled mainly by re-warm peak and replay ratio.
Strong: uses two evaluation lanes at once: domain gain and general regression.
Weak: chooses continued pretraining for changing facts or assistant tone problems that should start with RAG or SFT.
Weak: picks the final checkpoint only because domain perplexity kept falling.

Follow-up questions

Prompt	Answer sketch
What is the core difference between continued pretraining and SFT?	Continued pretraining feeds unlabeled or weakly structured domain text through the same next-token objective. SFT trains on prompt-response examples to teach answer format, task behavior, and interface style.
When is continued pretraining a better first move than SFT? Can LoRA decide that?	Choose CPT when the base model is weak on domain language itself; choose SFT when it understands the text but answers poorly. LoRA can't decide between them because it can parameterize either objective.
A team wants the model to answer questions about this week's pricing rules. CPT, SFT, or RAG?	RAG. The facts change often and should be citable, so retrieving them at inference beats baking them into weights. CPT is for absorbing the domain's language, not for chasing fast-moving facts.
How do you limit general regression during continued pretraining?	Sweep re-warm/re-decay schedules and replay ratios, then watch a general-text regression lane beside domain gain. Don't assume a paper's replay percentage transfers to your data.
How do you know a continued-pretraining run went too far?	Domain metrics improve, but broad validation regresses, generations become narrow, or downstream probe SFT stops improving. Choose an earlier checkpoint with better overall trade-off.

What to remember

Continued pretraining keeps the same causal LM objective and changes the corpus.
It's best when the problem is domain language weakness, not a missing fact (RAG) or a wrong format (SFT).
Balance forgetting against underfitting by evaluating learning-rate re-warming/re-decaying and replay mixtures.
Replay ratios are experimental choices; measure them under a fixed token budget and a general-regression gate.
You still need filtering, deduplication, provenance, and leakage control.
Domain gain must be tracked beside general regression, not instead of it.
The right next step after continued pretraining is usually SFT, not direct deployment.

Next Step

Continue to Synthetic Data Generation Pipelines for LLMs

PreviousBuild GPT from Scratch Lab

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.

Gururangan, S., Marasovic, A., Swayamdipta, S., et al. · 2020 · ACL 2020

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

Ovadia, O., Brief, M., Mishaeli, M., & Elisha, O. · 2024 · EMNLP 2024

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Ibrahim, A., Therien, B., Gupta, K., et al. · 2024 · Transactions on Machine Learning Research

BloombergGPT: A Large Language Model for Finance

Wu, S., Irsoy, O., Lu, S., et al. · 2023

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

Continued Pretraining for Domain Shift

What changes and what stays the same

Diagnose the shift before spending training compute

The two failure dynamics: forgetting and underfitting

Learning rate re-warming and re-decaying

Replay: keep prior-data signal in the mix

Your continued-pretraining run resumes from the base checkpoint at its final tiny learning rate and uses 100% domain text. Domain perplexity barely moves. After you raise the re-warm peak, domain perplexity improves but general-text loss regresses. Which two sweeps should you run?

When continued pretraining is the right tool

Good signals

Bad signals

Domain-adaptive vs task-adaptive pretraining

A practical decision test

Data for continued pretraining

Gate the corpus before tokenization

Keep evaluation text out of training

Mixing strategy

Pack blocks and preserve the mixture

Evaluation: domain gain without lying to yourself

Lane 1: domain gain

Lane 2: general regression

Runnable checkpoint ledger

Stopping rules

Where it fits relative to LoRA and SFT

Common pitfalls

Using continued pretraining to fix assistant tone

Over-specializing on one corpus

Skipping the downstream check

Mastery check

Evaluation rubric

Follow-up questions

Your base model writes polite answers, but it misreads service exception codes and produces weak raw continuations for service logs. What is the better first move: continued pretraining or SFT?

Domain perplexity keeps improving, but a probe SFT barely lifts the downstream task and your general validation slice regresses. What does that pattern suggest?

You have a large corpus of unlabeled internal incident notes and only a small set of high-quality prompt-response examples. What training order usually makes more sense?

What to remember

Mastery Check

Continued Pretraining for Domain Shift

What changes and what stays the same

Diagnose the shift before spending training compute

The two failure dynamics: forgetting and underfitting

Learning rate re-warming and re-decaying

Replay: keep prior-data signal in the mix

Your continued-pretraining run resumes from the base checkpoint at its final tiny learning rate and uses 100% domain text. Domain perplexity barely moves. After you raise the re-warm peak, domain perplexity improves but general-text loss regresses. Which two sweeps should you run?

When continued pretraining is the right tool

Good signals

Bad signals

Domain-adaptive vs task-adaptive pretraining

A practical decision test

Data for continued pretraining

Gate the corpus before tokenization

Keep evaluation text out of training

Mixing strategy

Pack blocks and preserve the mixture

Evaluation: domain gain without lying to yourself

Lane 1: domain gain

Lane 2: general regression

Runnable checkpoint ledger

Stopping rules

Where it fits relative to LoRA and SFT

Common pitfalls

Using continued pretraining to fix assistant tone

Over-specializing on one corpus

Skipping the downstream check

Mastery check

Evaluation rubric

Follow-up questions

Your base model writes polite answers, but it misreads service exception codes and produces weak raw continuations for service logs. What is the better first move: continued pretraining or SFT?

Domain perplexity keeps improving, but a probe SFT barely lifts the downstream task and your general validation slice regresses. What does that pattern suggest?

You have a large corpus of unlabeled internal incident notes and only a small set of high-quality prompt-response examples. What training order usually makes more sense?

What to remember

Mastery Check