LearnAdvanced Training & AdaptationSupervised Fine-Tuning Pipeline

⚡HardFine-Tuning & Training

Supervised Fine-Tuning Pipeline

Run supervised fine-tuning as a real training system: choose the learning objective before the update surface, verify response-token loss and packing, track the real batch budget, save resumable checkpoints, and export on held-out behavior.

24 min read

Learning path

Step 101 of 158 in the full curriculum

Synthetic Data Pipelines Distributed Training: FSDP & ZeRO

The synthetic-data chapter built and verified candidate rows for post-training. Supervised fine-tuning (SFT) turns accepted demonstrations, whether written by humans or generated and checked, into gradient updates on desired responses. A good SFT run doesn't start with a trainer call. It starts with a behavior target, a leakage-resistant split, a loss mask, a batch budget, checkpoints that can resume, and an evaluation rule for exporting the best artifact.^{[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer}^{[2]Reference 2Fine-Tune Your First LLM.https://meta-pytorch.org/torchtune/stable/tutorials/first_finetune_tutorial.html}

Turn SFT into an operational training system for access-policy assistant replies. First choose the learning objective. Then choose whether that SFT objective updates every weight or a small adapter. Keeping those two decisions separate prevents expensive experiments that answer the wrong question.

SFT pipeline showing accepted rows, masked prompt labels, parameterization, held-out evaluation, and export. — An SFT run has one supervised objective but multiple possible update surfaces. Prompt tokens stay visible as context while -100 labels remove them from loss; keep that mask, the split, parameterization, checkpoint state, and held-out evaluation aligned before trusting an export.

What an SFT run must decide

Every serious SFT job has to answer the same questions:

What is the behavior target?
Is SFT the right objective, or is the missing ingredient still domain pretraining?
Are we updating all weights or only adapters?
What examples and tokens contribute to training loss?
What is the true example and supervised-token batch budget?
Which held-out metric decides the best checkpoint?
Can this stay on one GPU, or do we need Fully Sharded Data Parallel (FSDP) or Zero Redundancy Optimizer (ZeRO)?

Once you phrase the job that way, the trainer isn't the system. It's only one component inside the system.

Separate objective from parameterization

An objective tells the model what signal to learn from. A parameterization tells the trainer which weights may move. They aren't interchangeable.

Failure you observe	Objective to investigate	Why
Base model has weak exposure to domain language in unlabeled corpora	continued pretraining	next-token training on domain text adapts the language distribution before behavior training^{[3]Reference 3Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.https://aclanthology.org/2020.acl-main.740/}
Model can read an access policy but doesn't follow the desired escalation procedure	SFT	prompt-completion demonstrations directly teach that response behavior^{[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer}
Several acceptable answers need ranking by preference	preference training after establishing an SFT baseline	comparisons express relative preference rather than a single target answer^{[4]Reference 4Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290}

After choosing SFT, make a second decision:

SFT parameterization	What moves during the same supervised objective	When to test it
Full fine-tuning	all trainable model weights	when memory permits and adapters may be too restrictive
LoRA	small low-rank adapter matrices; base weights stay frozen	when iteration speed, memory, or many task variants matter^{[5]Reference 5LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685}
QLoRA	LoRA adapters while the frozen base is stored in 4-bit form	when the base model doesn't fit comfortably at higher precision^{[6]Reference 6QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314}

LoRA and QLoRA aren't alternatives to SFT. They can be the parameterization used for an SFT run. Continued pretraining is a different objective; it can also be parameter-efficient in an appropriate setup, but it still doesn't become SFT.

Full weights or adapters

Full fine-tuning

Full fine-tuning is the simplest conceptual update: every eligible parameter may change to reduce supervised response loss. Test it when:

the model fits with its gradients and optimizer state
one merged checkpoint is operationally simpler than serving adapters
an adapter baseline underfits your held-out behavior metric

LoRA

LoRA is still SFT when trained on prompt-response rows. Use it when:

iteration speed or training memory limits matter
you need separate task or tenant variants over one base model
you want a strong adapter baseline before paying for full updates

LoRA freezes the base model weights and trains low-rank adapter matrices inside selected layers, which reduces trainable parameters and memory compared with full fine-tuning.^{[5]Reference 5LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685}

QLoRA

Use QLoRA when:

LoRA is the right algorithmic choice
the frozen base model itself is too large to keep in BF16/FP16 memory
you still want adapter training instead of full-weight updates

QLoRA keeps the adapter-training idea but backpropagates through a frozen 4-bit quantized base model into LoRA adapters, which is why it can make larger models fit in less memory.^{[6]Reference 6QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314} The comparison that matters is held-out behavior under an honest memory and compute budget. A cheap run that fails the task isn't a win.

Reject broken demonstrations before training

For this pipeline, store a prompt, a desired completion, and a group key for leakage-resistant splitting on every SFT row. A missing completion isn't harmless missing metadata: it's a training example with no answer to teach.

validate_sft_rows.py

rows = [
    {"thread_id": "T102", "prompt": "Stale key", "completion": "Request rotation evidence."},
    {"thread_id": "B550", "prompt": "Expired key", "completion": ""},
    {"prompt": "Wrong size", "completion": "Offer an exchange label."},
]

required = {"thread_id", "prompt", "completion"}
accepted, rejected = [], []
for index, row in enumerate(rows):
    missing = sorted(required - row.keys())
    if missing:
        rejected.append((index, f"missing {missing}"))
    elif not row["completion"].strip():
        rejected.append((index, "empty completion"))
    else:
        accepted.append(row)

print("accepted_threads=", [row["thread_id"] for row in accepted])
print("rejected=", rejected)

Row contract output

accepted_threads= ['T102']
rejected= [(1, 'empty completion'), (2, "missing ['thread_id']")]

Split by the unit that could leak

Before formatting rows, reserve evaluation examples that training can't imitate through near duplicates. For a policy assistant, several messages from one access incident or one policy document often share facts and phrasing. Randomly splitting individual messages can put one part of the same case in training and another in evaluation.

Group by the smallest deployment unit that should be unseen at evaluation time: incident thread, policy document, tenant, or time window. The tiny check below keeps every message from an incident on one side of the split.

grouped_sft_split.py

rows = [
    {"case_id": "T102", "turn": 1, "answer": "Request key-rotation evidence."},
    {"case_id": "T102", "turn": 2, "answer": "Approve access after rotation proof."},
    {"case_id": "R550", "turn": 1, "answer": "Escalate privileged-role changes."},
    {"case_id": "P771", "turn": 1, "answer": "Cite the session-timeout policy."},
]

eval_cases = {"R550"}
train = [row for row in rows if row["case_id"] not in eval_cases]
evaluation = [row for row in rows if row["case_id"] in eval_cases]

train_cases = {row["case_id"] for row in train}
held_out_cases = {row["case_id"] for row in evaluation}
assert train_cases.isdisjoint(held_out_cases)

print("train_cases=", sorted(train_cases))
print("eval_cases=", sorted(held_out_cases))
print("case_overlap=", train_cases & held_out_cases)

Grouped split output

train_cases= ['P771', 'T102']
eval_cases= ['R550']
case_overlap= set()

Data path: template, tokenize, label, pack

An SFT run usually performs these steps:

read structured examples
apply the model-specific chat template
run tokenization
label desired response tokens and mask the prompt/context tokens
pack short examples with sequence-boundary handling, or pad them into batches
feed the trainer

The instruction-tuning lesson already covered chat-template mechanics. The operational lesson here is that you version this whole path together. If you change the template or masking logic, you changed the training distribution and therefore the experiment.^{[7]Reference 7Transformers Documentation: Writing a chat template.https://huggingface.co/docs/transformers/main/en/chat_templating_writing}

The label mask is easy to get wrong because the input still contains the system instruction, user request, and formatting control tokens. For response-only SFT, the desired answer tokens (including the turn terminator when the model must learn when to stop) contribute to loss. Prefix tokens provide context, not answer targets. The tiny example uses -100, the ignored label value used by PyTorch cross-entropy.

sft_loss_mask.py

IGNORE_INDEX = -100

tokens = [
    ("prefix", "<system>"),
    ("prefix", "Follow access policy."),
    ("prefix", "<user>"),
    ("prefix", "Key T102 is stale."),
    ("prefix", "<assistant>"),
    ("target", "Ask for rotation evidence within 24 hours."),
    ("target", "<end_of_turn>"),
]

labels = [
    token if span == "target" else IGNORE_INDEX
    for span, token in tokens
]

supervised = [
    token
    for (_, token), label in zip(tokens, labels)
    if label != IGNORE_INDEX
]
print("supervised_tokens=", supervised)
print("masked_positions=", sum(label == IGNORE_INDEX for label in labels))

Loss mask output

supervised_tokens= ['Ask for rotation evidence within 24 hours.', '<end_of_turn>']
masked_positions= 5

Don't confuse the label mask with the attention mask. The label mask decides which positions contribute to loss. The attention mask decides which earlier positions a token can read. Prompt tokens should remain visible as context even when their labels are -100; packed examples also need attention boundaries so one example can't read another.

Why bother masking easy prefix tokens? If a long repeated instruction is easy to predict, its small losses can dominate the average and hide poor answer learning:

response_only_loss.py

prefix_losses = [0.03, 0.04, 0.02, 0.05, 0.03]
answer_losses = [2.20, 1.80]

full_sequence_loss = sum(prefix_losses + answer_losses) / len(prefix_losses + answer_losses)
response_only_loss = sum(answer_losses) / len(answer_losses)

print(f"full_sequence_loss={full_sequence_loss:.3f}")
print(f"response_only_loss={response_only_loss:.3f}")
assert response_only_loss > full_sequence_loss

Response-only loss output

full_sequence_loss=0.596
response_only_loss=2.000

You rarely build this mask by hand. In Hugging Face TRL's SFTTrainer, prompt-completion datasets use completion-only loss by default. Conversational datasets can set assistant_only_loss=True; their chat template must expose assistant spans through generation markers. Current TRL can substitute supported training templates for known model families, but a custom template still needs verification.^{[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer} Inspect a tokenized example before launching a long run: it should reveal exactly which answer and stop tokens receive labels.

Mask inspection must happen after truncation as well. If the prompt fills max_length, the run can keep a row whose entire answer has disappeared:

reject_truncated_completion.py

IGNORE_INDEX = -100

prompt_tokens = ["<system>", "policy", "<user>", "stale_key", "<assistant>"]
answer_tokens = ["request_evidence", "<end_of_turn>"]
max_length = 5

kept_tokens = (prompt_tokens + answer_tokens)[:max_length]
labels = [
    token if position >= len(prompt_tokens) else IGNORE_INDEX
    for position, token in enumerate(kept_tokens)
]
has_answer_label = any(label != IGNORE_INDEX for label in labels)

print("kept_tokens=", kept_tokens)
print("has_answer_label=", has_answer_label)
print("decision=", "keep" if has_answer_label else "reject_or_increase_max_length")

Truncation guard output

kept_tokens= ['<system>', 'policy', '<user>', 'stale_key', '<assistant>']
has_answer_label= False
decision= reject_or_increase_max_length

Pack short examples while preserving their boundaries

Most SFT datasets are short replies. Padding each one to a fixed length wastes compute on padding tokens. Packing places several examples in a nearly full training block so more positions carry real tokens.

For independent instruction examples, a reply for case R550 shouldn't gain context from case T102 only because the loader put them in one block. A boundary-aware packed attention path records each segment, resets positions at each sequence boundary, and prevents that cross-example attention:

packed_boundary_check.py

segments = [
    ["A102 prompt", "A102 response", "<eos>"],
    ["B550 prompt", "B550 response", "<eos>"],
]

flat = [(sequence_id, token) for sequence_id, row in enumerate(segments) for token in row]
position_ids = [position for row in segments for position, _ in enumerate(row)]

def may_attend(query_index, key_index):
    query_sequence, _ = flat[query_index]
    key_sequence, _ = flat[key_index]
    return key_index <= query_index and query_sequence == key_sequence

b_response = 4
assert position_ids == [0, 1, 2, 0, 1, 2]
assert not may_attend(b_response, 1)
print("position_ids=", position_ids)
print("B550_response_sees_A102_response=", may_attend(b_response, 1))

Packed boundary output

position_ids= [0, 1, 2, 0, 1, 2]
B550_response_sees_A102_response= False

Reset position IDs are boundary metadata, not a universal attention mask. The attention implementation still has to honor those boundaries when it computes attention.

Current TRL exposes packing=True and eval_packing. Its default best-fit-decreasing (bfd) strategy packs intact examples efficiently and truncates overflow sequences. bfd_split preserves overflow tokens by splitting long sequences before packing. wrapped is the aggressive stream-and-split option, and it may mix unrelated examples. With bfd, TRL enables padding-free processing. That flattened path relies on FlashAttention 2 or 3 to honor sequence boundaries; without a compatible attention implementation, adjacent samples may contaminate each other.^{[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer} The right check isn't "never pack." It's: choose an SFT-appropriate strategy, confirm the supported attention path, and inspect a packed batch after library upgrades.

Hyperparameters are experiments, not promises

SFT begins from a pretrained checkpoint, so a useful configuration starts conservatively and earns expansion through held-out results. A library default isn't evidence that one value fits your dataset or parameterization.

Knob	Baseline to test	What decides whether it survives
Learning rate	TRL `SFTConfig` currently defaults to `2e-5`; its PEFT guidance suggests about `1e-4` for adapters^{[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer}	a short sweep on the actual update surface and held-out task metrics
Passes over data	start with a small budget and evaluate during the run	widening train/eval gap or declining behavior metrics
Warmup and schedule	set explicitly in the run config	early instability and sweep results
Weight decay and clipping	log them with the run, even when disabled	stability and evaluation evidence, not inherited habit

Don't turn 2e-5 or 1e-4 into a rule. Record parameterization, trainable-parameter count, sequence length, effective token budget, data snapshot, and evaluation result beside every candidate configuration.

Effective batch size is also a token budget

Training discussions get sloppy fast here. The batch that sits on one device isn't the same thing as the batch the optimizer sees before one update. For equal-length examples, the usual rule is:

text

examples_per_update = per_device_batch * gradient_accumulation_steps * world_size

Packed response examples aren't equal length. This formula reports examples per optimizer update, but the number of supervised answer tokens can still change from update to update. Track both.

sft_effective_batch.py

per_device_batch = 4
grad_accum = 8
world_size = 4
examples_per_update = per_device_batch * grad_accum * world_size

supervised_tokens_per_microbatch_on_one_rank = [180, 212, 164, 220, 198, 204, 175, 207]
supervised_tokens_per_update = sum(supervised_tokens_per_microbatch_on_one_rank) * world_size

print(f"examples_per_update={examples_per_update}")
print(f"supervised_tokens_per_update={supervised_tokens_per_update}")

Effective batch output

examples_per_update=128
supervised_tokens_per_update=6240

Always calculate changed configurations instead of reasoning from one changed knob:

compare_update_sizes.py

def examples_per_update(per_device_batch, accumulation, world_size):
    return per_device_batch * accumulation * world_size

run_a = examples_per_update(4, 8, 4)
run_b = examples_per_update(2, 16, 8)

print("run_a_examples_per_update=", run_a)
print("run_b_examples_per_update=", run_b)
print("ratio_b_over_a=", run_b / run_a)
assert run_b == 2 * run_a

Run comparison output

run_a_examples_per_update= 128
run_b_examples_per_update= 256
ratio_b_over_a= 2.0

Two habits follow from this:

when someone reports batch size, ask for accumulation, world size, packing, and supervised tokens too
compare learning-rate sweeps with similar token budgets or explain why they differ

Warmup, clipping, and weight decay

These knobs aren't decoration.

Warmup

Warmup limits the size of early updates while the run is most sensitive to its initial configuration. For LLM SFT, jumping immediately to the peak learning rate can make the first steps unnecessarily noisy.

Gradient clipping

Clip gradients when spikes would destabilize the run. Clipping doesn't fix bad data, but it can stop one pathological batch from damaging the checkpoint.

Weight decay

Weight decay regularizes the update path. It should be treated as part of the recipe, not an afterthought copy-pasted from another config.

Don't memorize one numeric recipe. Know what each control is trying to prevent.

What a resumable run must save

A real SFT resume bundle isn't only model_state_dict.

At minimum, retain:

model weights
optimizer state
scheduler state
global step / epoch
RNG state and sampler/data position when repeatable continuation matters
tokenizer + chat-template version
training config, data manifest, and evaluation manifest
best metric so far

Why this matters:

optimizer state controls the next update
scheduler state controls learning rate at resume
sampler state prevents quietly repeating or skipping shuffled training rows
tokenizer/template version controls what the model thinks each token means

Exact bit-for-bit replay may still depend on hardware, kernels, and determinism settings. The resume bundle should at least prevent avoidable changes to data order, learning-rate state, and model selection. torchtune and DeepSpeed both document resume flows explicitly instead of treating checkpointing as a trivial file write.^{[8]Reference 8Checkpointing in torchtune.https://meta-pytorch.org/torchtune/stable/deep_dives/checkpointer.html}^{[9]Reference 9Universal Checkpointing with DeepSpeed: A Practical Guide.https://www.deepspeed.ai/tutorials/universal-checkpointing/}

Resumable SFT run bundle showing model weights, optimizer and scheduler state, sampler position, formatting versions, data and evaluation manifests, and best metric. — A resume bundle is more than weights. Repeatable continuation needs model and optimizer state, data position, formatting versions, evaluation definition, and the record of which checkpoint currently wins.

A minimal manifest should make missing resume state obvious:

resume_bundle_manifest_check.py

required = {
    "model_state",
    "optimizer_state",
    "scheduler_state",
    "global_step",
    "rng_state",
    "sampler_state",
    "tokenizer_version",
    "chat_template_version",
    "data_manifest",
    "eval_manifest",
    "best_metric",
}

resume_bundle = {
    "model_state": "weights/step_600.safetensors",
    "optimizer_state": "optimizer/step_600.pt",
    "scheduler_state": "scheduler/step_600.pt",
    "global_step": 600,
    "rng_state": "rng/step_600.pt",
    "sampler_state": {"epoch": 1, "batches_consumed": 120},
    "tokenizer_version": "policy-sft-tokenizer-v3",
    "chat_template_version": "llama3-support-v2",
    "data_manifest": "data/sft_manifest_2026-05-20.json",
    "eval_manifest": "eval/access_policy_behavior_v4.json",
    "best_metric": {"name": "support_resolution_accuracy", "value": 0.78},
}

missing = sorted(required - resume_bundle.keys())
print("resume_ready=", not missing)
print("best_metric=", resume_bundle["best_metric"])

Checkpoint manifest output

resume_ready= True
best_metric= {'name': 'support_resolution_accuracy', 'value': 0.78}

Choosing the best checkpoint

The best checkpoint is the one that optimizes the product metric you care about on held-out data.

Bad rules:

lowest train loss
latest step
biggest file

Better evidence:

highest exact or structured task success where the behavior is verifiable
best human preference rate on a blinded held-out set when quality is subjective
judge-assisted scoring only after calibrating the judge against human or verifiable decisions
lowest validation loss only when that truly matches the task

If the task is structured access-policy replies, check policy compliance and schema validity before style preference. Loss is useful for training diagnostics, but it doesn't define deployment quality.

select_sft_checkpoint.py

checkpoints = [
    {"step": 200, "val_loss": 1.91, "policy_pass_rate": 0.93, "format_pass_rate": 0.99},
    {"step": 400, "val_loss": 1.77, "policy_pass_rate": 0.91, "format_pass_rate": 1.00},
    {"step": 600, "val_loss": 1.79, "policy_pass_rate": 0.97, "format_pass_rate": 0.98},
]

eligible = [row for row in checkpoints if row["format_pass_rate"] >= 0.98]
best = max(eligible, key=lambda row: (row["policy_pass_rate"], -row["val_loss"]))

print("best_step=", best["step"])
print("best_policy_pass_rate=", best["policy_pass_rate"])
print("lowest_loss_step=", min(checkpoints, key=lambda row: row["val_loss"])["step"])

Checkpoint selection output

best_step= 600
best_policy_pass_rate= 0.97
lowest_loss_step= 400

Single GPU first, then scale out

A good SFT rollout usually starts with the smallest honest setup:

one GPU
tiny eval slice
frequent checkpoints
verified template/masking
one clean metric

Only then do you widen the job.

Signs that one GPU is still fine

the model and activations fit
throughput is acceptable
you're still debugging correctness
labels are the main uncertainty, not compute budget

Signs that you need FSDP or ZeRO

full-model weights and optimizer state don't fit
activation memory collapses the run at useful context lengths
checkpointing and accumulation still leave the job too slow
you need larger world size to hit the schedule

That bridge matters. FSDP and ZeRO aren't "advanced decorations" on top of a stable recipe. They let the same recipe continue when memory stops fitting on one device by sharding model states across workers.^{[10]Reference 10FullyShardedDataParallelhttps://docs.pytorch.org/docs/stable/fsdp.html}^{[11]Reference 11ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.https://arxiv.org/abs/1910.02054}

Minimal operational skeleton

The exact library can vary, but the shape is stable:

sft_pipeline_shape.txt

dataset -> template -> tokenizer -> collator/mask -> trainer
       -> periodic eval -> checkpoint save -> best-checkpoint export

TRL, torchtune, and the Alignment Handbook all expose variations of this same loop.^{[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer}^{[12]Reference 12alignment-handbook: Robust recipes to align language models with human and AI preferences.https://github.com/huggingface/alignment-handbook}^{[13]Reference 13Welcome to the torchtune Documentation.https://meta-pytorch.org/torchtune/stable/index.html}

For a TRL prompt-completion dataset, a configuration skeleton makes the decisions visible. Configure model_with_lora_adapters with a supported FlashAttention 2 or 3 implementation before enabling bfd packing:

trl_sft_config_fragment.py

from trl import SFTConfig, SFTTrainer

args = SFTConfig(
    output_dir="runs/access-policy-sft-lora",
    learning_rate=1e-4,              # adapter baseline to evaluate, not a universal rule
    max_length=1024,
    packing=True,
    packing_strategy="bfd",             # keep examples intact; use a compatible attention path
    completion_only_loss=True,       # train on the completion field, not the prompt
    eval_strategy="steps",
    eval_steps=100,
    save_steps=100,
)

trainer = SFTTrainer(
    model=model_with_lora_adapters,
    args=args,
    train_dataset=train_rows,        # {"prompt": ..., "completion": ...}
    eval_dataset=held_out_rows,
    processing_class=tokenizer,
)

This fragment deliberately leaves model loading, adapter construction, and the data manifest outside the snippet. They must be versioned inputs to a real run, not invisible defaults.

Common pitfalls

Confusing the objective with the update surface

Symptom: A team says "we tried LoRA instead of SFT," or switches to full updates without changing data or evaluation.
Cause: LoRA and full fine-tuning describe which parameters move; SFT describes the supervised loss objective.
Fix: Write down the failed behavior and desired objective first. Then compare adapter and full-weight SFT only if update capacity or memory is the actual uncertainty.

Evaluating examples that share a case with training

Symptom: Held-out score is excellent, but the model fails on new tenants, policies, or access incidents.
Cause: Rows were randomly split while near-duplicate turns from the same incident or document appeared on both sides.
Fix: Group the split by incident thread, policy source, tenant, or later time window, then re-run selection.

Reporting micro-batch as if it were the real batch

Symptom: Two runs both claim "batch size 4," but convergence looks different and nobody can reconcile the results.
Cause: The reported number ignored accumulation or world size, so the optimizer was seeing different effective batches.
Fix: Publish per_device_batch, gradient_accumulation_steps, world_size, packed/padded mode, examples per update, and supervised tokens per update.

Saving weights without resume state

Symptom: Resume starts, but the learning rate jumps, checkpoint selection resets, or the run diverges from prior curves.
Cause: The checkpoint kept weights but dropped optimizer state, scheduler position, or best-metric record.
Fix: Treat resume state as part of the checkpoint contract: weights, optimizer, scheduler, global step, RNG/sampler state, template/tokenizer version, data and eval manifests, and best metric.

Using a packing path that crosses example boundaries

Symptom: Loss falls smoothly, but the model learns transitions that never occur in real conversations.
Cause: A packing strategy or attention implementation let tokens from one independent example condition another.
Fix: For TRL SFT, prefer bfd with a supported FlashAttention 2 or 3 implementation and inspect packed batches. Use bfd_split deliberately when preserving overflow matters; don't silently substitute wrapped or an unsupported attention path.

Selecting the best checkpoint by train loss

Symptom: Exported checkpoint has the prettiest curve but underperforms on held-out support tasks.
Cause: Train loss rewards fitting seen data, not solving the downstream task the product cares about.
Fix: Export the checkpoint that wins on held-out verifiable metrics or calibrated human-preference evaluation that matches deployment.

Evaluation rubric

Check that you can explain and operate these parts of an SFT run:

How dataset, template, tokenizer, collator, loss mask, optimizer, checkpointing, and evaluation fit into one training loop.
Why a response-only objective masks prompt tokens and includes the answer termination behavior.
Why a grouped held-out split is stronger than a row-random split for related support conversations.
How TRL's bfd, bfd_split, and wrapped strategies differ, and why the padding-free attention path still has to honor example boundaries.
How to treat documented learning-rate defaults as sweep starting points rather than guarantees.
How to compute examples and supervised answer tokens per optimizer update.
How to choose an objective first, then compare full-weight, LoRA, or QLoRA SFT parameterizations.
What a resumable checkpoint has to save beyond model weights.
Why the best checkpoint is selected by held-out verifiable or calibrated behavior metrics rather than training loss.
When to stay on one GPU for correctness and when FSDP or ZeRO becomes necessary.

Practice checkpoints

What to remember

SFT is a training system, not a trainer call alone.
Choose the learning objective before choosing full updates, LoRA, or QLoRA as its parameterization.
Hold out real groups such as incident threads or policy sources, not randomly related messages.
For response-only SFT, supervise answer and termination tokens; in TRL that means completion_only_loss or assistant_only_loss.
Packing saves compute only when strategy and attention behavior preserve the boundaries between independent examples.
Library defaults begin a sweep; they don't certify your learning rate or training duration.
Track supervised tokens per update in addition to examples, accumulation, and world size.
Resumable checkpoints need optimizer, scheduler, sampler, formatting, data, evaluation, and best-metric state, not weights alone.
The best checkpoint is chosen by held-out verifiable or calibrated behavior evidence, not by train loss.
Start on one GPU for correctness, then widen to FSDP / ZeRO when memory or schedule demands it.

Next Step

Continue to Distributed Training: FSDP & ZeRO

You defined the SFT recipe and the checkpointing, batch, and evaluation rules that make a fine-tuning run legitimate. Next you'll keep that exact recipe alive once model states, activations, and communication no longer fit comfortably on one GPU.

PreviousSynthetic Data Pipelines

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

TRL Documentation: SFT Trainer.

Hugging Face · 2026

Fine-Tune Your First LLM.

PyTorch Contributors · 2026

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.

Gururangan, S., Marasovic, A., Swayamdipta, S., et al. · 2020 · ACL 2020

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

Transformers Documentation: Writing a chat template.

Hugging Face · 2026

Checkpointing in torchtune.

PyTorch Contributors · 2026

Universal Checkpointing with DeepSpeed: A Practical Guide.

DeepSpeed Team · 2026

FullyShardedDataParallel

PyTorch Contributors · 2026

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.

Rajbhandari, S., et al. · 2020 · SC 2020

alignment-handbook: Robust recipes to align language models with human and AI preferences.

Hugging Face · 2025

Welcome to the torchtune Documentation.

PyTorch Contributors · 2026

Back to Topics

LearnAdvanced Training & AdaptationSupervised Fine-Tuning Pipeline

⚡HardFine-Tuning & Training

Supervised Fine-Tuning Pipeline

24 min read

Learning path

Step 101 of 158 in the full curriculum

Synthetic Data Pipelines Distributed Training: FSDP & ZeRO

What an SFT run must decide

Every serious SFT job has to answer the same questions:

What is the behavior target?
Is SFT the right objective, or is the missing ingredient still domain pretraining?
Are we updating all weights or only adapters?
What examples and tokens contribute to training loss?
What is the true example and supervised-token batch budget?
Which held-out metric decides the best checkpoint?
Can this stay on one GPU, or do we need Fully Sharded Data Parallel (FSDP) or Zero Redundancy Optimizer (ZeRO)?

Once you phrase the job that way, the trainer isn't the system. It's only one component inside the system.

Separate objective from parameterization

An objective tells the model what signal to learn from. A parameterization tells the trainer which weights may move. They aren't interchangeable.

Failure you observe	Objective to investigate	Why
Base model has weak exposure to domain language in unlabeled corpora	continued pretraining	next-token training on domain text adapts the language distribution before behavior training^{[3]Reference 3Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.https://aclanthology.org/2020.acl-main.740/}
Model can read an access policy but doesn't follow the desired escalation procedure	SFT	prompt-completion demonstrations directly teach that response behavior^{[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer}
Several acceptable answers need ranking by preference	preference training after establishing an SFT baseline	comparisons express relative preference rather than a single target answer^{[4]Reference 4Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290}

After choosing SFT, make a second decision:

SFT parameterization	What moves during the same supervised objective	When to test it
Full fine-tuning	all trainable model weights	when memory permits and adapters may be too restrictive
LoRA	small low-rank adapter matrices; base weights stay frozen	when iteration speed, memory, or many task variants matter^{[5]Reference 5LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685}
QLoRA	LoRA adapters while the frozen base is stored in 4-bit form	when the base model doesn't fit comfortably at higher precision^{[6]Reference 6QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314}

Full weights or adapters

Full fine-tuning

Full fine-tuning is the simplest conceptual update: every eligible parameter may change to reduce supervised response loss. Test it when:

the model fits with its gradients and optimizer state
one merged checkpoint is operationally simpler than serving adapters
an adapter baseline underfits your held-out behavior metric

LoRA

LoRA is still SFT when trained on prompt-response rows. Use it when:

iteration speed or training memory limits matter
you need separate task or tenant variants over one base model
you want a strong adapter baseline before paying for full updates

QLoRA

Use QLoRA when:

LoRA is the right algorithmic choice
the frozen base model itself is too large to keep in BF16/FP16 memory
you still want adapter training instead of full-weight updates

Reject broken demonstrations before training

validate_sft_rows.py

rows = [
    {"thread_id": "T102", "prompt": "Stale key", "completion": "Request rotation evidence."},
    {"thread_id": "B550", "prompt": "Expired key", "completion": ""},
    {"prompt": "Wrong size", "completion": "Offer an exchange label."},
]

required = {"thread_id", "prompt", "completion"}
accepted, rejected = [], []
for index, row in enumerate(rows):
    missing = sorted(required - row.keys())
    if missing:
        rejected.append((index, f"missing {missing}"))
    elif not row["completion"].strip():
        rejected.append((index, "empty completion"))
    else:
        accepted.append(row)

print("accepted_threads=", [row["thread_id"] for row in accepted])
print("rejected=", rejected)

Row contract output

accepted_threads= ['T102']
rejected= [(1, 'empty completion'), (2, "missing ['thread_id']")]

Split by the unit that could leak

grouped_sft_split.py

rows = [
    {"case_id": "T102", "turn": 1, "answer": "Request key-rotation evidence."},
    {"case_id": "T102", "turn": 2, "answer": "Approve access after rotation proof."},
    {"case_id": "R550", "turn": 1, "answer": "Escalate privileged-role changes."},
    {"case_id": "P771", "turn": 1, "answer": "Cite the session-timeout policy."},
]

eval_cases = {"R550"}
train = [row for row in rows if row["case_id"] not in eval_cases]
evaluation = [row for row in rows if row["case_id"] in eval_cases]

train_cases = {row["case_id"] for row in train}
held_out_cases = {row["case_id"] for row in evaluation}
assert train_cases.isdisjoint(held_out_cases)

print("train_cases=", sorted(train_cases))
print("eval_cases=", sorted(held_out_cases))
print("case_overlap=", train_cases & held_out_cases)

Grouped split output

train_cases= ['P771', 'T102']
eval_cases= ['R550']
case_overlap= set()

Data path: template, tokenize, label, pack

An SFT run usually performs these steps:

read structured examples
apply the model-specific chat template
run tokenization
label desired response tokens and mask the prompt/context tokens
pack short examples with sequence-boundary handling, or pad them into batches
feed the trainer

sft_loss_mask.py

IGNORE_INDEX = -100

tokens = [
    ("prefix", "<system>"),
    ("prefix", "Follow access policy."),
    ("prefix", "<user>"),
    ("prefix", "Key T102 is stale."),
    ("prefix", "<assistant>"),
    ("target", "Ask for rotation evidence within 24 hours."),
    ("target", "<end_of_turn>"),
]

labels = [
    token if span == "target" else IGNORE_INDEX
    for span, token in tokens
]

supervised = [
    token
    for (_, token), label in zip(tokens, labels)
    if label != IGNORE_INDEX
]
print("supervised_tokens=", supervised)
print("masked_positions=", sum(label == IGNORE_INDEX for label in labels))

Loss mask output

supervised_tokens= ['Ask for rotation evidence within 24 hours.', '<end_of_turn>']
masked_positions= 5

Why bother masking easy prefix tokens? If a long repeated instruction is easy to predict, its small losses can dominate the average and hide poor answer learning:

response_only_loss.py

prefix_losses = [0.03, 0.04, 0.02, 0.05, 0.03]
answer_losses = [2.20, 1.80]

full_sequence_loss = sum(prefix_losses + answer_losses) / len(prefix_losses + answer_losses)
response_only_loss = sum(answer_losses) / len(answer_losses)

print(f"full_sequence_loss={full_sequence_loss:.3f}")
print(f"response_only_loss={response_only_loss:.3f}")
assert response_only_loss > full_sequence_loss

Response-only loss output

full_sequence_loss=0.596
response_only_loss=2.000

Mask inspection must happen after truncation as well. If the prompt fills max_length, the run can keep a row whose entire answer has disappeared:

reject_truncated_completion.py

IGNORE_INDEX = -100

prompt_tokens = ["<system>", "policy", "<user>", "stale_key", "<assistant>"]
answer_tokens = ["request_evidence", "<end_of_turn>"]
max_length = 5

kept_tokens = (prompt_tokens + answer_tokens)[:max_length]
labels = [
    token if position >= len(prompt_tokens) else IGNORE_INDEX
    for position, token in enumerate(kept_tokens)
]
has_answer_label = any(label != IGNORE_INDEX for label in labels)

print("kept_tokens=", kept_tokens)
print("has_answer_label=", has_answer_label)
print("decision=", "keep" if has_answer_label else "reject_or_increase_max_length")

Truncation guard output

kept_tokens= ['<system>', 'policy', '<user>', 'stale_key', '<assistant>']
has_answer_label= False
decision= reject_or_increase_max_length

Pack short examples while preserving their boundaries

packed_boundary_check.py

segments = [
    ["A102 prompt", "A102 response", "<eos>"],
    ["B550 prompt", "B550 response", "<eos>"],
]

flat = [(sequence_id, token) for sequence_id, row in enumerate(segments) for token in row]
position_ids = [position for row in segments for position, _ in enumerate(row)]

def may_attend(query_index, key_index):
    query_sequence, _ = flat[query_index]
    key_sequence, _ = flat[key_index]
    return key_index <= query_index and query_sequence == key_sequence

b_response = 4
assert position_ids == [0, 1, 2, 0, 1, 2]
assert not may_attend(b_response, 1)
print("position_ids=", position_ids)
print("B550_response_sees_A102_response=", may_attend(b_response, 1))

Packed boundary output

position_ids= [0, 1, 2, 0, 1, 2]
B550_response_sees_A102_response= False

Reset position IDs are boundary metadata, not a universal attention mask. The attention implementation still has to honor those boundaries when it computes attention.

Hyperparameters are experiments, not promises

Knob	Baseline to test	What decides whether it survives
Learning rate	TRL `SFTConfig` currently defaults to `2e-5`; its PEFT guidance suggests about `1e-4` for adapters^{[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer}	a short sweep on the actual update surface and held-out task metrics
Passes over data	start with a small budget and evaluate during the run	widening train/eval gap or declining behavior metrics
Warmup and schedule	set explicitly in the run config	early instability and sweep results
Weight decay and clipping	log them with the run, even when disabled	stability and evaluation evidence, not inherited habit

Effective batch size is also a token budget

Training discussions get sloppy fast here. The batch that sits on one device isn't the same thing as the batch the optimizer sees before one update. For equal-length examples, the usual rule is:

text

examples_per_update = per_device_batch * gradient_accumulation_steps * world_size

Packed response examples aren't equal length. This formula reports examples per optimizer update, but the number of supervised answer tokens can still change from update to update. Track both.

sft_effective_batch.py

per_device_batch = 4
grad_accum = 8
world_size = 4
examples_per_update = per_device_batch * grad_accum * world_size

supervised_tokens_per_microbatch_on_one_rank = [180, 212, 164, 220, 198, 204, 175, 207]
supervised_tokens_per_update = sum(supervised_tokens_per_microbatch_on_one_rank) * world_size

print(f"examples_per_update={examples_per_update}")
print(f"supervised_tokens_per_update={supervised_tokens_per_update}")

Effective batch output

examples_per_update=128
supervised_tokens_per_update=6240

Always calculate changed configurations instead of reasoning from one changed knob:

compare_update_sizes.py

def examples_per_update(per_device_batch, accumulation, world_size):
    return per_device_batch * accumulation * world_size

run_a = examples_per_update(4, 8, 4)
run_b = examples_per_update(2, 16, 8)

print("run_a_examples_per_update=", run_a)
print("run_b_examples_per_update=", run_b)
print("ratio_b_over_a=", run_b / run_a)
assert run_b == 2 * run_a

Run comparison output

run_a_examples_per_update= 128
run_b_examples_per_update= 256
ratio_b_over_a= 2.0

Two habits follow from this:

when someone reports batch size, ask for accumulation, world size, packing, and supervised tokens too
compare learning-rate sweeps with similar token budgets or explain why they differ

Warmup, clipping, and weight decay

These knobs aren't decoration.

Warmup

Gradient clipping

Clip gradients when spikes would destabilize the run. Clipping doesn't fix bad data, but it can stop one pathological batch from damaging the checkpoint.

Weight decay

Weight decay regularizes the update path. It should be treated as part of the recipe, not an afterthought copy-pasted from another config.

Don't memorize one numeric recipe. Know what each control is trying to prevent.

What a resumable run must save

A real SFT resume bundle isn't only model_state_dict.

At minimum, retain:

model weights
optimizer state
scheduler state
global step / epoch
RNG state and sampler/data position when repeatable continuation matters
tokenizer + chat-template version
training config, data manifest, and evaluation manifest
best metric so far

Why this matters:

optimizer state controls the next update
scheduler state controls learning rate at resume
sampler state prevents quietly repeating or skipping shuffled training rows
tokenizer/template version controls what the model thinks each token means

A minimal manifest should make missing resume state obvious:

resume_bundle_manifest_check.py

required = {
    "model_state",
    "optimizer_state",
    "scheduler_state",
    "global_step",
    "rng_state",
    "sampler_state",
    "tokenizer_version",
    "chat_template_version",
    "data_manifest",
    "eval_manifest",
    "best_metric",
}

resume_bundle = {
    "model_state": "weights/step_600.safetensors",
    "optimizer_state": "optimizer/step_600.pt",
    "scheduler_state": "scheduler/step_600.pt",
    "global_step": 600,
    "rng_state": "rng/step_600.pt",
    "sampler_state": {"epoch": 1, "batches_consumed": 120},
    "tokenizer_version": "policy-sft-tokenizer-v3",
    "chat_template_version": "llama3-support-v2",
    "data_manifest": "data/sft_manifest_2026-05-20.json",
    "eval_manifest": "eval/access_policy_behavior_v4.json",
    "best_metric": {"name": "support_resolution_accuracy", "value": 0.78},
}

missing = sorted(required - resume_bundle.keys())
print("resume_ready=", not missing)
print("best_metric=", resume_bundle["best_metric"])

Checkpoint manifest output

resume_ready= True
best_metric= {'name': 'support_resolution_accuracy', 'value': 0.78}

Choosing the best checkpoint

The best checkpoint is the one that optimizes the product metric you care about on held-out data.

Bad rules:

lowest train loss
latest step
biggest file

Better evidence:

highest exact or structured task success where the behavior is verifiable
best human preference rate on a blinded held-out set when quality is subjective
judge-assisted scoring only after calibrating the judge against human or verifiable decisions
lowest validation loss only when that truly matches the task

select_sft_checkpoint.py

checkpoints = [
    {"step": 200, "val_loss": 1.91, "policy_pass_rate": 0.93, "format_pass_rate": 0.99},
    {"step": 400, "val_loss": 1.77, "policy_pass_rate": 0.91, "format_pass_rate": 1.00},
    {"step": 600, "val_loss": 1.79, "policy_pass_rate": 0.97, "format_pass_rate": 0.98},
]

eligible = [row for row in checkpoints if row["format_pass_rate"] >= 0.98]
best = max(eligible, key=lambda row: (row["policy_pass_rate"], -row["val_loss"]))

print("best_step=", best["step"])
print("best_policy_pass_rate=", best["policy_pass_rate"])
print("lowest_loss_step=", min(checkpoints, key=lambda row: row["val_loss"])["step"])

Checkpoint selection output

best_step= 600
best_policy_pass_rate= 0.97
lowest_loss_step= 400

Single GPU first, then scale out

A good SFT rollout usually starts with the smallest honest setup:

one GPU
tiny eval slice
frequent checkpoints
verified template/masking
one clean metric

Only then do you widen the job.

Signs that one GPU is still fine

the model and activations fit
throughput is acceptable
you're still debugging correctness
labels are the main uncertainty, not compute budget

Signs that you need FSDP or ZeRO

full-model weights and optimizer state don't fit
activation memory collapses the run at useful context lengths
checkpointing and accumulation still leave the job too slow
you need larger world size to hit the schedule

Minimal operational skeleton

The exact library can vary, but the shape is stable:

sft_pipeline_shape.txt

dataset -> template -> tokenizer -> collator/mask -> trainer
       -> periodic eval -> checkpoint save -> best-checkpoint export

trl_sft_config_fragment.py

from trl import SFTConfig, SFTTrainer

args = SFTConfig(
    output_dir="runs/access-policy-sft-lora",
    learning_rate=1e-4,              # adapter baseline to evaluate, not a universal rule
    max_length=1024,
    packing=True,
    packing_strategy="bfd",             # keep examples intact; use a compatible attention path
    completion_only_loss=True,       # train on the completion field, not the prompt
    eval_strategy="steps",
    eval_steps=100,
    save_steps=100,
)

trainer = SFTTrainer(
    model=model_with_lora_adapters,
    args=args,
    train_dataset=train_rows,        # {"prompt": ..., "completion": ...}
    eval_dataset=held_out_rows,
    processing_class=tokenizer,
)

This fragment deliberately leaves model loading, adapter construction, and the data manifest outside the snippet. They must be versioned inputs to a real run, not invisible defaults.

Common pitfalls

Confusing the objective with the update surface

Symptom: A team says "we tried LoRA instead of SFT," or switches to full updates without changing data or evaluation.
Cause: LoRA and full fine-tuning describe which parameters move; SFT describes the supervised loss objective.
Fix: Write down the failed behavior and desired objective first. Then compare adapter and full-weight SFT only if update capacity or memory is the actual uncertainty.

Evaluating examples that share a case with training

Symptom: Held-out score is excellent, but the model fails on new tenants, policies, or access incidents.
Cause: Rows were randomly split while near-duplicate turns from the same incident or document appeared on both sides.
Fix: Group the split by incident thread, policy source, tenant, or later time window, then re-run selection.

Reporting micro-batch as if it were the real batch

Symptom: Two runs both claim "batch size 4," but convergence looks different and nobody can reconcile the results.
Cause: The reported number ignored accumulation or world size, so the optimizer was seeing different effective batches.
Fix: Publish per_device_batch, gradient_accumulation_steps, world_size, packed/padded mode, examples per update, and supervised tokens per update.

Saving weights without resume state

Symptom: Resume starts, but the learning rate jumps, checkpoint selection resets, or the run diverges from prior curves.
Cause: The checkpoint kept weights but dropped optimizer state, scheduler position, or best-metric record.
Fix: Treat resume state as part of the checkpoint contract: weights, optimizer, scheduler, global step, RNG/sampler state, template/tokenizer version, data and eval manifests, and best metric.

Using a packing path that crosses example boundaries

Symptom: Loss falls smoothly, but the model learns transitions that never occur in real conversations.
Cause: A packing strategy or attention implementation let tokens from one independent example condition another.
Fix: For TRL SFT, prefer bfd with a supported FlashAttention 2 or 3 implementation and inspect packed batches. Use bfd_split deliberately when preserving overflow matters; don't silently substitute wrapped or an unsupported attention path.

Selecting the best checkpoint by train loss

Symptom: Exported checkpoint has the prettiest curve but underperforms on held-out support tasks.
Cause: Train loss rewards fitting seen data, not solving the downstream task the product cares about.
Fix: Export the checkpoint that wins on held-out verifiable metrics or calibrated human-preference evaluation that matches deployment.

Evaluation rubric

Check that you can explain and operate these parts of an SFT run:

How dataset, template, tokenizer, collator, loss mask, optimizer, checkpointing, and evaluation fit into one training loop.
Why a response-only objective masks prompt tokens and includes the answer termination behavior.
Why a grouped held-out split is stronger than a row-random split for related support conversations.
How TRL's bfd, bfd_split, and wrapped strategies differ, and why the padding-free attention path still has to honor example boundaries.
How to treat documented learning-rate defaults as sweep starting points rather than guarantees.
How to compute examples and supervised answer tokens per optimizer update.
How to choose an objective first, then compare full-weight, LoRA, or QLoRA SFT parameterizations.
What a resumable checkpoint has to save beyond model weights.
Why the best checkpoint is selected by held-out verifiable or calibrated behavior metrics rather than training loss.
When to stay on one GPU for correctness and when FSDP or ZeRO becomes necessary.

Practice checkpoints

What to remember

SFT is a training system, not a trainer call alone.
Choose the learning objective before choosing full updates, LoRA, or QLoRA as its parameterization.
Hold out real groups such as incident threads or policy sources, not randomly related messages.
For response-only SFT, supervise answer and termination tokens; in TRL that means completion_only_loss or assistant_only_loss.
Packing saves compute only when strategy and attention behavior preserve the boundaries between independent examples.
Library defaults begin a sweep; they don't certify your learning rate or training duration.
Track supervised tokens per update in addition to examples, accumulation, and world size.
Resumable checkpoints need optimizer, scheduler, sampler, formatting, data, evaluation, and best-metric state, not weights alone.
The best checkpoint is chosen by held-out verifiable or calibrated behavior evidence, not by train loss.
Start on one GPU for correctness, then widen to FSDP / ZeRO when memory or schedule demands it.

Next Step

Continue to Distributed Training: FSDP & ZeRO

PreviousSynthetic Data Pipelines

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

TRL Documentation: SFT Trainer.

Hugging Face · 2026

Fine-Tune Your First LLM.

PyTorch Contributors · 2026

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.

Gururangan, S., Marasovic, A., Swayamdipta, S., et al. · 2020 · ACL 2020

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

Transformers Documentation: Writing a chat template.

Hugging Face · 2026

Checkpointing in torchtune.

PyTorch Contributors · 2026

Universal Checkpointing with DeepSpeed: A Practical Guide.

DeepSpeed Team · 2026

FullyShardedDataParallel

PyTorch Contributors · 2026

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.

Rajbhandari, S., et al. · 2020 · SC 2020

alignment-handbook: Robust recipes to align language models with human and AI preferences.

Hugging Face · 2025

Welcome to the torchtune Documentation.

PyTorch Contributors · 2026

Supervised Fine-Tuning Pipeline

What an SFT run must decide

Separate objective from parameterization

Full weights or adapters

Full fine-tuning

LoRA

QLoRA

Reject broken demonstrations before training

Split by the unit that could leak

Data path: template, tokenize, label, pack

Pack short examples while preserving their boundaries

Hyperparameters are experiments, not promises

Effective batch size is also a token budget

Warmup, clipping, and weight decay

Warmup

Gradient clipping

Weight decay

What a resumable run must save

Choosing the best checkpoint

Single GPU first, then scale out

Signs that one GPU is still fine

Signs that you need FSDP or ZeRO

Minimal operational skeleton

Common pitfalls

Confusing the objective with the update surface

Evaluating examples that share a case with training

Reporting micro-batch as if it were the real batch

Saving weights without resume state

Using a packing path that crosses example boundaries

Selecting the best checkpoint by train loss

Evaluation rubric

Practice checkpoints

Your policy assistant answers in the right style but still confuses tenant-specific access rules. What do you diagnose before deciding between another adapter run and a different objective?

Run A uses per_device_batch=4, gradient_accumulation_steps=8, world_size=4. Run B uses per_device_batch=2, gradient_accumulation_steps=16, world_size=8. Are they comparable, and which checkpoint do you export if one has lower train loss but the other has better held-out task accuracy?

Packing made throughput jump and loss fall. Later you learn your attention path let one independent example condition another, and your saved checkpoint contains weights only. Name both failures.

What to remember

Mastery Check

Supervised Fine-Tuning Pipeline

What an SFT run must decide

Separate objective from parameterization

Full weights or adapters

Full fine-tuning

LoRA

QLoRA

Reject broken demonstrations before training

Split by the unit that could leak

Data path: template, tokenize, label, pack

Pack short examples while preserving their boundaries

Hyperparameters are experiments, not promises

Effective batch size is also a token budget

Warmup, clipping, and weight decay

Warmup

Gradient clipping

Weight decay

What a resumable run must save

Choosing the best checkpoint

Single GPU first, then scale out

Signs that one GPU is still fine

Signs that you need FSDP or ZeRO

Minimal operational skeleton

Common pitfalls

Confusing the objective with the update surface

Evaluating examples that share a case with training

Reporting micro-batch as if it were the real batch

Saving weights without resume state

Using a packing path that crosses example boundaries

Selecting the best checkpoint by train loss

Evaluation rubric

Practice checkpoints

Your policy assistant answers in the right style but still confuses tenant-specific access rules. What do you diagnose before deciding between another adapter run and a different objective?

Run A uses per_device_batch=4, gradient_accumulation_steps=8, world_size=4. Run B uses per_device_batch=2, gradient_accumulation_steps=16, world_size=8. Are they comparable, and which checkpoint do you export if one has lower train loss but the other has better held-out task accuracy?

Packing made throughput jump and loss fall. Later you learn your attention path let one independent example condition another, and your saved checkpoint contains weights only. Name both failures.

What to remember

Mastery Check