Run supervised fine-tuning as a real training system: choose the learning objective before the update surface, verify response-token loss and packing, track the real batch budget, save resumable checkpoints, and export on held-out behavior.
The synthetic-data chapter built and verified candidate rows for post-training. Supervised fine-tuning (SFT) turns accepted demonstrations, whether written by humans or generated and checked, into gradient updates on desired responses. A good SFT run doesn't start with a trainer call. It starts with a behavior target, a leakage-resistant split, a loss mask, a batch budget, checkpoints that can resume, and an evaluation rule for exporting the best artifact.[1][2]
Turn SFT into an operational training system for access-policy assistant replies. First choose the learning objective. Then choose whether that SFT objective updates every weight or a small adapter. Keeping those two decisions separate prevents expensive experiments that answer the wrong question.
Every serious SFT job has to answer the same questions:
Once you phrase the job that way, the trainer isn't the system. It's only one component inside the system.
An objective tells the model what signal to learn from. A parameterization tells the trainer which weights may move. They aren't interchangeable.
| Failure you observe | Objective to investigate | Why |
|---|---|---|
| Base model has weak exposure to domain language in unlabeled corpora | continued pretraining | next-token training on domain text adapts the language distribution before behavior training[3] |
| Model can read an access policy but doesn't follow the desired escalation procedure | SFT | prompt-completion demonstrations directly teach that response behavior[1] |
| Several acceptable answers need ranking by preference | preference training after establishing an SFT baseline | comparisons express relative preference rather than a single target answer[4] |
After choosing SFT, make a second decision:
| SFT parameterization | What moves during the same supervised objective | When to test it |
|---|---|---|
| Full fine-tuning | all trainable model weights | when memory permits and adapters may be too restrictive |
| LoRA | small low-rank adapter matrices; base weights stay frozen | when iteration speed, memory, or many task variants matter[5] |
| QLoRA | LoRA adapters while the frozen base is stored in 4-bit form | when the base model doesn't fit comfortably at higher precision[6] |
LoRA and QLoRA aren't alternatives to SFT. They can be the parameterization used for an SFT run. Continued pretraining is a different objective; it can also be parameter-efficient in an appropriate setup, but it still doesn't become SFT.
Full fine-tuning is the simplest conceptual update: every eligible parameter may change to reduce supervised response loss. Test it when:
LoRA is still SFT when trained on prompt-response rows. Use it when:
LoRA freezes the base model weights and trains low-rank adapter matrices inside selected layers, which reduces trainable parameters and memory compared with full fine-tuning.[5]
Use QLoRA when:
QLoRA keeps the adapter-training idea but backpropagates through a frozen 4-bit quantized base model into LoRA adapters, which is why it can make larger models fit in less memory.[6] The comparison that matters is held-out behavior under an honest memory and compute budget. A cheap run that fails the task isn't a win.
For this pipeline, store a prompt, a desired completion, and a group key for leakage-resistant splitting on every SFT row. A missing completion isn't harmless missing metadata: it's a training example with no answer to teach.
1rows = [
2 {"thread_id": "T102", "prompt": "Stale key", "completion": "Request rotation evidence."},
3 {"thread_id": "B550", "prompt": "Expired key", "completion": ""},
4 {"prompt": "Wrong size", "completion": "Offer an exchange label."},
5]
6
7required = {"thread_id", "prompt", "completion"}
8accepted, rejected = [], []
9for index, row in enumerate(rows):
10 missing = sorted(required - row.keys())
11 if missing:
12 rejected.append((index, f"missing {missing}"))
13 elif not row["completion"].strip():
14 rejected.append((index, "empty completion"))
15 else:
16 accepted.append(row)
17
18print("accepted_threads=", [row["thread_id"] for row in accepted])
19print("rejected=", rejected)1accepted_threads= ['T102']
2rejected= [(1, 'empty completion'), (2, "missing ['thread_id']")]Before formatting rows, reserve evaluation examples that training can't imitate through near duplicates. For a policy assistant, several messages from one access incident or one policy document often share facts and phrasing. Randomly splitting individual messages can put one part of the same case in training and another in evaluation.
Group by the smallest deployment unit that should be unseen at evaluation time: incident thread, policy document, tenant, or time window. The tiny check below keeps every message from an incident on one side of the split.
1rows = [
2 {"case_id": "T102", "turn": 1, "answer": "Request key-rotation evidence."},
3 {"case_id": "T102", "turn": 2, "answer": "Approve access after rotation proof."},
4 {"case_id": "R550", "turn": 1, "answer": "Escalate privileged-role changes."},
5 {"case_id": "P771", "turn": 1, "answer": "Cite the session-timeout policy."},
6]
7
8eval_cases = {"R550"}
9train = [row for row in rows if row["case_id"] not in eval_cases]
10evaluation = [row for row in rows if row["case_id"] in eval_cases]
11
12train_cases = {row["case_id"] for row in train}
13held_out_cases = {row["case_id"] for row in evaluation}
14assert train_cases.isdisjoint(held_out_cases)
15
16print("train_cases=", sorted(train_cases))
17print("eval_cases=", sorted(held_out_cases))
18print("case_overlap=", train_cases & held_out_cases)1train_cases= ['P771', 'T102']
2eval_cases= ['R550']
3case_overlap= set()An SFT run usually performs these steps:
The instruction-tuning lesson already covered chat-template mechanics. The operational lesson here is that you version this whole path together. If you change the template or masking logic, you changed the training distribution and therefore the experiment.[7]
The label mask is easy to get wrong because the input still contains the system instruction, user request, and formatting control tokens. For response-only SFT, the desired answer tokens (including the turn terminator when the model must learn when to stop) contribute to loss. Prefix tokens provide context, not answer targets. The tiny example uses -100, the ignored label value used by PyTorch cross-entropy.
1IGNORE_INDEX = -100
2
3tokens = [
4 ("prefix", "<system>"),
5 ("prefix", "Follow access policy."),
6 ("prefix", "<user>"),
7 ("prefix", "Key T102 is stale."),
8 ("prefix", "<assistant>"),
9 ("target", "Ask for rotation evidence within 24 hours."),
10 ("target", "<end_of_turn>"),
11]
12
13labels = [
14 token if span == "target" else IGNORE_INDEX
15 for span, token in tokens
16]
17
18supervised = [
19 token
20 for (_, token), label in zip(tokens, labels)
21 if label != IGNORE_INDEX
22]
23print("supervised_tokens=", supervised)
24print("masked_positions=", sum(label == IGNORE_INDEX for label in labels))1supervised_tokens= ['Ask for rotation evidence within 24 hours.', '<end_of_turn>']
2masked_positions= 5Don't confuse the label mask with the attention mask. The label mask decides which positions contribute to loss. The attention mask decides which earlier positions a token can read. Prompt tokens should remain visible as context even when their labels are -100; packed examples also need attention boundaries so one example can't read another.
Why bother masking easy prefix tokens? If a long repeated instruction is easy to predict, its small losses can dominate the average and hide poor answer learning:
1prefix_losses = [0.03, 0.04, 0.02, 0.05, 0.03]
2answer_losses = [2.20, 1.80]
3
4full_sequence_loss = sum(prefix_losses + answer_losses) / len(prefix_losses + answer_losses)
5response_only_loss = sum(answer_losses) / len(answer_losses)
6
7print(f"full_sequence_loss={full_sequence_loss:.3f}")
8print(f"response_only_loss={response_only_loss:.3f}")
9assert response_only_loss > full_sequence_loss1full_sequence_loss=0.596
2response_only_loss=2.000You rarely build this mask by hand. In Hugging Face TRL's SFTTrainer, prompt-completion datasets use completion-only loss by default. Conversational datasets can set assistant_only_loss=True; their chat template must expose assistant spans through generation markers. Current TRL can substitute supported training templates for known model families, but a custom template still needs verification.[1] Inspect a tokenized example before launching a long run: it should reveal exactly which answer and stop tokens receive labels.
Mask inspection must happen after truncation as well. If the prompt fills max_length, the run can keep a row whose entire answer has disappeared:
1IGNORE_INDEX = -100
2
3prompt_tokens = ["<system>", "policy", "<user>", "stale_key", "<assistant>"]
4answer_tokens = ["request_evidence", "<end_of_turn>"]
5max_length = 5
6
7kept_tokens = (prompt_tokens + answer_tokens)[:max_length]
8labels = [
9 token if position >= len(prompt_tokens) else IGNORE_INDEX
10 for position, token in enumerate(kept_tokens)
11]
12has_answer_label = any(label != IGNORE_INDEX for label in labels)
13
14print("kept_tokens=", kept_tokens)
15print("has_answer_label=", has_answer_label)
16print("decision=", "keep" if has_answer_label else "reject_or_increase_max_length")1kept_tokens= ['<system>', 'policy', '<user>', 'stale_key', '<assistant>']
2has_answer_label= False
3decision= reject_or_increase_max_lengthMost SFT datasets are short replies. Padding each one to a fixed length wastes compute on padding tokens. Packing places several examples in a nearly full training block so more positions carry real tokens.
For independent instruction examples, a reply for case R550 shouldn't gain context from case T102 only because the loader put them in one block. A boundary-aware packed attention path records each segment, resets positions at each sequence boundary, and prevents that cross-example attention:
1segments = [
2 ["A102 prompt", "A102 response", "<eos>"],
3 ["B550 prompt", "B550 response", "<eos>"],
4]
5
6flat = [(sequence_id, token) for sequence_id, row in enumerate(segments) for token in row]
7position_ids = [position for row in segments for position, _ in enumerate(row)]
8
9def may_attend(query_index, key_index):
10 query_sequence, _ = flat[query_index]
11 key_sequence, _ = flat[key_index]
12 return key_index <= query_index and query_sequence == key_sequence
13
14b_response = 4
15assert position_ids == [0, 1, 2, 0, 1, 2]
16assert not may_attend(b_response, 1)
17print("position_ids=", position_ids)
18print("B550_response_sees_A102_response=", may_attend(b_response, 1))1position_ids= [0, 1, 2, 0, 1, 2]
2B550_response_sees_A102_response= FalseReset position IDs are boundary metadata, not a universal attention mask. The attention implementation still has to honor those boundaries when it computes attention.
Current TRL exposes packing=True and eval_packing. Its default best-fit-decreasing (bfd) strategy packs intact examples efficiently and truncates overflow sequences. bfd_split preserves overflow tokens by splitting long sequences before packing. wrapped is the aggressive stream-and-split option, and it may mix unrelated examples. With bfd, TRL enables padding-free processing. That flattened path relies on FlashAttention 2 or 3 to honor sequence boundaries; without a compatible attention implementation, adjacent samples may contaminate each other.[1] The right check isn't "never pack." It's: choose an SFT-appropriate strategy, confirm the supported attention path, and inspect a packed batch after library upgrades.
SFT begins from a pretrained checkpoint, so a useful configuration starts conservatively and earns expansion through held-out results. A library default isn't evidence that one value fits your dataset or parameterization.
| Knob | Baseline to test | What decides whether it survives |
|---|---|---|
| Learning rate | TRL SFTConfig currently defaults to 2e-5; its PEFT guidance suggests about 1e-4 for adapters[1] | a short sweep on the actual update surface and held-out task metrics |
| Passes over data | start with a small budget and evaluate during the run | widening train/eval gap or declining behavior metrics |
| Warmup and schedule | set explicitly in the run config | early instability and sweep results |
| Weight decay and clipping | log them with the run, even when disabled | stability and evaluation evidence, not inherited habit |
Don't turn 2e-5 or 1e-4 into a rule. Record parameterization, trainable-parameter count, sequence length, effective token budget, data snapshot, and evaluation result beside every candidate configuration.
Training discussions get sloppy fast here. The batch that sits on one device isn't the same thing as the batch the optimizer sees before one update. For equal-length examples, the usual rule is:
1examples_per_update = per_device_batch * gradient_accumulation_steps * world_sizePacked response examples aren't equal length. This formula reports examples per optimizer update, but the number of supervised answer tokens can still change from update to update. Track both.
1per_device_batch = 4
2grad_accum = 8
3world_size = 4
4examples_per_update = per_device_batch * grad_accum * world_size
5
6supervised_tokens_per_microbatch_on_one_rank = [180, 212, 164, 220, 198, 204, 175, 207]
7supervised_tokens_per_update = sum(supervised_tokens_per_microbatch_on_one_rank) * world_size
8
9print(f"examples_per_update={examples_per_update}")
10print(f"supervised_tokens_per_update={supervised_tokens_per_update}")1examples_per_update=128
2supervised_tokens_per_update=6240Always calculate changed configurations instead of reasoning from one changed knob:
1def examples_per_update(per_device_batch, accumulation, world_size):
2 return per_device_batch * accumulation * world_size
3
4run_a = examples_per_update(4, 8, 4)
5run_b = examples_per_update(2, 16, 8)
6
7print("run_a_examples_per_update=", run_a)
8print("run_b_examples_per_update=", run_b)
9print("ratio_b_over_a=", run_b / run_a)
10assert run_b == 2 * run_a1run_a_examples_per_update= 128
2run_b_examples_per_update= 256
3ratio_b_over_a= 2.0Two habits follow from this:
These knobs aren't decoration.
Warmup limits the size of early updates while the run is most sensitive to its initial configuration. For LLM SFT, jumping immediately to the peak learning rate can make the first steps unnecessarily noisy.
Clip gradients when spikes would destabilize the run. Clipping doesn't fix bad data, but it can stop one pathological batch from damaging the checkpoint.
Weight decay regularizes the update path. It should be treated as part of the recipe, not an afterthought copy-pasted from another config.
Don't memorize one numeric recipe. Know what each control is trying to prevent.
A real SFT resume bundle isn't only model_state_dict.
At minimum, retain:
Why this matters:
Exact bit-for-bit replay may still depend on hardware, kernels, and determinism settings. The resume bundle should at least prevent avoidable changes to data order, learning-rate state, and model selection. torchtune and DeepSpeed both document resume flows explicitly instead of treating checkpointing as a trivial file write.[8][9]
A minimal manifest should make missing resume state obvious:
1required = {
2 "model_state",
3 "optimizer_state",
4 "scheduler_state",
5 "global_step",
6 "rng_state",
7 "sampler_state",
8 "tokenizer_version",
9 "chat_template_version",
10 "data_manifest",
11 "eval_manifest",
12 "best_metric",
13}
14
15resume_bundle = {
16 "model_state": "weights/step_600.safetensors",
17 "optimizer_state": "optimizer/step_600.pt",
18 "scheduler_state": "scheduler/step_600.pt",
19 "global_step": 600,
20 "rng_state": "rng/step_600.pt",
21 "sampler_state": {"epoch": 1, "batches_consumed": 120},
22 "tokenizer_version": "policy-sft-tokenizer-v3",
23 "chat_template_version": "llama3-support-v2",
24 "data_manifest": "data/sft_manifest_2026-05-20.json",
25 "eval_manifest": "eval/access_policy_behavior_v4.json",
26 "best_metric": {"name": "support_resolution_accuracy", "value": 0.78},
27}
28
29missing = sorted(required - resume_bundle.keys())
30print("resume_ready=", not missing)
31print("best_metric=", resume_bundle["best_metric"])1resume_ready= True
2best_metric= {'name': 'support_resolution_accuracy', 'value': 0.78}The best checkpoint is the one that optimizes the product metric you care about on held-out data.
Bad rules:
Better evidence:
If the task is structured access-policy replies, check policy compliance and schema validity before style preference. Loss is useful for training diagnostics, but it doesn't define deployment quality.
1checkpoints = [
2 {"step": 200, "val_loss": 1.91, "policy_pass_rate": 0.93, "format_pass_rate": 0.99},
3 {"step": 400, "val_loss": 1.77, "policy_pass_rate": 0.91, "format_pass_rate": 1.00},
4 {"step": 600, "val_loss": 1.79, "policy_pass_rate": 0.97, "format_pass_rate": 0.98},
5]
6
7eligible = [row for row in checkpoints if row["format_pass_rate"] >= 0.98]
8best = max(eligible, key=lambda row: (row["policy_pass_rate"], -row["val_loss"]))
9
10print("best_step=", best["step"])
11print("best_policy_pass_rate=", best["policy_pass_rate"])
12print("lowest_loss_step=", min(checkpoints, key=lambda row: row["val_loss"])["step"])1best_step= 600
2best_policy_pass_rate= 0.97
3lowest_loss_step= 400A good SFT rollout usually starts with the smallest honest setup:
Only then do you widen the job.
That bridge matters. FSDP and ZeRO aren't "advanced decorations" on top of a stable recipe. They let the same recipe continue when memory stops fitting on one device by sharding model states across workers.[10][11]
The exact library can vary, but the shape is stable:
1dataset -> template -> tokenizer -> collator/mask -> trainer
2 -> periodic eval -> checkpoint save -> best-checkpoint exportTRL, torchtune, and the Alignment Handbook all expose variations of this same loop.[1][12][13]
For a TRL prompt-completion dataset, a configuration skeleton makes the decisions visible. Configure model_with_lora_adapters with a supported FlashAttention 2 or 3 implementation before enabling bfd packing:
1from trl import SFTConfig, SFTTrainer
2
3args = SFTConfig(
4 output_dir="runs/access-policy-sft-lora",
5 learning_rate=1e-4, # adapter baseline to evaluate, not a universal rule
6 max_length=1024,
7 packing=True,
8 packing_strategy="bfd", # keep examples intact; use a compatible attention path
9 completion_only_loss=True, # train on the completion field, not the prompt
10 eval_strategy="steps",
11 eval_steps=100,
12 save_steps=100,
13)
14
15trainer = SFTTrainer(
16 model=model_with_lora_adapters,
17 args=args,
18 train_dataset=train_rows, # {"prompt": ..., "completion": ...}
19 eval_dataset=held_out_rows,
20 processing_class=tokenizer,
21)This fragment deliberately leaves model loading, adapter construction, and the data manifest outside the snippet. They must be versioned inputs to a real run, not invisible defaults.
per_device_batch, gradient_accumulation_steps, world_size, packed/padded mode, examples per update, and supervised tokens per update.bfd with a supported FlashAttention 2 or 3 implementation and inspect packed batches. Use bfd_split deliberately when preserving overflow matters; don't silently substitute wrapped or an unsupported attention path.Check that you can explain and operate these parts of an SFT run:
bfd, bfd_split, and wrapped strategies differ, and why the padding-free attention path still has to honor example boundaries.completion_only_loss or assistant_only_loss.Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
TRL Documentation: SFT Trainer.
Hugging Face · 2026
Fine-Tune Your First LLM.
PyTorch Contributors · 2026
Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.
Gururangan, S., Marasovic, A., Swayamdipta, S., et al. · 2020 · ACL 2020
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov, R., et al. · 2023
LoRA: Low-Rank Adaptation of Large Language Models.
Hu, E. J., et al. · 2021 · ICLR
QLoRA: Efficient Finetuning of Quantized Language Models.
Dettmers, T., et al. · 2023 · NeurIPS
Transformers Documentation: Writing a chat template.
Hugging Face · 2026
Checkpointing in torchtune.
PyTorch Contributors · 2026
Universal Checkpointing with DeepSpeed: A Practical Guide.
DeepSpeed Team · 2026
FullyShardedDataParallel
PyTorch Contributors · 2026
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.
Rajbhandari, S., et al. · 2020 · SC 2020
alignment-handbook: Robust recipes to align language models with human and AI preferences.
Hugging Face · 2025
Welcome to the torchtune Documentation.
PyTorch Contributors · 2026