LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationSupervised Fine-Tuning Pipeline
⚡HardFine-Tuning & Training

Supervised Fine-Tuning Pipeline

Run supervised fine-tuning as a real training system: choose the learning objective before the update surface, verify response-token loss and packing, track the real batch budget, save resumable checkpoints, and export on held-out behavior.

24 min read
Learning path
Step 101 of 158 in the full curriculum
Synthetic Data PipelinesDistributed Training: FSDP & ZeRO

The synthetic-data chapter built and verified candidate rows for post-training. Supervised fine-tuning (SFT) turns accepted demonstrations, whether written by humans or generated and checked, into gradient updates on desired responses. A good SFT run doesn't start with a trainer call. It starts with a behavior target, a leakage-resistant split, a loss mask, a batch budget, checkpoints that can resume, and an evaluation rule for exporting the best artifact.[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer[2]Reference 2Fine-Tune Your First LLM.https://meta-pytorch.org/torchtune/stable/tutorials/first_finetune_tutorial.html

Turn SFT into an operational training system for access-policy assistant replies. First choose the learning objective. Then choose whether that SFT objective updates every weight or a small adapter. Keeping those two decisions separate prevents expensive experiments that answer the wrong question.

SFT pipeline showing accepted rows, masked prompt labels, parameterization, held-out evaluation, and export. SFT pipeline showing accepted rows, masked prompt labels, parameterization, held-out evaluation, and export.
An SFT run has one supervised objective but multiple possible update surfaces. Prompt tokens stay visible as context while -100 labels remove them from loss; keep that mask, the split, parameterization, checkpoint state, and held-out evaluation aligned before trusting an export.

What an SFT run must decide

Every serious SFT job has to answer the same questions:

  1. What is the behavior target?
  2. Is SFT the right objective, or is the missing ingredient still domain pretraining?
  3. Are we updating all weights or only adapters?
  4. What examples and tokens contribute to training loss?
  5. What is the true example and supervised-token batch budget?
  6. Which held-out metric decides the best checkpoint?
  7. Can this stay on one GPU, or do we need Fully Sharded Data Parallel (FSDP) or Zero Redundancy Optimizer (ZeRO)?

Once you phrase the job that way, the trainer isn't the system. It's only one component inside the system.

Separate objective from parameterization

An objective tells the model what signal to learn from. A parameterization tells the trainer which weights may move. They aren't interchangeable.

Failure you observeObjective to investigateWhy
Base model has weak exposure to domain language in unlabeled corporacontinued pretrainingnext-token training on domain text adapts the language distribution before behavior training[3]Reference 3Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.https://aclanthology.org/2020.acl-main.740/
Model can read an access policy but doesn't follow the desired escalation procedureSFTprompt-completion demonstrations directly teach that response behavior[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer
Several acceptable answers need ranking by preferencepreference training after establishing an SFT baselinecomparisons express relative preference rather than a single target answer[4]Reference 4Direct Preference Optimization: Your Language Model is Secretly a Reward Model.https://arxiv.org/abs/2305.18290

After choosing SFT, make a second decision:

SFT parameterizationWhat moves during the same supervised objectiveWhen to test it
Full fine-tuningall trainable model weightswhen memory permits and adapters may be too restrictive
LoRAsmall low-rank adapter matrices; base weights stay frozenwhen iteration speed, memory, or many task variants matter[5]Reference 5LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685
QLoRALoRA adapters while the frozen base is stored in 4-bit formwhen the base model doesn't fit comfortably at higher precision[6]Reference 6QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314

LoRA and QLoRA aren't alternatives to SFT. They can be the parameterization used for an SFT run. Continued pretraining is a different objective; it can also be parameter-efficient in an appropriate setup, but it still doesn't become SFT.

Full weights or adapters

Full fine-tuning

Full fine-tuning is the simplest conceptual update: every eligible parameter may change to reduce supervised response loss. Test it when:

  • the model fits with its gradients and optimizer state
  • one merged checkpoint is operationally simpler than serving adapters
  • an adapter baseline underfits your held-out behavior metric

LoRA

LoRA is still SFT when trained on prompt-response rows. Use it when:

  • iteration speed or training memory limits matter
  • you need separate task or tenant variants over one base model
  • you want a strong adapter baseline before paying for full updates

LoRA freezes the base model weights and trains low-rank adapter matrices inside selected layers, which reduces trainable parameters and memory compared with full fine-tuning.[5]Reference 5LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685

QLoRA

Use QLoRA when:

  • LoRA is the right algorithmic choice
  • the frozen base model itself is too large to keep in BF16/FP16 memory
  • you still want adapter training instead of full-weight updates

QLoRA keeps the adapter-training idea but backpropagates through a frozen 4-bit quantized base model into LoRA adapters, which is why it can make larger models fit in less memory.[6]Reference 6QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314 The comparison that matters is held-out behavior under an honest memory and compute budget. A cheap run that fails the task isn't a win.

Reject broken demonstrations before training

For this pipeline, store a prompt, a desired completion, and a group key for leakage-resistant splitting on every SFT row. A missing completion isn't harmless missing metadata: it's a training example with no answer to teach.

validate_sft_rows.py
1rows = [ 2 {"thread_id": "T102", "prompt": "Stale key", "completion": "Request rotation evidence."}, 3 {"thread_id": "B550", "prompt": "Expired key", "completion": ""}, 4 {"prompt": "Wrong size", "completion": "Offer an exchange label."}, 5] 6 7required = {"thread_id", "prompt", "completion"} 8accepted, rejected = [], [] 9for index, row in enumerate(rows): 10 missing = sorted(required - row.keys()) 11 if missing: 12 rejected.append((index, f"missing {missing}")) 13 elif not row["completion"].strip(): 14 rejected.append((index, "empty completion")) 15 else: 16 accepted.append(row) 17 18print("accepted_threads=", [row["thread_id"] for row in accepted]) 19print("rejected=", rejected)
Row contract output
1accepted_threads= ['T102'] 2rejected= [(1, 'empty completion'), (2, "missing ['thread_id']")]

Split by the unit that could leak

Before formatting rows, reserve evaluation examples that training can't imitate through near duplicates. For a policy assistant, several messages from one access incident or one policy document often share facts and phrasing. Randomly splitting individual messages can put one part of the same case in training and another in evaluation.

Group by the smallest deployment unit that should be unseen at evaluation time: incident thread, policy document, tenant, or time window. The tiny check below keeps every message from an incident on one side of the split.

grouped_sft_split.py
1rows = [ 2 {"case_id": "T102", "turn": 1, "answer": "Request key-rotation evidence."}, 3 {"case_id": "T102", "turn": 2, "answer": "Approve access after rotation proof."}, 4 {"case_id": "R550", "turn": 1, "answer": "Escalate privileged-role changes."}, 5 {"case_id": "P771", "turn": 1, "answer": "Cite the session-timeout policy."}, 6] 7 8eval_cases = {"R550"} 9train = [row for row in rows if row["case_id"] not in eval_cases] 10evaluation = [row for row in rows if row["case_id"] in eval_cases] 11 12train_cases = {row["case_id"] for row in train} 13held_out_cases = {row["case_id"] for row in evaluation} 14assert train_cases.isdisjoint(held_out_cases) 15 16print("train_cases=", sorted(train_cases)) 17print("eval_cases=", sorted(held_out_cases)) 18print("case_overlap=", train_cases & held_out_cases)
Grouped split output
1train_cases= ['P771', 'T102'] 2eval_cases= ['R550'] 3case_overlap= set()

Data path: template, tokenize, label, pack

An SFT run usually performs these steps:

  1. read structured examples
  2. apply the model-specific chat template
  3. run tokenization
  4. label desired response tokens and mask the prompt/context tokens
  5. pack short examples with sequence-boundary handling, or pad them into batches
  6. feed the trainer

The instruction-tuning lesson already covered chat-template mechanics. The operational lesson here is that you version this whole path together. If you change the template or masking logic, you changed the training distribution and therefore the experiment.[7]Reference 7Transformers Documentation: Writing a chat template.https://huggingface.co/docs/transformers/main/en/chat_templating_writing

The label mask is easy to get wrong because the input still contains the system instruction, user request, and formatting control tokens. For response-only SFT, the desired answer tokens (including the turn terminator when the model must learn when to stop) contribute to loss. Prefix tokens provide context, not answer targets. The tiny example uses -100, the ignored label value used by PyTorch cross-entropy.

sft_loss_mask.py
1IGNORE_INDEX = -100 2 3tokens = [ 4 ("prefix", "<system>"), 5 ("prefix", "Follow access policy."), 6 ("prefix", "<user>"), 7 ("prefix", "Key T102 is stale."), 8 ("prefix", "<assistant>"), 9 ("target", "Ask for rotation evidence within 24 hours."), 10 ("target", "<end_of_turn>"), 11] 12 13labels = [ 14 token if span == "target" else IGNORE_INDEX 15 for span, token in tokens 16] 17 18supervised = [ 19 token 20 for (_, token), label in zip(tokens, labels) 21 if label != IGNORE_INDEX 22] 23print("supervised_tokens=", supervised) 24print("masked_positions=", sum(label == IGNORE_INDEX for label in labels))
Loss mask output
1supervised_tokens= ['Ask for rotation evidence within 24 hours.', '<end_of_turn>'] 2masked_positions= 5

Don't confuse the label mask with the attention mask. The label mask decides which positions contribute to loss. The attention mask decides which earlier positions a token can read. Prompt tokens should remain visible as context even when their labels are -100; packed examples also need attention boundaries so one example can't read another.

Why bother masking easy prefix tokens? If a long repeated instruction is easy to predict, its small losses can dominate the average and hide poor answer learning:

response_only_loss.py
1prefix_losses = [0.03, 0.04, 0.02, 0.05, 0.03] 2answer_losses = [2.20, 1.80] 3 4full_sequence_loss = sum(prefix_losses + answer_losses) / len(prefix_losses + answer_losses) 5response_only_loss = sum(answer_losses) / len(answer_losses) 6 7print(f"full_sequence_loss={full_sequence_loss:.3f}") 8print(f"response_only_loss={response_only_loss:.3f}") 9assert response_only_loss > full_sequence_loss
Response-only loss output
1full_sequence_loss=0.596 2response_only_loss=2.000

You rarely build this mask by hand. In Hugging Face TRL's SFTTrainer, prompt-completion datasets use completion-only loss by default. Conversational datasets can set assistant_only_loss=True; their chat template must expose assistant spans through generation markers. Current TRL can substitute supported training templates for known model families, but a custom template still needs verification.[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer Inspect a tokenized example before launching a long run: it should reveal exactly which answer and stop tokens receive labels.

Mask inspection must happen after truncation as well. If the prompt fills max_length, the run can keep a row whose entire answer has disappeared:

reject_truncated_completion.py
1IGNORE_INDEX = -100 2 3prompt_tokens = ["<system>", "policy", "<user>", "stale_key", "<assistant>"] 4answer_tokens = ["request_evidence", "<end_of_turn>"] 5max_length = 5 6 7kept_tokens = (prompt_tokens + answer_tokens)[:max_length] 8labels = [ 9 token if position >= len(prompt_tokens) else IGNORE_INDEX 10 for position, token in enumerate(kept_tokens) 11] 12has_answer_label = any(label != IGNORE_INDEX for label in labels) 13 14print("kept_tokens=", kept_tokens) 15print("has_answer_label=", has_answer_label) 16print("decision=", "keep" if has_answer_label else "reject_or_increase_max_length")
Truncation guard output
1kept_tokens= ['<system>', 'policy', '<user>', 'stale_key', '<assistant>'] 2has_answer_label= False 3decision= reject_or_increase_max_length

Pack short examples while preserving their boundaries

Most SFT datasets are short replies. Padding each one to a fixed length wastes compute on padding tokens. Packing places several examples in a nearly full training block so more positions carry real tokens.

For independent instruction examples, a reply for case R550 shouldn't gain context from case T102 only because the loader put them in one block. A boundary-aware packed attention path records each segment, resets positions at each sequence boundary, and prevents that cross-example attention:

packed_boundary_check.py
1segments = [ 2 ["A102 prompt", "A102 response", "<eos>"], 3 ["B550 prompt", "B550 response", "<eos>"], 4] 5 6flat = [(sequence_id, token) for sequence_id, row in enumerate(segments) for token in row] 7position_ids = [position for row in segments for position, _ in enumerate(row)] 8 9def may_attend(query_index, key_index): 10 query_sequence, _ = flat[query_index] 11 key_sequence, _ = flat[key_index] 12 return key_index <= query_index and query_sequence == key_sequence 13 14b_response = 4 15assert position_ids == [0, 1, 2, 0, 1, 2] 16assert not may_attend(b_response, 1) 17print("position_ids=", position_ids) 18print("B550_response_sees_A102_response=", may_attend(b_response, 1))
Packed boundary output
1position_ids= [0, 1, 2, 0, 1, 2] 2B550_response_sees_A102_response= False

Reset position IDs are boundary metadata, not a universal attention mask. The attention implementation still has to honor those boundaries when it computes attention.

Current TRL exposes packing=True and eval_packing. Its default best-fit-decreasing (bfd) strategy packs intact examples efficiently and truncates overflow sequences. bfd_split preserves overflow tokens by splitting long sequences before packing. wrapped is the aggressive stream-and-split option, and it may mix unrelated examples. With bfd, TRL enables padding-free processing. That flattened path relies on FlashAttention 2 or 3 to honor sequence boundaries; without a compatible attention implementation, adjacent samples may contaminate each other.[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer The right check isn't "never pack." It's: choose an SFT-appropriate strategy, confirm the supported attention path, and inspect a packed batch after library upgrades.

Hyperparameters are experiments, not promises

SFT begins from a pretrained checkpoint, so a useful configuration starts conservatively and earns expansion through held-out results. A library default isn't evidence that one value fits your dataset or parameterization.

KnobBaseline to testWhat decides whether it survives
Learning rateTRL SFTConfig currently defaults to 2e-5; its PEFT guidance suggests about 1e-4 for adapters[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainera short sweep on the actual update surface and held-out task metrics
Passes over datastart with a small budget and evaluate during the runwidening train/eval gap or declining behavior metrics
Warmup and scheduleset explicitly in the run configearly instability and sweep results
Weight decay and clippinglog them with the run, even when disabledstability and evaluation evidence, not inherited habit

Don't turn 2e-5 or 1e-4 into a rule. Record parameterization, trainable-parameter count, sequence length, effective token budget, data snapshot, and evaluation result beside every candidate configuration.

Effective batch size is also a token budget

Training discussions get sloppy fast here. The batch that sits on one device isn't the same thing as the batch the optimizer sees before one update. For equal-length examples, the usual rule is:

text
1examples_per_update = per_device_batch * gradient_accumulation_steps * world_size

Packed response examples aren't equal length. This formula reports examples per optimizer update, but the number of supervised answer tokens can still change from update to update. Track both.

sft_effective_batch.py
1per_device_batch = 4 2grad_accum = 8 3world_size = 4 4examples_per_update = per_device_batch * grad_accum * world_size 5 6supervised_tokens_per_microbatch_on_one_rank = [180, 212, 164, 220, 198, 204, 175, 207] 7supervised_tokens_per_update = sum(supervised_tokens_per_microbatch_on_one_rank) * world_size 8 9print(f"examples_per_update={examples_per_update}") 10print(f"supervised_tokens_per_update={supervised_tokens_per_update}")
Effective batch output
1examples_per_update=128 2supervised_tokens_per_update=6240

Always calculate changed configurations instead of reasoning from one changed knob:

compare_update_sizes.py
1def examples_per_update(per_device_batch, accumulation, world_size): 2 return per_device_batch * accumulation * world_size 3 4run_a = examples_per_update(4, 8, 4) 5run_b = examples_per_update(2, 16, 8) 6 7print("run_a_examples_per_update=", run_a) 8print("run_b_examples_per_update=", run_b) 9print("ratio_b_over_a=", run_b / run_a) 10assert run_b == 2 * run_a
Run comparison output
1run_a_examples_per_update= 128 2run_b_examples_per_update= 256 3ratio_b_over_a= 2.0

Two habits follow from this:

  • when someone reports batch size, ask for accumulation, world size, packing, and supervised tokens too
  • compare learning-rate sweeps with similar token budgets or explain why they differ

Warmup, clipping, and weight decay

These knobs aren't decoration.

Warmup

Warmup limits the size of early updates while the run is most sensitive to its initial configuration. For LLM SFT, jumping immediately to the peak learning rate can make the first steps unnecessarily noisy.

Gradient clipping

Clip gradients when spikes would destabilize the run. Clipping doesn't fix bad data, but it can stop one pathological batch from damaging the checkpoint.

Weight decay

Weight decay regularizes the update path. It should be treated as part of the recipe, not an afterthought copy-pasted from another config.

Don't memorize one numeric recipe. Know what each control is trying to prevent.

What a resumable run must save

A real SFT resume bundle isn't only model_state_dict.

At minimum, retain:

  • model weights
  • optimizer state
  • scheduler state
  • global step / epoch
  • RNG state and sampler/data position when repeatable continuation matters
  • tokenizer + chat-template version
  • training config, data manifest, and evaluation manifest
  • best metric so far

Why this matters:

  • optimizer state controls the next update
  • scheduler state controls learning rate at resume
  • sampler state prevents quietly repeating or skipping shuffled training rows
  • tokenizer/template version controls what the model thinks each token means

Exact bit-for-bit replay may still depend on hardware, kernels, and determinism settings. The resume bundle should at least prevent avoidable changes to data order, learning-rate state, and model selection. torchtune and DeepSpeed both document resume flows explicitly instead of treating checkpointing as a trivial file write.[8]Reference 8Checkpointing in torchtune.https://meta-pytorch.org/torchtune/stable/deep_dives/checkpointer.html[9]Reference 9Universal Checkpointing with DeepSpeed: A Practical Guide.https://www.deepspeed.ai/tutorials/universal-checkpointing/

Resumable SFT run bundle showing model weights, optimizer and scheduler state, sampler position, formatting versions, data and evaluation manifests, and best metric. Resumable SFT run bundle showing model weights, optimizer and scheduler state, sampler position, formatting versions, data and evaluation manifests, and best metric.
A resume bundle is more than weights. Repeatable continuation needs model and optimizer state, data position, formatting versions, evaluation definition, and the record of which checkpoint currently wins.

A minimal manifest should make missing resume state obvious:

resume_bundle_manifest_check.py
1required = { 2 "model_state", 3 "optimizer_state", 4 "scheduler_state", 5 "global_step", 6 "rng_state", 7 "sampler_state", 8 "tokenizer_version", 9 "chat_template_version", 10 "data_manifest", 11 "eval_manifest", 12 "best_metric", 13} 14 15resume_bundle = { 16 "model_state": "weights/step_600.safetensors", 17 "optimizer_state": "optimizer/step_600.pt", 18 "scheduler_state": "scheduler/step_600.pt", 19 "global_step": 600, 20 "rng_state": "rng/step_600.pt", 21 "sampler_state": {"epoch": 1, "batches_consumed": 120}, 22 "tokenizer_version": "policy-sft-tokenizer-v3", 23 "chat_template_version": "llama3-support-v2", 24 "data_manifest": "data/sft_manifest_2026-05-20.json", 25 "eval_manifest": "eval/access_policy_behavior_v4.json", 26 "best_metric": {"name": "support_resolution_accuracy", "value": 0.78}, 27} 28 29missing = sorted(required - resume_bundle.keys()) 30print("resume_ready=", not missing) 31print("best_metric=", resume_bundle["best_metric"])
Checkpoint manifest output
1resume_ready= True 2best_metric= {'name': 'support_resolution_accuracy', 'value': 0.78}

Choosing the best checkpoint

The best checkpoint is the one that optimizes the product metric you care about on held-out data.

Bad rules:

  • lowest train loss
  • latest step
  • biggest file

Better evidence:

  • highest exact or structured task success where the behavior is verifiable
  • best human preference rate on a blinded held-out set when quality is subjective
  • judge-assisted scoring only after calibrating the judge against human or verifiable decisions
  • lowest validation loss only when that truly matches the task

If the task is structured access-policy replies, check policy compliance and schema validity before style preference. Loss is useful for training diagnostics, but it doesn't define deployment quality.

select_sft_checkpoint.py
1checkpoints = [ 2 {"step": 200, "val_loss": 1.91, "policy_pass_rate": 0.93, "format_pass_rate": 0.99}, 3 {"step": 400, "val_loss": 1.77, "policy_pass_rate": 0.91, "format_pass_rate": 1.00}, 4 {"step": 600, "val_loss": 1.79, "policy_pass_rate": 0.97, "format_pass_rate": 0.98}, 5] 6 7eligible = [row for row in checkpoints if row["format_pass_rate"] >= 0.98] 8best = max(eligible, key=lambda row: (row["policy_pass_rate"], -row["val_loss"])) 9 10print("best_step=", best["step"]) 11print("best_policy_pass_rate=", best["policy_pass_rate"]) 12print("lowest_loss_step=", min(checkpoints, key=lambda row: row["val_loss"])["step"])
Checkpoint selection output
1best_step= 600 2best_policy_pass_rate= 0.97 3lowest_loss_step= 400

Single GPU first, then scale out

A good SFT rollout usually starts with the smallest honest setup:

  1. one GPU
  2. tiny eval slice
  3. frequent checkpoints
  4. verified template/masking
  5. one clean metric

Only then do you widen the job.

Signs that one GPU is still fine

  • the model and activations fit
  • throughput is acceptable
  • you're still debugging correctness
  • labels are the main uncertainty, not compute budget

Signs that you need FSDP or ZeRO

  • full-model weights and optimizer state don't fit
  • activation memory collapses the run at useful context lengths
  • checkpointing and accumulation still leave the job too slow
  • you need larger world size to hit the schedule

That bridge matters. FSDP and ZeRO aren't "advanced decorations" on top of a stable recipe. They let the same recipe continue when memory stops fitting on one device by sharding model states across workers.[10]Reference 10FullyShardedDataParallelhttps://docs.pytorch.org/docs/stable/fsdp.html[11]Reference 11ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.https://arxiv.org/abs/1910.02054

Minimal operational skeleton

The exact library can vary, but the shape is stable:

sft_pipeline_shape.txt
1dataset -> template -> tokenizer -> collator/mask -> trainer 2 -> periodic eval -> checkpoint save -> best-checkpoint export

TRL, torchtune, and the Alignment Handbook all expose variations of this same loop.[1]Reference 1TRL Documentation: SFT Trainer.https://huggingface.co/docs/trl/sft_trainer[12]Reference 12alignment-handbook: Robust recipes to align language models with human and AI preferences.https://github.com/huggingface/alignment-handbook[13]Reference 13Welcome to the torchtune Documentation.https://meta-pytorch.org/torchtune/stable/index.html

For a TRL prompt-completion dataset, a configuration skeleton makes the decisions visible. Configure model_with_lora_adapters with a supported FlashAttention 2 or 3 implementation before enabling bfd packing:

trl_sft_config_fragment.py
1from trl import SFTConfig, SFTTrainer 2 3args = SFTConfig( 4 output_dir="runs/access-policy-sft-lora", 5 learning_rate=1e-4, # adapter baseline to evaluate, not a universal rule 6 max_length=1024, 7 packing=True, 8 packing_strategy="bfd", # keep examples intact; use a compatible attention path 9 completion_only_loss=True, # train on the completion field, not the prompt 10 eval_strategy="steps", 11 eval_steps=100, 12 save_steps=100, 13) 14 15trainer = SFTTrainer( 16 model=model_with_lora_adapters, 17 args=args, 18 train_dataset=train_rows, # {"prompt": ..., "completion": ...} 19 eval_dataset=held_out_rows, 20 processing_class=tokenizer, 21)

This fragment deliberately leaves model loading, adapter construction, and the data manifest outside the snippet. They must be versioned inputs to a real run, not invisible defaults.

Common pitfalls

Confusing the objective with the update surface

  • Symptom: A team says "we tried LoRA instead of SFT," or switches to full updates without changing data or evaluation.
  • Cause: LoRA and full fine-tuning describe which parameters move; SFT describes the supervised loss objective.
  • Fix: Write down the failed behavior and desired objective first. Then compare adapter and full-weight SFT only if update capacity or memory is the actual uncertainty.

Evaluating examples that share a case with training

  • Symptom: Held-out score is excellent, but the model fails on new tenants, policies, or access incidents.
  • Cause: Rows were randomly split while near-duplicate turns from the same incident or document appeared on both sides.
  • Fix: Group the split by incident thread, policy source, tenant, or later time window, then re-run selection.

Reporting micro-batch as if it were the real batch

  • Symptom: Two runs both claim "batch size 4," but convergence looks different and nobody can reconcile the results.
  • Cause: The reported number ignored accumulation or world size, so the optimizer was seeing different effective batches.
  • Fix: Publish per_device_batch, gradient_accumulation_steps, world_size, packed/padded mode, examples per update, and supervised tokens per update.

Saving weights without resume state

  • Symptom: Resume starts, but the learning rate jumps, checkpoint selection resets, or the run diverges from prior curves.
  • Cause: The checkpoint kept weights but dropped optimizer state, scheduler position, or best-metric record.
  • Fix: Treat resume state as part of the checkpoint contract: weights, optimizer, scheduler, global step, RNG/sampler state, template/tokenizer version, data and eval manifests, and best metric.

Using a packing path that crosses example boundaries

  • Symptom: Loss falls smoothly, but the model learns transitions that never occur in real conversations.
  • Cause: A packing strategy or attention implementation let tokens from one independent example condition another.
  • Fix: For TRL SFT, prefer bfd with a supported FlashAttention 2 or 3 implementation and inspect packed batches. Use bfd_split deliberately when preserving overflow matters; don't silently substitute wrapped or an unsupported attention path.

Selecting the best checkpoint by train loss

  • Symptom: Exported checkpoint has the prettiest curve but underperforms on held-out support tasks.
  • Cause: Train loss rewards fitting seen data, not solving the downstream task the product cares about.
  • Fix: Export the checkpoint that wins on held-out verifiable metrics or calibrated human-preference evaluation that matches deployment.

Evaluation rubric

Check that you can explain and operate these parts of an SFT run:

  • How dataset, template, tokenizer, collator, loss mask, optimizer, checkpointing, and evaluation fit into one training loop.
  • Why a response-only objective masks prompt tokens and includes the answer termination behavior.
  • Why a grouped held-out split is stronger than a row-random split for related support conversations.
  • How TRL's bfd, bfd_split, and wrapped strategies differ, and why the padding-free attention path still has to honor example boundaries.
  • How to treat documented learning-rate defaults as sweep starting points rather than guarantees.
  • How to compute examples and supervised answer tokens per optimizer update.
  • How to choose an objective first, then compare full-weight, LoRA, or QLoRA SFT parameterizations.
  • What a resumable checkpoint has to save beyond model weights.
  • Why the best checkpoint is selected by held-out verifiable or calibrated behavior metrics rather than training loss.
  • When to stay on one GPU for correctness and when FSDP or ZeRO becomes necessary.

Practice checkpoints

What to remember

  • SFT is a training system, not a trainer call alone.
  • Choose the learning objective before choosing full updates, LoRA, or QLoRA as its parameterization.
  • Hold out real groups such as incident threads or policy sources, not randomly related messages.
  • For response-only SFT, supervise answer and termination tokens; in TRL that means completion_only_loss or assistant_only_loss.
  • Packing saves compute only when strategy and attention behavior preserve the boundaries between independent examples.
  • Library defaults begin a sweep; they don't certify your learning rate or training duration.
  • Track supervised tokens per update in addition to examples, accumulation, and world size.
  • Resumable checkpoints need optimizer, scheduler, sampler, formatting, data, evaluation, and best-metric state, not weights alone.
  • The best checkpoint is chosen by held-out verifiable or calibrated behavior evidence, not by train loss.
  • Start on one GPU for correctness, then widen to FSDP / ZeRO when memory or schedule demands it.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A platform team has prompt-completion rows that show the required access escalation step, but the model often skips it. A teammate says, "Let's try LoRA instead of SFT." Which reply keeps the pipeline decisions straight?
2.Your team has prompt-response demonstrations for access triage. LoRA SFT is the desired update style, but the frozen base model does not fit comfortably in BF16 or FP16 memory. What should you test first under the same supervised objective?
3.A support dataset contains two valid turns from thread A102, a B550 row with an empty completion, an otherwise valid row with no thread_id, and valid rows from other threads. Which preprocessing rule preserves both the row contract and an honest evaluation split?
4.After templating a prompt-completion row, the tokens are prefix <system>, prefix policy, prefix <user>, prefix stale_key, prefix <assistant>, target request_evidence, and target <end_of_turn>. With response-only SFT, what should the masking behavior be?
5.After a library upgrade, a TRL job uses bfd packing on a padding-free path. Inspection shows an R550 response token can attend to a T102 response token in the same block. What correction is required before training continues?
6.Run A uses per_device_batch=3, gradient_accumulation_steps=12, world_size=2, and 5,200 supervised tokens across one rank's accumulated microbatches. Run B uses 6, 6, 2, and 4,300 respectively. Which budget statement is correct?
7.A run resumes from saved model weights, but its learning rate restarts and shuffled rows are repeated. Which checkpoint contract would prevent these avoidable changes and preserve model selection?
8.An adapter SFT pilot starts from a pretrained checkpoint. A teammate wants to lock the PEFT guidance learning rate near 1e-4, skip warmup and clipping logs, and compare runs that have very different supervised-token budgets. What is the sound plan?
9.An export rule requires format_pass_rate >= 0.98, then maximizes policy_pass_rate and uses lower val_loss only as a tie-breaker. Results are step 200: 0.99, 0.93, 1.91; step 400: 1.00, 0.91, 1.77; step 600: 0.98, 0.97, 1.79. Which step is exported?
10.One SFT job still has unverified template and loss-mask behavior, and the model fits on one GPU with acceptable throughput. Another job has a verified data path, but full-model weights and optimizer state do not fit at the useful context length. Which rollout choice matches the training-system rules?

10 questions remaining.

Next Step
Continue to Distributed Training: FSDP & ZeRO

You defined the SFT recipe and the checkpointing, batch, and evaluation rules that make a fine-tuning run legitimate. Next you'll keep that exact recipe alive once model states, activations, and communication no longer fit comfortably on one GPU.

PreviousSynthetic Data Pipelines
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

TRL Documentation: SFT Trainer.

Hugging Face · 2026

Fine-Tune Your First LLM.

PyTorch Contributors · 2026

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks.

Gururangan, S., Marasovic, A., Swayamdipta, S., et al. · 2020 · ACL 2020

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

Transformers Documentation: Writing a chat template.

Hugging Face · 2026

Checkpointing in torchtune.

PyTorch Contributors · 2026

Universal Checkpointing with DeepSpeed: A Practical Guide.

DeepSpeed Team · 2026

FullyShardedDataParallel

PyTorch Contributors · 2026

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.

Rajbhandari, S., et al. · 2020 · SC 2020

alignment-handbook: Robust recipes to align language models with human and AI preferences.

Hugging Face · 2025

Welcome to the torchtune Documentation.

PyTorch Contributors · 2026