LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationKnowledge Distillation for LLMs
⚡HardFine-Tuning & Training

Knowledge Distillation for LLMs

Understand the main forms of knowledge distillation for LLMs, from logit matching and response-based supervision to on-policy KD. Learn when distillation helps, where student capacity becomes the bottleneck, and how to implement a correct teacher-student training loop.

32 min read
Learning path
Step 105 of 155 in the full curriculum
RLVR & Verifiable RewardsModel Merging and Weight Interpolation

RLVR trained a policy against checked outcomes where a verifier exists. Knowledge distillation asks a different deployment question: after you have a useful large language model (LLM) teacher, how do you transfer selected behavior into a smaller model that's cheaper to serve?

Knowledge distillation trains a smaller student model to imitate useful behavior from a larger or otherwise more capable teacher. Imagine a return-policy chatbot: a large model performs well on an evaluated question set, but running it for every user costs too much. A smaller student is viable only if it retains enough answer quality under a latency and serving-cost budget.

The transfer signal depends on teacher access. In white-box settings, the student can match softened token probabilities or internal features. In black-box settings, it can fine-tune on selected teacher-written answers, solution , or synthetic corpora. Each channel carries different information and different failure modes. The goal isn't to assume the student inherits the teacher; it's to transfer useful behavior and verify the resulting quality-cost tradeoff.

Knowledge distillation flow: a teacher supplies transfer signal to a smaller student, followed by independent evaluation gates. Knowledge distillation flow: a teacher supplies transfer signal to a smaller student, followed by independent evaluation gates.
A teacher supplies additional supervision to a smaller student. Trusted labels or checks still matter because teacher outputs can be wrong.

The flow below shows the white-box KD case. Response distillation uses selected teacher text instead of the teacher-probability branch.

Diagram showing Input tokens, Large teacher 70B-class model, Smaller student 7B-class model, and Teacher probabilities. Diagram showing Input tokens, Large teacher 70B-class model, Smaller student 7B-class model, and Teacher probabilities.
Input tokens, Large teacher 70B-class model, Smaller student 7B-class model, and Teacher probabilities.

Why soft labels teach more than hard labels

In machine learning, the core idea (introduced by Hinton et al. in 2015) is to train a student model to mimic a teacher model's behavior, rather than only the ground truth labels. The student learns from the teacher's full probability distribution, which contains richer information than simple one-hot labels:[1]

L=α⋅Ldistill+(1−α)⋅Ltask\mathcal{L} = \alpha \cdot \mathcal{L}_{\text{distill}} + (1 - \alpha) \cdot \mathcal{L}_{\text{task}}L=α⋅Ldistill​+(1−α)⋅Ltask​

Where Ldistill\mathcal{L}_{\text{distill}}Ldistill​ matches teacher behavior, Ltask\mathcal{L}_{\text{task}}Ltask​ matches ground-truth labels, and α∈[0,1]\alpha \in [0, 1]α∈[0,1] controls the tradeoff. This combined loss makes sure the student learns from both the teacher's soft predictions and the true labels.

A concrete example

Suppose a customer asks, "How long do I have to return an electronic item?" The teacher's output distribution contains dark knowledge: relationships between classes that hard labels don't capture.

The hard label only says "30 days" is correct. In this simplified answer-choice example, the soft label also records that the teacher ranks "14 days" above "1 year." That extra signal is useful only if the teacher's ranking is itself useful; it is not proof that the student learned a general return-policy rule.

Soft-label distillation example comparing a hard target with a teacher ranking over 30 days, 14 days, 90 days, and 1 year. Soft-label distillation example comparing a hard target with a teacher ranking over 30 days, 14 days, 90 days, and 1 year.
Soft labels preserve the teacher's ranking over alternatives. A validation set still has to establish whether that ranking helps.

The probabilities above describe one simplified prediction decision. For a causal LLM, logit distillation applies this idea at each predicted token position. exposes more of the teacher's ranking, but it can also flatten the distribution until very little preference signal remains.

temperature-softening.py
1import math 2 3logits = {"30": 4.0, "14": 2.0, "90": 1.0, "1": 0.7} 4 5def softened_probabilities(temperature: float) -> dict[str, float]: 6 scaled = {token: math.exp(logit / temperature) for token, logit in logits.items()} 7 total = sum(scaled.values()) 8 return {token: value / total for token, value in scaled.items()} 9 10for temperature in (1.0, 4.0, 40.0): 11 probs = softened_probabilities(temperature) 12 rounded = {token: round(probability, 3) for token, probability in probs.items()} 13 top_gap = probs["30"] - probs["14"] 14 print(f"T={temperature:g} probabilities:", rounded, "top_gap:", round(top_gap, 3))
Output
1T=1 probabilities: {'30': 0.818, '14': 0.111, '90': 0.041, '1': 0.03} top_gap: 0.708 2T=4 probabilities: {'30': 0.397, '14': 0.241, '90': 0.188, '1': 0.174} top_gap: 0.156 3T=40 probabilities: {'30': 0.263, '14': 0.25, '90': 0.244, '1': 0.242} top_gap: 0.013

When distillation beats training from scratch

Distillation is most useful when you already have a strong teacher, legal access to its signal, and a clear smaller deployment target. It does not win automatically over training from scratch. The Gemma 2 report gives a controlled example: its authors train the 2B and 9B models with token-probability distillation, and a 2B ablation trained for 500B tokens scores 67.7 when distilled from a 7B teacher versus 60.3 from scratch on their three-benchmark average.[2]

Different recipes expose different supervision channels: Orca trains on explanation traces, phi-1.5 uses curated synthetic textbook-like data, and Gemma 2 uses teacher token probabilities for small models.[3][4][2] They motivate careful data and signal selection; they do not establish one universally best recipe.

Matching the teacher's probabilities: logit distillation

This method requires access to the teacher model's logits (the raw, unnormalized scores output by the final layer of the network before the softmax function). We minimize the KL Divergence (Kullback-Leibler divergence), a mathematical measure of how one probability distribution differs from another, to align the student's probability distribution with the teacher's.

First, we apply temperature scaling (dividing logits by a temperature T>1T > 1T>1 before softmax to soften the probability distribution) to both models' logits to get softened probability distributions:

qi=exp⁡(zt,i/T)∑jexp⁡(zt,j/T),pi=exp⁡(zs,i/T)∑jexp⁡(zs,j/T)q_i = \frac{\exp(z_{t,i} / T)}{\sum_j \exp(z_{t,j} / T)}, \quad p_i = \frac{\exp(z_{s,i} / T)}{\sum_j \exp(z_{s,j} / T)}qi​=∑j​exp(zt,j​/T)exp(zt,i​/T)​,pi​=∑j​exp(zs,j​/T)exp(zs,i​/T)​

Where ztz_tzt​ are the teacher's logits, zsz_szs​ are the student's logits, and TTT is the temperature. Then we compute the KL divergence loss:

LKL=T2∑iqilog⁡qipi\mathcal{L}_{\text{KL}} = T^2 \sum_i q_i \log \frac{q_i}{p_i}LKL​=T2∑i​qi​logpi​qi​​

Reading the formula

  • qiq_iqi​ is the teacher's softened probability for token iii (the target we want the student to match)
  • pip_ipi​ is the student's softened probability for token iii (what the student currently predicts)
  • TTT is the temperature (higher TTT creates a softer, more uniform distribution)
  • The T2T^2T2 scaling factor compensates for gradient scaling: as temperature increases, gradients from soft targets scale down by approximately 1/T21/T^21/T2. Multiplying by T2T^2T2 keeps the relative contribution from soft targets roughly stable as you tune temperature. Think of it like turning up the volume on a quiet signal so it can compete with the loud one.
  • KL divergence measures how much information is lost when using the student's distribution ppp to approximate the teacher's distribution qqq

For causal LMs, two implementation details matter. First, next-token training requires a one-token shift: logits at position ttt train against the label at position t+1t+1t+1. Second, direct logit KD assumes teacher and student use the same token-to-id output mapping. Equal vocabulary sizes alone are insufficient: token id 42 must denote the same token for both models. If output spaces differ, plain token-level KL no longer lines up cleanly and you usually fall back to response distillation or design an explicit mapping. The snippet below handles the shift, masks ignored positions, and fails fast on a reordered vocabulary.

Common mistake: Running logit distillation without verifying tokenizer alignment. Two models can have the same vocabulary size and different token-id mappings. Compare the complete output mapping, not only vocab_size, before training.

reading-the-formula.py
1import torch 2import torch.nn.functional as F 3 4def distillation_loss( 5 student_logits: torch.Tensor, 6 teacher_logits: torch.Tensor, 7 labels: torch.Tensor, 8 student_vocabulary: tuple[str, ...], 9 teacher_vocabulary: tuple[str, ...], 10 temperature: float = 3.0, 11 alpha: float = 0.5, 12 ignore_index: int = -100, 13) -> torch.Tensor: 14 """ 15 Computes the weighted sum of Knowledge Distillation (KD) loss and Cross-Entropy loss. 16 """ 17 if student_vocabulary != teacher_vocabulary: 18 raise ValueError( 19 "Logit KD requires identical token-to-id mappings. " 20 "Use response KD or design an explicit mapping when output spaces differ." 21 ) 22 if student_logits.size(-1) != teacher_logits.size(-1) or student_logits.size(-1) != len(student_vocabulary): 23 raise ValueError("Logit tensors and vocabulary dimensions must agree.") 24 25 # Causal LMs predict token t+1 from positions up to t. 26 shift_student = student_logits[..., :-1, :].contiguous() 27 shift_teacher = teacher_logits[..., :-1, :].contiguous() 28 shift_labels = labels[..., 1:].contiguous() 29 30 vocab_size = shift_student.size(-1) 31 flat_student = shift_student.reshape(-1, vocab_size) 32 flat_teacher = shift_teacher.reshape(-1, vocab_size) 33 flat_labels = shift_labels.reshape(-1) 34 35 valid_mask = flat_labels != ignore_index 36 if not valid_mask.any(): 37 return student_logits.sum() * 0 38 39 student_valid = flat_student[valid_mask] 40 teacher_valid = flat_teacher[valid_mask] 41 labels_valid = flat_labels[valid_mask] 42 43 # Soft loss: KL divergence between softened distributions 44 soft_teacher = F.softmax(teacher_valid.detach() / temperature, dim=-1) 45 soft_student = F.log_softmax(student_valid / temperature, dim=-1) 46 47 # KLDivLoss expects log-probabilities for the input (student) 48 # and standard probabilities for the target (teacher) 49 soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') 50 soft_loss *= temperature ** 2 # Scale by T² for gradient magnitude 51 52 # Hard loss: standard cross-entropy with ground truth 53 hard_loss = F.cross_entropy(student_valid, labels_valid) 54 55 return alpha * soft_loss + (1 - alpha) * hard_loss 56 57torch.manual_seed(7) 58batch, seq_len, vocab = 2, 5, 8 59student_logits = torch.nn.Parameter(torch.randn(batch, seq_len, vocab)) 60teacher_logits = torch.randn(batch, seq_len, vocab) 61labels = torch.tensor([ 62 [0, 1, 2, 3, 4], 63 [0, 2, -100, 5, 6], 64]) 65vocabulary = tuple(f"token_{index}" for index in range(vocab)) 66 67loss = distillation_loss( 68 student_logits, 69 teacher_logits, 70 labels, 71 vocabulary, 72 vocabulary, 73 temperature=3.0, 74 alpha=0.6, 75) 76loss.backward() 77 78loss_is_scalar = loss.ndim == 0 79loss_is_finite = bool(torch.isfinite(loss)) 80grad_exists = student_logits.grad is not None 81grad_is_finite = bool(torch.isfinite(student_logits.grad).all()) if grad_exists else False 82mismatch_failed = False 83 84try: 85 distillation_loss(student_logits, teacher_logits, labels, vocabulary, tuple(reversed(vocabulary))) 86except ValueError as exc: 87 mismatch_failed = "token-to-id mappings" in str(exc) 88 89print("loss:", round(float(loss), 4)) 90print("grad norm:", round(float(student_logits.grad.norm()), 4)) 91print("loss_is_scalar:", loss_is_scalar) 92print("loss_is_finite:", loss_is_finite) 93print("grad_is_finite:", grad_is_finite) 94print("mismatch failed:", mismatch_failed)
Output
1loss: 1.4959 2grad norm: 0.1891 3loss_is_scalar: True 4loss_is_finite: True 5grad_is_finite: True 6mismatch failed: True

When you only have text: response distillation

When you don't have access to the teacher's weights or logits, but can obtain generated text, response distillation is the available KD channel. In this approach, the teacher generates text responses for prompts, and the student fine-tunes on selected (prompt, response) pairs.

This is Supervised Fine-Tuning (SFT) on teacher-generated targets, not ground truth. Because an API commonly provides text rather than full token probabilities, the training channel is less detailed than direct logit access. A teacher can provide worked solutions, decomposed subproblems, critiques, or multiple candidate answers, but these outputs should be filtered with task-specific checks where possible. For a warehouse-routing assistant, a generated route is useful training data only after checks for capacity, delivery-window, and policy constraints.

Examples and adjacent synthetic-data recipes

  • Alpaca[5]: Fine-tuned LLaMA 7B on 52K instruction-following examples generated by text-davinci-003.
  • Orca[3]: Learned from GPT-4 explanation traces plus guidance from ChatGPT, moving beyond shallow answer imitation.
  • phi-1.5[4]: Not classic KD, but an adjacent synthetic-data recipe built from textbook-like generated data; it does not compare KD loss choices.
  • Distilling Step-by-Step[6]: Uses generated rationales as an additional supervised target and evaluates whether smaller students improve on the studied tasks.
  • DeepSeek-R1-Distill[7]: Fine-tunes Qwen2.5- and Llama-based students (1.5B to 70B) for two to three epochs on the paper's roughly 800K-example SFT collection. The paper reports strong results for these distilled models on selected reasoning benchmarks; it does not prove that every teacher behavior transfers through text.

Response distillation and synthetic-data training overlap when a stronger model generates the selected targets. Calling a dataset "distillation" should not bypass evaluation: generated traces can be incorrect, stylistically misleading, contaminated, or out of scope for the intended student.

teacher-output-gate.py
1generated = [ 2 {"prompt": "sealed electronics return", "teacher": "30 days", "verified": "30 days"}, 3 {"prompt": "final-sale item return", "teacher": "30 days", "verified": "not eligible"}, 4 {"prompt": "defective item warranty", "teacher": "warranty process", "verified": "warranty process"}, 5] 6 7accepted = [ 8 example for example in generated 9 if example["teacher"] == example["verified"] 10] 11rejected = [ 12 example["prompt"] for example in generated 13 if example["teacher"] != example["verified"] 14] 15 16print("generated:", len(generated)) 17print("accepted:", len(accepted)) 18print("rejected prompts:", rejected) 19print("teacher text is trusted label:", len(rejected) == 0)
Output
1generated: 3 2accepted: 2 3rejected prompts: ['final-sale item return'] 4teacher text is trusted label: False

Aligning internal layers: feature distillation

Logit distillation matches output distributions. With white-box access, a training objective can also match selected student hidden states to selected teacher hidden states through a learned projection.

Lfeature=∑l∥flteacher−g(flstudent)∥2\mathcal{L}_{\text{feature}} = \sum_l \|f_l^{\text{teacher}} - g(f_l^{\text{student}})\|^2Lfeature​=∑l​∥flteacher​−g(flstudent​)∥2

Where flteacherf_l^{\text{teacher}}flteacher​ and flstudentf_l^{\text{student}}flstudent​ are hidden states at layer lll, and g(⋅)g(\cdot)g(⋅) projects student features into the teacher feature space before comparison.

Since the student commonly has different hidden dimensions, ggg maps the student's representation into the teacher comparison space. Feature matching introduces extra decisions: which layers correspond, how the projection is trained, and whether its additional compute improves held-out outcomes. Hidden-state access is a richer interface, not a guarantee of a better student.

MethodTeacher signalMain advantageMain constraint
Response KDSelected text outputsWorks without white-box accessTeacher errors become SFT targets unless filtered
Logit KDToken probabilitiesPreserves distribution informationRequires aligned output space or an explicit mapping
Feature KDSelected hidden statesExposes intermediate representationsNeeds layer/projection design and more storage or compute
On-policy KDTeacher scores on student samplesVisits prefixes the student actually producesRequires online sampling and teacher evaluation
Distillation method selector showing response, logit, and feature signals from different teacher access levels, while fixed versus on-policy sampling is a separate choice. Distillation method selector showing response, logit, and feature signals from different teacher access levels, while fixed versus on-policy sampling is a separate choice.
Teacher access chooses the transferable signal. Fixed versus on-policy sampling separately chooses which prefixes receive teacher supervision.

Forward versus reverse KL

When minimizing KL divergence for language generation, direction matters. Classical distillation commonly minimizes Forward KL (teacher || student), which penalizes a student for missing probability mass that the teacher assigns to continuations. When a small student cannot model the teacher distribution well, this pressure may be costly.

Reverse KL (student || teacher) places more pressure on probability mass the student assigns where the teacher assigns little. MiniLLM reports improvements over its studied standard-KD baselines using reverse KL with an on-policy optimization algorithm in instruction-following experiments.[8] GKD evaluates multiple divergences and explicitly reports that the best divergence depends on the task and diversity-performance tradeoff.[9]

DirectionFormulaBehaviorCommon fit
Forward KLDKL(Pteacher∥Pstudent)D_{KL}(P_{teacher} \| P_{student})DKL​(Pteacher​∥Pstudent​)Mean-seeking, covers more of the teacher distributionClassic KD when broad coverage matters
Reverse KLDKL(Pstudent∥Pteacher)D_{KL}(P_{student} \| P_{teacher})DKL​(Pstudent​∥Pteacher​)Penalizes student mass in teacher-low-probability regionsCandidate objective to evaluate for generation

No divergence is the default winner for every task. Measure task quality, diversity, calibration, and failure rates under the actual decoding setup.

kl-direction-diagnostic.py
1import math 2 3teacher = {"safe": 0.58, "alternate": 0.40, "bad": 0.02} 4students = { 5 "covers_teacher": {"safe": 0.54, "alternate": 0.36, "bad": 0.10}, 6 "adds_bad_mass": {"safe": 0.40, "alternate": 0.35, "bad": 0.25}, 7} 8 9def kl(left: dict[str, float], right: dict[str, float]) -> float: 10 return sum(prob * math.log(prob / right[token]) for token, prob in left.items()) 11 12for name, student in students.items(): 13 forward = kl(teacher, student) 14 reverse = kl(student, teacher) 15 print(name, "forward:", round(forward, 3), "reverse:", round(reverse, 3)) 16 17print("choose objective from evaluation, not slogan")
Output
1covers_teacher forward: 0.051 reverse: 0.084 2adds_bad_mass forward: 0.218 reverse: 0.436 3choose objective from evaluation, not slogan

Off-policy versus on-policy distillation

Off-policy (standard) distillation trains the student on teacher outputs for a static dataset. During inference, the student generates its own tokens, causing distribution shift. Errors compound as the student drifts from training data (exposure bias).

On-policy methods like Generalized Knowledge Distillation (GKD) sample sequences from the student, then compare student and teacher token distributions on the prefixes the student produced. GKD can mix fixed outputs and student-generated outputs through a student-data fraction λ\lambdaλ; it does not require a natural-language critique.[9] The tradeoff is computational: both student sampling and teacher scoring run during training. It is useful when fixed teacher data misses prefixes that the deployed student commonly enters, but the benefit must be measured per task.

on-policy-prefix-coverage.py
1fixed_teacher_prefixes = { 2 "return sealed electronics", 3 "return unopened clothing", 4} 5student_generated_prefixes = { 6 "return sealed electronics", 7 "return opened final-sale electronics", 8 "return item without receipt", 9} 10 11unseen_in_fixed_data = student_generated_prefixes - fixed_teacher_prefixes 12teacher_scored_prefixes = fixed_teacher_prefixes | student_generated_prefixes 13 14print("fixed prefixes:", len(fixed_teacher_prefixes)) 15print("student prefixes needing new teacher scores:", sorted(unseen_in_fixed_data)) 16print("scored after on-policy collection:", len(teacher_scored_prefixes))
Output
1fixed prefixes: 2 2student prefixes needing new teacher scores: ['return item without receipt', 'return opened final-sale electronics'] 3scored after on-policy collection: 4

Temperature: expose rankings without trusting them

The temperature parameter TTT controls how much we smooth the teacher distribution used in a logit loss. It changes which relative token preferences are visible to the loss; it does not guarantee that those preferences are useful or that the student can represent them.

Mathematically, the softmax function converts raw logits into probabilities using an exponential function. When one logit is significantly larger than the rest, the exponential makes it dominate the entire probability mass, crushing the others to near-zero. By dividing all logits by a temperature T>1T > 1T>1 before applying the exponential, we reduce the relative difference between the largest logit and the smaller ones. This prevents the top choice from monopolizing the probability space and allows the relative scores of the "incorrect" choices to emerge.

A worked example

Imagine the teacher sees the prompt "Return policy for electronics?" and produces these raw logits for the next token:

TokenRaw logit
304.0
142.0
901.0
10.7

At T=1T = 1T=1, the probabilities are sharp: 30 gets about 0.82, 14 gets 0.11, 90 gets 0.04, and 1 gets 0.03. The student sees only a weak signal that 90 is a more reasonable guess than 1.

At T=4T = 4T=4, after dividing each logit by 4 and applying softmax, the probabilities spread out: 30 gets about 0.40, 14 gets 0.24, 90 gets 0.19, and 1 gets 0.17. The distillation loss now exposes more of the teacher's ranking among alternatives.

Key insight: Temperature changes the training target, not the trusted answer. If the teacher ranks an invalid option highly, a softened loss makes that error easier to copy. Use labels or checks and held-out evaluation alongside KD.

Diagram showing T = 1 (sharp) 30 days: 0.82 14 days: 0.11 90 days: 0.04 1 year: 0.03, T = 4 (soft) 30 days: 0.40 14 days: 0.24 90 days: 0.19 1 year: 0.17, and raise temperature. Diagram showing T = 1 (sharp) 30 days: 0.82 14 days: 0.11 90 days: 0.04 1 year: 0.03, T = 4 (soft) 30 days: 0.40 14 days: 0.24 90 days: 0.19 1 year: 0.17, and raise temperature.
T = 1 (sharp) 30 days: 0.82 14 days: 0.11 90 days: 0.04 1 year: 0.03, T = 4 (soft) 30 days: 0.40 14 days: 0.24 90 days: 0.19 1 year: 0.17, and raise temperature.
  • Low Temperature (T≈1T \approx 1T≈1): The distribution is sharp. The model is very confident in its top choice (30). The relationships between incorrect choices (14 vs 90) are hidden because their probabilities are near zero.
  • High Temperature (T>1T > 1T>1): The distribution flattens. Probability mass spreads to alternatives, making the teacher's relative ranking more visible to the loss.
  • Too High (T≫10T \gg 10T≫10): The distribution becomes nearly uniform. The ranking signal is lost, and the student learns little from the teacher distribution.

In practice, tune temperature rather than assuming a universal constant. The right value depends on the teacher distribution, soft-loss weight, task, and evaluation metrics. Inspect probability spreads as in the executable example above, then compare held-out quality.

A practical distillation training loop

A typical distillation training loop involves a frozen teacher model and a trainable student model. We forward the same input through both models and compute the combined loss. The local example below uses tiny PyTorch models so you can test the mechanics without downloading a real teacher.

a-practical-distillation-training-loop.py
1import torch 2from torch import nn 3import torch.nn.functional as F 4 5class TinyLM(nn.Module): 6 def __init__(self, vocab_size: int, hidden_size: int): 7 super().__init__() 8 self.embedding = nn.Embedding(vocab_size, hidden_size) 9 self.output = nn.Linear(hidden_size, vocab_size) 10 11 def forward(self, input_ids: torch.Tensor) -> torch.Tensor: 12 return self.output(self.embedding(input_ids)) 13 14def kd_loss(student_logits: torch.Tensor, teacher_logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor: 15 temperature = 3.0 16 alpha = 0.7 17 18 shift_student = student_logits[:, :-1, :] 19 shift_teacher = teacher_logits[:, :-1, :] 20 shift_labels = labels[:, 1:] 21 22 student_flat = shift_student.reshape(-1, shift_student.size(-1)) 23 teacher_flat = shift_teacher.reshape(-1, shift_teacher.size(-1)) 24 labels_flat = shift_labels.reshape(-1) 25 26 soft_teacher = F.softmax(teacher_flat / temperature, dim=-1) 27 soft_student = F.log_softmax(student_flat / temperature, dim=-1) 28 soft_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean") * temperature**2 29 hard_loss = F.cross_entropy(student_flat, labels_flat) 30 return alpha * soft_loss + (1 - alpha) * hard_loss 31 32torch.manual_seed(0) 33vocab_size = 12 34teacher = TinyLM(vocab_size=vocab_size, hidden_size=16) 35student = TinyLM(vocab_size=vocab_size, hidden_size=6) 36teacher.requires_grad_(False) 37teacher.eval() 38 39input_ids = torch.tensor([ 40 [1, 2, 3, 4, 5], 41 [1, 3, 5, 7, 9], 42]) 43labels = input_ids.clone() 44optimizer = torch.optim.AdamW(student.parameters(), lr=0.05) 45 46with torch.no_grad(): 47 teacher_logits = teacher(input_ids) 48 49before = kd_loss(student(input_ids), teacher_logits, labels) 50optimizer.zero_grad() 51before.backward() 52has_grad = any(parameter.grad is not None for parameter in student.parameters()) 53optimizer.step() 54 55after = kd_loss(student(input_ids), teacher_logits, labels) 56 57print("before:", round(float(before), 4)) 58print("after:", round(float(after), 4)) 59print("has_grad:", has_grad) 60print("after_is_finite:", bool(torch.isfinite(after))) 61print("improved:", bool(after < before))
Output
1before: 1.1513 2after: 0.9793 3has_grad: True 4after_is_finite: True 5improved: True

In production, the same pattern usually lives inside a framework trainer, with aligned-vocabulary models such as a larger Gemma teacher and smaller Gemma student when direct token-probability distillation is required. Teams may pre-compute some teacher signal offline to avoid running the teacher inside every student update. Dense next-token logits across a long corpus are costly to store, so a design may consider top-k logits, teacher responses, or online scoring, then measure the quality effect of compression. The payload-only estimate below ignores metadata and storage-format overhead, so treat it as a lower-bound sizing exercise.

logit-cache-budget.py
1tokens = 50_000_000 2vocab_size = 32_000 3bytes_per_logit = 2 # bf16 4top_k = 64 5bytes_per_topk_item = 2 + 4 # bf16 value plus int32 token id 6 7dense_bytes = tokens * vocab_size * bytes_per_logit 8topk_bytes = tokens * top_k * bytes_per_topk_item 9gib = 1024 ** 3 10 11print("dense cache GiB:", round(dense_bytes / gib, 1)) 12print("top-k cache GiB:", round(topk_bytes / gib, 1)) 13print("storage reduction:", round(dense_bytes / topk_bytes, 1), "x") 14print("quality must still be evaluated:", True)
Output
1dense cache GiB: 2980.2 2top-k cache GiB: 17.9 3storage reduction: 166.7 x 4quality must still be evaluated: True

If an online-distillation pilot is feasible, compare it with an offline baseline before committing to large-scale data generation. The result can expose whether fresh teacher scoring is worth its compute cost for this task.

Distilled models in practice

When the teacher is only available through generated text, response distillation becomes the default. When you control both models and can inspect logits or hidden states, white-box distillation exposes a richer training signal. Recent systems use both patterns depending on what the teacher exposes.

What changes from system to system isn't the core idea, but the supervision channel: raw logits, hidden states, instruction-response pairs, rationale traces, or synthetic corpora generated by a stronger model.

Student / recipeTeacherSignal transferredWhy it matters
Alpaca 7B[5]text-davinci-00352K generated instruction-response examplesThe repository reports preliminary instruction-following evaluation and clear non-commercial dataset terms.
Orca 13B[3]GPT-4 + ChatGPTExplanation traces and task instructionsEvaluates a richer generated-trace training recipe, rather than logit KD.
phi-1.5[4]Existing LLMs + curated synthetic dataTextbook-like synthetic corporaAdjacent synthetic-data recipe, not a direct teacher-distribution KD comparison.
Gemma 2 2B / 9B[2]Larger Gemma teachersToken-probability distillation during pretrainingReports a controlled 2B distilled-versus-from-scratch ablation.
DeepSeek-R1-Distill 1.5B-70B[7]DeepSeek-R1Roughly 800K selected SFT examplesReports results from text-target fine-tuning; transfer remains benchmark-scoped.

Building a distillation dataset

When using response-based distillation, the selected dataset is one of the main controllable inputs, alongside student capacity and training budget. Build a generation and selection pipeline that can reject incorrect, duplicate, contaminated, or irrelevant examples.

Seed-Expand-Filter pipeline

A structured approach to generating teacher data involves a Seed-Expand-Filter pipeline. It does not ensure quality by itself; each filter needs a measurable contract and a separate evaluation split.

  1. Seed: Start with a small set of high-quality, human-written prompts (e.g., 100 examples about warehouse operations).
  2. Expand: Ask the teacher model to generate new, diverse variations of these prompts.
  3. Generate: Have the teacher answer these new prompts, often with rationales or decomposed steps when richer supervision helps.
  4. Filter: Use checks, deduplication, safety screening, or reviewed scoring rules to reject unsuitable generations.
Seed-expand-filter distillation data pipeline: human seed prompts, teacher prompt expansion, teacher response generation, measurable filtering, and selected student targets. Seed-expand-filter distillation data pipeline: human seed prompts, teacher prompt expansion, teacher response generation, measurable filtering, and selected student targets.
For response distillation, selection is part of model design. Each stage should increase coverage or reject measurable failure modes before examples reach the student.

This is close to the spirit of Alpaca's Self-Instruct-style pipeline and Orca's richer explanation-trace data generation, even though real systems add deduplication, safety filters, and task balancing.[5][3]

This logic can be implemented by creating a small wrapper around a teacher client. The example below focuses on two easy-to-miss requirements: deduplicate prompts before paying for generation, and verify teacher answers before they become student targets.

select-teacher-responses.py
1from collections.abc import Callable 2 3class DistillationDataGenerator: 4 def __init__( 5 self, 6 teacher_generate: Callable[[str], str], 7 verify_response: Callable[[str, str], bool], 8 ): 9 self.teacher_generate = teacher_generate 10 self.verify_response = verify_response 11 12 def generate_dataset(self, prompts: list[str]) -> list[dict[str, str]]: 13 selected: list[dict[str, str]] = [] 14 seen: set[str] = set() 15 for prompt in prompts: 16 normalized = " ".join(prompt.lower().split()) 17 if normalized in seen: 18 continue 19 seen.add(normalized) 20 response = self.teacher_generate(normalized).strip() 21 if self.verify_response(normalized, response): 22 selected.append({"prompt": normalized, "response": response}) 23 return selected 24 25def fake_teacher(prompt: str) -> str: 26 return "30 days" 27 28trusted_answers = { 29 "sealed electronics return": "30 days", 30 "final-sale electronics return": "not eligible", 31} 32 33def verify_response(prompt: str, response: str) -> bool: 34 return trusted_answers[prompt] == response 35 36generator = DistillationDataGenerator(fake_teacher, verify_response) 37examples = generator.generate_dataset([ 38 "sealed electronics return", 39 " Sealed electronics return ", 40 "final-sale electronics return", 41]) 42 43print("selected prompts:", [example["prompt"] for example in examples]) 44print("selected count:", len(examples)) 45print("bad response retained:", any("final-sale" in example["prompt"] for example in examples))
Output
1selected prompts: ['sealed electronics return'] 2selected count: 1 3bad response retained: False

Keep training data out of evaluation

Teacher generation can quietly contaminate a benchmark when prompts, reference solutions, or close rewrites enter the student training set. At minimum, block exact normalized overlap before training. For real releases, extend the gate with near-duplicate and reference-solution checks.

held-out-contamination-gate.py
1def normalize(prompt: str) -> str: 2 return " ".join(prompt.lower().replace("?", "").split()) 3 4candidate_training_prompts = [ 5 "Compute shipping refund for a late parcel", 6 "Return sealed electronics within 30 days", 7 "Can a FINAL-SALE item be returned?", 8] 9held_out_prompts = [ 10 "can a final-sale item be returned", 11 "Estimate delivery window for a remote zip code", 12] 13 14held_out_keys = {normalize(prompt) for prompt in held_out_prompts} 15accepted = [ 16 prompt for prompt in candidate_training_prompts 17 if normalize(prompt) not in held_out_keys 18] 19blocked = [ 20 prompt for prompt in candidate_training_prompts 21 if normalize(prompt) in held_out_keys 22] 23 24print("accepted training prompts:", len(accepted)) 25print("blocked overlap:", blocked) 26print("held-out exact overlap after gate:", any(normalize(p) in held_out_keys for p in accepted))
Output
1accepted training prompts: 2 2blocked overlap: ['Can a FINAL-SALE item be returned?'] 3held-out exact overlap after gate: False

Limitations and when not to distill

Distillation does not make capacity, context-window, or data-coverage constraints disappear. A student may beat its teacher on a narrow checked metric after filtering or task-specific training, while regressing on other behavior. Treat the teacher and student as separate artifacts to evaluate.

Before investing in a distillation pipeline, define which behaviors matter and how they will be tested.

BehaviorRegression risk to testUseful held-out gate
Domain answersGenerated targets can repeat teacher errorsChecked answer accuracy and abstention rate
Instruction followingNarrow traces can miss new constraintsFresh constraint-following prompts
Multi-step solutionsFinal answers can hide invalid stepsStep checks where available plus final-answer accuracy
Long-context useStudent architecture or context limit may differRetrieval and long-context slices at deployment length
Safety and policy behaviorFiltered corpus may omit refusals or edge casesSafety-policy evaluation separate from task benchmark

Legal and ethical considerations

Beyond technical constraints, distillation introduces unique licensing and evaluation challenges. Because the student model closely mirrors the teacher's outputs, the origins of that training data are important.

  • Provider terms matter: the Stanford Alpaca release was research-only and non-commercial, and the repo points to both the underlying LLaMA restrictions and the dataset's CC BY-NC 4.0 terms.[5]
  • Restrictions must be reviewed: before generating a corpus or shipping a student, review the teacher access terms, base-student license, generated-data license, and permitted use of outputs. Do not infer permission from technical access.
  • Imitation isn't capability proof: a student may reproduce style or familiar output patterns while failing new checked tasks. Held-out evaluation, not resemblance, establishes value.

Cost-quality tradeoff

Deciding whether to distill, and which method to use, comes down to measured quality and economics. Richer teacher access can enable different losses; it does not rank final models without evaluation.

ApproachRequired accessMain training costRelease gate
Use teacher directlyTeacher inferenceNo student trainingBaseline quality, latency, and cost
Response KDGenerated outputs and permitted useGeneration plus SFTOutput filtering and held-out task quality
Logit KDAligned teacher token probabilitiesTeacher scoring or cache storageTask quality plus cache/online cost
Feature KDHidden states and layer mappingExtra projections and state transferAblation against simpler KD baseline

Also, don't stop at distillation loss. A student can match teacher probabilities on training batches and still regress on held-out generation quality, long-context behavior, or latency targets. Measure task metrics, pairwise win rate, and real serving cost together.

deployment-gate.py
1teacher = {"checked_accuracy": 0.94, "policy_error_rate": 0.01, "latency_ms": 180} 2student = {"checked_accuracy": 0.92, "policy_error_rate": 0.04, "latency_ms": 42} 3requirements = {"checked_accuracy": 0.90, "max_policy_error_rate": 0.02, "max_latency_ms": 60} 4 5checks = { 6 "quality": student["checked_accuracy"] >= requirements["checked_accuracy"], 7 "policy": student["policy_error_rate"] <= requirements["max_policy_error_rate"], 8 "latency": student["latency_ms"] <= requirements["max_latency_ms"], 9} 10 11print("student faster:", student["latency_ms"] < teacher["latency_ms"]) 12print("release checks:", checks) 13print("deploy student:", all(checks.values()))
Output
1student faster: True 2release checks: {'quality': True, 'policy': False, 'latency': True} 3deploy student: False

Practice checkpoints

How does temperature affect distillation? Temperature controls how soft the teacher distribution becomes. A higher TTT spreads probability mass across alternatives, exposing more of the teacher ranking to the loss. Too high becomes nearly uniform and loses ranking signal; any setting still needs held-out validation.

Can you distill from a proprietary API model? If its permitted interface and terms allow generated outputs for training, the available KD channel is response distillation. You generate candidate prompt-response pairs or checked solution traces and fine-tune the student on selected text. You don't get token probabilities, so filtering, use rights, and held-out evaluation are central.

When does distillation fail to transfer capability? It fails on a target behavior when student capacity, context length, architecture, or training coverage cannot reproduce that behavior at the required threshold. Define slices such as long-context tasks, policy edge cases, and checked multi-step solutions so the failure is visible.

How do you decide whether to deploy a distilled student? Set quality, policy, latency, and cost gates before training. Lower latency is not enough if a student fails a policy or accuracy gate, as the executable deployment example shows.

Can you directly distill between different tokenizers? Not with plain token-level KL unless token-to-id output mappings agree. Same vocabulary size is not enough. If output spaces differ, use response distillation or design and validate an explicit mapping.

Check your understanding

Before moving on, try to answer these questions without looking back at the article.

  1. Why does a soft label teach more than a hard label?
    Hint: Think about what the student learns about the near-miss answers, not only the correct one.

  2. Why might T=1T = 1T=1 provide less ranking signal than a higher temperature? Hint: Consider what a sharp teacher distribution exposes about alternatives.

  3. Why do causal language models need a one-token shift when computing distillation loss?
    Hint: Remember that a causal LM predicts the next token from all previous tokens.

  4. Why should forward versus reverse KL be chosen through evaluation? Hint: Think about mode coverage, low-teacher-probability mass, and task-dependent diversity requirements.

Click to see solution sketches
  1. A hard label only tells the student which answer is selected. A soft label exposes the teacher's full ranking over alternatives. That extra signal can help, but a trusted check still has to catch bad teacher rankings.

  2. If the teacher distribution is sharp at T=1T = 1T=1, the top token receives most of the probability mass and alternatives contribute little signal. Raising temperature can expose their relative ranking, until excessive smoothing removes useful separation.

  3. In a causal LM, the logits produced at position ttt predict the token at position t+1t+1t+1. If you don't shift the labels by one, you are training the model to predict the current token from itself, which is trivial and wrong.

  4. Forward KL penalizes missing teacher mass; reverse KL penalizes student mass where the teacher is low. Their quality and diversity tradeoffs vary by task and decoding setup, so evaluate both where the objective choice matters.

What to remember

  1. Core concept: Distillation transfers teacher signal, not guaranteed truth. Keep trusted labels, checks, and held-out evaluation in the design.
  2. Methods: Response KD uses selected generated text. Logit KD uses aligned token probabilities. Feature KD adds hidden-state comparisons. On-policy KD scores prefixes produced by the student.
  3. Temperature: Softening can expose more of a teacher ranking; too much removes separation. With the classical soft-loss formulation, scale KL by T2T^2T2 to preserve gradient magnitude.
  4. Loss function: For causal-LM logit KD, shift positions correctly, ignore masked positions, and verify identical token-to-id output mappings. Divergence direction is an evaluated design choice.
  5. On-policy versus off-policy: Static teacher targets may miss prefixes the student produces. GKD adds student-generated sequences and teacher token-distribution scoring at additional compute cost.
  6. Data and release gates: Deduplicate and check generated targets, protect evaluation splits, and release only if quality, policy, latency, and cost gates pass.

Common pitfalls

SymptomCauseFix
Validation loss barely changes as you raise temperature.The softened teacher distribution may be too flat or the soft-loss weight may be ineffective.Inspect teacher probabilities and tune temperature and loss weight on held-out tasks.
Student looks strong on training prompts but weak on held-out tasks.Distillation corpus is too narrow, repetitive, or too close to evaluation data.Broaden prompt coverage, filter duplicates, and keep a separate held-out evaluation slice.
Student predicts current token instead of next token during logit KD.Causal LM loss forgot the one-token shift.Shift logits at position ttt against labels at position t+1t+1t+1 before KL or cross-entropy.
Student copies teacher hallucinations and policy mistakes.Distillation blindly transferred bad teacher outputs.Filter teacher generations, add task loss, and evaluate against trusted labels or reward checks.
KL loss runs but student quality stays random.Teacher and student token-to-id mappings do not align, even if sizes match.Compare mappings exactly, use response distillation, or design an explicit output-space mapping.
Tiny student misses required checked behaviors.Student capacity, context, or data coverage is insufficient for this release target.Narrow task scope, increase student size, or revise training and evaluation design.
Offline metrics look great but production quality collapses.Distillation and evaluation data leaked into each other.Split generation, tuning, and evaluation sets cleanly before training starts.

What you should be able to defend

By the end of this chapter, you should be able to:

  • Explain why soft labels carry more information than hard labels.
  • Choose between response, logit, feature, and on-policy distillation based on available teacher signal and measurable costs.
  • Implement causal-LM logit distillation with one-token label shifting and vocabulary checks.
  • Explain why forward KL and reverse KL behave differently for generative models.
  • Design a generated-data pipeline with checks, deduplication, held-out protection, and licensing clarity.
  • Decide whether a distilled student passes quality, policy, latency, and cost release gates.
Next Step
Continue to Model Merging and Weight Interpolation

Distillation trains a new student from teacher signal. Model merging asks whether compatible checkpoints can be combined into one candidate without another gradient-training run.

PreviousRLVR & Verifiable Rewards
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Distilling the Knowledge in a Neural Network.

Hinton, G., Vinyals, O., & Dean, J. · 2015

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Google DeepMind · 2024

Orca: Progressive Learning from Complex Explanation Traces of GPT-4.

Mukherjee, S., et al. · 2023

Textbooks Are All You Need II: phi-1.5 technical report.

Li, Y., et al. · 2023

Stanford Alpaca: An Instruction-following LLaMA Model.

Taori, R., et al. · 2023 · GitHub

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data.

Hsieh, C., et al. · 2023

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

MiniLLM: On-Policy Distillation of Large Language Models.

Gu, Y., et al. · 2024

On-Policy Distillation for Language Models.

Agarwal, R., et al. · 2024