LearnAdvanced Training & AdaptationKnowledge Distillation for LLMs

⚡HardFine-Tuning & Training

Knowledge Distillation for LLMs

Understand the main forms of knowledge distillation for LLMs, from logit matching and response-based supervision to on-policy KD. Learn when distillation helps, where student capacity becomes the bottleneck, and how to implement a correct teacher-student training loop.

32 min read

Learning path

Step 105 of 155 in the full curriculum

RLVR & Verifiable Rewards Model Merging and Weight Interpolation

RLVR trained a policy against checked outcomes where a verifier exists. Knowledge distillation asks a different deployment question: after you have a useful large language model (LLM) teacher, how do you transfer selected behavior into a smaller model that's cheaper to serve?

Knowledge distillation trains a smaller student model to imitate useful behavior from a larger or otherwise more capable teacher. Imagine a return-policy chatbot: a large model performs well on an evaluated question set, but running it for every user costs too much. A smaller student is viable only if it retains enough answer quality under a latency and serving-cost budget.

The transfer signal depends on teacher access. In white-box settings, the student can match softened token probabilities or internal features. In black-box settings, it can fine-tune on selected teacher-written answers, solution , or synthetic corpora. Each channel carries different information and different failure modes. The goal isn't to assume the student inherits the teacher; it's to transfer useful behavior and verify the resulting quality-cost tradeoff.

Knowledge distillation flow: a teacher supplies transfer signal to a smaller student, followed by independent evaluation gates. — A teacher supplies additional supervision to a smaller student. Trusted labels or checks still matter because teacher outputs can be wrong.

The flow below shows the white-box KD case. Response distillation uses selected teacher text instead of the teacher-probability branch.

Diagram showing Input tokens, Large teacher 70B-class model, Smaller student 7B-class model, and Teacher probabilities. — Input tokens, Large teacher 70B-class model, Smaller student 7B-class model, and Teacher probabilities.

Why soft labels teach more than hard labels

In machine learning, the core idea (introduced by Hinton et al. in 2015) is to train a student model to mimic a teacher model's behavior, rather than only the ground truth labels. The student learns from the teacher's full probability distribution, which contains richer information than simple one-hot labels:^[1]

$\mathcal{L} = \alpha \cdot \mathcal{L}_{\text{distill}} + (1 - \alpha) \cdot \mathcal{L}_{\text{task}}$

Where $\mathcal{L}_{\text{distill}}$ matches teacher behavior, $\mathcal{L}_{\text{task}}$ matches ground-truth labels, and $\alpha \in [0, 1]$ controls the tradeoff. This combined loss makes sure the student learns from both the teacher's soft predictions and the true labels.

A concrete example

Suppose a customer asks, "How long do I have to return an electronic item?" The teacher's output distribution contains dark knowledge: relationships between classes that hard labels don't capture.

The hard label only says "30 days" is correct. In this simplified answer-choice example, the soft label also records that the teacher ranks "14 days" above "1 year." That extra signal is useful only if the teacher's ranking is itself useful; it is not proof that the student learned a general return-policy rule.

Soft-label distillation example comparing a hard target with a teacher ranking over 30 days, 14 days, 90 days, and 1 year. — Soft labels preserve the teacher's ranking over alternatives. A validation set still has to establish whether that ranking helps.

The probabilities above describe one simplified prediction decision. For a causal LLM, logit distillation applies this idea at each predicted token position. exposes more of the teacher's ranking, but it can also flatten the distribution until very little preference signal remains.

temperature-softening.py

import math

logits = {"30": 4.0, "14": 2.0, "90": 1.0, "1": 0.7}

def softened_probabilities(temperature: float) -> dict[str, float]:
    scaled = {token: math.exp(logit / temperature) for token, logit in logits.items()}
    total = sum(scaled.values())
    return {token: value / total for token, value in scaled.items()}

for temperature in (1.0, 4.0, 40.0):
    probs = softened_probabilities(temperature)
    rounded = {token: round(probability, 3) for token, probability in probs.items()}
    top_gap = probs["30"] - probs["14"]
    print(f"T={temperature:g} probabilities:", rounded, "top_gap:", round(top_gap, 3))

Output

T=1 probabilities: {'30': 0.818, '14': 0.111, '90': 0.041, '1': 0.03} top_gap: 0.708
T=4 probabilities: {'30': 0.397, '14': 0.241, '90': 0.188, '1': 0.174} top_gap: 0.156
T=40 probabilities: {'30': 0.263, '14': 0.25, '90': 0.244, '1': 0.242} top_gap: 0.013

When distillation beats training from scratch

Distillation is most useful when you already have a strong teacher, legal access to its signal, and a clear smaller deployment target. It does not win automatically over training from scratch. The Gemma 2 report gives a controlled example: its authors train the 2B and 9B models with token-probability distillation, and a 2B ablation trained for 500B tokens scores 67.7 when distilled from a 7B teacher versus 60.3 from scratch on their three-benchmark average.^[2]

Different recipes expose different supervision channels: Orca trains on explanation traces, phi-1.5 uses curated synthetic textbook-like data, and Gemma 2 uses teacher token probabilities for small models.^[3]^[4]^[2] They motivate careful data and signal selection; they do not establish one universally best recipe.

Matching the teacher's probabilities: logit distillation

This method requires access to the teacher model's logits (the raw, unnormalized scores output by the final layer of the network before the softmax function). We minimize the KL Divergence (Kullback-Leibler divergence), a mathematical measure of how one probability distribution differs from another, to align the student's probability distribution with the teacher's.

First, we apply temperature scaling (dividing logits by a temperature $T > 1$ before softmax to soften the probability distribution) to both models' logits to get softened probability distributions:

q_i = \frac{\exp(z_{t,i} / T)}{\sum_j \exp(z_{t,j} / T)}, \quad p_i = \frac{\exp(z_{s,i} / T)}{\sum_j \exp(z_{s,j} / T)}

Where $z_t$ are the teacher's logits, $z_s$ are the student's logits, and $T$ is the temperature. Then we compute the KL divergence loss:

$\mathcal{L}_{\text{KL}} = T^2 \sum_i q_i \log \frac{q_i}{p_i}$

Reading the formula

$q_i$ is the teacher's softened probability for token $i$ (the target we want the student to match)
$p_i$ is the student's softened probability for token $i$ (what the student currently predicts)
$T$ is the temperature (higher $T$ creates a softer, more uniform distribution)
The $T^2$ scaling factor compensates for gradient scaling: as temperature increases, gradients from soft targets scale down by approximately $1/T^2$ . Multiplying by $T^2$ keeps the relative contribution from soft targets roughly stable as you tune temperature. Think of it like turning up the volume on a quiet signal so it can compete with the loud one.
KL divergence measures how much information is lost when using the student's distribution $p$ to approximate the teacher's distribution $q$

For causal LMs, two implementation details matter. First, next-token training requires a one-token shift: logits at position $t$ train against the label at position $t+1$ . Second, direct logit KD assumes teacher and student use the same token-to-id output mapping. Equal vocabulary sizes alone are insufficient: token id 42 must denote the same token for both models. If output spaces differ, plain token-level KL no longer lines up cleanly and you usually fall back to response distillation or design an explicit mapping. The snippet below handles the shift, masks ignored positions, and fails fast on a reordered vocabulary.

Common mistake: Running logit distillation without verifying tokenizer alignment. Two models can have the same vocabulary size and different token-id mappings. Compare the complete output mapping, not only vocab_size, before training.

reading-the-formula.py

import torch
import torch.nn.functional as F

def distillation_loss(
    student_logits: torch.Tensor,
    teacher_logits: torch.Tensor,
    labels: torch.Tensor,
    student_vocabulary: tuple[str, ...],
    teacher_vocabulary: tuple[str, ...],
    temperature: float = 3.0,
    alpha: float = 0.5,
    ignore_index: int = -100,
) -> torch.Tensor:
    """
    Computes the weighted sum of Knowledge Distillation (KD) loss and Cross-Entropy loss.
    """
    if student_vocabulary != teacher_vocabulary:
        raise ValueError(
            "Logit KD requires identical token-to-id mappings. "
            "Use response KD or design an explicit mapping when output spaces differ."
        )
    if student_logits.size(-1) != teacher_logits.size(-1) or student_logits.size(-1) != len(student_vocabulary):
        raise ValueError("Logit tensors and vocabulary dimensions must agree.")

    # Causal LMs predict token t+1 from positions up to t.
    shift_student = student_logits[..., :-1, :].contiguous()
    shift_teacher = teacher_logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

    vocab_size = shift_student.size(-1)
    flat_student = shift_student.reshape(-1, vocab_size)
    flat_teacher = shift_teacher.reshape(-1, vocab_size)
    flat_labels = shift_labels.reshape(-1)

    valid_mask = flat_labels != ignore_index
    if not valid_mask.any():
        return student_logits.sum() * 0

    student_valid = flat_student[valid_mask]
    teacher_valid = flat_teacher[valid_mask]
    labels_valid = flat_labels[valid_mask]

    # Soft loss: KL divergence between softened distributions
    soft_teacher = F.softmax(teacher_valid.detach() / temperature, dim=-1)
    soft_student = F.log_softmax(student_valid / temperature, dim=-1)

    # KLDivLoss expects log-probabilities for the input (student)
    # and standard probabilities for the target (teacher)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean')
    soft_loss *= temperature ** 2  # Scale by T² for gradient magnitude

    # Hard loss: standard cross-entropy with ground truth
    hard_loss = F.cross_entropy(student_valid, labels_valid)

    return alpha * soft_loss + (1 - alpha) * hard_loss

torch.manual_seed(7)
batch, seq_len, vocab = 2, 5, 8
student_logits = torch.nn.Parameter(torch.randn(batch, seq_len, vocab))
teacher_logits = torch.randn(batch, seq_len, vocab)
labels = torch.tensor([
    [0, 1, 2, 3, 4],
    [0, 2, -100, 5, 6],
])
vocabulary = tuple(f"token_{index}" for index in range(vocab))

loss = distillation_loss(
    student_logits,
    teacher_logits,
    labels,
    vocabulary,
    vocabulary,
    temperature=3.0,
    alpha=0.6,
)
loss.backward()

loss_is_scalar = loss.ndim == 0
loss_is_finite = bool(torch.isfinite(loss))
grad_exists = student_logits.grad is not None
grad_is_finite = bool(torch.isfinite(student_logits.grad).all()) if grad_exists else False
mismatch_failed = False

try:
    distillation_loss(student_logits, teacher_logits, labels, vocabulary, tuple(reversed(vocabulary)))
except ValueError as exc:
    mismatch_failed = "token-to-id mappings" in str(exc)

print("loss:", round(float(loss), 4))
print("grad norm:", round(float(student_logits.grad.norm()), 4))
print("loss_is_scalar:", loss_is_scalar)
print("loss_is_finite:", loss_is_finite)
print("grad_is_finite:", grad_is_finite)
print("mismatch failed:", mismatch_failed)

Output

loss: 1.4959
grad norm: 0.1891
loss_is_scalar: True
loss_is_finite: True
grad_is_finite: True
mismatch failed: True

When you only have text: response distillation

When you don't have access to the teacher's weights or logits, but can obtain generated text, response distillation is the available KD channel. In this approach, the teacher generates text responses for prompts, and the student fine-tunes on selected (prompt, response) pairs.

This is Supervised Fine-Tuning (SFT) on teacher-generated targets, not ground truth. Because an API commonly provides text rather than full token probabilities, the training channel is less detailed than direct logit access. A teacher can provide worked solutions, decomposed subproblems, critiques, or multiple candidate answers, but these outputs should be filtered with task-specific checks where possible. For a warehouse-routing assistant, a generated route is useful training data only after checks for capacity, delivery-window, and policy constraints.

Examples and adjacent synthetic-data recipes

Alpaca^[5]: Fine-tuned LLaMA 7B on 52K instruction-following examples generated by text-davinci-003.
Orca^[3]: Learned from GPT-4 explanation traces plus guidance from ChatGPT, moving beyond shallow answer imitation.
phi-1.5^[4]: Not classic KD, but an adjacent synthetic-data recipe built from textbook-like generated data; it does not compare KD loss choices.
Distilling Step-by-Step^[6]: Uses generated rationales as an additional supervised target and evaluates whether smaller students improve on the studied tasks.
DeepSeek-R1-Distill^[7]: Fine-tunes Qwen2.5- and Llama-based students (1.5B to 70B) for two to three epochs on the paper's roughly 800K-example SFT collection. The paper reports strong results for these distilled models on selected reasoning benchmarks; it does not prove that every teacher behavior transfers through text.

Response distillation and synthetic-data training overlap when a stronger model generates the selected targets. Calling a dataset "distillation" should not bypass evaluation: generated traces can be incorrect, stylistically misleading, contaminated, or out of scope for the intended student.

teacher-output-gate.py

generated = [
    {"prompt": "sealed electronics return", "teacher": "30 days", "verified": "30 days"},
    {"prompt": "final-sale item return", "teacher": "30 days", "verified": "not eligible"},
    {"prompt": "defective item warranty", "teacher": "warranty process", "verified": "warranty process"},
]

accepted = [
    example for example in generated
    if example["teacher"] == example["verified"]
]
rejected = [
    example["prompt"] for example in generated
    if example["teacher"] != example["verified"]
]

print("generated:", len(generated))
print("accepted:", len(accepted))
print("rejected prompts:", rejected)
print("teacher text is trusted label:", len(rejected) == 0)

Output

generated: 3
accepted: 2
rejected prompts: ['final-sale item return']
teacher text is trusted label: False

Aligning internal layers: feature distillation

Logit distillation matches output distributions. With white-box access, a training objective can also match selected student hidden states to selected teacher hidden states through a learned projection.

$\mathcal{L}_{\text{feature}} = \sum_l \|f_l^{\text{teacher}} - g(f_l^{\text{student}})\|^2$

Where $f_l^{\text{teacher}}$ and $f_l^{\text{student}}$ are hidden states at layer $l$ , and $g(\cdot)$ projects student features into the teacher feature space before comparison.

Since the student commonly has different hidden dimensions, $g$ maps the student's representation into the teacher comparison space. Feature matching introduces extra decisions: which layers correspond, how the projection is trained, and whether its additional compute improves held-out outcomes. Hidden-state access is a richer interface, not a guarantee of a better student.

Method	Teacher signal	Main advantage	Main constraint
Response KD	Selected text outputs	Works without white-box access	Teacher errors become SFT targets unless filtered
Logit KD	Token probabilities	Preserves distribution information	Requires aligned output space or an explicit mapping
Feature KD	Selected hidden states	Exposes intermediate representations	Needs layer/projection design and more storage or compute
On-policy KD	Teacher scores on student samples	Visits prefixes the student actually produces	Requires online sampling and teacher evaluation

Distillation method selector showing response, logit, and feature signals from different teacher access levels, while fixed versus on-policy sampling is a separate choice. — Teacher access chooses the transferable signal. Fixed versus on-policy sampling separately chooses which prefixes receive teacher supervision.

Forward versus reverse KL

When minimizing KL divergence for language generation, direction matters. Classical distillation commonly minimizes Forward KL (teacher || student), which penalizes a student for missing probability mass that the teacher assigns to continuations. When a small student cannot model the teacher distribution well, this pressure may be costly.

Reverse KL (student || teacher) places more pressure on probability mass the student assigns where the teacher assigns little. MiniLLM reports improvements over its studied standard-KD baselines using reverse KL with an on-policy optimization algorithm in instruction-following experiments.^[8] GKD evaluates multiple divergences and explicitly reports that the best divergence depends on the task and diversity-performance tradeoff.^[9]

Direction	Formula	Behavior	Common fit
Forward KL	$D_{KL}(P_{teacher} \\| P_{student})$	Mean-seeking, covers more of the teacher distribution	Classic KD when broad coverage matters
Reverse KL	$D_{KL}(P_{student} \\| P_{teacher})$	Penalizes student mass in teacher-low-probability regions	Candidate objective to evaluate for generation

No divergence is the default winner for every task. Measure task quality, diversity, calibration, and failure rates under the actual decoding setup.

kl-direction-diagnostic.py

import math

teacher = {"safe": 0.58, "alternate": 0.40, "bad": 0.02}
students = {
    "covers_teacher": {"safe": 0.54, "alternate": 0.36, "bad": 0.10},
    "adds_bad_mass": {"safe": 0.40, "alternate": 0.35, "bad": 0.25},
}

def kl(left: dict[str, float], right: dict[str, float]) -> float:
    return sum(prob * math.log(prob / right[token]) for token, prob in left.items())

for name, student in students.items():
    forward = kl(teacher, student)
    reverse = kl(student, teacher)
    print(name, "forward:", round(forward, 3), "reverse:", round(reverse, 3))

print("choose objective from evaluation, not slogan")

Output

covers_teacher forward: 0.051 reverse: 0.084
adds_bad_mass forward: 0.218 reverse: 0.436
choose objective from evaluation, not slogan

Off-policy versus on-policy distillation

Off-policy (standard) distillation trains the student on teacher outputs for a static dataset. During inference, the student generates its own tokens, causing distribution shift. Errors compound as the student drifts from training data (exposure bias).

On-policy methods like Generalized Knowledge Distillation (GKD) sample sequences from the student, then compare student and teacher token distributions on the prefixes the student produced. GKD can mix fixed outputs and student-generated outputs through a student-data fraction $\lambda$ ; it does not require a natural-language critique.^[9] The tradeoff is computational: both student sampling and teacher scoring run during training. It is useful when fixed teacher data misses prefixes that the deployed student commonly enters, but the benefit must be measured per task.

on-policy-prefix-coverage.py

fixed_teacher_prefixes = {
    "return sealed electronics",
    "return unopened clothing",
}
student_generated_prefixes = {
    "return sealed electronics",
    "return opened final-sale electronics",
    "return item without receipt",
}

unseen_in_fixed_data = student_generated_prefixes - fixed_teacher_prefixes
teacher_scored_prefixes = fixed_teacher_prefixes | student_generated_prefixes

print("fixed prefixes:", len(fixed_teacher_prefixes))
print("student prefixes needing new teacher scores:", sorted(unseen_in_fixed_data))
print("scored after on-policy collection:", len(teacher_scored_prefixes))

Output

fixed prefixes: 2
student prefixes needing new teacher scores: ['return item without receipt', 'return opened final-sale electronics']
scored after on-policy collection: 4

Temperature: expose rankings without trusting them

The temperature parameter $T$ controls how much we smooth the teacher distribution used in a logit loss. It changes which relative token preferences are visible to the loss; it does not guarantee that those preferences are useful or that the student can represent them.

Mathematically, the softmax function converts raw logits into probabilities using an exponential function. When one logit is significantly larger than the rest, the exponential makes it dominate the entire probability mass, crushing the others to near-zero. By dividing all logits by a temperature $T > 1$ before applying the exponential, we reduce the relative difference between the largest logit and the smaller ones. This prevents the top choice from monopolizing the probability space and allows the relative scores of the "incorrect" choices to emerge.

A worked example

Imagine the teacher sees the prompt "Return policy for electronics?" and produces these raw logits for the next token:

Token	Raw logit
`30`	4.0
`14`	2.0
`90`	1.0
`1`	0.7

At $T = 1$ , the probabilities are sharp: 30 gets about 0.82, 14 gets 0.11, 90 gets 0.04, and 1 gets 0.03. The student sees only a weak signal that 90 is a more reasonable guess than 1.

At $T = 4$ , after dividing each logit by 4 and applying softmax, the probabilities spread out: 30 gets about 0.40, 14 gets 0.24, 90 gets 0.19, and 1 gets 0.17. The distillation loss now exposes more of the teacher's ranking among alternatives.

Key insight: Temperature changes the training target, not the trusted answer. If the teacher ranks an invalid option highly, a softened loss makes that error easier to copy. Use labels or checks and held-out evaluation alongside KD.

Diagram showing T = 1 (sharp) 30 days: 0.82 14 days: 0.11 90 days: 0.04 1 year: 0.03, T = 4 (soft) 30 days: 0.40 14 days: 0.24 90 days: 0.19 1 year: 0.17, and raise temperature. — T = 1 (sharp) 30 days: 0.82 14 days: 0.11 90 days: 0.04 1 year: 0.03, T = 4 (soft) 30 days: 0.40 14 days: 0.24 90 days: 0.19 1 year: 0.17, and raise temperature.

Low Temperature ( $T \approx 1$ ): The distribution is sharp. The model is very confident in its top choice (30). The relationships between incorrect choices (14 vs 90) are hidden because their probabilities are near zero.
High Temperature ( $T > 1$ ): The distribution flattens. Probability mass spreads to alternatives, making the teacher's relative ranking more visible to the loss.
Too High ( $T \gg 10$ ): The distribution becomes nearly uniform. The ranking signal is lost, and the student learns little from the teacher distribution.

In practice, tune temperature rather than assuming a universal constant. The right value depends on the teacher distribution, soft-loss weight, task, and evaluation metrics. Inspect probability spreads as in the executable example above, then compare held-out quality.

A practical distillation training loop

A typical distillation training loop involves a frozen teacher model and a trainable student model. We forward the same input through both models and compute the combined loss. The local example below uses tiny PyTorch models so you can test the mechanics without downloading a real teacher.

a-practical-distillation-training-loop.py

import torch
from torch import nn
import torch.nn.functional as F

class TinyLM(nn.Module):
    def __init__(self, vocab_size: int, hidden_size: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.output = nn.Linear(hidden_size, vocab_size)

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        return self.output(self.embedding(input_ids))

def kd_loss(student_logits: torch.Tensor, teacher_logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
    temperature = 3.0
    alpha = 0.7

    shift_student = student_logits[:, :-1, :]
    shift_teacher = teacher_logits[:, :-1, :]
    shift_labels = labels[:, 1:]

    student_flat = shift_student.reshape(-1, shift_student.size(-1))
    teacher_flat = shift_teacher.reshape(-1, shift_teacher.size(-1))
    labels_flat = shift_labels.reshape(-1)

    soft_teacher = F.softmax(teacher_flat / temperature, dim=-1)
    soft_student = F.log_softmax(student_flat / temperature, dim=-1)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean") * temperature**2
    hard_loss = F.cross_entropy(student_flat, labels_flat)
    return alpha * soft_loss + (1 - alpha) * hard_loss

torch.manual_seed(0)
vocab_size = 12
teacher = TinyLM(vocab_size=vocab_size, hidden_size=16)
student = TinyLM(vocab_size=vocab_size, hidden_size=6)
teacher.requires_grad_(False)
teacher.eval()

input_ids = torch.tensor([
    [1, 2, 3, 4, 5],
    [1, 3, 5, 7, 9],
])
labels = input_ids.clone()
optimizer = torch.optim.AdamW(student.parameters(), lr=0.05)

with torch.no_grad():
    teacher_logits = teacher(input_ids)

before = kd_loss(student(input_ids), teacher_logits, labels)
optimizer.zero_grad()
before.backward()
has_grad = any(parameter.grad is not None for parameter in student.parameters())
optimizer.step()

after = kd_loss(student(input_ids), teacher_logits, labels)

print("before:", round(float(before), 4))
print("after:", round(float(after), 4))
print("has_grad:", has_grad)
print("after_is_finite:", bool(torch.isfinite(after)))
print("improved:", bool(after < before))

Output

before: 1.1513
after: 0.9793
has_grad: True
after_is_finite: True
improved: True

In production, the same pattern usually lives inside a framework trainer, with aligned-vocabulary models such as a larger Gemma teacher and smaller Gemma student when direct token-probability distillation is required. Teams may pre-compute some teacher signal offline to avoid running the teacher inside every student update. Dense next-token logits across a long corpus are costly to store, so a design may consider top-k logits, teacher responses, or online scoring, then measure the quality effect of compression. The payload-only estimate below ignores metadata and storage-format overhead, so treat it as a lower-bound sizing exercise.

logit-cache-budget.py

tokens = 50_000_000
vocab_size = 32_000
bytes_per_logit = 2  # bf16
top_k = 64
bytes_per_topk_item = 2 + 4  # bf16 value plus int32 token id

dense_bytes = tokens * vocab_size * bytes_per_logit
topk_bytes = tokens * top_k * bytes_per_topk_item
gib = 1024 ** 3

print("dense cache GiB:", round(dense_bytes / gib, 1))
print("top-k cache GiB:", round(topk_bytes / gib, 1))
print("storage reduction:", round(dense_bytes / topk_bytes, 1), "x")
print("quality must still be evaluated:", True)

Output

dense cache GiB: 2980.2
top-k cache GiB: 17.9
storage reduction: 166.7 x
quality must still be evaluated: True

If an online-distillation pilot is feasible, compare it with an offline baseline before committing to large-scale data generation. The result can expose whether fresh teacher scoring is worth its compute cost for this task.

Distilled models in practice

When the teacher is only available through generated text, response distillation becomes the default. When you control both models and can inspect logits or hidden states, white-box distillation exposes a richer training signal. Recent systems use both patterns depending on what the teacher exposes.

What changes from system to system isn't the core idea, but the supervision channel: raw logits, hidden states, instruction-response pairs, rationale traces, or synthetic corpora generated by a stronger model.

Student / recipe	Teacher	Signal transferred	Why it matters
Alpaca 7B^[5]	`text-davinci-003`	52K generated instruction-response examples	The repository reports preliminary instruction-following evaluation and clear non-commercial dataset terms.
Orca 13B^[3]	GPT-4 + ChatGPT	Explanation traces and task instructions	Evaluates a richer generated-trace training recipe, rather than logit KD.
phi-1.5^[4]	Existing LLMs + curated synthetic data	Textbook-like synthetic corpora	Adjacent synthetic-data recipe, not a direct teacher-distribution KD comparison.
Gemma 2 2B / 9B^[2]	Larger Gemma teachers	Token-probability distillation during pretraining	Reports a controlled 2B distilled-versus-from-scratch ablation.
DeepSeek-R1-Distill 1.5B-70B^[7]	DeepSeek-R1	Roughly 800K selected SFT examples	Reports results from text-target fine-tuning; transfer remains benchmark-scoped.

Building a distillation dataset

When using response-based distillation, the selected dataset is one of the main controllable inputs, alongside student capacity and training budget. Build a generation and selection pipeline that can reject incorrect, duplicate, contaminated, or irrelevant examples.

Seed-Expand-Filter pipeline

A structured approach to generating teacher data involves a Seed-Expand-Filter pipeline. It does not ensure quality by itself; each filter needs a measurable contract and a separate evaluation split.

Seed: Start with a small set of high-quality, human-written prompts (e.g., 100 examples about warehouse operations).
Expand: Ask the teacher model to generate new, diverse variations of these prompts.
Generate: Have the teacher answer these new prompts, often with rationales or decomposed steps when richer supervision helps.
Filter: Use checks, deduplication, safety screening, or reviewed scoring rules to reject unsuitable generations.

Seed-expand-filter distillation data pipeline: human seed prompts, teacher prompt expansion, teacher response generation, measurable filtering, and selected student targets. — For response distillation, selection is part of model design. Each stage should increase coverage or reject measurable failure modes before examples reach the student.

This is close to the spirit of Alpaca's Self-Instruct-style pipeline and Orca's richer explanation-trace data generation, even though real systems add deduplication, safety filters, and task balancing.^[5]^[3]

This logic can be implemented by creating a small wrapper around a teacher client. The example below focuses on two easy-to-miss requirements: deduplicate prompts before paying for generation, and verify teacher answers before they become student targets.

select-teacher-responses.py

from collections.abc import Callable

class DistillationDataGenerator:
    def __init__(
        self,
        teacher_generate: Callable[[str], str],
        verify_response: Callable[[str, str], bool],
    ):
        self.teacher_generate = teacher_generate
        self.verify_response = verify_response

    def generate_dataset(self, prompts: list[str]) -> list[dict[str, str]]:
        selected: list[dict[str, str]] = []
        seen: set[str] = set()
        for prompt in prompts:
            normalized = " ".join(prompt.lower().split())
            if normalized in seen:
                continue
            seen.add(normalized)
            response = self.teacher_generate(normalized).strip()
            if self.verify_response(normalized, response):
                selected.append({"prompt": normalized, "response": response})
        return selected

def fake_teacher(prompt: str) -> str:
    return "30 days"

trusted_answers = {
    "sealed electronics return": "30 days",
    "final-sale electronics return": "not eligible",
}

def verify_response(prompt: str, response: str) -> bool:
    return trusted_answers[prompt] == response

generator = DistillationDataGenerator(fake_teacher, verify_response)
examples = generator.generate_dataset([
    "sealed electronics return",
    " Sealed electronics  return ",
    "final-sale electronics return",
])

print("selected prompts:", [example["prompt"] for example in examples])
print("selected count:", len(examples))
print("bad response retained:", any("final-sale" in example["prompt"] for example in examples))

Output

selected prompts: ['sealed electronics return']
selected count: 1
bad response retained: False

Keep training data out of evaluation

Teacher generation can quietly contaminate a benchmark when prompts, reference solutions, or close rewrites enter the student training set. At minimum, block exact normalized overlap before training. For real releases, extend the gate with near-duplicate and reference-solution checks.

held-out-contamination-gate.py

def normalize(prompt: str) -> str:
    return " ".join(prompt.lower().replace("?", "").split())

candidate_training_prompts = [
    "Compute shipping refund for a late parcel",
    "Return sealed electronics within 30 days",
    "Can a FINAL-SALE item be returned?",
]
held_out_prompts = [
    "can a final-sale item be returned",
    "Estimate delivery window for a remote zip code",
]

held_out_keys = {normalize(prompt) for prompt in held_out_prompts}
accepted = [
    prompt for prompt in candidate_training_prompts
    if normalize(prompt) not in held_out_keys
]
blocked = [
    prompt for prompt in candidate_training_prompts
    if normalize(prompt) in held_out_keys
]

print("accepted training prompts:", len(accepted))
print("blocked overlap:", blocked)
print("held-out exact overlap after gate:", any(normalize(p) in held_out_keys for p in accepted))

Output

accepted training prompts: 2
blocked overlap: ['Can a FINAL-SALE item be returned?']
held-out exact overlap after gate: False

Limitations and when not to distill

Distillation does not make capacity, context-window, or data-coverage constraints disappear. A student may beat its teacher on a narrow checked metric after filtering or task-specific training, while regressing on other behavior. Treat the teacher and student as separate artifacts to evaluate.

Before investing in a distillation pipeline, define which behaviors matter and how they will be tested.

Behavior	Regression risk to test	Useful held-out gate
Domain answers	Generated targets can repeat teacher errors	Checked answer accuracy and abstention rate
Instruction following	Narrow traces can miss new constraints	Fresh constraint-following prompts
Multi-step solutions	Final answers can hide invalid steps	Step checks where available plus final-answer accuracy
Long-context use	Student architecture or context limit may differ	Retrieval and long-context slices at deployment length
Safety and policy behavior	Filtered corpus may omit refusals or edge cases	Safety-policy evaluation separate from task benchmark

Legal and ethical considerations

Beyond technical constraints, distillation introduces unique licensing and evaluation challenges. Because the student model closely mirrors the teacher's outputs, the origins of that training data are important.

Provider terms matter: the Stanford Alpaca release was research-only and non-commercial, and the repo points to both the underlying LLaMA restrictions and the dataset's CC BY-NC 4.0 terms.^[5]
Restrictions must be reviewed: before generating a corpus or shipping a student, review the teacher access terms, base-student license, generated-data license, and permitted use of outputs. Do not infer permission from technical access.
Imitation isn't capability proof: a student may reproduce style or familiar output patterns while failing new checked tasks. Held-out evaluation, not resemblance, establishes value.

Cost-quality tradeoff

Deciding whether to distill, and which method to use, comes down to measured quality and economics. Richer teacher access can enable different losses; it does not rank final models without evaluation.

Approach	Required access	Main training cost	Release gate
Use teacher directly	Teacher inference	No student training	Baseline quality, latency, and cost
Response KD	Generated outputs and permitted use	Generation plus SFT	Output filtering and held-out task quality
Logit KD	Aligned teacher token probabilities	Teacher scoring or cache storage	Task quality plus cache/online cost
Feature KD	Hidden states and layer mapping	Extra projections and state transfer	Ablation against simpler KD baseline

Also, don't stop at distillation loss. A student can match teacher probabilities on training batches and still regress on held-out generation quality, long-context behavior, or latency targets. Measure task metrics, pairwise win rate, and real serving cost together.

deployment-gate.py

teacher = {"checked_accuracy": 0.94, "policy_error_rate": 0.01, "latency_ms": 180}
student = {"checked_accuracy": 0.92, "policy_error_rate": 0.04, "latency_ms": 42}
requirements = {"checked_accuracy": 0.90, "max_policy_error_rate": 0.02, "max_latency_ms": 60}

checks = {
    "quality": student["checked_accuracy"] >= requirements["checked_accuracy"],
    "policy": student["policy_error_rate"] <= requirements["max_policy_error_rate"],
    "latency": student["latency_ms"] <= requirements["max_latency_ms"],
}

print("student faster:", student["latency_ms"] < teacher["latency_ms"])
print("release checks:", checks)
print("deploy student:", all(checks.values()))

Output

student faster: True
release checks: {'quality': True, 'policy': False, 'latency': True}
deploy student: False

Practice checkpoints

How does temperature affect distillation? Temperature controls how soft the teacher distribution becomes. A higher $T$ spreads probability mass across alternatives, exposing more of the teacher ranking to the loss. Too high becomes nearly uniform and loses ranking signal; any setting still needs held-out validation.

Can you distill from a proprietary API model? If its permitted interface and terms allow generated outputs for training, the available KD channel is response distillation. You generate candidate prompt-response pairs or checked solution traces and fine-tune the student on selected text. You don't get token probabilities, so filtering, use rights, and held-out evaluation are central.

When does distillation fail to transfer capability? It fails on a target behavior when student capacity, context length, architecture, or training coverage cannot reproduce that behavior at the required threshold. Define slices such as long-context tasks, policy edge cases, and checked multi-step solutions so the failure is visible.

How do you decide whether to deploy a distilled student? Set quality, policy, latency, and cost gates before training. Lower latency is not enough if a student fails a policy or accuracy gate, as the executable deployment example shows.

Can you directly distill between different tokenizers? Not with plain token-level KL unless token-to-id output mappings agree. Same vocabulary size is not enough. If output spaces differ, use response distillation or design and validate an explicit mapping.

Check your understanding

Before moving on, try to answer these questions without looking back at the article.

Why does a soft label teach more than a hard label?
Hint: Think about what the student learns about the near-miss answers, not only the correct one.
Why might $T = 1$ provide less ranking signal than a higher temperature? Hint: Consider what a sharp teacher distribution exposes about alternatives.
Why do causal language models need a one-token shift when computing distillation loss?
Hint: Remember that a causal LM predicts the next token from all previous tokens.
Why should forward versus reverse KL be chosen through evaluation? Hint: Think about mode coverage, low-teacher-probability mass, and task-dependent diversity requirements.

Click to see solution sketches

A hard label only tells the student which answer is selected. A soft label exposes the teacher's full ranking over alternatives. That extra signal can help, but a trusted check still has to catch bad teacher rankings.
If the teacher distribution is sharp at $T = 1$ , the top token receives most of the probability mass and alternatives contribute little signal. Raising temperature can expose their relative ranking, until excessive smoothing removes useful separation.
In a causal LM, the logits produced at position $t$ predict the token at position $t+1$ . If you don't shift the labels by one, you are training the model to predict the current token from itself, which is trivial and wrong.
Forward KL penalizes missing teacher mass; reverse KL penalizes student mass where the teacher is low. Their quality and diversity tradeoffs vary by task and decoding setup, so evaluate both where the objective choice matters.

What to remember

Core concept: Distillation transfers teacher signal, not guaranteed truth. Keep trusted labels, checks, and held-out evaluation in the design.
Methods: Response KD uses selected generated text. Logit KD uses aligned token probabilities. Feature KD adds hidden-state comparisons. On-policy KD scores prefixes produced by the student.
Temperature: Softening can expose more of a teacher ranking; too much removes separation. With the classical soft-loss formulation, scale KL by $T^2$ to preserve gradient magnitude.
Loss function: For causal-LM logit KD, shift positions correctly, ignore masked positions, and verify identical token-to-id output mappings. Divergence direction is an evaluated design choice.
On-policy versus off-policy: Static teacher targets may miss prefixes the student produces. GKD adds student-generated sequences and teacher token-distribution scoring at additional compute cost.
Data and release gates: Deduplicate and check generated targets, protect evaluation splits, and release only if quality, policy, latency, and cost gates pass.

Common pitfalls

Symptom	Cause	Fix
Validation loss barely changes as you raise temperature.	The softened teacher distribution may be too flat or the soft-loss weight may be ineffective.	Inspect teacher probabilities and tune temperature and loss weight on held-out tasks.
Student looks strong on training prompts but weak on held-out tasks.	Distillation corpus is too narrow, repetitive, or too close to evaluation data.	Broaden prompt coverage, filter duplicates, and keep a separate held-out evaluation slice.
Student predicts current token instead of next token during logit KD.	Causal LM loss forgot the one-token shift.	Shift logits at position $t$ against labels at position $t+1$ before KL or cross-entropy.
Student copies teacher hallucinations and policy mistakes.	Distillation blindly transferred bad teacher outputs.	Filter teacher generations, add task loss, and evaluate against trusted labels or reward checks.
KL loss runs but student quality stays random.	Teacher and student token-to-id mappings do not align, even if sizes match.	Compare mappings exactly, use response distillation, or design an explicit output-space mapping.
Tiny student misses required checked behaviors.	Student capacity, context, or data coverage is insufficient for this release target.	Narrow task scope, increase student size, or revise training and evaluation design.
Offline metrics look great but production quality collapses.	Distillation and evaluation data leaked into each other.	Split generation, tuning, and evaluation sets cleanly before training starts.

What you should be able to defend

By the end of this chapter, you should be able to:

Explain why soft labels carry more information than hard labels.
Choose between response, logit, feature, and on-policy distillation based on available teacher signal and measurable costs.
Implement causal-LM logit distillation with one-token label shifting and vocabulary checks.
Explain why forward KL and reverse KL behave differently for generative models.
Design a generated-data pipeline with checks, deduplication, held-out protection, and licensing clarity.
Decide whether a distilled student passes quality, policy, latency, and cost release gates.

Next Step

Continue to Model Merging and Weight Interpolation

Distillation trains a new student from teacher signal. Model merging asks whether compatible checkpoints can be combined into one candidate without another gradient-training run.

PreviousRLVR & Verifiable Rewards

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Distilling the Knowledge in a Neural Network.

Hinton, G., Vinyals, O., & Dean, J. · 2015

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Google DeepMind · 2024

Orca: Progressive Learning from Complex Explanation Traces of GPT-4.

Mukherjee, S., et al. · 2023

Textbooks Are All You Need II: phi-1.5 technical report.

Li, Y., et al. · 2023

Stanford Alpaca: An Instruction-following LLaMA Model.

Taori, R., et al. · 2023 · GitHub

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data.

Hsieh, C., et al. · 2023

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

MiniLLM: On-Policy Distillation of Large Language Models.

Gu, Y., et al. · 2024

On-Policy Distillation for Language Models.

Agarwal, R., et al. · 2024

Back to Topics

LearnAdvanced Training & AdaptationKnowledge Distillation for LLMs

⚡HardFine-Tuning & Training

Knowledge Distillation for LLMs

32 min read

Learning path

Step 105 of 155 in the full curriculum

RLVR & Verifiable Rewards Model Merging and Weight Interpolation

The flow below shows the white-box KD case. Response distillation uses selected teacher text instead of the teacher-probability branch.

Why soft labels teach more than hard labels

$\mathcal{L} = \alpha \cdot \mathcal{L}_{\text{distill}} + (1 - \alpha) \cdot \mathcal{L}_{\text{task}}$

A concrete example

temperature-softening.py

import math

logits = {"30": 4.0, "14": 2.0, "90": 1.0, "1": 0.7}

def softened_probabilities(temperature: float) -> dict[str, float]:
    scaled = {token: math.exp(logit / temperature) for token, logit in logits.items()}
    total = sum(scaled.values())
    return {token: value / total for token, value in scaled.items()}

for temperature in (1.0, 4.0, 40.0):
    probs = softened_probabilities(temperature)
    rounded = {token: round(probability, 3) for token, probability in probs.items()}
    top_gap = probs["30"] - probs["14"]
    print(f"T={temperature:g} probabilities:", rounded, "top_gap:", round(top_gap, 3))

Output

T=1 probabilities: {'30': 0.818, '14': 0.111, '90': 0.041, '1': 0.03} top_gap: 0.708
T=4 probabilities: {'30': 0.397, '14': 0.241, '90': 0.188, '1': 0.174} top_gap: 0.156
T=40 probabilities: {'30': 0.263, '14': 0.25, '90': 0.244, '1': 0.242} top_gap: 0.013

When distillation beats training from scratch

Matching the teacher's probabilities: logit distillation

First, we apply temperature scaling (dividing logits by a temperature $T > 1$ before softmax to soften the probability distribution) to both models' logits to get softened probability distributions:

q_i = \frac{\exp(z_{t,i} / T)}{\sum_j \exp(z_{t,j} / T)}, \quad p_i = \frac{\exp(z_{s,i} / T)}{\sum_j \exp(z_{s,j} / T)}

Where $z_t$ are the teacher's logits, $z_s$ are the student's logits, and $T$ is the temperature. Then we compute the KL divergence loss:

$\mathcal{L}_{\text{KL}} = T^2 \sum_i q_i \log \frac{q_i}{p_i}$

Reading the formula

$q_i$ is the teacher's softened probability for token $i$ (the target we want the student to match)
$p_i$ is the student's softened probability for token $i$ (what the student currently predicts)
$T$ is the temperature (higher $T$ creates a softer, more uniform distribution)
The $T^2$ scaling factor compensates for gradient scaling: as temperature increases, gradients from soft targets scale down by approximately $1/T^2$ . Multiplying by $T^2$ keeps the relative contribution from soft targets roughly stable as you tune temperature. Think of it like turning up the volume on a quiet signal so it can compete with the loud one.
KL divergence measures how much information is lost when using the student's distribution $p$ to approximate the teacher's distribution $q$

Common mistake: Running logit distillation without verifying tokenizer alignment. Two models can have the same vocabulary size and different token-id mappings. Compare the complete output mapping, not only vocab_size, before training.

reading-the-formula.py

import torch
import torch.nn.functional as F

def distillation_loss(
    student_logits: torch.Tensor,
    teacher_logits: torch.Tensor,
    labels: torch.Tensor,
    student_vocabulary: tuple[str, ...],
    teacher_vocabulary: tuple[str, ...],
    temperature: float = 3.0,
    alpha: float = 0.5,
    ignore_index: int = -100,
) -> torch.Tensor:
    """
    Computes the weighted sum of Knowledge Distillation (KD) loss and Cross-Entropy loss.
    """
    if student_vocabulary != teacher_vocabulary:
        raise ValueError(
            "Logit KD requires identical token-to-id mappings. "
            "Use response KD or design an explicit mapping when output spaces differ."
        )
    if student_logits.size(-1) != teacher_logits.size(-1) or student_logits.size(-1) != len(student_vocabulary):
        raise ValueError("Logit tensors and vocabulary dimensions must agree.")

    # Causal LMs predict token t+1 from positions up to t.
    shift_student = student_logits[..., :-1, :].contiguous()
    shift_teacher = teacher_logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

    vocab_size = shift_student.size(-1)
    flat_student = shift_student.reshape(-1, vocab_size)
    flat_teacher = shift_teacher.reshape(-1, vocab_size)
    flat_labels = shift_labels.reshape(-1)

    valid_mask = flat_labels != ignore_index
    if not valid_mask.any():
        return student_logits.sum() * 0

    student_valid = flat_student[valid_mask]
    teacher_valid = flat_teacher[valid_mask]
    labels_valid = flat_labels[valid_mask]

    # Soft loss: KL divergence between softened distributions
    soft_teacher = F.softmax(teacher_valid.detach() / temperature, dim=-1)
    soft_student = F.log_softmax(student_valid / temperature, dim=-1)

    # KLDivLoss expects log-probabilities for the input (student)
    # and standard probabilities for the target (teacher)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean')
    soft_loss *= temperature ** 2  # Scale by T² for gradient magnitude

    # Hard loss: standard cross-entropy with ground truth
    hard_loss = F.cross_entropy(student_valid, labels_valid)

    return alpha * soft_loss + (1 - alpha) * hard_loss

torch.manual_seed(7)
batch, seq_len, vocab = 2, 5, 8
student_logits = torch.nn.Parameter(torch.randn(batch, seq_len, vocab))
teacher_logits = torch.randn(batch, seq_len, vocab)
labels = torch.tensor([
    [0, 1, 2, 3, 4],
    [0, 2, -100, 5, 6],
])
vocabulary = tuple(f"token_{index}" for index in range(vocab))

loss = distillation_loss(
    student_logits,
    teacher_logits,
    labels,
    vocabulary,
    vocabulary,
    temperature=3.0,
    alpha=0.6,
)
loss.backward()

loss_is_scalar = loss.ndim == 0
loss_is_finite = bool(torch.isfinite(loss))
grad_exists = student_logits.grad is not None
grad_is_finite = bool(torch.isfinite(student_logits.grad).all()) if grad_exists else False
mismatch_failed = False

try:
    distillation_loss(student_logits, teacher_logits, labels, vocabulary, tuple(reversed(vocabulary)))
except ValueError as exc:
    mismatch_failed = "token-to-id mappings" in str(exc)

print("loss:", round(float(loss), 4))
print("grad norm:", round(float(student_logits.grad.norm()), 4))
print("loss_is_scalar:", loss_is_scalar)
print("loss_is_finite:", loss_is_finite)
print("grad_is_finite:", grad_is_finite)
print("mismatch failed:", mismatch_failed)

Output

loss: 1.4959
grad norm: 0.1891
loss_is_scalar: True
loss_is_finite: True
grad_is_finite: True
mismatch failed: True

When you only have text: response distillation

Examples and adjacent synthetic-data recipes

Alpaca^[5]: Fine-tuned LLaMA 7B on 52K instruction-following examples generated by text-davinci-003.
Orca^[3]: Learned from GPT-4 explanation traces plus guidance from ChatGPT, moving beyond shallow answer imitation.
phi-1.5^[4]: Not classic KD, but an adjacent synthetic-data recipe built from textbook-like generated data; it does not compare KD loss choices.
Distilling Step-by-Step^[6]: Uses generated rationales as an additional supervised target and evaluates whether smaller students improve on the studied tasks.
DeepSeek-R1-Distill^[7]: Fine-tunes Qwen2.5- and Llama-based students (1.5B to 70B) for two to three epochs on the paper's roughly 800K-example SFT collection. The paper reports strong results for these distilled models on selected reasoning benchmarks; it does not prove that every teacher behavior transfers through text.

teacher-output-gate.py

generated = [
    {"prompt": "sealed electronics return", "teacher": "30 days", "verified": "30 days"},
    {"prompt": "final-sale item return", "teacher": "30 days", "verified": "not eligible"},
    {"prompt": "defective item warranty", "teacher": "warranty process", "verified": "warranty process"},
]

accepted = [
    example for example in generated
    if example["teacher"] == example["verified"]
]
rejected = [
    example["prompt"] for example in generated
    if example["teacher"] != example["verified"]
]

print("generated:", len(generated))
print("accepted:", len(accepted))
print("rejected prompts:", rejected)
print("teacher text is trusted label:", len(rejected) == 0)

Output

generated: 3
accepted: 2
rejected prompts: ['final-sale item return']
teacher text is trusted label: False

Aligning internal layers: feature distillation

$\mathcal{L}_{\text{feature}} = \sum_l \|f_l^{\text{teacher}} - g(f_l^{\text{student}})\|^2$

Where $f_l^{\text{teacher}}$ and $f_l^{\text{student}}$ are hidden states at layer $l$ , and $g(\cdot)$ projects student features into the teacher feature space before comparison.

Method	Teacher signal	Main advantage	Main constraint
Response KD	Selected text outputs	Works without white-box access	Teacher errors become SFT targets unless filtered
Logit KD	Token probabilities	Preserves distribution information	Requires aligned output space or an explicit mapping
Feature KD	Selected hidden states	Exposes intermediate representations	Needs layer/projection design and more storage or compute
On-policy KD	Teacher scores on student samples	Visits prefixes the student actually produces	Requires online sampling and teacher evaluation

Forward versus reverse KL

Direction	Formula	Behavior	Common fit
Forward KL	$D_{KL}(P_{teacher} \\| P_{student})$	Mean-seeking, covers more of the teacher distribution	Classic KD when broad coverage matters
Reverse KL	$D_{KL}(P_{student} \\| P_{teacher})$	Penalizes student mass in teacher-low-probability regions	Candidate objective to evaluate for generation

No divergence is the default winner for every task. Measure task quality, diversity, calibration, and failure rates under the actual decoding setup.

kl-direction-diagnostic.py

import math

teacher = {"safe": 0.58, "alternate": 0.40, "bad": 0.02}
students = {
    "covers_teacher": {"safe": 0.54, "alternate": 0.36, "bad": 0.10},
    "adds_bad_mass": {"safe": 0.40, "alternate": 0.35, "bad": 0.25},
}

def kl(left: dict[str, float], right: dict[str, float]) -> float:
    return sum(prob * math.log(prob / right[token]) for token, prob in left.items())

for name, student in students.items():
    forward = kl(teacher, student)
    reverse = kl(student, teacher)
    print(name, "forward:", round(forward, 3), "reverse:", round(reverse, 3))

print("choose objective from evaluation, not slogan")

Output

covers_teacher forward: 0.051 reverse: 0.084
adds_bad_mass forward: 0.218 reverse: 0.436
choose objective from evaluation, not slogan

Off-policy versus on-policy distillation

on-policy-prefix-coverage.py

fixed_teacher_prefixes = {
    "return sealed electronics",
    "return unopened clothing",
}
student_generated_prefixes = {
    "return sealed electronics",
    "return opened final-sale electronics",
    "return item without receipt",
}

unseen_in_fixed_data = student_generated_prefixes - fixed_teacher_prefixes
teacher_scored_prefixes = fixed_teacher_prefixes | student_generated_prefixes

print("fixed prefixes:", len(fixed_teacher_prefixes))
print("student prefixes needing new teacher scores:", sorted(unseen_in_fixed_data))
print("scored after on-policy collection:", len(teacher_scored_prefixes))

Output

fixed prefixes: 2
student prefixes needing new teacher scores: ['return item without receipt', 'return opened final-sale electronics']
scored after on-policy collection: 4

Temperature: expose rankings without trusting them

A worked example

Imagine the teacher sees the prompt "Return policy for electronics?" and produces these raw logits for the next token:

Token	Raw logit
`30`	4.0
`14`	2.0
`90`	1.0
`1`	0.7

At $T = 1$ , the probabilities are sharp: 30 gets about 0.82, 14 gets 0.11, 90 gets 0.04, and 1 gets 0.03. The student sees only a weak signal that 90 is a more reasonable guess than 1.

Key insight: Temperature changes the training target, not the trusted answer. If the teacher ranks an invalid option highly, a softened loss makes that error easier to copy. Use labels or checks and held-out evaluation alongside KD.

Low Temperature ( $T \approx 1$ ): The distribution is sharp. The model is very confident in its top choice (30). The relationships between incorrect choices (14 vs 90) are hidden because their probabilities are near zero.
High Temperature ( $T > 1$ ): The distribution flattens. Probability mass spreads to alternatives, making the teacher's relative ranking more visible to the loss.
Too High ( $T \gg 10$ ): The distribution becomes nearly uniform. The ranking signal is lost, and the student learns little from the teacher distribution.

A practical distillation training loop

a-practical-distillation-training-loop.py

import torch
from torch import nn
import torch.nn.functional as F

class TinyLM(nn.Module):
    def __init__(self, vocab_size: int, hidden_size: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.output = nn.Linear(hidden_size, vocab_size)

    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        return self.output(self.embedding(input_ids))

def kd_loss(student_logits: torch.Tensor, teacher_logits: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
    temperature = 3.0
    alpha = 0.7

    shift_student = student_logits[:, :-1, :]
    shift_teacher = teacher_logits[:, :-1, :]
    shift_labels = labels[:, 1:]

    student_flat = shift_student.reshape(-1, shift_student.size(-1))
    teacher_flat = shift_teacher.reshape(-1, shift_teacher.size(-1))
    labels_flat = shift_labels.reshape(-1)

    soft_teacher = F.softmax(teacher_flat / temperature, dim=-1)
    soft_student = F.log_softmax(student_flat / temperature, dim=-1)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean") * temperature**2
    hard_loss = F.cross_entropy(student_flat, labels_flat)
    return alpha * soft_loss + (1 - alpha) * hard_loss

torch.manual_seed(0)
vocab_size = 12
teacher = TinyLM(vocab_size=vocab_size, hidden_size=16)
student = TinyLM(vocab_size=vocab_size, hidden_size=6)
teacher.requires_grad_(False)
teacher.eval()

input_ids = torch.tensor([
    [1, 2, 3, 4, 5],
    [1, 3, 5, 7, 9],
])
labels = input_ids.clone()
optimizer = torch.optim.AdamW(student.parameters(), lr=0.05)

with torch.no_grad():
    teacher_logits = teacher(input_ids)

before = kd_loss(student(input_ids), teacher_logits, labels)
optimizer.zero_grad()
before.backward()
has_grad = any(parameter.grad is not None for parameter in student.parameters())
optimizer.step()

after = kd_loss(student(input_ids), teacher_logits, labels)

print("before:", round(float(before), 4))
print("after:", round(float(after), 4))
print("has_grad:", has_grad)
print("after_is_finite:", bool(torch.isfinite(after)))
print("improved:", bool(after < before))

Output

before: 1.1513
after: 0.9793
has_grad: True
after_is_finite: True
improved: True

logit-cache-budget.py

tokens = 50_000_000
vocab_size = 32_000
bytes_per_logit = 2  # bf16
top_k = 64
bytes_per_topk_item = 2 + 4  # bf16 value plus int32 token id

dense_bytes = tokens * vocab_size * bytes_per_logit
topk_bytes = tokens * top_k * bytes_per_topk_item
gib = 1024 ** 3

print("dense cache GiB:", round(dense_bytes / gib, 1))
print("top-k cache GiB:", round(topk_bytes / gib, 1))
print("storage reduction:", round(dense_bytes / topk_bytes, 1), "x")
print("quality must still be evaluated:", True)

Output

dense cache GiB: 2980.2
top-k cache GiB: 17.9
storage reduction: 166.7 x
quality must still be evaluated: True

Distilled models in practice

Student / recipe	Teacher	Signal transferred	Why it matters
Alpaca 7B^[5]	`text-davinci-003`	52K generated instruction-response examples	The repository reports preliminary instruction-following evaluation and clear non-commercial dataset terms.
Orca 13B^[3]	GPT-4 + ChatGPT	Explanation traces and task instructions	Evaluates a richer generated-trace training recipe, rather than logit KD.
phi-1.5^[4]	Existing LLMs + curated synthetic data	Textbook-like synthetic corpora	Adjacent synthetic-data recipe, not a direct teacher-distribution KD comparison.
Gemma 2 2B / 9B^[2]	Larger Gemma teachers	Token-probability distillation during pretraining	Reports a controlled 2B distilled-versus-from-scratch ablation.
DeepSeek-R1-Distill 1.5B-70B^[7]	DeepSeek-R1	Roughly 800K selected SFT examples	Reports results from text-target fine-tuning; transfer remains benchmark-scoped.

Building a distillation dataset

Seed-Expand-Filter pipeline

Seed: Start with a small set of high-quality, human-written prompts (e.g., 100 examples about warehouse operations).
Expand: Ask the teacher model to generate new, diverse variations of these prompts.
Generate: Have the teacher answer these new prompts, often with rationales or decomposed steps when richer supervision helps.
Filter: Use checks, deduplication, safety screening, or reviewed scoring rules to reject unsuitable generations.

select-teacher-responses.py

from collections.abc import Callable

class DistillationDataGenerator:
    def __init__(
        self,
        teacher_generate: Callable[[str], str],
        verify_response: Callable[[str, str], bool],
    ):
        self.teacher_generate = teacher_generate
        self.verify_response = verify_response

    def generate_dataset(self, prompts: list[str]) -> list[dict[str, str]]:
        selected: list[dict[str, str]] = []
        seen: set[str] = set()
        for prompt in prompts:
            normalized = " ".join(prompt.lower().split())
            if normalized in seen:
                continue
            seen.add(normalized)
            response = self.teacher_generate(normalized).strip()
            if self.verify_response(normalized, response):
                selected.append({"prompt": normalized, "response": response})
        return selected

def fake_teacher(prompt: str) -> str:
    return "30 days"

trusted_answers = {
    "sealed electronics return": "30 days",
    "final-sale electronics return": "not eligible",
}

def verify_response(prompt: str, response: str) -> bool:
    return trusted_answers[prompt] == response

generator = DistillationDataGenerator(fake_teacher, verify_response)
examples = generator.generate_dataset([
    "sealed electronics return",
    " Sealed electronics  return ",
    "final-sale electronics return",
])

print("selected prompts:", [example["prompt"] for example in examples])
print("selected count:", len(examples))
print("bad response retained:", any("final-sale" in example["prompt"] for example in examples))

Output

selected prompts: ['sealed electronics return']
selected count: 1
bad response retained: False

Keep training data out of evaluation

held-out-contamination-gate.py

def normalize(prompt: str) -> str:
    return " ".join(prompt.lower().replace("?", "").split())

candidate_training_prompts = [
    "Compute shipping refund for a late parcel",
    "Return sealed electronics within 30 days",
    "Can a FINAL-SALE item be returned?",
]
held_out_prompts = [
    "can a final-sale item be returned",
    "Estimate delivery window for a remote zip code",
]

held_out_keys = {normalize(prompt) for prompt in held_out_prompts}
accepted = [
    prompt for prompt in candidate_training_prompts
    if normalize(prompt) not in held_out_keys
]
blocked = [
    prompt for prompt in candidate_training_prompts
    if normalize(prompt) in held_out_keys
]

print("accepted training prompts:", len(accepted))
print("blocked overlap:", blocked)
print("held-out exact overlap after gate:", any(normalize(p) in held_out_keys for p in accepted))

Output

accepted training prompts: 2
blocked overlap: ['Can a FINAL-SALE item be returned?']
held-out exact overlap after gate: False

Limitations and when not to distill

Before investing in a distillation pipeline, define which behaviors matter and how they will be tested.

Behavior	Regression risk to test	Useful held-out gate
Domain answers	Generated targets can repeat teacher errors	Checked answer accuracy and abstention rate
Instruction following	Narrow traces can miss new constraints	Fresh constraint-following prompts
Multi-step solutions	Final answers can hide invalid steps	Step checks where available plus final-answer accuracy
Long-context use	Student architecture or context limit may differ	Retrieval and long-context slices at deployment length
Safety and policy behavior	Filtered corpus may omit refusals or edge cases	Safety-policy evaluation separate from task benchmark

Legal and ethical considerations

Provider terms matter: the Stanford Alpaca release was research-only and non-commercial, and the repo points to both the underlying LLaMA restrictions and the dataset's CC BY-NC 4.0 terms.^[5]
Restrictions must be reviewed: before generating a corpus or shipping a student, review the teacher access terms, base-student license, generated-data license, and permitted use of outputs. Do not infer permission from technical access.
Imitation isn't capability proof: a student may reproduce style or familiar output patterns while failing new checked tasks. Held-out evaluation, not resemblance, establishes value.

Cost-quality tradeoff

Approach	Required access	Main training cost	Release gate
Use teacher directly	Teacher inference	No student training	Baseline quality, latency, and cost
Response KD	Generated outputs and permitted use	Generation plus SFT	Output filtering and held-out task quality
Logit KD	Aligned teacher token probabilities	Teacher scoring or cache storage	Task quality plus cache/online cost
Feature KD	Hidden states and layer mapping	Extra projections and state transfer	Ablation against simpler KD baseline

deployment-gate.py

teacher = {"checked_accuracy": 0.94, "policy_error_rate": 0.01, "latency_ms": 180}
student = {"checked_accuracy": 0.92, "policy_error_rate": 0.04, "latency_ms": 42}
requirements = {"checked_accuracy": 0.90, "max_policy_error_rate": 0.02, "max_latency_ms": 60}

checks = {
    "quality": student["checked_accuracy"] >= requirements["checked_accuracy"],
    "policy": student["policy_error_rate"] <= requirements["max_policy_error_rate"],
    "latency": student["latency_ms"] <= requirements["max_latency_ms"],
}

print("student faster:", student["latency_ms"] < teacher["latency_ms"])
print("release checks:", checks)
print("deploy student:", all(checks.values()))

Output

student faster: True
release checks: {'quality': True, 'policy': False, 'latency': True}
deploy student: False

Practice checkpoints

Check your understanding

Before moving on, try to answer these questions without looking back at the article.

Why does a soft label teach more than a hard label?
Hint: Think about what the student learns about the near-miss answers, not only the correct one.
Why might $T = 1$ provide less ranking signal than a higher temperature? Hint: Consider what a sharp teacher distribution exposes about alternatives.
Why do causal language models need a one-token shift when computing distillation loss?
Hint: Remember that a causal LM predicts the next token from all previous tokens.
Why should forward versus reverse KL be chosen through evaluation? Hint: Think about mode coverage, low-teacher-probability mass, and task-dependent diversity requirements.

Click to see solution sketches

A hard label only tells the student which answer is selected. A soft label exposes the teacher's full ranking over alternatives. That extra signal can help, but a trusted check still has to catch bad teacher rankings.
If the teacher distribution is sharp at $T = 1$ , the top token receives most of the probability mass and alternatives contribute little signal. Raising temperature can expose their relative ranking, until excessive smoothing removes useful separation.
In a causal LM, the logits produced at position $t$ predict the token at position $t+1$ . If you don't shift the labels by one, you are training the model to predict the current token from itself, which is trivial and wrong.
Forward KL penalizes missing teacher mass; reverse KL penalizes student mass where the teacher is low. Their quality and diversity tradeoffs vary by task and decoding setup, so evaluate both where the objective choice matters.

What to remember

Core concept: Distillation transfers teacher signal, not guaranteed truth. Keep trusted labels, checks, and held-out evaluation in the design.
Methods: Response KD uses selected generated text. Logit KD uses aligned token probabilities. Feature KD adds hidden-state comparisons. On-policy KD scores prefixes produced by the student.
Temperature: Softening can expose more of a teacher ranking; too much removes separation. With the classical soft-loss formulation, scale KL by $T^2$ to preserve gradient magnitude.
Loss function: For causal-LM logit KD, shift positions correctly, ignore masked positions, and verify identical token-to-id output mappings. Divergence direction is an evaluated design choice.
On-policy versus off-policy: Static teacher targets may miss prefixes the student produces. GKD adds student-generated sequences and teacher token-distribution scoring at additional compute cost.
Data and release gates: Deduplicate and check generated targets, protect evaluation splits, and release only if quality, policy, latency, and cost gates pass.

Common pitfalls

Symptom	Cause	Fix
Validation loss barely changes as you raise temperature.	The softened teacher distribution may be too flat or the soft-loss weight may be ineffective.	Inspect teacher probabilities and tune temperature and loss weight on held-out tasks.
Student looks strong on training prompts but weak on held-out tasks.	Distillation corpus is too narrow, repetitive, or too close to evaluation data.	Broaden prompt coverage, filter duplicates, and keep a separate held-out evaluation slice.
Student predicts current token instead of next token during logit KD.	Causal LM loss forgot the one-token shift.	Shift logits at position $t$ against labels at position $t+1$ before KL or cross-entropy.
Student copies teacher hallucinations and policy mistakes.	Distillation blindly transferred bad teacher outputs.	Filter teacher generations, add task loss, and evaluate against trusted labels or reward checks.
KL loss runs but student quality stays random.	Teacher and student token-to-id mappings do not align, even if sizes match.	Compare mappings exactly, use response distillation, or design an explicit output-space mapping.
Tiny student misses required checked behaviors.	Student capacity, context, or data coverage is insufficient for this release target.	Narrow task scope, increase student size, or revise training and evaluation design.
Offline metrics look great but production quality collapses.	Distillation and evaluation data leaked into each other.	Split generation, tuning, and evaluation sets cleanly before training starts.

What you should be able to defend

By the end of this chapter, you should be able to:

Explain why soft labels carry more information than hard labels.
Choose between response, logit, feature, and on-policy distillation based on available teacher signal and measurable costs.
Implement causal-LM logit distillation with one-token label shifting and vocabulary checks.
Explain why forward KL and reverse KL behave differently for generative models.
Design a generated-data pipeline with checks, deduplication, held-out protection, and licensing clarity.
Decide whether a distilled student passes quality, policy, latency, and cost release gates.

Next Step

Continue to Model Merging and Weight Interpolation

Distillation trains a new student from teacher signal. Model merging asks whether compatible checkpoints can be combined into one candidate without another gradient-training run.

PreviousRLVR & Verifiable Rewards

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Distilling the Knowledge in a Neural Network.

Hinton, G., Vinyals, O., & Dean, J. · 2015

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Google DeepMind · 2024

Orca: Progressive Learning from Complex Explanation Traces of GPT-4.

Mukherjee, S., et al. · 2023

Textbooks Are All You Need II: phi-1.5 technical report.

Li, Y., et al. · 2023

Stanford Alpaca: An Instruction-following LLaMA Model.

Taori, R., et al. · 2023 · GitHub

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data.

Hsieh, C., et al. · 2023

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

MiniLLM: On-Policy Distillation of Large Language Models.

Gu, Y., et al. · 2024

On-Policy Distillation for Language Models.

Agarwal, R., et al. · 2024

Knowledge Distillation for LLMs

Why soft labels teach more than hard labels

What happens when α\alphaα moves closer to 1?

A concrete example

When distillation beats training from scratch

Matching the teacher's probabilities: logit distillation

Reading the formula

Why use temperature T>1T > 1T>1 during logit distillation?

When you only have text: response distillation

Examples and adjacent synthetic-data recipes

Aligning internal layers: feature distillation

Why does feature distillation need the projection g(⋅)g(\cdot)g(⋅)?

Forward versus reverse KL

Off-policy versus on-policy distillation

Temperature: expose rankings without trusting them

A worked example

A practical distillation training loop

Distilled models in practice

Building a distillation dataset

Seed-Expand-Filter pipeline

Keep training data out of evaluation

Limitations and when not to distill

Legal and ethical considerations

Cost-quality tradeoff

Practice checkpoints

Check your understanding

What to remember

Common pitfalls

What you should be able to defend

Knowledge Distillation for LLMs

Why soft labels teach more than hard labels

What happens when α\alphaα moves closer to 1?

A concrete example

When distillation beats training from scratch

Matching the teacher's probabilities: logit distillation

Reading the formula

Why use temperature T>1T > 1T>1 during logit distillation?

When you only have text: response distillation

Examples and adjacent synthetic-data recipes

Aligning internal layers: feature distillation

Why does feature distillation need the projection g(⋅)g(\cdot)g(⋅)?

Forward versus reverse KL

Off-policy versus on-policy distillation

Temperature: expose rankings without trusting them

A worked example

A practical distillation training loop

Distilled models in practice

Building a distillation dataset

Seed-Expand-Filter pipeline

Keep training data out of evaluation

Limitations and when not to distill

Legal and ethical considerations

Cost-quality tradeoff

Practice checkpoints

Check your understanding

What to remember

Common pitfalls

What you should be able to defend