Distill large teachers into compact SLMs using MobileLLM architectures and Phi-style data recipes. Compile and run them on-device with MLC LLM, ONNX Runtime, Core ML, and ExecuTorch while respecting power, thermal, and strict privacy constraints.
Local deployment taught you how to size and package a model for fixed hardware. Edge deployment tightens the same problem until battery, heat, privacy, and app packaging become first-class model requirements.
A field technician on the night shift scans a damaged pump and needs the exact maintenance policy and vendor exception rules right now. The handheld scanner has spotty signal, the policy contains pricing tiers that the facility considers confidential, and company policy forbids sending facility or incident data to a cloud service for this route. A cloud-only assistant isn't an option. If deterministic lookup is insufficient and a generative path is justified, a small specialized model must run on the device itself.
That's the real job for Small Language Model (SLM) specialization and edge deployment: produce a model small enough to run locally and evaluate it under memory, battery, heat, privacy, and app-packaging limits. Here, an SLM means a model selected for a constrained, usually narrow workload rather than a universal parameter-count threshold.
NVIDIA Research's 2025 position paper argues for heterogeneous agent systems that use SLMs for routine narrow invocations and larger models when complexity requires them.[1] Treat that as an architecture hypothesis to evaluate, not proof that every local route should use a generative model: retrieval, classifiers, or a human lane may be better for some jobs.
Rather than starting with "what is the smallest model that fits in 4 GB?", start edge work with "what job must this device do offline, and what latency, power, and heat budget must it hold during a sustained test?"
For the inspection handheld example, the following numbers are candidate launch bars, not measured device results:
| Job on device | Latency target | Privacy need | Typical model class |
|---|---|---|---|
| Basic intent routing ("is this a fault or escalation?") | < 60 ms | High | 200M-500M classifier or tiny LLM |
| Field policy Q&A with citations | < 250 ms time to first token (TTFT), 25 t/s decode | Very high | 1B-4B instruct SLM |
| Damaged-item photo + text decision support | < 400 ms end-to-end | High | 3B vision-language SLM or two-stage pipeline |
| Facility policy lookup while offline | Instant | High | 350M-1B retrieval-augmented SLM |
Notice that the 70B model never appears in the local shortlist. Once the job is defined, test the smallest model or deterministic path that can clear the launch bar.
1jobs = [
2 {"name": "intent", "privacy_local": True, "p95_ms": 42, "quality": 0.98},
3 {"name": "policy_qa", "privacy_local": True, "p95_ms": 228, "quality": 0.93},
4 {"name": "photo_support", "privacy_local": True, "p95_ms": 470, "quality": 0.91},
5]
6limits = {"max_p95_ms": 250, "min_quality": 0.92}
7
8approved = [
9 job["name"]
10 for job in jobs
11 if job["privacy_local"]
12 and job["p95_ms"] <= limits["max_p95_ms"]
13 and job["quality"] >= limits["min_quality"]
14]
15print(f"approved local routes: {approved}")1approved local routes: ['intent', 'policy_qa']The original knowledge-distillation paper showed that a student can learn from the soft probability distribution of a teacher rather than hard one-hot labels.[2] For LLMs the same idea scales up dramatically: when teacher logits are available, the teacher provides a dense "dark knowledge" signal in the shape of its next-token distribution.
Instead of training the student only with standard cross-entropy on hard labels, we add a softmax-based term that matches the teacher's softened distribution:
Where:
1import math
2
3def softmax(values, temperature):
4 shifted = [value / temperature for value in values]
5 weights = [math.exp(value) for value in shifted]
6 total = sum(weights)
7 return [weight / total for weight in weights]
8
9teacher_logits = [4.0, 2.0, 1.0]
10for temperature in (1.0, 3.0):
11 probs = softmax(teacher_logits, temperature)
12 print(f"temperature={temperature:.0f}: {[round(p, 3) for p in probs]}")1temperature=1: [0.844, 0.114, 0.042]
2temperature=3: [0.532, 0.273, 0.196]Modern LLM distillation can add several practical upgrades on top of the basic KL objective:
When full teacher logits are unavailable, teams can train with permitted teacher responses, ranked candidates, critique labels, or hard-negative rewrites. This isn't the same as logit matching, and it needs its own evaluation.
The Phi-3 technical report presents one compact-model result from carefully selected training data; it doesn't establish that every distilled 3.8B model beats every larger model.[4] The deployment lesson is narrower: data recipe and evaluation can matter alongside parameter count.
For sub-billion parameter models, raw parameter count isn't the only lever. MobileLLM reports results for its studied 125M and 350M variants showing that architecture choices affect quality per parameter.[5]
The key MobileLLM ideas that transfer to production SLMs are:
1def kv_mib(layers, kv_heads, head_dim, tokens, bytes_per_value=2):
2 return 2 * layers * kv_heads * head_dim * tokens * bytes_per_value / 2**20
3
4mha = kv_mib(layers=24, kv_heads=16, head_dim=64, tokens=4096)
5gqa = kv_mib(layers=24, kv_heads=4, head_dim=64, tokens=4096)
6print(f"MHA KV cache: {mha:.1f} MiB")
7print(f"GQA KV cache: {gqa:.1f} MiB")
8print(f"cache reduction for this architecture: {mha / gqa:.1f}x")1MHA KV cache: 384.0 MiB
2GQA KV cache: 96.0 MiB
3cache reduction for this architecture: 4.0x
Microsoft's Phi-3 report describes a heavily filtered data recipe and a 3.8B model intended for constrained deployment scenarios.[4] Whether a particular artifact fits a target device still depends on runtime, quantization, context, and measured behavior.
Both lines of work teach the same lesson: at the edge, architecture and data quality are first-class citizens alongside size.
Once you have weights, you still have to run them on the target's silicon. Candidate runtime families include the options below; supported operators, packaging, and acceleration paths must be checked against the runtime release and device:
| Runtime | Packaging or export path | Backends to verify | Integration question |
|---|---|---|---|
| MLC LLM (Apache TVM)[6] | Compiled deployment artifacts | Metal, Vulkan, WebGPU, or documented target path | Does compiled artifact support each device family? |
| ONNX Runtime Mobile[7] | ONNX model with execution providers | Platform execution provider and CPU fallback | Which operators partition or fall back? |
| Core ML / ExecuTorch[8][9] | Core ML model or ExecuTorch program | Documented Apple or vendor delegate path | Does conversion preserve required operators? |
| llama.cpp (GGUF)[10] | GGUF artifact | Build-specific CPU/GPU backend | Does selected build meet local latency and memory bars? |
For a rugged inspection handheld, a runtime that silently falls back to CPU may meet a short functional test yet miss a sustained battery or latency target. Measure actual operator placement and long-run behavior on every supported device family.
1results = [
2 {"runtime": "compiled", "accelerator": True, "p95_ms": 94, "ten_min_tps": 19},
3 {"runtime": "fallback", "accelerator": False, "p95_ms": 182, "ten_min_tps": 9},
4]
5required_tps = 15
6max_p95_ms = 120
7
8qualified = [
9 result["runtime"]
10 for result in results
11 if result["accelerator"]
12 and result["p95_ms"] <= max_p95_ms
13 and result["ten_min_tps"] >= required_tps
14]
15print(f"qualified runtimes: {qualified}")1qualified runtimes: ['compiled']Burst speed isn't the field metric. Under sustained decode, a device may throttle as heat builds up. Measure:
Local inference removes the cloud prompt path, but privacy still depends on the surrounding product design. You have to prove that:
For field-service maintenance this matters when the policy the model answers contains negotiated vendor rates or vendor SLAs that should never appear in a cloud prompt log.
Apple documents an on-device Foundation Models framework for integrating supported language-model tasks into apps.[11] That demonstrates a platform path, not evidence that it satisfies this scanner's policy, quality, runtime, or privacy requirements.
1requests = [
2 {"intent": "public_faq", "sensitive": False, "cloud_allowed": True},
3 {"intent": "facility_procedure", "sensitive": True, "cloud_allowed": False},
4]
5
6for request in requests:
7 route = "local_only" if request["sensitive"] or not request["cloud_allowed"] else "local_or_cloud"
8 print(f"{request['intent']}: {route}")1public_faq: local_or_cloud
2facility_procedure: local_only
If any step fails, the deployment fails, even if the model "runs."
Some products can evaluate a small on-device model as a gate between local handling and permitted escalation:
This is one deployment interpretation of NVIDIA Research's heterogeneous agent argument: test an SLM for routine steps and reserve stronger paths for cases that require them.[1] Privacy classification must happen before any cloud escalation.
1queries = [
2 {"name": "routine_public", "confidence": 0.94, "sensitive": False},
3 {"name": "hard_public", "confidence": 0.41, "sensitive": False},
4 {"name": "hard_private", "confidence": 0.38, "sensitive": True},
5]
6
7for query in queries:
8 if query["confidence"] >= 0.80:
9 route = "local_answer"
10 elif query["sensitive"]:
11 route = "human_review_local_only"
12 else:
13 route = "cloud_allowed_after_redaction"
14 print(f"{query['name']}: {route}")1routine_public: local_answer
2hard_public: cloud_allowed_after_redaction
3hard_private: human_review_local_only
One cloud distillation step feeds the downstream runtime paths and edge devices. Once the specialized weights exist, one checkpoint may feed MLC for a tuned iOS path, ONNX Runtime for cross-platform graph execution, or GGUF for a llama.cpp prototype path. Each path still needs its own benchmark on the target device family.
Suppose an approved teacher answers maintenance policy questions with citations and you want a smaller student that runs on the scanner. This is a planning example; the actual teacher, student size, and training-data access require policy and evaluation review.
Step 1. Create a permitted distillation dataset from synthetic policy scenarios and any approved, redacted production-like cases. If the teacher exposes logits or hidden states and storing them is allowed, record those signals; otherwise train from allowed outputs and critiques without pretending they are equivalent to logit distillation.
Step 2. Train the student with a combined loss:
2.0; tune this value during training).A minimal training loop sketch looks like this. It uses tiny linear layers instead of real LLM blocks, but the three signals are the same: hard-label cross-entropy, soft-logit KL, and hidden-state alignment.
1import torch
2import torch.nn.functional as F
3
4torch.manual_seed(7)
5
6batch = 3
7hidden_dim = 5
8vocab = 4
9temperature = 2.0
10
11inputs = torch.randn(batch, hidden_dim)
12labels = torch.tensor([0, 2, 1])
13
14teacher_head = torch.nn.Linear(hidden_dim, vocab, bias=False)
15student_head = torch.nn.Linear(hidden_dim, vocab, bias=False)
16teacher_proj = torch.nn.Linear(hidden_dim, hidden_dim, bias=False)
17student_proj = torch.nn.Linear(hidden_dim, hidden_dim, bias=False)
18
19with torch.no_grad():
20 teacher_logits = teacher_head(inputs)
21 teacher_hidden = teacher_proj(inputs)
22
23student_logits = student_head(inputs)
24student_hidden = student_proj(inputs)
25
26hard_loss = F.cross_entropy(student_logits, labels)
27
28soft_loss = F.kl_div(
29 F.log_softmax(student_logits / temperature, dim=-1),
30 F.softmax(teacher_logits / temperature, dim=-1),
31 reduction="batchmean",
32) * (temperature * temperature)
33
34hidden_loss = F.mse_loss(student_hidden, teacher_hidden)
35
36total_loss = hard_loss + 0.8 * soft_loss + 0.2 * hidden_loss
37
38print(f"hard_loss={hard_loss.item():.4f}")
39print(f"soft_loss={soft_loss.item():.4f}")
40print(f"hidden_loss={hidden_loss.item():.4f}")
41print(f"total_loss={total_loss.item():.4f}")1hard_loss=1.5157
2soft_loss=0.3425
3hidden_loss=0.5703
4total_loss=1.9038After training, don't assume the student has "recovered" the teacher. Measure it. A launchable result is a student that meets the golden-set policy-accuracy bar, preserves abstention behavior, fits the memory budget, and stays above the sustained decode floor on the target NPU.
Step 3. Quantize the student with an artifact format supported by the device runtime and re-evaluate on the same golden policy set. If an exported artifact misses the allowed quality or latency regression threshold, reject it and investigate quantization choice, training recipe, or model size before retraining or deployment.
1baseline = {"quality": 0.94, "p95_ms": 132}
2exports = [
3 {"name": "int8", "quality": 0.938, "p95_ms": 104},
4 {"name": "int4", "quality": 0.887, "p95_ms": 71},
5]
6max_quality_drop = 0.02
7latency_gate_ms = 120
8
9approved = [
10 item["name"]
11 for item in exports
12 if baseline["quality"] - item["quality"] <= max_quality_drop
13 and item["p95_ms"] <= latency_gate_ms
14]
15print(f"approved exports: {approved}")1approved exports: ['int8']A low-bit SLM can pass a short burst test and still fail the field workload. In a sustained run, a device may reduce performance after heat builds up, and decode speed can fall below the product floor in the middle of an answer. Measure that risk rather than assuming a burst result holds.
Possible mitigations to evaluate include:
1samples_tps = [23.0, 22.4, 21.7, 18.3, 16.8]
2minimum_sustained_tps = 18.0
3burst_tps = samples_tps[0]
4floor_tps = min(samples_tps)
5
6print(f"burst throughput: {burst_tps:.1f} t/s")
7print(f"sustained floor: {floor_tps:.1f} t/s")
8print(f"thermal/runtime gate passes: {floor_tps >= minimum_sustained_tps}")1burst throughput: 23.0 t/s
2sustained floor: 16.8 t/s
3thermal/runtime gate passes: FalseAn on-device eval harness should load a versioned golden set from the app bundle, run inference with the production prompt template and runtime flags, and check answers, citations, abstentions, and latency. Run it after model updates and block releases on regression.
These cited model families illustrate how to form a shortlist, not which models are newest or best. Verify release documentation, licensing, runtime support, and artifact availability before benchmarking.[5][12][13][14]
| Model family | Size | Reported focus | What to benchmark | Candidate job to test |
|---|---|---|---|---|
| MobileLLM (Meta) | 125M / 350M | Architecture search for sub-500M | Sustained latency and accuracy on narrow routes | Ultra-low power intent routers |
| Phi-4 mini / mini-reasoning (Microsoft) | 3.8B | Strong reasoning, math, and function-calling for size | Policy accuracy, tool-call format, and thermal floor | Policy Q&A, lightweight agents, tool routing |
| Gemma 4 E2B / E4B (Google) | Effective 2B / 4B classes | Multimodal edge stack with selective activation | Modality mix, runtime support, and memory footprint | Field assistants with text, image, audio, or video inputs |
| Qwen3 1.7B / 4B (Qwen Team) | 1.7B-4B | Multilingual, coding, tool use, thinking/non-thinking modes | Language mix, structured-output reliability, and thinking-mode budget | Multilingual field workflows and code-like structured tasks |
Many of these families can run fully locally through one or more edge runtimes, but support isn't automatic. The checkpoint, tokenizer, quantization format, custom ops, and accelerator path all have to work together on the target device.
Use a controlled comparison for the scanner fleet: compile supported candidates, run the same versioned golden set on supported device SKUs, and measure a sustained workload while collecting available power and temperature signals. Select from measured results, not model-family reputation.
A fleet may evaluate separate checkpoints for routing and policy answering rather than force one model to serve both jobs. The same evaluation harness can compare each candidate against its own route requirements.
1candidates = [
2 {"name": "tiny_gate", "route": "intent", "quality": 0.98, "floor_tps": 33, "privacy_ok": True},
3 {"name": "policy_student", "route": "policy", "quality": 0.93, "floor_tps": 19, "privacy_ok": True},
4 {"name": "hot_policy", "route": "policy", "quality": 0.95, "floor_tps": 11, "privacy_ok": True},
5]
6bars = {"intent": {"quality": 0.97, "floor_tps": 30}, "policy": {"quality": 0.92, "floor_tps": 15}}
7
8approved = [
9 item["name"]
10 for item in candidates
11 if item["privacy_ok"]
12 and item["quality"] >= bars[item["route"]]["quality"]
13 and item["floor_tps"] >= bars[item["route"]]["floor_tps"]
14]
15print(f"approved candidates: {approved}")1approved candidates: ['tiny_gate', 'policy_student']When the checklist is complete, the maintenance policy assistant on the scanner is a real production system, not a research demo.
Choose an edge job, pick a viable student class, and explain why quality, runtime, thermal, and privacy checks all have to pass before rollout.
250 ms time to first token. Would you start with a tiny gate, a 1B-4B local model plus retrieval, or a cloud-first path? Why?4-bit export it starts giving fluent answers with wrong citations. What would you measure or change before increasing model size?20 seconds, then drops below product floor after 6 minutes. Which signals tell you this is a thermal problem instead of a prompt-quality problem?Symptom: A 1B model feels slower than expected and drains the scanner battery.
Cause: The runtime falls back to CPU or moves tensors inefficiently even though the checkpoint is small.
Fix: Measure accelerator placement, sustained decode, tokens per joule, memory growth, and temperature on the actual device SKU.
Symptom: The model passes on a flagship phone but misses latency targets on older handheld devices.
Cause: The same GGUF, ONNX, Core ML, or PTE artifact can hit different kernels, memory limits, and thermal behavior across hardware.
Fix: Build a device matrix and require the golden eval plus sustained thermal benchmark for every supported device family.
Symptom: The student mimics teacher tone but fails abstention, citations, and hard policy cases.
Cause: Training uses only teacher-written final answers and skips soft logits, intermediate features, ranked candidates, hard negatives, or on-policy corrections.
Fix: Add richer supervision where available and keep a hard-case set from the student's own failures.
Symptom: A demo looks fast for ten seconds, then field answers slow down midway through a shift.
Cause: Burst tests hide sustained heat, battery, allocator, and repeated-prompt behavior.
Fix: Run a ten-minute workload with the production prompt template, output cap, retrieval path, and telemetry enabled.
Symptom: The app launches after an over-the-air update, but answers regress and rollback is manual.
Cause: The model artifact is treated like a static file instead of a versioned production dependency.
Fix: Ship manifests, checksums, min runtime versions, previous artifacts, smoke evals, and automatic rollback.
SLM quality at the edge depends on data recipe, architecture, compression, and route-specific evaluation, not parameter count alone. A smaller candidate should ship only if it clears the same task bar as larger candidates.
Runtime and hardware constraints (power, heat, memory bandwidth) are part of model selection. Choose the model after measuring sustained tokens per joule on the actual scanner hardware.
Privacy is a product requirement that local inference can satisfy only if telemetry, sync, crash reporting, and model-update paths are controlled too. Any permitted cloud path must follow the product's data-handling policy.
Field-service and inspection apps often need a hybrid split: a fast, private SLM on the device that knows when to escalate. The local model acts as both accelerator and privacy boundary.
Local deployment taught you how to size and package a model for fixed hardware. Edge deployment pushes the same ideas onto a battery-powered, thermally constrained, privacy-first device that the technician carries. If you can ship a reliable SLM policy assistant on a handheld scanner, you can defend one of the hardest production constraints in LLM engineering.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Small Language Models are the Future of Agentic AI
Belcak, P., Heinrich, G., Diao, S., et al. (NVIDIA) · 2025
Distilling the Knowledge in a Neural Network.
Hinton, G., Vinyals, O., & Dean, J. · 2015
MiniLLM: On-Policy Distillation of Large Language Models.
Gu, Y., et al. · 2024
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin, M., et al. · 2024 · arXiv preprint
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Liu, Z., Zhao, C., Iandola, F., et al. · 2024 · ICML 2024
MLC LLM: Universal LLM Deployment Engine for On-Device Inference
MLC AI Team · 2024
Deploy ONNX Runtime Mobile
Microsoft · 2026
Core ML
Apple · 2026
ExecuTorch Documentation
PyTorch · 2026
llama.cpp: Inference of LLaMA model in pure C/C++
Gerganov, G. · 2023
Updates to Apple's On-Device and Server Foundation Language Models
Apple Machine Learning Research · 2025
Phi-4-mini-instruct Model Card
Microsoft · 2025
Gemma 4 Model Card
Gemma Team, Google DeepMind · 2026
Qwen3-1.7B Model Card
Qwen Team · 2025