LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleSLM Specialization & Edge Deployment
🚀HardInference Optimization

SLM Specialization & Edge Deployment

Distill large teachers into compact SLMs using MobileLLM architectures and Phi-style data recipes. Compile and run them on-device with MLC LLM, ONNX Runtime, Core ML, and ExecuTorch while respecting power, thermal, and strict privacy constraints.

27 min read
Learning path
Step 136 of 158 in the full curriculum
Local LLM DeploymentSpeculative Decoding

Local deployment taught you how to size and package a model for fixed hardware. Edge deployment tightens the same problem until battery, heat, privacy, and app packaging become first-class model requirements.

A field technician on the night shift scans a damaged pump and needs the exact maintenance policy and vendor exception rules right now. The handheld scanner has spotty signal, the policy contains pricing tiers that the facility considers confidential, and company policy forbids sending facility or incident data to a cloud service for this route. A cloud-only assistant isn't an option. If deterministic lookup is insufficient and a generative path is justified, a small specialized model must run on the device itself.

That's the real job for Small Language Model (SLM) specialization and edge deployment: produce a model small enough to run locally and evaluate it under memory, battery, heat, privacy, and app-packaging limits. Here, an SLM means a model selected for a constrained, usually narrow workload rather than a universal parameter-count threshold.

NVIDIA Research's 2025 position paper argues for heterogeneous agent systems that use SLMs for routine narrow invocations and larger models when complexity requires them.[1]Reference 1Small Language Models are the Future of Agentic AIhttps://arxiv.org/abs/2506.02153 Treat that as an architecture hypothesis to evaluate, not proof that every local route should use a generative model: retrieval, classifiers, or a human lane may be better for some jobs.

Edge SLM deployment stack showing device constraints feeding teacher distillation, compact model selection, runtime compilation, release gates, and a signed on-device artifact bundle. Edge SLM deployment stack showing device constraints feeding teacher distillation, compact model selection, runtime compilation, release gates, and a signed on-device artifact bundle.
The edge stack starts with the device job, then chooses the teacher recipe, smallest passing model class, runtime target, and release bundle that can survive field checks.

Start from the constraint, not the model size

Rather than starting with "what is the smallest model that fits in 4 GB?", start edge work with "what job must this device do offline, and what latency, power, and heat budget must it hold during a sustained test?"

For the inspection handheld example, the following numbers are candidate launch bars, not measured device results:

Job on deviceLatency targetPrivacy needTypical model class
Basic intent routing ("is this a fault or escalation?")< 60 msHigh200M-500M classifier or tiny LLM
Field policy Q&A with citations< 250 ms time to first token (TTFT), 25 t/s decodeVery high1B-4B instruct SLM
Damaged-item photo + text decision support< 400 ms end-to-endHigh3B vision-language SLM or two-stage pipeline
Facility policy lookup while offlineInstantHigh350M-1B retrieval-augmented SLM
Device constraint matrix for edge SLM deployment comparing intent routing, policy question answering, multimodal decision support, and offline carrier lookup by latency, privacy, model class, and device pressure. Device constraint matrix for edge SLM deployment comparing intent routing, policy question answering, multimodal decision support, and offline carrier lookup by latency, privacy, model class, and device pressure.
Start from the offline job, not from download size. Different device jobs push model class upward for different reasons.

Notice that the 70B model never appears in the local shortlist. Once the job is defined, test the smallest model or deterministic path that can clear the launch bar.

edge-job-admission.py
1jobs = [ 2 {"name": "intent", "privacy_local": True, "p95_ms": 42, "quality": 0.98}, 3 {"name": "policy_qa", "privacy_local": True, "p95_ms": 228, "quality": 0.93}, 4 {"name": "photo_support", "privacy_local": True, "p95_ms": 470, "quality": 0.91}, 5] 6limits = {"max_p95_ms": 250, "min_quality": 0.92} 7 8approved = [ 9 job["name"] 10 for job in jobs 11 if job["privacy_local"] 12 and job["p95_ms"] <= limits["max_p95_ms"] 13 and job["quality"] >= limits["min_quality"] 14] 15print(f"approved local routes: {approved}")
Output
1approved local routes: ['intent', 'policy_qa']

Distillation: moving knowledge without moving the data

The original knowledge-distillation paper showed that a student can learn from the soft probability distribution of a teacher rather than hard one-hot labels.[2]Reference 2Distilling the Knowledge in a Neural Network.https://arxiv.org/abs/1503.02531 For LLMs the same idea scales up dramatically: when teacher logits are available, the teacher provides a dense "dark knowledge" signal in the shape of its next-token distribution.

The distillation objective

Instead of training the student only with standard cross-entropy on hard labels, we add a softmax-based term that matches the teacher's softened distribution:

Ldistill=  α⋅KL(softmax(zTτ)∥softmax(zSτ))+(1−α)⋅CE(y,zS)\begin{aligned} \mathcal{L}_{\text{distill}} =&\;\alpha \cdot \text{KL}\left( \text{softmax}\left(\frac{z_T}{\tau}\right) \parallel \text{softmax}\left(\frac{z_S}{\tau}\right) \right) \\ &+ (1-\alpha) \cdot \text{CE}(y, z_S) \end{aligned}Ldistill​=​α⋅KL(softmax(τzT​​)∥softmax(τzS​​))+(1−α)⋅CE(y,zS​)​

Where:

  • zT,zSz_T, z_SzT​,zS​ are the logits from teacher and student
  • τ\tauτ is a temperature hyperparameter that softens the distribution and must be tuned for the training setup
  • α\alphaα balances the distillation loss against the hard-label loss
  • KL is Kullback-Leibler divergence
soft-teacher-target.py
1import math 2 3def softmax(values, temperature): 4 shifted = [value / temperature for value in values] 5 weights = [math.exp(value) for value in shifted] 6 total = sum(weights) 7 return [weight / total for weight in weights] 8 9teacher_logits = [4.0, 2.0, 1.0] 10for temperature in (1.0, 3.0): 11 probs = softmax(teacher_logits, temperature) 12 print(f"temperature={temperature:.0f}: {[round(p, 3) for p in probs]}")
Output
1temperature=1: [0.844, 0.114, 0.042] 2temperature=3: [0.532, 0.273, 0.196]

Modern LLM distillation can add several practical upgrades on top of the basic KL objective:

  • Hidden-state alignment: match intermediate layer activations or attention maps between teacher and student (feature distillation).
  • On-policy distillation: the student generates its own trajectories (MiniLLM style); the teacher then provides labels or distributions on those self-generated sequences, reducing distribution shift.[3]Reference 3MiniLLM: On-Policy Distillation of Large Language Models.https://arxiv.org/abs/2306.08543
  • Synthetic data recipes: teacher-generated and filtered examples can shape student capabilities; Phi-3 reports a heavily filtered data recipe for a compact model.[4]Reference 4Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phonehttps://arxiv.org/abs/2404.14219

When full teacher logits are unavailable, teams can train with permitted teacher responses, ranked candidates, critique labels, or hard-negative rewrites. This isn't the same as logit matching, and it needs its own evaluation.

Edge SLM distillation flow showing teacher signals, supervision bundle, compact student, and launch gate with short support callouts. Edge SLM distillation flow showing teacher signals, supervision bundle, compact student, and launch gate with short support callouts.
When teacher distributions are available, distillation can transfer more than final answers. Otherwise, richer response-level supervision still needs independent evaluation.

The Phi-3 technical report presents one compact-model result from carefully selected training data; it doesn't establish that every distilled 3.8B model beats every larger model.[4]Reference 4Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phonehttps://arxiv.org/abs/2404.14219 The deployment lesson is narrower: data recipe and evaluation can matter alongside parameter count.

Architecture choices that matter at small scale

For sub-billion parameter models, raw parameter count isn't the only lever. MobileLLM reports results for its studied 125M and 350M variants showing that architecture choices affect quality per parameter.[5]Reference 5MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Caseshttps://arxiv.org/abs/2402.14905

The key MobileLLM ideas that transfer to production SLMs are:

  • Deep-and-thin designs were beneficial in the MobileLLM study; validate that choice for a new model and runtime.
  • Embedding sharing or block-wise weight sharing (the LS variant) can reduce parameters; latency still needs device measurement.
  • Grouped-query attention, which shares key-value heads across query groups, from the start so the KV cache stays small during decode.
kv-head-device-budget.py
1def kv_mib(layers, kv_heads, head_dim, tokens, bytes_per_value=2): 2 return 2 * layers * kv_heads * head_dim * tokens * bytes_per_value / 2**20 3 4mha = kv_mib(layers=24, kv_heads=16, head_dim=64, tokens=4096) 5gqa = kv_mib(layers=24, kv_heads=4, head_dim=64, tokens=4096) 6print(f"MHA KV cache: {mha:.1f} MiB") 7print(f"GQA KV cache: {gqa:.1f} MiB") 8print(f"cache reduction for this architecture: {mha / gqa:.1f}x")
Output
1MHA KV cache: 384.0 MiB 2GQA KV cache: 96.0 MiB 3cache reduction for this architecture: 4.0x
Small language model design split into quality levers and bandwidth levers, with a final evaluation rule for edge deployment. Small language model design split into quality levers and bandwidth levers, with a final evaluation rule for edge deployment.
Small-model levers help in different ways. Pair one lever that lifts task skill with one lever that relieves memory or KV-cache pressure, then rerun the same eval after export.

Microsoft's Phi-3 report describes a heavily filtered data recipe and a 3.8B model intended for constrained deployment scenarios.[4]Reference 4Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phonehttps://arxiv.org/abs/2404.14219 Whether a particular artifact fits a target device still depends on runtime, quantization, context, and measured behavior.

Both lines of work teach the same lesson: at the edge, architecture and data quality are first-class citizens alongside size.

On-device runtimes: the real deployment surface

Once you have weights, you still have to run them on the target's silicon. Candidate runtime families include the options below; supported operators, packaging, and acceleration paths must be checked against the runtime release and device:

RuntimePackaging or export pathBackends to verifyIntegration question
MLC LLM (Apache TVM)[6]Reference 6MLC LLM: Universal LLM Deployment Engine for On-Device Inferencehttps://llm.mlc.ai/Compiled deployment artifactsMetal, Vulkan, WebGPU, or documented target pathDoes compiled artifact support each device family?
ONNX Runtime Mobile[7]Reference 7Deploy ONNX Runtime Mobilehttps://onnxruntime.ai/docs/tutorials/mobile/ONNX model with execution providersPlatform execution provider and CPU fallbackWhich operators partition or fall back?
Core ML / ExecuTorch[8]Reference 8Core MLhttps://developer.apple.com/documentation/CoreML[9]Reference 9ExecuTorch Documentationhttps://docs.pytorch.org/executorch/stable/Core ML model or ExecuTorch programDocumented Apple or vendor delegate pathDoes conversion preserve required operators?
llama.cpp (GGUF)[10]Reference 10llama.cpp: Inference of LLaMA model in pure C/C++https://github.com/ggml-org/llama.cppGGUF artifactBuild-specific CPU/GPU backendDoes selected build meet local latency and memory bars?
Runtime candidate paths for edge SLM deployment showing MLC LLM, ONNX Runtime, native bundles, and GGUF alongside device qualification checks. Runtime candidate paths for edge SLM deployment showing MLC LLM, ONNX Runtime, native bundles, and GGUF alongside device qualification checks.
Each runtime family packages and executes the model differently. Treat every path as a candidate until operator placement, sustained decode, and fallback behavior pass on the target hardware.

For a rugged inspection handheld, a runtime that silently falls back to CPU may meet a short functional test yet miss a sustained battery or latency target. Measure actual operator placement and long-run behavior on every supported device family.

runtime-device-qualification.py
1results = [ 2 {"runtime": "compiled", "accelerator": True, "p95_ms": 94, "ten_min_tps": 19}, 3 {"runtime": "fallback", "accelerator": False, "p95_ms": 182, "ten_min_tps": 9}, 4] 5required_tps = 15 6max_p95_ms = 120 7 8qualified = [ 9 result["runtime"] 10 for result in results 11 if result["accelerator"] 12 and result["p95_ms"] <= max_p95_ms 13 and result["ten_min_tps"] >= required_tps 14] 15print(f"qualified runtimes: {qualified}")
Output
1qualified runtimes: ['compiled']

Power, thermal, and privacy constraints are first-class requirements

Burst speed isn't the field metric. Under sustained decode, a device may throttle as heat builds up. Measure:

  • Average power draw during decode (mW)
  • Skin temperature rise over 5 minutes
  • Tokens per joule (the key efficiency metric for a battery-powered device)

Local inference removes the cloud prompt path, but privacy still depends on the surrounding product design. You have to prove that:

  • All weights, tokenizer, and prompt templates are bundled inside the signed app container.
  • No network calls are made for inference (you can still phone home for telemetry or model updates, but inference itself stays local).
  • Logs and telemetry exclude prompt content or follow an explicitly approved sync policy.

For field-service maintenance this matters when the policy the model answers contains negotiated vendor rates or vendor SLAs that should never appear in a cloud prompt log.

Apple documents an on-device Foundation Models framework for integrating supported language-model tasks into apps.[11]Reference 11Updates to Apple's On-Device and Server Foundation Language Modelshttps://machinelearning.apple.com/research/apple-foundation-models-2025-updates That demonstrates a platform path, not evidence that it satisfies this scanner's policy, quality, runtime, or privacy requirements.

privacy-routing-gate.py
1requests = [ 2 {"intent": "public_faq", "sensitive": False, "cloud_allowed": True}, 3 {"intent": "facility_procedure", "sensitive": True, "cloud_allowed": False}, 4] 5 6for request in requests: 7 route = "local_only" if request["sensitive"] or not request["cloud_allowed"] else "local_or_cloud" 8 print(f"{request['intent']}: {route}")
Output
1public_faq: local_or_cloud 2facility_procedure: local_only
Edge SLM launch gates showing sustained decode drop from burst to ten-minute floor and privacy checks for bundled weights, local prompts, redacted telemetry, and blocked cloud fallback for sensitive intents. Edge SLM launch gates showing sustained decode drop from burst to ten-minute floor and privacy checks for bundled weights, local prompts, redacted telemetry, and blocked cloud fallback for sensitive intents.
Edge launch checks aren't only model-quality checks. Sustained decode, tokens per joule, telemetry redaction, and cloud-fallback rules all need explicit pass/fail checks.

A practical edge deployment checklist

  1. Define the exact job and the maximum acceptable latency and quality drop versus the cloud teacher.
  2. Choose or distill an SLM that passes your golden eval set at the target size.
  3. Quantize to 4-bit or 8-bit and re-run the eval (KV cache also quantized where the runtime supports it).
  4. Compile or export for the target runtime and device family.
  5. Measure sustained decode speed, power, and temperature on the actual hardware for 10 minutes.
  6. Package the model artifact with a checksum, a tiny smoke eval, and a rollback path inside the mobile app.
  7. Define an abstention and escalation policy, then test it against private and ambiguous cases.

If any step fails, the deployment fails, even if the model "runs."

A hybrid-routing candidate with a local gate

Some products can evaluate a small on-device model as a gate between local handling and permitted escalation:

  • The 350M local model handles routine maintenance-policy questions instantly and privately when confidence is high.
  • When confidence is low or the query needs a stronger path, the router escalates only if privacy policy permits the data leaving the device.
  • A policy-approved, redacted cloud answer is logged, and the pair is later used to improve the next distillation round.

This is one deployment interpretation of NVIDIA Research's heterogeneous agent argument: test an SLM for routine steps and reserve stronger paths for cases that require them.[1]Reference 1Small Language Models are the Future of Agentic AIhttps://arxiv.org/abs/2506.02153 Privacy classification must happen before any cloud escalation.

hybrid-route-policy.py
1queries = [ 2 {"name": "routine_public", "confidence": 0.94, "sensitive": False}, 3 {"name": "hard_public", "confidence": 0.41, "sensitive": False}, 4 {"name": "hard_private", "confidence": 0.38, "sensitive": True}, 5] 6 7for query in queries: 8 if query["confidence"] >= 0.80: 9 route = "local_answer" 10 elif query["sensitive"]: 11 route = "human_review_local_only" 12 else: 13 route = "cloud_allowed_after_redaction" 14 print(f"{query['name']}: {route}")
Output
1routine_public: local_answer 2hard_public: cloud_allowed_after_redaction 3hard_private: human_review_local_only

Complete edge specialization pipeline

Edge specialization pipeline from teacher traces to student training, export, device benchmark, and rollout with quality, runtime, thermal, and privacy gates. Edge specialization pipeline from teacher traces to student training, export, device benchmark, and rollout with quality, runtime, thermal, and privacy gates.
One checkpoint isn't a release. Every edge model still crosses export, device, thermal, and privacy gates before rollout.

One cloud distillation step feeds the downstream runtime paths and edge devices. Once the specialized weights exist, one checkpoint may feed MLC for a tuned iOS path, ONNX Runtime for cross-platform graph execution, or GGUF for a llama.cpp prototype path. Each path still needs its own benchmark on the target device family.

A concrete maintenance policy distillation walk-through

Suppose an approved teacher answers maintenance policy questions with citations and you want a smaller student that runs on the scanner. This is a planning example; the actual teacher, student size, and training-data access require policy and evaluation review.

Step 1. Create a permitted distillation dataset from synthetic policy scenarios and any approved, redacted production-like cases. If the teacher exposes logits or hidden states and storing them is allowed, record those signals; otherwise train from allowed outputs and critiques without pretending they are equivalent to logit distillation.

Step 2. Train the student with a combined loss:

  • Standard cross-entropy on the hard label (the final answer the teacher chose).
  • KL divergence between teacher and student logits (the sketch uses temperature 2.0; tune this value during training).
  • If available and approved, hidden-state L2 loss on selected layers.

A minimal training loop sketch looks like this. It uses tiny linear layers instead of real LLM blocks, but the three signals are the same: hard-label cross-entropy, soft-logit KL, and hidden-state alignment.

a-concrete-maintenance-policy-distillation.py
1import torch 2import torch.nn.functional as F 3 4torch.manual_seed(7) 5 6batch = 3 7hidden_dim = 5 8vocab = 4 9temperature = 2.0 10 11inputs = torch.randn(batch, hidden_dim) 12labels = torch.tensor([0, 2, 1]) 13 14teacher_head = torch.nn.Linear(hidden_dim, vocab, bias=False) 15student_head = torch.nn.Linear(hidden_dim, vocab, bias=False) 16teacher_proj = torch.nn.Linear(hidden_dim, hidden_dim, bias=False) 17student_proj = torch.nn.Linear(hidden_dim, hidden_dim, bias=False) 18 19with torch.no_grad(): 20 teacher_logits = teacher_head(inputs) 21 teacher_hidden = teacher_proj(inputs) 22 23student_logits = student_head(inputs) 24student_hidden = student_proj(inputs) 25 26hard_loss = F.cross_entropy(student_logits, labels) 27 28soft_loss = F.kl_div( 29 F.log_softmax(student_logits / temperature, dim=-1), 30 F.softmax(teacher_logits / temperature, dim=-1), 31 reduction="batchmean", 32) * (temperature * temperature) 33 34hidden_loss = F.mse_loss(student_hidden, teacher_hidden) 35 36total_loss = hard_loss + 0.8 * soft_loss + 0.2 * hidden_loss 37 38print(f"hard_loss={hard_loss.item():.4f}") 39print(f"soft_loss={soft_loss.item():.4f}") 40print(f"hidden_loss={hidden_loss.item():.4f}") 41print(f"total_loss={total_loss.item():.4f}")
Output
1hard_loss=1.5157 2soft_loss=0.3425 3hidden_loss=0.5703 4total_loss=1.9038

After training, don't assume the student has "recovered" the teacher. Measure it. A launchable result is a student that meets the golden-set policy-accuracy bar, preserves abstention behavior, fits the memory budget, and stays above the sustained decode floor on the target NPU.

Step 3. Quantize the student with an artifact format supported by the device runtime and re-evaluate on the same golden policy set. If an exported artifact misses the allowed quality or latency regression threshold, reject it and investigate quantization choice, training recipe, or model size before retraining or deployment.

exported-student-gate.py
1baseline = {"quality": 0.94, "p95_ms": 132} 2exports = [ 3 {"name": "int8", "quality": 0.938, "p95_ms": 104}, 4 {"name": "int4", "quality": 0.887, "p95_ms": 71}, 5] 6max_quality_drop = 0.02 7latency_gate_ms = 120 8 9approved = [ 10 item["name"] 11 for item in exports 12 if baseline["quality"] - item["quality"] <= max_quality_drop 13 and item["p95_ms"] <= latency_gate_ms 14] 15print(f"approved exports: {approved}")
Output
1approved exports: ['int8']

Thermal throttling isn't theoretical

A low-bit SLM can pass a short burst test and still fail the field workload. In a sustained run, a device may reduce performance after heat builds up, and decode speed can fall below the product floor in the middle of an answer. Measure that risk rather than assuming a burst result holds.

Possible mitigations to evaluate include:

  • Cap the maximum output length the app will request from the local model.
  • Route deterministic lookups outside generation when a fixed answer is sufficient.
  • Test a smaller or differently quantized artifact on the same golden set.
  • Adjust sustained-use policy only after measuring latency, power, and thermal effects.
sustained-throughput-gate.py
1samples_tps = [23.0, 22.4, 21.7, 18.3, 16.8] 2minimum_sustained_tps = 18.0 3burst_tps = samples_tps[0] 4floor_tps = min(samples_tps) 5 6print(f"burst throughput: {burst_tps:.1f} t/s") 7print(f"sustained floor: {floor_tps:.1f} t/s") 8print(f"thermal/runtime gate passes: {floor_tps >= minimum_sustained_tps}")
Output
1burst throughput: 23.0 t/s 2sustained floor: 16.8 t/s 3thermal/runtime gate passes: False

An on-device eval harness should load a versioned golden set from the app bundle, run inference with the production prompt template and runtime flags, and check answers, citations, abstentions, and latency. Run it after model updates and block releases on regression.

Example SLM candidate families

These cited model families illustrate how to form a shortlist, not which models are newest or best. Verify release documentation, licensing, runtime support, and artifact availability before benchmarking.[5]Reference 5MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Caseshttps://arxiv.org/abs/2402.14905[12]Reference 12Phi-4-mini-instruct Model Cardhttps://huggingface.co/microsoft/Phi-4-mini-instruct[13]Reference 13Gemma 4 Model Cardhttps://ai.google.dev/gemma/docs/core/model_card_4[14]Reference 14Qwen3-1.7B Model Cardhttps://huggingface.co/Qwen/Qwen3-1.7B

Model familySizeReported focusWhat to benchmarkCandidate job to test
MobileLLM (Meta)125M / 350MArchitecture search for sub-500MSustained latency and accuracy on narrow routesUltra-low power intent routers
Phi-4 mini / mini-reasoning (Microsoft)3.8BStrong reasoning, math, and function-calling for sizePolicy accuracy, tool-call format, and thermal floorPolicy Q&A, lightweight agents, tool routing
Gemma 4 E2B / E4B (Google)Effective 2B / 4B classesMultimodal edge stack with selective activationModality mix, runtime support, and memory footprintField assistants with text, image, audio, or video inputs
Qwen3 1.7B / 4B (Qwen Team)1.7B-4BMultilingual, coding, tool use, thinking/non-thinking modesLanguage mix, structured-output reliability, and thinking-mode budgetMultilingual field workflows and code-like structured tasks
Edge SLM shortlist showing MobileLLM, Phi, Gemma, and Qwen candidate families alongside one fixed same-device evaluation harness. Edge SLM shortlist showing MobileLLM, Phi, Gemma, and Qwen candidate families alongside one fixed same-device evaluation harness.
These families cover different candidate scopes, not ranked positions. Run one fixed harness across supported artifacts on the target devices.

Many of these families can run fully locally through one or more edge runtimes, but support isn't automatic. The checkpoint, tokenizer, quantization format, custom ops, and accelerator path all have to work together on the target device.

Use a controlled comparison for the scanner fleet: compile supported candidates, run the same versioned golden set on supported device SKUs, and measure a sustained workload while collecting available power and temperature signals. Select from measured results, not model-family reputation.

A fleet may evaluate separate checkpoints for routing and policy answering rather than force one model to serve both jobs. The same evaluation harness can compare each candidate against its own route requirements.

device-shortlist-scorecard.py
1candidates = [ 2 {"name": "tiny_gate", "route": "intent", "quality": 0.98, "floor_tps": 33, "privacy_ok": True}, 3 {"name": "policy_student", "route": "policy", "quality": 0.93, "floor_tps": 19, "privacy_ok": True}, 4 {"name": "hot_policy", "route": "policy", "quality": 0.95, "floor_tps": 11, "privacy_ok": True}, 5] 6bars = {"intent": {"quality": 0.97, "floor_tps": 30}, "policy": {"quality": 0.92, "floor_tps": 15}} 7 8approved = [ 9 item["name"] 10 for item in candidates 11 if item["privacy_ok"] 12 and item["quality"] >= bars[item["route"]]["quality"] 13 and item["floor_tps"] >= bars[item["route"]]["floor_tps"] 14] 15print(f"approved candidates: {approved}")
Output
1approved candidates: ['tiny_gate', 'policy_student']

Expanded production checklist for edge SLM rollouts

  • Golden eval set lives inside the app bundle and runs automatically after every model update.
  • Model artifact includes a manifest with sha256, min runtime version, and supported NPU families.
  • On-device telemetry records TTFT, tokens per second, power draw (where the OS exposes it), and user "was this answer useful?" thumbs without uploading the prompt text.
  • Rollback: the app keeps approved previous model versions and switches automatically within the product's recovery target if an over-the-air update fails smoke evaluation.
  • Legal review: the model card for the SLM version that ships on devices must list training data summary, known limitations, and the exact red-team cases that were run on the final checkpoint.

When the checklist is complete, the maintenance policy assistant on the scanner is a real production system, not a research demo.

Mastery check

Choose an edge job, pick a viable student class, and explain why quality, runtime, thermal, and privacy checks all have to pass before rollout.

Evaluation rubric

  • Can explain why edge projects start from device job, privacy class, and sustained thermal budget instead of parameter count alone.
  • Can describe what distillation adds beyond final-answer copying: soft logits, feature targets, and corrections on the student's own failures.
  • Can compare MLC LLM, ONNX Runtime, Core ML, ExecuTorch, and llama.cpp by export model, backend reach, and CPU-fallback risk.
  • Can defend a launch rule that uses the same golden set before and after quantization on target hardware.
  • Can design a hybrid router that separates low confidence from privacy permission.
  • Can name artifact safety requirements: manifest, checksum, smoke eval, rollback, and device-family benchmark matrix.

Follow-up questions

  1. A scanner must answer maintenance policy questions offline, cite the right paragraph, and stay under 250 ms time to first token. Would you start with a tiny gate, a 1B-4B local model plus retrieval, or a cloud-first path? Why?
  2. Your student's loss looks good, but after 4-bit export it starts giving fluent answers with wrong citations. What would you measure or change before increasing model size?
  3. A model is fast for 20 seconds, then drops below product floor after 6 minutes. Which signals tell you this is a thermal problem instead of a prompt-quality problem?
  4. A local router sees low confidence on a private private facility procedure question. When is cloud escalation allowed, and what other classifier must be checked first?
  5. Same checkpoint passes on one flagship phone and fails on two older handheld SKUs. What runtime and device-qualification checks should block rollout?

Common pitfalls

Assuming small means fast and cool

  • Symptom: A 1B model feels slower than expected and drains the scanner battery.

  • Cause: The runtime falls back to CPU or moves tensors inefficiently even though the checkpoint is small.

  • Fix: Measure accelerator placement, sustained decode, tokens per joule, memory growth, and temperature on the actual device SKU.

Shipping one artifact to every phone

  • Symptom: The model passes on a flagship phone but misses latency targets on older handheld devices.

  • Cause: The same GGUF, ONNX, Core ML, or PTE artifact can hit different kernels, memory limits, and thermal behavior across hardware.

  • Fix: Build a device matrix and require the golden eval plus sustained thermal benchmark for every supported device family.

Treating distillation as final-answer copying

  • Symptom: The student mimics teacher tone but fails abstention, citations, and hard policy cases.

  • Cause: Training uses only teacher-written final answers and skips soft logits, intermediate features, ranked candidates, hard negatives, or on-policy corrections.

  • Fix: Add richer supervision where available and keep a hard-case set from the student's own failures.

Ignoring thermal throttling

  • Symptom: A demo looks fast for ten seconds, then field answers slow down midway through a shift.

  • Cause: Burst tests hide sustained heat, battery, allocator, and repeated-prompt behavior.

  • Fix: Run a ten-minute workload with the production prompt template, output cap, retrieval path, and telemetry enabled.

Forgetting artifact safety checks

  • Symptom: The app launches after an over-the-air update, but answers regress and rollback is manual.

  • Cause: The model artifact is treated like a static file instead of a versioned production dependency.

  • Fix: Ship manifests, checksums, min runtime versions, previous artifacts, smoke evals, and automatic rollback.

Edge deployment checklist

SLM quality at the edge depends on data recipe, architecture, compression, and route-specific evaluation, not parameter count alone. A smaller candidate should ship only if it clears the same task bar as larger candidates.

Runtime and hardware constraints (power, heat, memory bandwidth) are part of model selection. Choose the model after measuring sustained tokens per joule on the actual scanner hardware.

Privacy is a product requirement that local inference can satisfy only if telemetry, sync, crash reporting, and model-update paths are controlled too. Any permitted cloud path must follow the product's data-handling policy.

Field-service and inspection apps often need a hybrid split: a fast, private SLM on the device that knows when to escalate. The local model acts as both accelerator and privacy boundary.

Local deployment taught you how to size and package a model for fixed hardware. Edge deployment pushes the same ideas onto a battery-powered, thermally constrained, privacy-first device that the technician carries. If you can ship a reliable SLM policy assistant on a handheld scanner, you can defend one of the hardest production constraints in LLM engineering.

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An inspection handheld must answer maintenance-policy questions offline. Facility and incident data cannot leave the device, TTFT must be under 250 ms, quality must clear the policy-QA bar, and the app must survive a 10-minute sustained run. Which selection plan matches those constraints?
2.The same scanner needs two offline features: classify whether a scan is a fault or a vendor escalation in under 60 ms, and answer maintenance policy questions with citations under 250 ms TTFT. Which architecture split is most defensible?
3.In a distillation run, the teacher logits for three next-token options are [4.0, 2.0, 1.0]. At temperature 1 the soft target is about [0.844, 0.114, 0.042]; at temperature 3 it is about [0.532, 0.273, 0.196]. What useful signal does the higher-temperature target add compared with a one-hot final answer?
4.A team may store teacher-written answers, ranked candidates, critique labels, and hard-negative rewrites, but policy forbids storing teacher logits or hidden states. They also have a set of student-generated failures. Which distillation plan follows from those constraints?
5.An SLM candidate has 24 layers, head_dim 64, a 4096-token context, and 2-byte KV values. With 16 KV heads its KV cache is 384 MiB; with grouped-query attention using 4 KV heads it is 96 MiB. What conclusion is valid for edge deployment?
6.A scanner runtime qualification test has a max p95 latency of 120 ms and a minimum 10-minute throughput floor of 15 t/s. Candidate A uses the accelerator, has p95 94 ms, and sustains 19 t/s. Candidate B silently falls back to CPU, has p95 182 ms, and sustains 9 t/s. Which runtime should pass?
7.A baseline student scores quality 0.94 but has p95 latency 132 ms. Exports must have quality drop <= 0.02 and p95 latency <= 120 ms. The int8 export scores quality 0.938 and p95 104 ms; the int4 export scores quality 0.887 and p95 71 ms. Which export should be approved?
8.A scanner app executes model tokens on device. It also has telemetry, crash reports, sync jobs, and OTA diagnostics. Site policy text is sensitive and cloud prompts are forbidden. Which privacy gate must pass before launch?
9.A local router uses confidence >= 0.80 for local answers. For lower-confidence questions, public data may go to a cloud path only after redaction, but sensitive private facility procedures must never leave the device. How should it route: routine_public(conf 0.94, not sensitive), hard_public(conf 0.41, not sensitive), hard_private(conf 0.38, sensitive)?
10.An over-the-air model update still lets the app launch, but its manifest checksum fails on one artifact, two older scanner SKUs lack the declared minimum runtime version, the bundled smoke eval breaks abstention, and a ten-minute test shows lower throughput. What should the release system do?

10 questions remaining.

Next Step
Continue to Speculative Decoding

You now know how to create and run a specialized small model on edge hardware. Speculative decoding is a serving-time acceleration technique that can reduce decode latency when a faster draft path predicts tokens the target model accepts.

PreviousLocal LLM Deployment
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Small Language Models are the Future of Agentic AI

Belcak, P., Heinrich, G., Diao, S., et al. (NVIDIA) · 2025

Distilling the Knowledge in a Neural Network.

Hinton, G., Vinyals, O., & Dean, J. · 2015

MiniLLM: On-Policy Distillation of Large Language Models.

Gu, Y., et al. · 2024

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin, M., et al. · 2024 · arXiv preprint

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Liu, Z., Zhao, C., Iandola, F., et al. · 2024 · ICML 2024

MLC LLM: Universal LLM Deployment Engine for On-Device Inference

MLC AI Team · 2024

Deploy ONNX Runtime Mobile

Microsoft · 2026

Core ML

Apple · 2026

ExecuTorch Documentation

PyTorch · 2026

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

Updates to Apple's On-Device and Server Foundation Language Models

Apple Machine Learning Research · 2025

Phi-4-mini-instruct Model Card

Microsoft · 2025

Gemma 4 Model Card

Gemma Team, Google DeepMind · 2026

Qwen3-1.7B Model Card

Qwen Team · 2025