LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleLocal LLM Deployment
🚀HardInference Optimization

Local LLM Deployment

Plan local LLM deployment with model size, quantization, pruning and sparsity trade-offs, Docker packaging, runtime choice, and hardware budgets.

18 min read
Learning path
Step 135 of 158 in the full curriculum
Model Quantization: GPTQ, AWQ & GGUFSLM Specialization & Edge Deployment

Quantization shrinks the model artifact. Local large language model (LLM) deployment turns that smaller artifact into a service with hardware limits, runtime choices, health checks, eval gates, and rollback.

Local LLM deployment isn't "download a model and hope." It's a serving plan.

Sometimes local deployment is the right answer: source code or incident data can't leave your boundary, latency needs to stay inside an on-prem network, the workload is steady enough to justify hardware, or the product needs offline operation. Other times, a hosted API is cheaper and safer. The engineering task is to measure the trade-off.

Local LLM deployment flow from route evaluation through model choice, runtime fit, and smoke plus rollback proof. Local LLM deployment flow from route evaluation through model choice, runtime fit, and smoke plus rollback proof.
Local deployment starts with route requirements and ends with operational proof. Model choice is only one part of service design.

Start from the job

Pick the smallest local model that passes the job's evaluation, not the largest model your GPU can barely load.

For a developer or incident-assistant workflow, define the job:

JobEvaluation targetLikely model class
Incident label routingHigh precision on triage labelsSmall local classifier or Gemma 4 E2B/E4B
Runbook Q&AFaithful cited answersGemma 4 12B with RAG
Sensitive source-code reviewStrong reasoning with private dataLarger local model or secure hosted lane
Offline field assistantLow latency and simple toolsQuantized local model

The model choice follows the evaluation. If Gemma 4 12B handles incident-status questions with high accuracy, don't deploy Qwen3.6-35B-A3B for that route just because it fits somewhere.[1][2]

Hardware budget

A local model consumes memory in several buckets:

  1. Model weights.
  2. KV cache for active requests.
  3. Runtime buffers.
  4. Activations during prefill.
  5. Fragmentation and safety margin.

Gemma 4 12B may fit a local workstation route, while Qwen3.6-35B-A3B can require high-memory accelerators once reserve, KV cache, and runtime buffers are counted.[1][2] Context length changes the math because KV cache grows with sequence length and concurrency.

Use two rough formulas before trusting any fit claim:

weight GB≈parameters×bits per weight8×109\text{weight GB} \approx \frac{\text{parameters} \times \text{bits per weight}}{8 \times 10^9}weight GB≈8×109parameters×bits per weight​

KV GB≈2×layers×KV heads109×head dim×tokens×concurrent sequences×bytes per value\begin{aligned} \text{KV GB} \approx {} & \frac{2 \times \text{layers} \times \text{KV heads}}{10^9} \\ & \times \text{head dim} \times \text{tokens} \\ & \times \text{concurrent sequences} \times \text{bytes per value} \end{aligned}KV GB≈​1092×layers×KV heads​×head dim×tokens×concurrent sequences×bytes per value​

The second formula is architecture-specific. Grouped-query and multi-query attention reduce KV heads, while longer context and more concurrent requests push the number up quickly.

Local LLM memory budget chart for a Gemma 4 12B route showing artifact size, runtime overhead, KV cache, and safety margin. Local LLM memory budget chart for a Gemma 4 12B route showing artifact size, runtime overhead, KV cache, and safety margin.
Weights are the first budget line, not the whole budget. Long context and concurrency can make KV cache dominate local serving.
hardware-budget.py
1def weight_gb(params_b: float, bits_per_weight: float) -> float: 2 return params_b * bits_per_weight / 8 3 4def kv_cache_gb( 5 layers: int, 6 kv_heads: int, 7 head_dim: int, 8 tokens: int, 9 concurrent_sequences: int, 10 bytes_per_value: int = 2, 11) -> float: 12 bytes_total = 2 * layers * kv_heads * head_dim * tokens * concurrent_sequences * bytes_per_value 13 return bytes_total / 1_000_000_000 14 15gemma4_12b_artifact_gb = 7.6 16kv_16k_x4_estimate_gb = 5.0 17qwen36_total_params_b = 35 18qwen36_bf16_weights_gb = weight_gb(params_b=qwen36_total_params_b, bits_per_weight=16) 19qwen36_kv_8k_x4_gb = kv_cache_gb( 20 layers=40, 21 kv_heads=2, 22 head_dim=256, 23 tokens=8192, 24 concurrent_sequences=4, 25) 26 27print(f"Gemma 4 12B local artifact: {gemma4_12b_artifact_gb:.1f} GB") 28print(f"16K context x 4 KV budget estimate: {kv_16k_x4_estimate_gb:.1f} GB") 29print(f"combined before buffers: {gemma4_12b_artifact_gb + kv_16k_x4_estimate_gb:.1f} GB") 30print(f"Qwen3.6-35B-A3B BF16 total weights: {qwen36_bf16_weights_gb:.1f} GB") 31print(f"Qwen3.6-35B-A3B 8K x 4 KV cache: {qwen36_kv_8k_x4_gb:.1f} GB")
Output
1Gemma 4 12B local artifact: 7.6 GB 216K context x 4 KV budget estimate: 5.0 GB 3combined before buffers: 12.6 GB 4Qwen3.6-35B-A3B BF16 total weights: 70.0 GB 5Qwen3.6-35B-A3B 8K x 4 KV cache: 2.7 GB

Fit needs a reserve policy, not a total under device capacity alone:

memory-reserve-gate.py
1device_gb = 24.0 2weights_gb = 4.4 # packed artifact plus metadata 3kv_gb = 8.6 4runtime_gb = 2.0 5reserve_fraction = 0.20 6 7usable_gb = device_gb * (1 - reserve_fraction) 8planned_gb = weights_gb + kv_gb + runtime_gb 9print(f"usable after reserve: {usable_gb:.1f} GB") 10print(f"planned allocation: {planned_gb:.1f} GB") 11print(f"admit route: {planned_gb <= usable_gb}")
Output
1usable after reserve: 19.2 GB 2planned allocation: 15.0 GB 3admit route: True

Ask these questions before buying or reserving hardware:

QuestionWhy it matters
How many concurrent users?Determines KV-cache pressure
How long are prompts?Drives prefill and cache size
How long are outputs?Drives decode time
What latency target?Determines runtime and GPU class
What traffic shape?Bursty traffic may favor hosted APIs
What failure path?Local hardware needs fallback

Local tends to be cheaper when utilization is high enough. A GPU sitting idle for most of the day is a fixed cost.

Where memory lives, and when bandwidth limits speed

Fit is only half the question. The other half is how fast the runtime serves your workload. During low-batch autoregressive decode, a runtime commonly streams much of the active weight set while generating one token at a time. When that weight traffic dominates, decode is memory-bandwidth-limited. Longer prompts, larger batches, cache traffic, offload, and different kernels can move the bottleneck, so treat this as a diagnostic model rather than a guarantee.

A useful upper-bound diagnostic for tokens per second during a weight-traffic-dominated decode step is:

max tokens/sec≈memory bandwidth (GB/s)bytes read per token (GB)\text{max tokens/sec} \approx \frac{\text{memory bandwidth (GB/s)}}{\text{bytes read per token (GB)}}max tokens/sec≈bytes read per token (GB)memory bandwidth (GB/s)​

Bytes read per token can be close to the active weight footprint in this regime, but measured throughput is lower when other traffic and kernel overhead matter. Reducing weight bytes may improve decode throughput; it doesn't promise a fixed speedup.

decode-bandwidth-upper-bound.py
1def weight_streaming_ceiling(memory_bandwidth_gbs: float, active_weight_gb: float) -> float: 2 return memory_bandwidth_gbs / active_weight_gb 3 4bandwidth_gbs = 400.0 # example measurement or hardware specification 5artifact_gb = 7.6 # Gemma 4 12B Ollama artifact size 6ceiling = weight_streaming_ceiling(bandwidth_gbs, artifact_gb) 7measured_tps = 31.0 8 9print(f"weight-streaming ceiling: {ceiling:.1f} tokens/sec") 10print(f"measured fraction of ceiling: {measured_tps / ceiling:.0%}") 11print("Use the gap to investigate cache traffic, kernels, and scheduling.")
Output
1weight-streaming ceiling: 52.6 tokens/sec 2measured fraction of ceiling: 59% 3Use the gap to investigate cache traffic, kernels, and scheduling.

Three hardware paths exist, and they trade capacity against bandwidth:

PathMemory modelWhat to measure
Discrete GPUDedicated video memory (VRAM)Whether artifact, KV cache, and reserve fit fully on device; benchmark if offload is needed
Apple SiliconUnified memory available to CPU and GPUAvailable memory after system use, selected runtime, quantized artifact, context, and measured throughput
CPU onlySystem RAMMemory capacity, vectorized runtime support, acceptable latency, and power

On Apple Silicon, MLX uses the unified-memory architecture so arrays can be operated on across CPU and GPU without copying between separate CPU and GPU pools.[3] That can make a large-memory machine a useful local candidate, but a model still needs room for artifact metadata, KV cache, runtime allocations, and the operating system. Benchmark the exact artifact and context requirement.

A sizing worksheet is more reliable than a fixed "what fits" table:

Candidate machineArtifact + cache conditionDecision
24 GB VRAMFits artifact, target KV cache, runtime, and reserve fully in VRAMBenchmark fully resident path
48 GB acceleratorLarger artifact fits only after reserve is accounted forBenchmark target concurrency before approval
Unified-memory systemArtifact fits total shared budgetBenchmark while normal system load is present
CPU/system RAM pathArtifact fits RAM but misses latency gateReject or reserve for asynchronous work
hardware-candidate-admission.py
1candidates = [ 2 {"name": "24 GB VRAM", "capacity": 24, "measured_p95_ms": 72}, 3 {"name": "64 GB unified", "capacity": 64, "measured_p95_ms": 118}, 4] 5need_gb = 19.0 6latency_gate_ms = 100 7 8approved = [ 9 item["name"] 10 for item in candidates 11 if need_gb <= item["capacity"] * 0.8 12 and item["measured_p95_ms"] <= latency_gate_ms 13] 14print(f"approved hardware: {approved}")
Output
1approved hardware: ['24 GB VRAM']

Quantization, pruning, and sparsity

Quantization stores selected weights with fewer bits. GGUF, GPTQ, and AWQ are common deployment terms you saw in the quantization chapter. A supported low-bit artifact can reduce weight storage and weight traffic; quality and end-to-end latency still need evaluation.

Pruning and sparsity attack a different part of the problem. Pruning removes weights or structures. Sparsity means many values are zero or skipped. The lottery-ticket line of work showed why sparse subnetworks are scientifically interesting,[4] but production speedups depend on runtime support. A sparse checkpoint that the runtime treats like a dense matrix may save little or nothing at inference time.

Use this rule:

Compression leverPractical starting point?Main caution
Weight quantizationYesMeasure quality on your task
KV-cache quantizationSometimesWatch long-context quality
PruningLess commonNeeds runtime support
Structured sparsityHardware-dependentSpeedup isn't automatic

For local deployment, a supported weight-quantized artifact is commonly the first compression candidate to test. Pruning and sparsity are worth tracking, but they don't create inference speed without a runtime path that uses their structure.

quantized-route-quality-gate.py
1artifacts = [ 2 {"name": "FP16", "memory_gb": 16.0, "quality": 0.932}, 3 {"name": "Q4", "memory_gb": 4.6, "quality": 0.925}, 4 {"name": "Q2", "memory_gb": 2.9, "quality": 0.861}, 5] 6budget_gb = 8.0 7min_quality = 0.90 8 9approved = [ 10 artifact["name"] 11 for artifact in artifacts 12 if artifact["memory_gb"] <= budget_gb and artifact["quality"] >= min_quality 13] 14print(f"approved artifacts: {approved}")
Output
1approved artifacts: ['Q4']
sparse-kernel-requirement.py
1dense_ms = 42.0 2zero_fraction = 0.50 3runtime_uses_sparse_kernel = False 4estimated_ms = dense_ms * (1 - zero_fraction) if runtime_uses_sparse_kernel else dense_ms 5 6print(f"runtime skips sparse work: {runtime_uses_sparse_kernel}") 7print(f"estimated decode time: {estimated_ms:.1f} ms")
Output
1runtime skips sparse work: False 2estimated decode time: 42.0 ms

Runtime choice

Common local runtimes solve different jobs:

RuntimeCandidate use case to verify
OllamaSingle-command local serving, model library, GGUF import, partial OpenAI API compatibility
llama.cppCPU, Apple Silicon, GGUF, embedded and desktop use, CPU/GPU layer offload
LM StudioDesktop GUI to download, chat, and serve GGUF or MLX models
MLX / mlx-lmApple Silicon-native inference and experimentation using unified memory
vLLMServer-side throughput, batching, OpenAI-compatible serving
TensorRT-LLMNVIDIA GPU serving with its supported optimized runtime paths

Ollama documents compatibility with parts of the OpenAI API and GGUF imports for local integrations.[5][6] llama.cpp supports GGUF workflows and a server path, with device/offload options depending on build and hardware.[7] LM Studio documents local model serving and supported runtimes for its releases.[8] MLX targets Apple Silicon's unified-memory architecture.[3] vLLM and TensorRT-LLM are serving candidates when GPU throughput and batching are requirements.[9][10] Runtime support and performance move with versions, so pin the runtime, inspect supported artifact formats, and benchmark the chosen pair.

The runtime should match the operational target. A developer laptop assistant and a production incident-review service shouldn't use the same acceptance bar.

runtime-capability-filter.py
1requirements = {"openai_api", "batching", "metrics", "gpu_serving"} 2runtimes = { 3 "desktop-runtime": {"openai_api", "single_user"}, 4 "server-runtime": {"openai_api", "batching", "metrics", "gpu_serving"}, 5 "embedded-runtime": {"gguf", "cpu_offload"}, 6} 7 8candidates = [ 9 name for name, features in runtimes.items() 10 if requirements <= features 11] 12print(f"runtime candidates to benchmark: {candidates}")
Output
1runtime candidates to benchmark: ['server-runtime']

Containers and repeatability

Containerization closes the gap between "works on my machine" and "can be deployed again." Docker gives you a repeatable package for the runtime, model mount, tokenizer files, server flags, health checks, and dependency versions.[11]

A minimal local serving plan should include:

  1. Container image or pinned runtime install.
  2. Model artifact location and checksum.
  3. Startup command with context length and quantization settings.
  4. Health endpoint.
  5. Metrics endpoint.
  6. Eval smoke test.
  7. Rollback path to previous model version.

For example, a local rollback-runbook assistant should run the same golden questions after every model or quantization change. If the model starts missing paging requirements or inventing runbook windows, the deployment fails even if the server starts.

deployment-artifact-manifest.py
1import hashlib 2 3artifact_bytes = b"incident-assistant-q4-release-17" 4manifest = { 5 "model_version": "incident-q4-r17", 6 "sha256": hashlib.sha256(artifact_bytes).hexdigest()[:12], 7 "context_tokens": 8192, 8 "previous_model": "incident-q4-r16", 9} 10print(manifest) 11print(f"rollback available: {bool(manifest['previous_model'])}")
Output
1{'model_version': 'incident-q4-r17', 'sha256': 'be45a76923bd', 'context_tokens': 8192, 'previous_model': 'incident-q4-r16'} 2rollback available: True

Local deployment decision

Choose local for explicit requirements: privacy, offline use, latency control, customization, or high steady utilization. Choose hosted when you need higher available capability, elastic traffic handling, or low operational burden.

Many mature systems use both: a local model handles sensitive or routine workflows, while a hosted model handles rare high-complexity cases through a gateway that records the trade-off.

Local deployment is a product commitment. Treat the model like a service: version it, containerize it, monitor it, evaluate it, and keep a fallback ready.

local-hosted-cost-crossover.py
1monthly_local_fixed_usd = 720 2hosted_cost_per_1k_requests_usd = 1.80 3expected_requests = 500_000 4 5hosted_monthly_usd = expected_requests / 1_000 * hosted_cost_per_1k_requests_usd 6choice = "local candidate" if monthly_local_fixed_usd < hosted_monthly_usd else "hosted candidate" 7print(f"hosted estimate: ${hosted_monthly_usd:.0f}/month") 8print(f"local fixed estimate: ${monthly_local_fixed_usd}/month") 9print(f"cost-only candidate: {choice}") 10print("Privacy, quality, reliability, and operations still require separate gates.")
Output
1hosted estimate: $900/month 2local fixed estimate: $720/month 3cost-only candidate: local candidate 4Privacy, quality, reliability, and operations still require separate gates.

Work a sizing check

Use rough math before you buy hardware. Suppose you want Gemma 4 12B for an on-prem runbook assistant; these are illustrative planning numbers:

ItemRough estimate
Gemma 4 12B Ollama artifactabout 7.6 GB
Runtime buffers and tokenizer assets1-3 GB
KV cache for long chatsdepends on context and concurrency
Safety margin20-30 percent

That can fit on modest hardware for a single route if the KV and runtime budget stays resident. Now change the requirement to Qwen3.6-35B-A3B in BF16, long context, and 20 concurrent users. The full checkpoint, KV cache, and interconnect requirements change the project shape. A credible answer may be quantization, a smaller model, fewer concurrent local users, sharding, or a hosted fallback for the rare hard cases.

What to check before moving on

Check that you can:

  • Choose a local model and quantization level from explicit latency, privacy, and quality targets.
  • Explain when local deployment is about data control or offline operation rather than lower cost.
  • Build a repeatable serving plan with pinned runtime, model checksum, container, health check, smoke eval, and rollback.
  • Estimate weight memory and KV-cache memory before claiming a model fits.
  • Explain why local decode speed is bounded by memory bandwidth, and estimate tokens per second from bandwidth and model size.
  • Compare discrete GPU VRAM, Apple Silicon unified memory, and CPU paths, then build a reserve-aware fit worksheet for the candidate machine.
  • Decide whether Ollama, llama.cpp, LM Studio, MLX, vLLM, or TensorRT-LLM matches the deployment target.

Production questions

Why is a local model not automatically cheaper than a hosted API?

Local serving adds hardware, power, maintenance, observability, capacity planning, and staff time. It wins when privacy, latency control, offline operation, or high steady utilization justify those costs. For bursty traffic, hosted APIs often stay cheaper because you only pay when requests arrive.

Where do pruning and sparsity fit relative to quantization?

Quantization stores each weight with fewer bits. Pruning and sparsity remove or skip weights and activations. Sparse models need runtime and hardware support to turn that structure into speed, so they are less plug-and-play than standard quantized checkpoints.

Common pitfalls

Choosing the largest local model first

  • Symptom: The route starts with Qwen3.6-35B-A3B before anyone defines the quality target.

  • Cause: Model size is treated as the goal instead of a cost paid only when smaller models fail.

  • Fix: Start with the workflow eval, try the smallest model that passes, then escalate only for measured failure cases.

Counting weights and forgetting KV cache

  • Symptom: A quantized checkpoint fits on paper but crashes or thrashes once context length and concurrency increase.

  • Cause: The sizing plan counts model weights but omits KV cache, runtime buffers, activations, fragmentation, and margin.

  • Fix: Budget weights, KV cache, buffers, and 20-30 percent headroom before buying hardware or setting num_ctx.

Deploying without a rollback path

  • Symptom: The local model starts, but a bad quantization or prompt-template change breaks real incident answers.

  • Cause: Deployment only checks process health and lacks a previous model artifact, routing fallback, and smoke eval.

  • Fix: Keep the old artifact, checksum both versions, route through a gateway, and require golden-case evals before traffic.

Testing only startup

  • Symptom: The local server starts, health check passes, and the first live engineer gets a wrong rollback answer.

  • Cause: Startup checks prove the process is alive. They don't prove the tokenizer, quantization, prompt template, context length, or policy retrieval path still produce correct answers.

  • Fix: Add a smoke eval to deployment. Ask five golden questions covering rollback status, paging windows, missing deploy IDs, source-sensitive abstention, and citation quality. The deploy passes when the local model gives acceptable answers, citations, and latency on those cases.

smoke-eval-rollout-gate.py
1cases = [ 2 {"name": "rollback_runbook", "correct": True, "has_citation": True, "latency_ms": 88}, 3 {"name": "missing_deploy_id", "correct": True, "has_citation": True, "latency_ms": 103}, 4 {"name": "abstain", "correct": True, "has_citation": True, "latency_ms": 75}, 5] 6latency_gate_ms = 120 7passes = all( 8 case["correct"] and case["has_citation"] and case["latency_ms"] <= latency_gate_ms 9 for case in cases 10) 11print(f"rollout approved: {passes}") 12print(f"fallback required: {not passes}")
Output
1rollout approved: True 2fallback required: False

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A platform team wants to deploy a local model for incident-status questions. Gemma 4 12B passes the route's accuracy and latency eval, while Qwen3.6-35B-A3B also fits if concurrency is reduced. What should the serving plan do first?
2.A Gemma 4 12B local artifact is 7.6 GB, the KV cache estimate is 5.0 GB, runtime buffers are 2.0 GB, and a 24 GB GPU keeps a 20% reserve. What should the memory gate conclude?
3.A model's KV cache is 2.15 GB at 8K tokens and 2 concurrent sequences. Its architecture and cache precision stay fixed. What is the rough KV-cache size at 16K tokens and 4 concurrent sequences?
4.A local route needs 19 GB after artifact, KV cache, and runtime are counted. The policy allows using at most 80% of memory and requires p95 latency <=100 ms. Machine A has 24 GB VRAM and measures 72 ms. Machine B has 64 GB unified memory and measures 118 ms under normal system load. Which machine passes?
5.A Gemma 4 12B low-batch decode path is suspected to be weight-bandwidth limited. The device bandwidth is 400 GB/s and the local artifact footprint is 7.6 GB. A benchmark measures 31 tokens/sec. How should you interpret this?
6.Two artifacts pass the task quality check. Artifact A is a Q4 artifact supported by the selected runtime and uses much less weight memory. Artifact B is a 50% sparse checkpoint, but the runtime still executes dense kernels. Which conclusion is justified?
7.A production incident-review service needs an OpenAI-compatible endpoint, shared GPU serving, batching, metrics, and throughput testing. Which runtime direction should the team evaluate first?
8.A local rollback-runbook model is being repackaged after a quantization and prompt-template change. Which rollout plan gives repeatable deployment and protects users from a bad model change?
9.A platform system has steady incident-status requests that use sensitive source data and pass a local model's eval. Rare non-sensitive architecture questions arrive in bursts and need stronger reasoning. Which serving design matches these constraints?

9 questions remaining.

Next Step
Continue to SLM Specialization & Edge Deployment

Local deployment shows how to size and package a model for on-prem hardware; the next step specializes further into small models, distillation, mobile-first architectures, and on-device runtimes that respect power, thermal, and privacy limits.

PreviousModel Quantization: GPTQ, AWQ & GGUF
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

gemma4

Ollama · 2026

Qwen3.6-35B-A3B

Qwen Team · 2026

MLX: An array framework for Apple silicon

Apple (ml-explore) · 2026

Linear Mode Connectivity and the Lottery Ticket Hypothesis.

Frankle, J., Dziugaite, G. K., Roy, D. M., & Carlin, M. · 2020 · ICML 2020

OpenAI compatibility - Ollama

Ollama · 2026

Importing a Model - Ollama

Ollama · 2026

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

LM Studio - Local AI on your computer

LM Studio · 2026

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

TensorRT-LLM: A High-Performance Inference Framework for LLMs.

NVIDIA · 2024

Docker Documentation.

Docker Inc. · 2026 · Official documentation