Plan local LLM deployment with model size, quantization, pruning and sparsity trade-offs, Docker packaging, runtime choice, and hardware budgets.
Quantization shrinks the model artifact. Local large language model (LLM) deployment turns that smaller artifact into a service with hardware limits, runtime choices, health checks, eval gates, and rollback.
Local LLM deployment isn't "download a model and hope." It's a serving plan.
Sometimes local deployment is the right answer: source code or incident data can't leave your boundary, latency needs to stay inside an on-prem network, the workload is steady enough to justify hardware, or the product needs offline operation. Other times, a hosted API is cheaper and safer. The engineering task is to measure the trade-off.
Pick the smallest local model that passes the job's evaluation, not the largest model your GPU can barely load.
For a developer or incident-assistant workflow, define the job:
| Job | Evaluation target | Likely model class |
|---|---|---|
| Incident label routing | High precision on triage labels | Small local classifier or Gemma 4 E2B/E4B |
| Runbook Q&A | Faithful cited answers | Gemma 4 12B with RAG |
| Sensitive source-code review | Strong reasoning with private data | Larger local model or secure hosted lane |
| Offline field assistant | Low latency and simple tools | Quantized local model |
The model choice follows the evaluation. If Gemma 4 12B handles incident-status questions with high accuracy, don't deploy Qwen3.6-35B-A3B for that route just because it fits somewhere.[1][2]
A local model consumes memory in several buckets:
Gemma 4 12B may fit a local workstation route, while Qwen3.6-35B-A3B can require high-memory accelerators once reserve, KV cache, and runtime buffers are counted.[1][2] Context length changes the math because KV cache grows with sequence length and concurrency.
Use two rough formulas before trusting any fit claim:
The second formula is architecture-specific. Grouped-query and multi-query attention reduce KV heads, while longer context and more concurrent requests push the number up quickly.
1def weight_gb(params_b: float, bits_per_weight: float) -> float:
2 return params_b * bits_per_weight / 8
3
4def kv_cache_gb(
5 layers: int,
6 kv_heads: int,
7 head_dim: int,
8 tokens: int,
9 concurrent_sequences: int,
10 bytes_per_value: int = 2,
11) -> float:
12 bytes_total = 2 * layers * kv_heads * head_dim * tokens * concurrent_sequences * bytes_per_value
13 return bytes_total / 1_000_000_000
14
15gemma4_12b_artifact_gb = 7.6
16kv_16k_x4_estimate_gb = 5.0
17qwen36_total_params_b = 35
18qwen36_bf16_weights_gb = weight_gb(params_b=qwen36_total_params_b, bits_per_weight=16)
19qwen36_kv_8k_x4_gb = kv_cache_gb(
20 layers=40,
21 kv_heads=2,
22 head_dim=256,
23 tokens=8192,
24 concurrent_sequences=4,
25)
26
27print(f"Gemma 4 12B local artifact: {gemma4_12b_artifact_gb:.1f} GB")
28print(f"16K context x 4 KV budget estimate: {kv_16k_x4_estimate_gb:.1f} GB")
29print(f"combined before buffers: {gemma4_12b_artifact_gb + kv_16k_x4_estimate_gb:.1f} GB")
30print(f"Qwen3.6-35B-A3B BF16 total weights: {qwen36_bf16_weights_gb:.1f} GB")
31print(f"Qwen3.6-35B-A3B 8K x 4 KV cache: {qwen36_kv_8k_x4_gb:.1f} GB")1Gemma 4 12B local artifact: 7.6 GB
216K context x 4 KV budget estimate: 5.0 GB
3combined before buffers: 12.6 GB
4Qwen3.6-35B-A3B BF16 total weights: 70.0 GB
5Qwen3.6-35B-A3B 8K x 4 KV cache: 2.7 GBFit needs a reserve policy, not a total under device capacity alone:
1device_gb = 24.0
2weights_gb = 4.4 # packed artifact plus metadata
3kv_gb = 8.6
4runtime_gb = 2.0
5reserve_fraction = 0.20
6
7usable_gb = device_gb * (1 - reserve_fraction)
8planned_gb = weights_gb + kv_gb + runtime_gb
9print(f"usable after reserve: {usable_gb:.1f} GB")
10print(f"planned allocation: {planned_gb:.1f} GB")
11print(f"admit route: {planned_gb <= usable_gb}")1usable after reserve: 19.2 GB
2planned allocation: 15.0 GB
3admit route: TrueAsk these questions before buying or reserving hardware:
| Question | Why it matters |
|---|---|
| How many concurrent users? | Determines KV-cache pressure |
| How long are prompts? | Drives prefill and cache size |
| How long are outputs? | Drives decode time |
| What latency target? | Determines runtime and GPU class |
| What traffic shape? | Bursty traffic may favor hosted APIs |
| What failure path? | Local hardware needs fallback |
Local tends to be cheaper when utilization is high enough. A GPU sitting idle for most of the day is a fixed cost.
Fit is only half the question. The other half is how fast the runtime serves your workload. During low-batch autoregressive decode, a runtime commonly streams much of the active weight set while generating one token at a time. When that weight traffic dominates, decode is memory-bandwidth-limited. Longer prompts, larger batches, cache traffic, offload, and different kernels can move the bottleneck, so treat this as a diagnostic model rather than a guarantee.
A useful upper-bound diagnostic for tokens per second during a weight-traffic-dominated decode step is:
Bytes read per token can be close to the active weight footprint in this regime, but measured throughput is lower when other traffic and kernel overhead matter. Reducing weight bytes may improve decode throughput; it doesn't promise a fixed speedup.
1def weight_streaming_ceiling(memory_bandwidth_gbs: float, active_weight_gb: float) -> float:
2 return memory_bandwidth_gbs / active_weight_gb
3
4bandwidth_gbs = 400.0 # example measurement or hardware specification
5artifact_gb = 7.6 # Gemma 4 12B Ollama artifact size
6ceiling = weight_streaming_ceiling(bandwidth_gbs, artifact_gb)
7measured_tps = 31.0
8
9print(f"weight-streaming ceiling: {ceiling:.1f} tokens/sec")
10print(f"measured fraction of ceiling: {measured_tps / ceiling:.0%}")
11print("Use the gap to investigate cache traffic, kernels, and scheduling.")1weight-streaming ceiling: 52.6 tokens/sec
2measured fraction of ceiling: 59%
3Use the gap to investigate cache traffic, kernels, and scheduling.Three hardware paths exist, and they trade capacity against bandwidth:
| Path | Memory model | What to measure |
|---|---|---|
| Discrete GPU | Dedicated video memory (VRAM) | Whether artifact, KV cache, and reserve fit fully on device; benchmark if offload is needed |
| Apple Silicon | Unified memory available to CPU and GPU | Available memory after system use, selected runtime, quantized artifact, context, and measured throughput |
| CPU only | System RAM | Memory capacity, vectorized runtime support, acceptable latency, and power |
On Apple Silicon, MLX uses the unified-memory architecture so arrays can be operated on across CPU and GPU without copying between separate CPU and GPU pools.[3] That can make a large-memory machine a useful local candidate, but a model still needs room for artifact metadata, KV cache, runtime allocations, and the operating system. Benchmark the exact artifact and context requirement.
A sizing worksheet is more reliable than a fixed "what fits" table:
| Candidate machine | Artifact + cache condition | Decision |
|---|---|---|
| 24 GB VRAM | Fits artifact, target KV cache, runtime, and reserve fully in VRAM | Benchmark fully resident path |
| 48 GB accelerator | Larger artifact fits only after reserve is accounted for | Benchmark target concurrency before approval |
| Unified-memory system | Artifact fits total shared budget | Benchmark while normal system load is present |
| CPU/system RAM path | Artifact fits RAM but misses latency gate | Reject or reserve for asynchronous work |
1candidates = [
2 {"name": "24 GB VRAM", "capacity": 24, "measured_p95_ms": 72},
3 {"name": "64 GB unified", "capacity": 64, "measured_p95_ms": 118},
4]
5need_gb = 19.0
6latency_gate_ms = 100
7
8approved = [
9 item["name"]
10 for item in candidates
11 if need_gb <= item["capacity"] * 0.8
12 and item["measured_p95_ms"] <= latency_gate_ms
13]
14print(f"approved hardware: {approved}")1approved hardware: ['24 GB VRAM']Quantization stores selected weights with fewer bits. GGUF, GPTQ, and AWQ are common deployment terms you saw in the quantization chapter. A supported low-bit artifact can reduce weight storage and weight traffic; quality and end-to-end latency still need evaluation.
Pruning and sparsity attack a different part of the problem. Pruning removes weights or structures. Sparsity means many values are zero or skipped. The lottery-ticket line of work showed why sparse subnetworks are scientifically interesting,[4] but production speedups depend on runtime support. A sparse checkpoint that the runtime treats like a dense matrix may save little or nothing at inference time.
Use this rule:
| Compression lever | Practical starting point? | Main caution |
|---|---|---|
| Weight quantization | Yes | Measure quality on your task |
| KV-cache quantization | Sometimes | Watch long-context quality |
| Pruning | Less common | Needs runtime support |
| Structured sparsity | Hardware-dependent | Speedup isn't automatic |
For local deployment, a supported weight-quantized artifact is commonly the first compression candidate to test. Pruning and sparsity are worth tracking, but they don't create inference speed without a runtime path that uses their structure.
1artifacts = [
2 {"name": "FP16", "memory_gb": 16.0, "quality": 0.932},
3 {"name": "Q4", "memory_gb": 4.6, "quality": 0.925},
4 {"name": "Q2", "memory_gb": 2.9, "quality": 0.861},
5]
6budget_gb = 8.0
7min_quality = 0.90
8
9approved = [
10 artifact["name"]
11 for artifact in artifacts
12 if artifact["memory_gb"] <= budget_gb and artifact["quality"] >= min_quality
13]
14print(f"approved artifacts: {approved}")1approved artifacts: ['Q4']1dense_ms = 42.0
2zero_fraction = 0.50
3runtime_uses_sparse_kernel = False
4estimated_ms = dense_ms * (1 - zero_fraction) if runtime_uses_sparse_kernel else dense_ms
5
6print(f"runtime skips sparse work: {runtime_uses_sparse_kernel}")
7print(f"estimated decode time: {estimated_ms:.1f} ms")1runtime skips sparse work: False
2estimated decode time: 42.0 msCommon local runtimes solve different jobs:
| Runtime | Candidate use case to verify |
|---|---|
| Ollama | Single-command local serving, model library, GGUF import, partial OpenAI API compatibility |
| llama.cpp | CPU, Apple Silicon, GGUF, embedded and desktop use, CPU/GPU layer offload |
| LM Studio | Desktop GUI to download, chat, and serve GGUF or MLX models |
| MLX / mlx-lm | Apple Silicon-native inference and experimentation using unified memory |
| vLLM | Server-side throughput, batching, OpenAI-compatible serving |
| TensorRT-LLM | NVIDIA GPU serving with its supported optimized runtime paths |
Ollama documents compatibility with parts of the OpenAI API and GGUF imports for local integrations.[5][6] llama.cpp supports GGUF workflows and a server path, with device/offload options depending on build and hardware.[7] LM Studio documents local model serving and supported runtimes for its releases.[8] MLX targets Apple Silicon's unified-memory architecture.[3] vLLM and TensorRT-LLM are serving candidates when GPU throughput and batching are requirements.[9][10] Runtime support and performance move with versions, so pin the runtime, inspect supported artifact formats, and benchmark the chosen pair.
The runtime should match the operational target. A developer laptop assistant and a production incident-review service shouldn't use the same acceptance bar.
1requirements = {"openai_api", "batching", "metrics", "gpu_serving"}
2runtimes = {
3 "desktop-runtime": {"openai_api", "single_user"},
4 "server-runtime": {"openai_api", "batching", "metrics", "gpu_serving"},
5 "embedded-runtime": {"gguf", "cpu_offload"},
6}
7
8candidates = [
9 name for name, features in runtimes.items()
10 if requirements <= features
11]
12print(f"runtime candidates to benchmark: {candidates}")1runtime candidates to benchmark: ['server-runtime']Containerization closes the gap between "works on my machine" and "can be deployed again." Docker gives you a repeatable package for the runtime, model mount, tokenizer files, server flags, health checks, and dependency versions.[11]
A minimal local serving plan should include:
For example, a local rollback-runbook assistant should run the same golden questions after every model or quantization change. If the model starts missing paging requirements or inventing runbook windows, the deployment fails even if the server starts.
1import hashlib
2
3artifact_bytes = b"incident-assistant-q4-release-17"
4manifest = {
5 "model_version": "incident-q4-r17",
6 "sha256": hashlib.sha256(artifact_bytes).hexdigest()[:12],
7 "context_tokens": 8192,
8 "previous_model": "incident-q4-r16",
9}
10print(manifest)
11print(f"rollback available: {bool(manifest['previous_model'])}")1{'model_version': 'incident-q4-r17', 'sha256': 'be45a76923bd', 'context_tokens': 8192, 'previous_model': 'incident-q4-r16'}
2rollback available: TrueChoose local for explicit requirements: privacy, offline use, latency control, customization, or high steady utilization. Choose hosted when you need higher available capability, elastic traffic handling, or low operational burden.
Many mature systems use both: a local model handles sensitive or routine workflows, while a hosted model handles rare high-complexity cases through a gateway that records the trade-off.
Local deployment is a product commitment. Treat the model like a service: version it, containerize it, monitor it, evaluate it, and keep a fallback ready.
1monthly_local_fixed_usd = 720
2hosted_cost_per_1k_requests_usd = 1.80
3expected_requests = 500_000
4
5hosted_monthly_usd = expected_requests / 1_000 * hosted_cost_per_1k_requests_usd
6choice = "local candidate" if monthly_local_fixed_usd < hosted_monthly_usd else "hosted candidate"
7print(f"hosted estimate: ${hosted_monthly_usd:.0f}/month")
8print(f"local fixed estimate: ${monthly_local_fixed_usd}/month")
9print(f"cost-only candidate: {choice}")
10print("Privacy, quality, reliability, and operations still require separate gates.")1hosted estimate: $900/month
2local fixed estimate: $720/month
3cost-only candidate: local candidate
4Privacy, quality, reliability, and operations still require separate gates.Use rough math before you buy hardware. Suppose you want Gemma 4 12B for an on-prem runbook assistant; these are illustrative planning numbers:
| Item | Rough estimate |
|---|---|
| Gemma 4 12B Ollama artifact | about 7.6 GB |
| Runtime buffers and tokenizer assets | 1-3 GB |
| KV cache for long chats | depends on context and concurrency |
| Safety margin | 20-30 percent |
That can fit on modest hardware for a single route if the KV and runtime budget stays resident. Now change the requirement to Qwen3.6-35B-A3B in BF16, long context, and 20 concurrent users. The full checkpoint, KV cache, and interconnect requirements change the project shape. A credible answer may be quantization, a smaller model, fewer concurrent local users, sharding, or a hosted fallback for the rare hard cases.
Check that you can:
Local serving adds hardware, power, maintenance, observability, capacity planning, and staff time. It wins when privacy, latency control, offline operation, or high steady utilization justify those costs. For bursty traffic, hosted APIs often stay cheaper because you only pay when requests arrive.
Quantization stores each weight with fewer bits. Pruning and sparsity remove or skip weights and activations. Sparse models need runtime and hardware support to turn that structure into speed, so they are less plug-and-play than standard quantized checkpoints.
Symptom: The route starts with Qwen3.6-35B-A3B before anyone defines the quality target.
Cause: Model size is treated as the goal instead of a cost paid only when smaller models fail.
Fix: Start with the workflow eval, try the smallest model that passes, then escalate only for measured failure cases.
Symptom: A quantized checkpoint fits on paper but crashes or thrashes once context length and concurrency increase.
Cause: The sizing plan counts model weights but omits KV cache, runtime buffers, activations, fragmentation, and margin.
Fix: Budget weights, KV cache, buffers, and 20-30 percent headroom before buying hardware or setting num_ctx.
Symptom: The local model starts, but a bad quantization or prompt-template change breaks real incident answers.
Cause: Deployment only checks process health and lacks a previous model artifact, routing fallback, and smoke eval.
Fix: Keep the old artifact, checksum both versions, route through a gateway, and require golden-case evals before traffic.
Symptom: The local server starts, health check passes, and the first live engineer gets a wrong rollback answer.
Cause: Startup checks prove the process is alive. They don't prove the tokenizer, quantization, prompt template, context length, or policy retrieval path still produce correct answers.
Fix: Add a smoke eval to deployment. Ask five golden questions covering rollback status, paging windows, missing deploy IDs, source-sensitive abstention, and citation quality. The deploy passes when the local model gives acceptable answers, citations, and latency on those cases.
1cases = [
2 {"name": "rollback_runbook", "correct": True, "has_citation": True, "latency_ms": 88},
3 {"name": "missing_deploy_id", "correct": True, "has_citation": True, "latency_ms": 103},
4 {"name": "abstain", "correct": True, "has_citation": True, "latency_ms": 75},
5]
6latency_gate_ms = 120
7passes = all(
8 case["correct"] and case["has_citation"] and case["latency_ms"] <= latency_gate_ms
9 for case in cases
10)
11print(f"rollout approved: {passes}")
12print(f"fallback required: {not passes}")1rollout approved: True
2fallback required: FalseAnswer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.
gemma4
Ollama · 2026
Qwen3.6-35B-A3B
Qwen Team · 2026
MLX: An array framework for Apple silicon
Apple (ml-explore) · 2026
Linear Mode Connectivity and the Lottery Ticket Hypothesis.
Frankle, J., Dziugaite, G. K., Roy, D. M., & Carlin, M. · 2020 · ICML 2020
OpenAI compatibility - Ollama
Ollama · 2026
Importing a Model - Ollama
Ollama · 2026
llama.cpp: Inference of LLaMA model in pure C/C++
Gerganov, G. · 2023
LM Studio - Local AI on your computer
LM Studio · 2026
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
TensorRT-LLM: A High-Performance Inference Framework for LLMs.
NVIDIA · 2024
Docker Documentation.
Docker Inc. · 2026 · Official documentation