Master the design of GPU serving infrastructure for LLMs with autoscaling, continuous batching, and cost optimization.
Advanced MLOps and DevOps for AI gave you the release machinery: promotion gates, lineage, rollout policy, and rollback triggers. This chapter turns that machinery toward the inference fleet. GPU serving and autoscaling decide whether a promoted model can survive real traffic without burning budget or missing latency goals.
Imagine your merchant-support chat app goes viral. At 8 AM, ten customers are asking about order delays. By 9 AM, there are a thousand. If you try serving your AI model on a normal web server, it'll quickly grind to a halt. That's because serving a Large Language Model (LLM) isn't like serving a website. A web server fetches pre-computed data and forgets the user immediately. An LLM has to process every prompt and retain attention state as generation proceeds. That short-term memory is a central capacity constraint alongside compute, memory bandwidth, and queueing delay.
Standard web servers don't maintain growing state for each active user. An LLM does. To generate the next token, the model must access computed attention states from previous tokens. In LLMs, this state is the KV cache, and managing its memory footprint is one of the central challenges of production serving.
Engineers designing serving systems must optimize for two conflicting metrics:
The core difficulty is that the Key-Value (KV) cache (the stored attention computations for past tokens) grows dynamically with every generated token. Poor memory management leads to fragmentation, which forces smaller batch sizes and lower throughput.
To see why GPU serving is so different from normal web scaling, picture a fulfillment center during a return-label surge:
A common mistake is scaling because the sorter is busy. You scale when the queue at the dock is too long. A busy sorter means you're using the hardware. A long queue means customers are waiting, and that's when you need more capacity.
Before diving into specific frameworks, let's look at one common production GPU serving architecture. It consists of several layers:
Infrastructure Layer: Kubernetes (EKS, GKE, or AKS) manages the GPU nodes. Unlike CPU nodes, GPU capacity is expensive and sometimes scarce, so a warm pool must be justified against latency targets and measured demand.
Control Layer: In a KEDA-based setup, KEDA (Kubernetes Event-Driven Autoscaling) can activate a workload from zero and configure a Horizontal Pod Autoscaler (HPA) to scale running replicas from metrics. Karpenter (AWS) or Cluster Autoscaler can then provision GPU nodes when new pods can't be scheduled on existing capacity.[1][2]
Serving Layer: The actual inference engines (vLLM, SGLang, TensorRT-LLM) that load models into VRAM and handle token generation. At datacenter scale, a control plane such as NVIDIA Dynamo can orchestrate disaggregated prefill/decode pools across many of these engine replicas.[3]
Metrics: CPU and RAM metrics alone don't describe LLM-serving capacity. NVIDIA's DCGM (Data Center GPU Manager) exposes GPU-specific telemetry, while user-facing scaling signals usually come from the serving layer: queue depth, KV cache usage, TTFT, and inter-token latency.[4]
The rest of this section zooms into the serving engine: batching policy, KV-cache allocation, framework choice, and request lifecycle.
Early serving systems used static batching: waiting for requests to arrive, padding them to the same length, and processing them together. It's inefficient because requests finish at different times. If one request generates 100 tokens and another generates 1000, the GPU must continue processing the batch until the longest request finishes, leaving slots idle for the shorter requests.
Modern frameworks use continuous batching (also known as iteration-level scheduling), pioneered by Orca[5]. The scheduler operates at the granularity of a single token generation step. As soon as a request finishes, a new request from the queue can be inserted into its slot in the next iteration.
The small simulation below uses generation lengths for four order-support replies. A static batch doesn't refill its two slots until both original replies finish; a continuous scheduler admits the next queued reply as soon as a slot opens.
1from collections import deque
2
3def run_batching(lengths: list[int], slots: int, continuous: bool) -> tuple[int, int]:
4 waiting = deque(lengths)
5 active: list[int] = []
6 token_steps = 0
7 empty_slot_steps = 0
8
9 while waiting or active:
10 if not active or continuous:
11 while waiting and len(active) < slots:
12 active.append(waiting.popleft())
13
14 token_steps += 1
15 empty_slot_steps += slots - len(active)
16 active = [remaining - 1 for remaining in active if remaining - 1 > 0]
17
18 return token_steps, empty_slot_steps
19
20reply_lengths = [8, 2, 5, 3]
21for label, continuous in [("static", False), ("continuous", True)]:
22 steps, empty = run_batching(reply_lengths, slots=2, continuous=continuous)
23 print(f"{label:10} token steps={steps:2} empty-slot steps={empty:2}")1static token steps=13 empty-slot steps= 8
2continuous token steps=10 empty-slot steps= 2With the same replies and slots, the continuous scheduler finishes sooner because it wastes fewer slot-iterations.
One widely used open-source serving engine is vLLM, which introduced PagedAttention[6]. Before vLLM, inference systems pre-allocated a single contiguous block of GPU memory sized to the maximum sequence length a request might ever use (often 4k-32k tokens). Because the actual length is revealed token-by-token during decoding, this approach created two problems:
Profiling across real workloads showed that prior systems typically used only 20–38% of the KV cache memory they had reserved, wasting 62–80%.[6] The result: the effective batch size stayed small, and throughput suffered even when the GPU had spare VRAM.
PagedAttention treats the KV cache like virtual memory in an operating system (OS). It breaks the KV cache into fixed-size blocks (pages) that can be stored in non-contiguous memory.
PagedAttention introduces two key abstractions:
A Block Table maintains the mapping between logical and physical blocks. When a new token is generated, vLLM checks if the current physical block has space. If not, it allocates a new physical block from a pre-allocated pool and updates the mapping. The PagedAttention paper reports much tighter memory use and higher throughput because batches can pack around real sequence lengths instead of worst-case reservations.[6]
PagedAttention gives the runtime fine-grained control over KV blocks. In current vLLM deployments, request-to-request reuse of a shared prompt prefix usually comes from automatic prefix caching, which reuses previously computed KV blocks when a new request shares the same prefix.[7]
Here is the allocation arithmetic for four conversations. The fixed reservation policy reserves each request for its maximum allowed length, while paging reserves blocks only for tokens currently stored:
1import math
2
3cached_tokens = [320, 75, 900, 250]
4max_tokens_per_request = 2048
5tokens_per_block = 16
6
7fixed_blocks = len(cached_tokens) * math.ceil(max_tokens_per_request / tokens_per_block)
8paged_blocks = sum(math.ceil(tokens / tokens_per_block) for tokens in cached_tokens)
9used_tokens = sum(cached_tokens)
10
11print(f"tokens currently cached: {used_tokens}")
12print(f"fixed reservation blocks: {fixed_blocks}")
13print(f"paged allocation blocks: {paged_blocks}")
14print(f"reserved tokens avoided: {(fixed_blocks - paged_blocks) * tokens_per_block}")1tokens currently cached: 1545
2fixed reservation blocks: 512
3paged allocation blocks: 98
4reserved tokens avoided: 6624Paging doesn't make attention state free. It stops short conversations from holding blocks for a maximum length they haven't used.
The overall request lifecycle coordinates these components. Client requests enter an API server queue, are packed by a continuous batching scheduler, and finally executed by a model engine using PagedAttention and, when enabled, prefix caching on the GPU cluster:
Choosing the right serving engine depends on your specific constraints:
| Framework | Key Feature | Best For | Pros | Cons |
|---|---|---|---|---|
| vLLM | PagedAttention | General production serving | High throughput, easy to use, active community | Whether it loses latency to tuned TensorRT-LLM depends on model, kernels, and benchmark |
| TensorRT-LLM[8][9] | TensorRT engine build + fused kernels | Max performance on NVIDIA | Very low latency, FP8 support, AWQ/GPTQ quantization | More build and deployment complexity, tighter NVIDIA coupling |
| TGI (Text Generation Inference)[10] | HF-first serving stack | Existing Hugging Face deployments | Mature launcher, streaming, Prometheus metrics, tensor parallelism, continuous batching | Official docs now describe TGI as maintenance mode and recommend newer optimized engines for most new work |
| SGLang | RadixAttention | Complex agent workflows, high-prefix-reuse traffic | Automatic KV-cache reuse across prompts via a radix tree[11] | Requires workload-specific benchmarking against other engines |
| llama.cpp[12] | GGUF quantization | Local/edge deployment | Broad GGUF quantization (INT4-INT8), runs on CPU/Mac/Windows with broad hardware compatibility | Lower throughput than GPU-optimized frameworks |
The TGI documentation describes TGI as maintenance mode and directs users toward newer optimized engines such as vLLM and SGLang for new deployments.[10] These engines are the data plane that loads weights and generates tokens. At datacenter scale, a separate control plane such as NVIDIA Dynamo can sit above them to provide disaggregated prefill/decode scheduling and KV-aware routing across engine replicas (covered later under disaggregation).[3]
To see how this works in practice, here's an example of initializing vLLM with settings you would benchmark before deployment. The configuration below loads a large model across multiple GPUs using tensor parallelism and sets a high memory-utilization budget for weights, KV blocks, and runtime buffers. It also uses chunked prefill so long prompts can share scheduling budget with decode traffic.[13] It takes the model identifier and hardware constraints as inputs, and outputs an initialized engine ready to accept inference requests:
1from vllm import LLM, SamplingParams
2
3# Production configuration for vLLM
4llm = LLM(
5 model="meta-llama/Llama-3.1-70B-Instruct",
6 tensor_parallel_size=4, # Split across 4 GPUs
7 gpu_memory_utilization=0.90, # Let vLLM claim ~90% of VRAM for weights, KV, and runtime buffers
8 max_model_len=8192, # Enforce context limit
9 enable_chunked_prefill=True, # vLLM V1 enables this by default; keep it explicit in reviewed configs
10 enable_prefix_caching=True, # Reuse KV blocks when requests share a prefix
11 max_num_batched_tokens=16384, # Main TTFT/TPOT trade-off knob for chunked prefill
12)
13
14# Sampling parameters control generation
15params = SamplingParams(
16 temperature=0.7,
17 top_p=0.95,
18 max_tokens=512
19)
20
21outputs = llm.generate(["Summarize the delivery delay for order A-1842."], params)All of these knobs are workload-dependent. Tune them against TTFT and TPOT SLOs, not just raw tokens/sec.
Selecting the right GPU depends on model size, quantization, and expected traffic. The dominant factors are VRAM (Video Random Access Memory) capacity (to fit the model + KV cache) and Memory Bandwidth (to serve tokens fast).
To estimate memory requirements for capacity, use this formula:
Where:
That distinction matters. The older shorthand using Hidden Dim assumes full multi-head attention. Many modern decoder models use grouped-query attention (GQA) or multi-query attention (MQA), so the number of KV heads is much smaller than the total attention head count. If you ignore that, you'll often overestimate KV cache size by a wide margin.
Here's a practical Python function that estimates the required GPU memory based on model size and cache expectations. It takes the model size plus the architecture terms that control KV growth (num_layers, num_kv_heads, head_dim, context length, and concurrency), then returns a dictionary detailing the number of specific GPU models needed. This helps engineers plan capacity before deploying:
1from collections.abc import Mapping
2import math
3
4def estimate_gpu_requirements(
5 model_params_b: float, # Billions of parameters
6 num_layers: int,
7 num_kv_heads: int,
8 head_dim: int,
9 context_len: int,
10 target_concurrency: int,
11 weight_bytes: int = 2, # BF16/FP16=2, INT8=1, FP8=1
12 kv_bytes: int = 2, # KV cache often stays in BF16/FP16
13 overhead_factor: float = 1.15,
14) -> Mapping[str, object]:
15 # 1. Model weights. model_params_b is already in billions, so
16 # model_params_b * bytes gives an approximate size in decimal GB.
17 weight_memory_gb = model_params_b * weight_bytes
18
19 # 2. KV cache. For GQA/MQA models, num_kv_heads is smaller than
20 # the total attention head count.
21 kv_memory_bytes = (
22 2
23 * num_layers
24 * num_kv_heads
25 * head_dim
26 * context_len
27 * target_concurrency
28 * kv_bytes
29 )
30 kv_memory_gb = kv_memory_bytes / 1e9
31
32 total_gb = (weight_memory_gb + kv_memory_gb) * overhead_factor
33
34 gpu_options: dict[str, Mapping[str, int]] = {
35 "L4_24GB": {"mem": 24, "needed": max(1, int(math.ceil(total_gb / 22)))},
36 "A100_40GB": {"mem": 40, "needed": max(1, int(math.ceil(total_gb / 38)))},
37 "A100_80GB": {"mem": 80, "needed": max(1, int(math.ceil(total_gb / 76)))},
38 "H100_80GB": {"mem": 80, "needed": max(1, int(math.ceil(total_gb / 76)))},
39 "H200_141GB": {"mem": 141, "needed": max(1, int(math.ceil(total_gb / 134)))},
40 }
41 return {
42 "weights_gb": weight_memory_gb,
43 "kv_cache_gb": kv_memory_gb,
44 "total_with_overhead_gb": total_gb,
45 "gpu_options": gpu_options,
46 }
47
48def show_case(name: str, result: Mapping[str, object]) -> None:
49 gpu_options = result["gpu_options"]
50 print(name)
51 print(f" weights: {result['weights_gb']:.1f} GB")
52 print(f" kv cache: {result['kv_cache_gb']:.1f} GB")
53 print(f" total with overhead: {result['total_with_overhead_gb']:.1f} GB")
54 print(f" H100_80GB needed: {gpu_options['H100_80GB']['needed']}")
55
56chat_8b = estimate_gpu_requirements(
57 model_params_b=8,
58 num_layers=32,
59 num_kv_heads=8,
60 head_dim=128,
61 context_len=4096,
62 target_concurrency=8,
63)
64
65llama_70b = estimate_gpu_requirements(
66 model_params_b=70,
67 num_layers=80,
68 num_kv_heads=8,
69 head_dim=128,
70 context_len=8192,
71 target_concurrency=4,
72)
73
74show_case("8B BF16, 4k context, 8 active requests", chat_8b)
75show_case("70B BF16, 8k context, 4 active requests", llama_70b)18B BF16, 4k context, 8 active requests
2 weights: 16.0 GB
3 kv cache: 4.3 GB
4 total with overhead: 23.3 GB
5 H100_80GB needed: 1
670B BF16, 8k context, 4 active requests
7 weights: 140.0 GB
8 kv cache: 10.7 GB
9 total with overhead: 173.3 GB
10 H100_80GB needed: 3Notice how the 70B BF16 case needs more than the theoretical two-GPU weight fit once you add four active 8k contexts and operational headroom. The table below is a minimum-fit reference; production sizing starts from measured prompt and concurrency distributions.
As a reference, the following table gives order-of-magnitude sizing for common dense decoder deployments. The KV numbers assume GQA-style architectures and FP16/BF16 KV cache. Full multi-head attention or larger KV precision pushes the cache higher.
| Model Size | Precision | Weights Memory | Est. KV Cache (per 1k cached tokens, 1 active request) | Minimum GPU Suggestion |
|---|---|---|---|---|
| 7B dense model | FP16/BF16 | ~14 GB | ~0.10-0.13 GB | 1x 24GB GPU (A10G, L4) |
| 32B dense model | FP16/BF16 | ~64 GB | ~0.15-0.30 GB | 1x 80GB GPU |
| 70B dense model | INT8 weights + BF16 KV | ~70 GB | ~0.25-0.35 GB | 1x 80GB only for short contexts and tight concurrency limits |
| 70B dense model | FP16/BF16 | ~140 GB | ~0.25-0.35 GB | 2x 80GB GPUs via TP |
That 70B INT8 row is a tight fit. In practice, many teams still use two GPUs so they have room for longer prompts, prefix caching, and higher concurrency.
For cost-efficient autoscaling of smaller models, Multi-Instance GPU (MIG) allows you to partition a single GPU into hardware-isolated slices. Ampere- and Hopper-class parts such as A100, H100, and H200 can expose up to seven instances on supported SKUs.[14] Instead of dedicating an entire large GPU to a single 7B model, you can run multiple replicas on isolated slices with dedicated compute and memory resources.
For low-batch decode of large decoder models, weight reads commonly make the system memory-bandwidth bound. Batching, cache traffic, kernels, and multi-GPU communication can change which limit dominates.
The theoretical maximum throughput () for a batch size of 1 is roughly (a simplified model derived from memory-bandwidth analyses such as Pope et al.)[15]:
In this simplified batch-size-one model, the GPU streams the weight footprint from HBM for each generated token. There is little weight reuse across sequential steps (the KV cache reuses activations, not the static weights). Memory bandwidth therefore gives a useful ceiling:
This is the reciprocal of the time required to load the entire model from HBM once. An H100 SXM (3.35 TB/s HBM bandwidth) serving a 7B dense model in FP16 (≈14 GB of weights) yields an upper bound of roughly 239 tokens/sec for a single sequence.[16] Real systems achieve lower throughput because:
This bound explains why quantization can improve decode throughput: halving only the weight-footprint term (FP16 to INT8, for example) doubles this simplified ceiling. Actual speedup is smaller or different when kernel support, KV-cache traffic, compute, or communication dominate. It also shows why large models may require tensor parallelism: the denominator grows while per-GPU bandwidth stays fixed.
The next calculation is deliberately a ceiling, not a benchmark. It isolates weight traffic so you can see what quantization changes before adding runtime overhead:
1bandwidth_gb_s = 3350
2weight_footprints_gb = {
3 "7B BF16": 14,
4 "7B INT8 weights": 7,
5 "70B BF16": 140,
6}
7
8for name, weights_gb in weight_footprints_gb.items():
9 ceiling_tps = bandwidth_gb_s / weights_gb
10 print(f"{name:16} weight-only ceiling = {ceiling_tps:6.1f} token/s")17B BF16 weight-only ceiling = 239.3 token/s
27B INT8 weights weight-only ceiling = 478.6 token/s
370B BF16 weight-only ceiling = 23.9 token/sFor models that don't fit on a single GPU (for example, a 70B dense model in BF16), we use Tensor Parallelism (TP). TP splits the individual weight matrices (e.g., ) across multiple GPUs so that the computation is distributed evenly. This technique was pioneered in the Megatron-LM training system[17] and adapted for inference serving.
TP operates by slicing the matrix computations, meaning all participating GPUs must communicate their partial results at each layer before the model can proceed to the next layer. This constant, high-volume communication requires immense bandwidth.
The following diagram illustrates the structural difference between intra-node Tensor Parallelism (where computation is split and synchronized at every layer) and inter-node Pipeline Parallelism (where computation is chained sequentially across nodes):
Communication frequency is not total cost: tensor-parallel messages may travel over faster links, while pipeline stages can create bubbles. This toy count makes the first screening question explicit:
1layers = 80
2pipeline_stages = 4
3decode_steps = 32
4
5tp_sync_events = layers * decode_steps
6pp_boundary_events = (pipeline_stages - 1) * decode_steps
7
8print(f"tensor-parallel sync events: {tp_sync_events}")
9print(f"pipeline boundary transfers: {pp_boundary_events}")
10print("Measure bytes, fabric speed, and pipeline bubbles before choosing.")1tensor-parallel sync events: 2560
2pipeline boundary transfers: 96
3Measure bytes, fabric speed, and pipeline bubbles before choosing.CPU-based autoscaling alone is insufficient for LLM serving because CPU isn't usually the scarce inference resource. CPU can still reveal overloaded gateways or tokenizers. GPU utilization can also be misleading: a GPU might be busy with a healthy batch, or it might be memory-bound while compute duty cycle looks modest.
The most reliable metrics for autoscaling come from two sources: NVIDIA's DCGM (Data Center GPU Manager) for hardware-level visibility, and the serving framework itself for application-level signals. In vLLM, for example, the Prometheus endpoint exposes metrics such as vllm:num_requests_waiting, vllm:num_requests_running, vllm:kv_cache_usage_perc, vllm:time_to_first_token_seconds, and vllm:inter_token_latency_seconds.[4]
| Metric | Source | Why It Matters | Example Signal |
|---|---|---|---|
| Request Queue Depth | Serving framework or gateway | Backlog of waiting requests | Sustained upward trend |
| KV Cache Utilization | Serving framework | Memory pressure; near saturation means admission gets tight | Sustained high watermark |
| GPU Duty Cycle | DCGM | Useful supporting telemetry, but easy to misread alone | Corroborate with app metrics |
| Time To First Token (TTFT) | Serving framework | User-facing latency SLA | SLO breach |
| Time Per Output Token (TPOT) | Serving framework | Streaming perceived speed | SLO breach |
DCGM (Data Center GPU Manager) exposes hardware-level GPU metrics like temperature, power, clock rates, and utilization. While GPU utilization seems like an obvious scaling signal, it's often misleading for LLM inference. A GPU can show 100% "utilization" while sitting idle waiting for memory (memory-bound), or show low utilization during long context prefill phases.
That's why application-level metrics (queue depth, KV cache utilization, TTFT, TPOT) are more reliable scaling signals. They directly measure capacity constraints rather than hardware activity. The exact thresholds are workload-specific, so treat any numbers you see in dashboards or sample code as starting points, not universal defaults.
One control-loop design works as follows: the serving framework exposes metrics via Prometheus. KEDA can activate a scaled workload from zero and configure the generated HPA for running replica scaling. If the scheduler can't place new GPU pods, the node autoscaler layer (often Karpenter or Cluster Autoscaler) provisions nodes that satisfy pod requirements.[1][2]
This two-layer approach separates desired pod count from hardware supply. Ready pods may scale quickly if spare GPU capacity exists; new nodes and model loading can take far longer. The useful warm buffer is a measured tradeoff between idle GPU cost and the latency damage of a cold burst.
The autoscaling control loop continuously monitors queue depth, KV cache utilization, and latency metrics to make scaling decisions. As the diagram below shows, the metric path and the node-provisioning path are separate:
In a real Kubernetes deployment, KEDA and HPA compute the desired replica count for you. The following Python snippet is just a mental model for the control logic. It takes the current metrics and replica count as inputs, then returns the next target replica count. The thresholds are illustrative only:
1import time
2
3class GPUAutoscaler:
4 def __init__(self, min_replicas: int = 1, max_replicas: int = 20, cooldown_s: int = 300):
5 self.min_replicas = min_replicas
6 self.max_replicas = max_replicas
7 self.cooldown_s = cooldown_s
8 self.last_scale_time = 0.0
9
10 def recommend(self, metrics: dict, current_replicas: int) -> int:
11 now = time.time()
12 if now - self.last_scale_time < self.cooldown_s:
13 return current_replicas
14
15 scale_up = (
16 metrics["num_requests_waiting"] > 10
17 or metrics["kv_cache_utilization"] > 0.85
18 or metrics["ttft_p95_ms"] > 1500
19 )
20
21 lightly_loaded = (
22 metrics["num_requests_waiting"] == 0
23 and metrics["num_requests_running"] <= max(1, current_replicas // 2)
24 and metrics["kv_cache_utilization"] < 0.20
25 )
26
27 if scale_up:
28 self.last_scale_time = now
29 return min(self.max_replicas, current_replicas + 1)
30 if lightly_loaded:
31 self.last_scale_time = now
32 return max(self.min_replicas, current_replicas - 1)
33 return current_replicas
34
35autoscaler = GPUAutoscaler(min_replicas=1, max_replicas=20, cooldown_s=0)
36scenarios = [
37 (
38 "spike",
39 {
40 "num_requests_waiting": 42,
41 "num_requests_running": 8,
42 "kv_cache_utilization": 0.72,
43 "ttft_p95_ms": 1200,
44 },
45 4,
46 ),
47 (
48 "cache pressure",
49 {
50 "num_requests_waiting": 2,
51 "num_requests_running": 7,
52 "kv_cache_utilization": 0.91,
53 "ttft_p95_ms": 900,
54 },
55 5,
56 ),
57 (
58 "quiet",
59 {
60 "num_requests_waiting": 0,
61 "num_requests_running": 1,
62 "kv_cache_utilization": 0.12,
63 "ttft_p95_ms": 450,
64 },
65 5,
66 ),
67]
68
69for label, metrics, replicas in scenarios:
70 print(f"{label}: {replicas} -> {autoscaler.recommend(metrics, replicas)} replicas")1spike: 4 -> 5 replicas
2cache pressure: 5 -> 6 replicas
3quiet: 5 -> 4 replicasThe autoscaler determines a target; it doesn't erase startup time. This capacity calculation separates the eventual replica count from the traffic a warm buffer can accept immediately:
1import math
2
3def capacity_plan(concurrent_requests: int, requests_per_replica: int, warm_replicas: int) -> dict[str, int]:
4 target_replicas = math.ceil(concurrent_requests / requests_per_replica)
5 immediately_admitted = min(concurrent_requests, warm_replicas * requests_per_replica)
6 return {
7 "target_replicas": target_replicas,
8 "replicas_to_start": max(0, target_replicas - warm_replicas),
9 "requests_waiting_for_cold_capacity": concurrent_requests - immediately_admitted,
10 }
11
12for warm in [1, 3, 20]:
13 plan = capacity_plan(concurrent_requests=100, requests_per_replica=5, warm_replicas=warm)
14 print(f"warm={warm:2}: {plan}")1warm= 1: {'target_replicas': 20, 'replicas_to_start': 19, 'requests_waiting_for_cold_capacity': 95}
2warm= 3: {'target_replicas': 20, 'replicas_to_start': 17, 'requests_waiting_for_cold_capacity': 85}
3warm=20: {'target_replicas': 20, 'replicas_to_start': 0, 'requests_waiting_for_cold_capacity': 0}Let's walk through a concrete scenario. You're hosting a Llama-3-8B model for your merchant-support chat app. Traffic is nearly zero at night, but at 9 AM it jumps to 100 concurrent users. How do you scale without blowing the budget or the user experience?
Step 1: pick the right metric. Scaling only on CPU percentage misses engine capacity. A GPU worker can show little CPU pressure while its weights and KV blocks constrain admission. Use request queue depth, queue age, KV cache utilization, and latency SLOs; retain CPU signals for gateway or preprocessing failures.
Step 2: set a threshold. Through benchmarking, you determine that one GPU can comfortably handle about 5 concurrent requests while keeping TPOT under 50 ms per token. That's your target capacity per replica.
Step 3: do the math. With 100 concurrent requests and 5 requests per GPU, you need 100 / 5 = 20 GPUs. If you're starting from one warm instance at 8:59 AM, the autoscaler must add 19 more replicas quickly.
Step 4: solve the cold-start gap. Even if a replica target changes promptly, node provisioning plus model loading can outlast the traffic's latency budget. Measure that ready time on your platform. If the morning rush is predictable, start warming nodes far enough ahead of the observed ready time, or keep a benchmarked buffer pool that absorbs its first wave.
Scaling only when latency spikes is late. By the time TTFT breaches your SLO, users are already frustrated. Scale on queue depth trends, not only on lagging latency indicators.
Cold starts in GPU serving environments are significantly more disruptive than in traditional microservices. While a typical web container might start in seconds, a GPU workload often has to provision a node, pull a large image, fetch model weights, and then warm up the runtime before it can accept traffic.
Here's the difference between a warm GPU (already loaded and serving) and a cold GPU (starting from scratch):
The cold start has four distinct phases, each with its own mitigation strategy. The ranges below are illustrative planning inputs, not platform guarantees; record your own p50 and p95 timings:
| Phase | Illustrative Duration | Mitigation Strategy |
|---|---|---|
| Node provisioning | 30-180s | Use warm pools, reserved capacity, or faster node images |
| Container pull | 10-90s | Use lazy pulling or container streaming so startup doesn't wait for the full image |
| Model weight loading | 30-180s | Keep weights on local NVMe or a warm shared cache close to the node |
| Runtime warmup | 5-30s | Pre-build optimized engines where possible and run readiness warmups |
Container streaming lets the container start executing before the entire image is downloaded. Only the required layers are fetched on demand, which cuts startup time for large inference images.
Model caching keeps frequently used model weights on high-speed local NVMe storage rather than pulling from remote object storage. For multi-node setups, shared network volumes like Amazon FSx for Lustre can serve weights at high speed to multiple nodes.
Some platforms also offer snapshot or restore features for a preloaded container or VM. Those can help, but they're more vendor-specific than warm pools plus local weight caches.
To mitigate cold start latency impact on users, engineers implement several proactive strategies:
You can turn measured phase durations into a readiness decision. In this example, predictive scaling four minutes before a known carrier-status surge is sufficient for the measured path, while scaling two minutes before it is not:
1phase_seconds = {
2 "node provisioning": 95,
3 "container pull": 22,
4 "weight loading": 78,
5 "runtime warmup": 12,
6}
7measured_ready_s = sum(phase_seconds.values())
8
9for lead_time_s in [120, 240]:
10 spare_s = lead_time_s - measured_ready_s
11 status = "ready before surge" if spare_s >= 0 else "surge sees cold queue"
12 print(f"lead={lead_time_s:3}s ready={measured_ready_s:3}s margin={spare_s:4}s: {status}")1lead=120s ready=207s margin= -87s: surge sees cold queue
2lead=240s ready=207s margin= 33s: ready before surgeRunning top-end GPUs 24/7 is expensive, and idle replicas burn budget fast. To optimize costs without breaking Service Level Agreements (SLAs):
Cloud providers offer spot instances at discounts, but they come with preemption risk. Inference can be easier to retry than long-running training, but reasoning runs and long generations may exceed a provider's warning window. Use measured request durations, a drain deadline, and retry behavior before assigning traffic to spot capacity.
In retail, seasonal spikes such as holiday rushes can be candidates for spot overflow capacity when a stable baseline pool meets the promised SLO and interrupted work can drain or retry cleanly.
SIGTERM, remove the replica from new admission, let completions that fit the remaining drain budget finish, and retry or fail over requests that cannot finish before termination.The drain policy must compare remaining work with a deadline rather than assume all requests finish. This example reserves twenty seconds for shutdown and transfer after a 120-second termination warning:
1warning_seconds = 120
2shutdown_margin_seconds = 20
3finish_budget_seconds = warning_seconds - shutdown_margin_seconds
4active_requests = {
5 "order-status": 12,
6 "refund-summary": 84,
7 "bulk-claims-reasoning": 170,
8}
9
10for request, remaining_seconds in active_requests.items():
11 action = "finish during drain" if remaining_seconds <= finish_budget_seconds else "retry on stable pool"
12 print(f"{request:23} remaining={remaining_seconds:3}s -> {action}")1order-status remaining= 12s -> finish during drain
2refund-summary remaining= 84s -> finish during drain
3bulk-claims-reasoning remaining=170s -> retry on stable poolA common (and expensive) mistake is scaling down too aggressively. If you terminate a GPU node because traffic dipped for 30 seconds, you'll pay the cold start penalty again when traffic returns a minute later. This "thrashing" can actually increase costs while degrading user experience.
In e-commerce logistics, this is like closing a return-processing lane because the conveyor cleared for 30 seconds during a lull, only to have the post-holiday rush hit again before the lane can reopen.
Best practices for scale-down:
When deciding how to deploy GPU inference, teams face a build-vs-buy decision:
| Approach | Providers | Pros | Cons |
|---|---|---|---|
| Serverless GPUs | Modal, RunPod, Replicate, Together AI | Less platform work, built-in autoscaling, usage-based billing | Less control, potentially higher unit cost, cold starts on infrequent traffic |
| Specialized GPU cloud | CoreWeave, Lambda Cloud | Fast access to newer GPUs, strong price/performance, more control over instances and storage topology | More infrastructure work than serverless, portability can be weaker |
| Managed K8s (GPU) | EKS/GKE/AKS + Karpenter | Full control, spot instance support, custom metrics | Complex to set up and maintain, requires ML platform expertise |
Serverless platforms like Modal and RunPod abstract away the Kubernetes layer entirely. They typically handle autoscaling and much of the instance lifecycle for you. This is ideal for teams without dedicated ML infrastructure engineers or for workloads with highly variable traffic.
Specialized GPU clouds sit in the middle. You usually get raw instances, storage, or managed Kubernetes primitives without the full DIY burden of the hyperscalers. Managed Kubernetes still gives you the most control over the full stack (vLLM versions, custom schedulers, quantization methods) and can be cheaper once utilization is high and predictable enough to pay back the platform engineering overhead.
Even experienced engineers trip over the same patterns when moving from web serving to GPU inference. Here are the frequent failures to catch during design review, with their symptoms, root causes, and fixes.
Symptom: Your GPU node count oscillates wildly. Cloud bills spike, but user latency doesn't improve. The autoscaler logs show scale-up events followed by scale-down events within minutes.
Cause: The autoscaler reacts to short traffic blips instead of sustained trends. A GPU node that just finished loading weights gets terminated before it serves enough requests to justify its startup cost.
Fix: Choose a cooldown from traffic and startup measurements rather than a universal duration. Scale down gradually, and require queue depth near zero, low KV cache usage, and low running requests before removing capacity.
Symptom: Your dashboard shows an average response time of 800 ms, but support tickets complain about 30-second waits. The p95 or p99 latency is an order of magnitude worse than the mean.
Cause: Average latency hides the users with long prompts or the batches where one straggler request keeps the GPU occupied. Static batching makes this worse, but even continuous batching can suffer if a single request with a 4,000-token prompt monopolizes a slot.
Fix: Monitor TTFT and TPOT at the p95 or p99 percentile, not the mean. Set SLOs on tail latency. If long prompts are common, enable chunked prefill so they don't starve shorter decode requests.
Symptom: Requests keep arriving even after the GPU workers are full. The service eventually times out randomly, and retry storms make the spike worse.
Cause: The system accepts more work than the serving engine can schedule. Without a visible queue, admission policy, timeout budget, and retry contract, overload becomes invisible until users see failures.
Fix: Put a queue or gateway in front of the inference engine. Track queue depth, queue age, and rejection rate. Return controlled overload responses before the GPU fleet collapses.
Symptom: A batch analytics job slows down interactive customer support, or one tenant's long prompts make all tenants miss their TTFT target.
Cause: The scheduler sees all requests as identical even though some workloads are interactive, some are batch, and some have stricter contractual latency limits.
Fix: Separate traffic classes. Use priority queues, per-tenant rate limits, maximum prompt and generation budgets, and different pools when workload shapes are too different to share one scheduler fairly.
Symptom: A model that passed a short-prompt load test starts rejecting work or swapping under real chat history. GPU memory looks fine at startup and then collapses as conversations lengthen.
Cause: The capacity plan counted weights but not enough KV cache. Long prompts, high concurrency, prefix-cache headroom, and larger generation limits all consume blocks during the request lifetime.
Fix: Size with realistic prompt and output distributions, not only maximum model weights. Load test with long conversations, watch vllm:kv_cache_usage_perc, and apply admission limits before the cache pool hits saturation.
Symptom: A team enables PagedAttention and expects repeated system prompts to become free, then sees no TTFT improvement on new replicas.
Cause: PagedAttention is an allocation strategy for KV blocks. Prefix caching is a reuse strategy for shared prompt prefixes. One reduces fragmentation; the other avoids recomputing prefix KV state.
Fix: Use both when the workload benefits from both. Treat PagedAttention as baseline memory management and prefix caching as a workload-specific optimization that needs stable shared prefixes and warm caches.
Symptom: You scale from 2 to 10 GPUs during a traffic spike, but the new nodes serve requests slower than the old ones. TTFT actually increases right after scaling.
Cause: The original nodes have been running for hours and have accumulated prefix cache hits (for example, a shared system prompt that every request includes). The brand-new nodes start with empty caches. They must recompute the full prefill from scratch, so their first tokens take much longer.
Fix: Warm new nodes with a few synthetic requests that populate the common prefix before adding them to the load balancer rotation. Alternatively, use vLLM's prefix caching and ensure new nodes receive "seed" traffic to warm their KV cache before taking full production load.
Prefill and decode stress GPUs differently. Prefill processes many prompt tokens in parallel and tends to be compute-heavy. Decode produces one token per active request and tends to be memory-bandwidth-heavy. If you run both phases on the same worker pool, a few long prompts can delay short streaming responses, and decode traffic can leave tensor cores underused.
Prefill-decode disaggregation, introduced for production serving by Splitwise[18], splits the serving fleet into two pools:
This is not free. You now need cache transfer, placement logic, and backpressure between pools. But it can help high-traffic systems where long prompts and streaming generations compete for the same GPU budget. The autoscaling signals also become more specific: prefill workers scale on prompt-token backlog and TTFT, while decode workers scale on active sequences, KV-cache utilization, and TPOT.
NVIDIA Dynamo is one documented open-source control-plane example. Its documentation describes disaggregated prefill and decode deployments across backends including vLLM, SGLang, and TensorRT-LLM, with KV-aware routing and KV-transfer mechanisms for split deployments.[3] This makes it a useful implementation to study, but the operational win still depends on measured prompt mix, transfer cost, and cache-hit behavior.
Don't start here. First tune continuous batching, chunked prefill, prefix caching, and queue policy. Disaggregation is a later optimization when one mixed pool can no longer hit both TTFT and TPOT targets.
Use measurements, not architecture fashion, to decide whether to split pools. The following gate recommends investigation only when a tuned mixed pool breaches both service objectives under a long-prompt-heavy trace:
1workload_trials = [
2 {"name": "short support chat", "long_prompt_share": 0.08, "ttft_p95_ms": 820, "tpot_p95_ms": 44},
3 {"name": "policy-document surge", "long_prompt_share": 0.61, "ttft_p95_ms": 2450, "tpot_p95_ms": 93},
4]
5ttft_slo_ms = 1500
6tpot_slo_ms = 60
7
8for trial in workload_trials:
9 both_breach = trial["ttft_p95_ms"] > ttft_slo_ms and trial["tpot_p95_ms"] > tpot_slo_ms
10 action = "benchmark split pools" if both_breach else "keep tuning mixed pool"
11 print(f"{trial['name']:23} long-prompts={trial['long_prompt_share']:.0%} -> {action}")1short support chat long-prompts=8% -> keep tuning mixed pool
2policy-document surge long-prompts=61% -> benchmark split poolsSpeculative decoding speeds up generation by using a smaller "draft" model to predict multiple tokens ahead, then verifying them with the larger "target" model. Leviathan et al. report roughly 2-3x acceleration in their evaluated settings, not a universal serving guarantee.[19]
The speedup comes from amortizing target-model work across several draft tokens. The draft model's predictions are treated as hypotheses; the target model evaluates them in one verification pass and keeps the valid prefix. When the draft model is reasonably accurate for the traffic pattern, this reduces the number of expensive target forward passes. It still pays draft-model compute and rejection overhead, so you measure it against your own workload instead of assuming it always helps.
For example, you might pair a small draft model with a 70B target model to accelerate decode-heavy traffic. The draft model generates candidate tokens quickly; the target model then verifies all candidates in a single forward pass, accepting all correct predictions up to the first mismatch.
GPU availability varies by region and time. During peak demand (like major AI product launches), entire regions can run out of H100 capacity. A resilient serving architecture should be able to "overflow" traffic to a secondary region when the primary is at capacity or experiencing issues.
This requires:
Regional failover is particularly important for spot instance workloads. If a regional spot fleet is reclaimed, your tested recovery-time objective determines whether secondary on-demand capacity can receive new requests before client timeout budgets expire. Long in-flight generations still need retry or resumption semantics.
Instead of deploying a separate 70B model for every customer fine-tune, use a shared base model and hot-swap LoRA (Low-Rank Adaptation)[20] adapters. Systems like S-LoRA[21] show how to batch requests across different adapters while keeping the base model weights shared in GPU memory. vLLM also supports per-request LoRA serving, with explicit warnings around runtime adapter loading in untrusted environments.[22]
To implement multi-tenancy, we can dynamically load LoRA adapters per request. The class below shows how a single base model instance can serve requests for different tenants by applying the corresponding tenant's adapter on the fly. While a true production environment would use vLLM's AsyncLLMEngine to handle concurrent requests without blocking, the synchronous LLM class is shown here for conceptual clarity. The function takes the incoming HTTP request containing a tenant ID, looks up the adapter path, and outputs the generated response using the dynamically merged weights:
1from collections.abc import Mapping
2from typing import Protocol
3
4from vllm import LLM, SamplingParams
5from vllm.lora.request import LoRARequest
6
7class TenantRequest(Protocol):
8 headers: Mapping[str, str]
9 prompt: str
10
11class MultiTenantServer:
12 """
13 Serves multiple fine-tunes on a single GPU using a shared base model.
14 """
15 def __init__(self, base_model_path: str):
16 self.engine = LLM(model=base_model_path, enable_lora=True)
17 self.adapters = {
18 "customer_A": {
19 "name": "adapter_A",
20 "id": 101,
21 "path": "path/to/adapter_A",
22 },
23 # Additional tenants...
24 }
25
26 def serve(self, request: TenantRequest) -> str:
27 tenant_id = request.headers.get("X-Tenant-ID")
28 adapter = self.adapters.get(tenant_id)
29
30 sampling_params = SamplingParams(temperature=0.7)
31
32 lora_request = None
33 if adapter:
34 lora_request = LoRARequest(
35 adapter["name"],
36 adapter["id"],
37 adapter["path"],
38 )
39
40 outputs = self.engine.generate(
41 [request.prompt],
42 sampling_params,
43 lora_request=lora_request,
44 )
45 return outputs[0].outputs[0].textIn production, prefer the async server path over a synchronous wrapper like this, and don't expose arbitrary adapter loading to untrusted tenants.[22]
At this point, you should be able to explain a serving design from first principles, not only name tools:
How do you handle cold starts for GPU instances in production?
Cold starts are dominated by node provisioning, image pulls, weight loading, and runtime warmup. Keep a small warm pool for critical traffic, pre-warm ahead of predictable spikes, store weights close to the node, and run readiness warmups before sending real users to a replica. Snapshot or restore features can help on specific platforms, but warm capacity and local weight caches are the common starting point.
When would you choose TensorRT-LLM over vLLM?
Choose TensorRT-LLM when measured latency or throughput on NVIDIA hardware justifies the build and deployment complexity. It fits tightly controlled fleets where the model, GPU type, precision mode, and engine build process are stable. vLLM can be easier to operate for general serving and rapid iteration; benchmark either claim on your workload.
How does continuous batching improve throughput compared with static batching?
Static batching waits for the longest request in the batch. Continuous batching repacks active work at token-step boundaries, so a finished request frees its slot for queued work immediately. That matters because decode produces one token at a time and request lengths vary widely.
Which metrics matter most for scaling LLM inference clusters?
Start with request queue depth and queue age, KV cache utilization, TTFT, and TPOT. Hardware metrics such as GPU duty cycle, memory bandwidth, temperature, and power are useful supporting signals, but they don't tell you whether users are waiting or whether the scheduler has enough KV blocks left.
The best way to internalize GPU serving concepts is to simulate the decisions yourself. Each lab below builds on the article's examples.
Set up a dummy inference service and configure an autoscaler to scale based on queue depth. Use a tool like KEDA with a Prometheus metric source, or simulate the control loop in Python using the GPUAutoscaler class from this article. Feed it a synthetic traffic trace (flat at night, spike at 9 AM) and plot the replica count over time. Compare a cooldown shorter than your measured node-ready time with one longer than a typical lull.
Measure how long it takes to load a 7B model versus a 70B model into GPU memory on your hardware (or using cloud instance startup logs). Calculate the "cost of a cold start" in dollars: multiply measured loading time by hourly GPU cost. Compare that cost and queue impact with keeping one ready replica through a known low-traffic period.
A user reports that the first token takes 5 seconds, but subsequent tokens arrive quickly. Which part of your stack is likely at fault: the autoscaler or the model engine? Write down your reasoning, then check it against the cold-start phases and the prefill-vs-decode discussion in this article.
You started with a merchant chat app that went viral and discovered why LLM serving isn't web serving. KV-cache capacity constrains active conversations; low-batch decode is often limited by memory bandwidth; and autoscaling needs queue, cache, and latency signals in addition to infrastructure telemetry.
A Kubernetes pattern using KEDA/HPA, a GPU node autoscaler, and a serving engine separates desired replicas from hardware provisioning. Continuous batching and PagedAttention improve GPU use, but a reliable fleet still needs measured capacity, a tuned cooldown, and a warmup policy justified by its latency target and cost.
After this chapter, you can explain why a GPU serving lane needs a different control loop than a web server, why new nodes sometimes run slower than old ones, and how to keep a thousand-user spike from overrunning your inference budget.
Scaling Deployments, StatefulSets & Custom Resources
KEDA · 2026
Concepts
Karpenter · 2026
NVIDIA Dynamo: A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models
NVIDIA · 2025
Metrics
vLLM · 2026
Orca: A Distributed Serving System for Transformer-Based Generative Models.
Yu, G.-I., et al. · 2022 · OSDI 2022
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Automatic Prefix Caching
vLLM · 2026
NVIDIA TensorRT-LLM Documentation.
NVIDIA · 2026
TensorRT-LLM Quantization.
NVIDIA · 2026
Text Generation Inference.
Hugging Face · 2026
SGLang: Efficient Execution of Structured Language Model Programs.
Zheng, L., et al. · 2023
llama.cpp: Inference of LLaMA model in pure C/C++
Gerganov, G. · 2023
Optimization and Tuning.
vLLM · 2026
Supported GPUs
NVIDIA · 2026
Efficiently Scaling Transformer Inference.
Pope, R., et al. · 2023 · arXiv preprint
H100 GPU
NVIDIA · 2026
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.
Shoeybi, M., et al. · 2019
Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
Patel, P., et al. · 2023
Fast Inference from Transformers via Speculative Decoding.
Leviathan, Y., et al. · 2022
LoRA: Low-Rank Adaptation of Large Language Models.
Hu, E. J., et al. · 2021 · ICLR
S-LoRA: Serving Thousands of Concurrent LoRA Adapters.
Sheng, Y., et al. · 2023 · arXiv preprint
LoRA Adapters
vLLM · 2026