LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleGPU Serving & Autoscaling
⚙️HardMLOps & Deployment

GPU Serving & Autoscaling

Master the design of GPU serving infrastructure for LLMs with autoscaling, continuous batching, and cost optimization.

49 min read
Learning path
Step 141 of 155 in the full curriculum
Advanced MLOps & DevOps for AIA/B Testing for LLMs

GPU Serving & Autoscaling

Advanced MLOps and DevOps for AI gave you the release machinery: promotion gates, lineage, rollout policy, and rollback triggers. This chapter turns that machinery toward the inference fleet. GPU serving and autoscaling decide whether a promoted model can survive real traffic without burning budget or missing latency goals.

Imagine your merchant-support chat app goes viral. At 8 AM, ten customers are asking about order delays. By 9 AM, there are a thousand. If you try serving your AI model on a normal web server, it'll quickly grind to a halt. That's because serving a Large Language Model (LLM) isn't like serving a website. A web server fetches pre-computed data and forgets the user immediately. An LLM has to process every prompt and retain attention state as generation proceeds. That short-term memory is a central capacity constraint alongside compute, memory bandwidth, and queueing delay.

Standard web servers don't maintain growing state for each active user. An LLM does. To generate the next token, the model must access computed attention states from previous tokens. In LLMs, this state is the KV cache, and managing its memory footprint is one of the central challenges of production serving.

Engineers designing serving systems must optimize for two conflicting metrics:

  1. Time To First Token (TTFT): The latency before the user sees the first word. This is dominated by the prefill phase (processing the input prompt) and is often compute-bound.
  2. Time Per Output Token (TPOT), also called Time Between Tokens (TBT): The latency between subsequent tokens. This determines the perceived generation speed. This is the decode phase and is usually memory-bandwidth bound for large decoder models.

The core difficulty is that the Key-Value (KV) cache (the stored attention computations for past tokens) grows dynamically with every generated token. Poor memory management leads to fragmentation, which forces smaller batch sizes and lower throughput.

The fulfillment-center analogy

To see why GPU serving is so different from normal web scaling, picture a fulfillment center during a return-label surge:

  • The GPU is the automated sorting lane. It's specialized, expensive, and you can't replace it with normal web servers.
  • The model weights are the routing map loaded into the sorter. Loading it takes time, and the lane can't process parcels until the map is resident.
  • The KV cache is the set of active parcel traces on the belt. If a customer conversation started ten turns ago, its trace stays allocated until generation finishes. The trace pool fills up fast during a traffic spike.
  • Autoscaling is opening another sorting lane. Bringing a new lane online takes time, and it can't help while the routing map is still loading.

A common mistake is scaling because the sorter is busy. You scale when the queue at the dock is too long. A busy sorter means you're using the hardware. A long queue means customers are waiting, and that's when you need more capacity.

The GPU serving stack

Before diving into specific frameworks, let's look at one common production GPU serving architecture. It consists of several layers:

GPU serving stack showing request path, metrics path, pod scaling, and node provisioning GPU serving stack showing request path, metrics path, pod scaling, and node provisioning
The serving stack separates the data path that handles user requests from the control path that turns queue, cache, and latency pressure into pod and node scaling decisions.

Infrastructure Layer: Kubernetes (EKS, GKE, or AKS) manages the GPU nodes. Unlike CPU nodes, GPU capacity is expensive and sometimes scarce, so a warm pool must be justified against latency targets and measured demand.

Control Layer: In a KEDA-based setup, KEDA (Kubernetes Event-Driven Autoscaling) can activate a workload from zero and configure a Horizontal Pod Autoscaler (HPA) to scale running replicas from metrics. Karpenter (AWS) or Cluster Autoscaler can then provision GPU nodes when new pods can't be scheduled on existing capacity.[1][2]

Serving Layer: The actual inference engines (vLLM, SGLang, TensorRT-LLM) that load models into VRAM and handle token generation. At datacenter scale, a control plane such as NVIDIA Dynamo can orchestrate disaggregated prefill/decode pools across many of these engine replicas.[3]

Metrics: CPU and RAM metrics alone don't describe LLM-serving capacity. NVIDIA's DCGM (Data Center GPU Manager) exposes GPU-specific telemetry, while user-facing scaling signals usually come from the serving layer: queue depth, KV cache usage, TTFT, and inter-token latency.[4]

Serving frameworks

The rest of this section zooms into the serving engine: batching policy, KV-cache allocation, framework choice, and request lifecycle.

The evolution of batching

Early serving systems used static batching: waiting for NNN requests to arrive, padding them to the same length, and processing them together. It's inefficient because requests finish at different times. If one request generates 100 tokens and another generates 1000, the GPU must continue processing the batch until the longest request finishes, leaving slots idle for the shorter requests.

Modern frameworks use continuous batching (also known as iteration-level scheduling), pioneered by Orca[5]. The scheduler operates at the granularity of a single token generation step. As soon as a request finishes, a new request from the queue can be inserted into its slot in the next iteration.

Static batching leaves finished slots idle, while continuous batching inserts queued requests at the next token step Static batching leaves finished slots idle, while continuous batching inserts queued requests at the next token step
Continuous batching keeps the GPU useful by admitting queued requests at token-step boundaries instead of waiting for the longest request in a static batch.

The small simulation below uses generation lengths for four order-support replies. A static batch doesn't refill its two slots until both original replies finish; a continuous scheduler admits the next queued reply as soon as a slot opens.

continuous-batching-slot-steps.py
1from collections import deque 2 3def run_batching(lengths: list[int], slots: int, continuous: bool) -> tuple[int, int]: 4 waiting = deque(lengths) 5 active: list[int] = [] 6 token_steps = 0 7 empty_slot_steps = 0 8 9 while waiting or active: 10 if not active or continuous: 11 while waiting and len(active) < slots: 12 active.append(waiting.popleft()) 13 14 token_steps += 1 15 empty_slot_steps += slots - len(active) 16 active = [remaining - 1 for remaining in active if remaining - 1 > 0] 17 18 return token_steps, empty_slot_steps 19 20reply_lengths = [8, 2, 5, 3] 21for label, continuous in [("static", False), ("continuous", True)]: 22 steps, empty = run_batching(reply_lengths, slots=2, continuous=continuous) 23 print(f"{label:10} token steps={steps:2} empty-slot steps={empty:2}")
Output
1static token steps=13 empty-slot steps= 8 2continuous token steps=10 empty-slot steps= 2

With the same replies and slots, the continuous scheduler finishes sooner because it wastes fewer slot-iterations.

vLLM and PagedAttention

One widely used open-source serving engine is vLLM, which introduced PagedAttention[6]. Before vLLM, inference systems pre-allocated a single contiguous block of GPU memory sized to the maximum sequence length a request might ever use (often 4k-32k tokens). Because the actual length is revealed token-by-token during decoding, this approach created two problems:

  • Internal fragmentation: most of the pre-allocated chunk sat empty for short requests.
  • External fragmentation: free memory became scattered and couldn't satisfy new large contiguous allocations.

Profiling across real workloads showed that prior systems typically used only 20–38% of the KV cache memory they had reserved, wasting 62–80%.[6] The result: the effective batch size stayed small, and throughput suffered even when the GPU had spare VRAM.

PagedAttention treats the KV cache like virtual memory in an operating system (OS). It breaks the KV cache into fixed-size blocks (pages) that can be stored in non-contiguous memory.

Logical vs. physical blocks

PagedAttention introduces two key abstractions:

  • Logical KV blocks: The view of the KV cache from the model's perspective. It appears as a contiguous sequence of tokens.
  • Physical KV blocks: The actual fixed-size chunks of memory allocated in HBM (High Bandwidth Memory, the GPU's ultra-fast RAM) to store these tokens.

A Block Table maintains the mapping between logical and physical blocks. When a new token is generated, vLLM checks if the current physical block has space. If not, it allocates a new physical block from a pre-allocated pool and updates the mapping. The PagedAttention paper reports much tighter memory use and higher throughput because batches can pack around real sequence lengths instead of worst-case reservations.[6]

PagedAttention maps logical KV blocks to non-contiguous physical blocks through a block table PagedAttention maps logical KV blocks to non-contiguous physical blocks through a block table
PagedAttention makes one request look contiguous to the model while the runtime stores its KV blocks in non-contiguous physical GPU memory.

PagedAttention gives the runtime fine-grained control over KV blocks. In current vLLM deployments, request-to-request reuse of a shared prompt prefix usually comes from automatic prefix caching, which reuses previously computed KV blocks when a new request shares the same prefix.[7]

Here is the allocation arithmetic for four conversations. The fixed reservation policy reserves each request for its maximum allowed length, while paging reserves blocks only for tokens currently stored:

paged-kv-block-accounting.py
1import math 2 3cached_tokens = [320, 75, 900, 250] 4max_tokens_per_request = 2048 5tokens_per_block = 16 6 7fixed_blocks = len(cached_tokens) * math.ceil(max_tokens_per_request / tokens_per_block) 8paged_blocks = sum(math.ceil(tokens / tokens_per_block) for tokens in cached_tokens) 9used_tokens = sum(cached_tokens) 10 11print(f"tokens currently cached: {used_tokens}") 12print(f"fixed reservation blocks: {fixed_blocks}") 13print(f"paged allocation blocks: {paged_blocks}") 14print(f"reserved tokens avoided: {(fixed_blocks - paged_blocks) * tokens_per_block}")
Output
1tokens currently cached: 1545 2fixed reservation blocks: 512 3paged allocation blocks: 98 4reserved tokens avoided: 6624

Paging doesn't make attention state free. It stops short conversations from holding blocks for a maximum length they haven't used.

The overall request lifecycle coordinates these components. Client requests enter an API server queue, are packed by a continuous batching scheduler, and finally executed by a model engine using PagedAttention and, when enabled, prefix caching on the GPU cluster:

LLM request path highlighting queue wait before GPU work, TTFT at prefill, and TPOT during decode streaming LLM request path highlighting queue wait before GPU work, TTFT at prefill, and TPOT during decode streaming
The request lifecycle exposes three separate latency surfaces: queue wait, prefill to first token, and decode time between streamed tokens.

Framework comparison

Choosing the right serving engine depends on your specific constraints:

FrameworkKey FeatureBest ForProsCons
vLLMPagedAttentionGeneral production servingHigh throughput, easy to use, active communityWhether it loses latency to tuned TensorRT-LLM depends on model, kernels, and benchmark
TensorRT-LLM[8][9]TensorRT engine build + fused kernelsMax performance on NVIDIAVery low latency, FP8 support, AWQ/GPTQ quantizationMore build and deployment complexity, tighter NVIDIA coupling
TGI (Text Generation Inference)[10]HF-first serving stackExisting Hugging Face deploymentsMature launcher, streaming, Prometheus metrics, tensor parallelism, continuous batchingOfficial docs now describe TGI as maintenance mode and recommend newer optimized engines for most new work
SGLangRadixAttentionComplex agent workflows, high-prefix-reuse trafficAutomatic KV-cache reuse across prompts via a radix tree[11]Requires workload-specific benchmarking against other engines
llama.cpp[12]GGUF quantizationLocal/edge deploymentBroad GGUF quantization (INT4-INT8), runs on CPU/Mac/Windows with broad hardware compatibilityLower throughput than GPU-optimized frameworks

The TGI documentation describes TGI as maintenance mode and directs users toward newer optimized engines such as vLLM and SGLang for new deployments.[10] These engines are the data plane that loads weights and generates tokens. At datacenter scale, a separate control plane such as NVIDIA Dynamo can sit above them to provide disaggregated prefill/decode scheduling and KV-aware routing across engine replicas (covered later under disaggregation).[3]

To see how this works in practice, here's an example of initializing vLLM with settings you would benchmark before deployment. The configuration below loads a large model across multiple GPUs using tensor parallelism and sets a high memory-utilization budget for weights, KV blocks, and runtime buffers. It also uses chunked prefill so long prompts can share scheduling budget with decode traffic.[13] It takes the model identifier and hardware constraints as inputs, and outputs an initialized engine ready to accept inference requests:

framework-comparison.py
1from vllm import LLM, SamplingParams 2 3# Production configuration for vLLM 4llm = LLM( 5 model="meta-llama/Llama-3.1-70B-Instruct", 6 tensor_parallel_size=4, # Split across 4 GPUs 7 gpu_memory_utilization=0.90, # Let vLLM claim ~90% of VRAM for weights, KV, and runtime buffers 8 max_model_len=8192, # Enforce context limit 9 enable_chunked_prefill=True, # vLLM V1 enables this by default; keep it explicit in reviewed configs 10 enable_prefix_caching=True, # Reuse KV blocks when requests share a prefix 11 max_num_batched_tokens=16384, # Main TTFT/TPOT trade-off knob for chunked prefill 12) 13 14# Sampling parameters control generation 15params = SamplingParams( 16 temperature=0.7, 17 top_p=0.95, 18 max_tokens=512 19) 20 21outputs = llm.generate(["Summarize the delivery delay for order A-1842."], params)

All of these knobs are workload-dependent. Tune them against TTFT and TPOT SLOs, not just raw tokens/sec.

GPU infrastructure

Sizing and selection

Selecting the right GPU depends on model size, quantization, and expected traffic. The dominant factors are VRAM (Video Random Access Memory) capacity (to fit the model + KV cache) and Memory Bandwidth (to serve tokens fast).

To estimate memory requirements for capacity, use this formula:

Memory≈Weights+KV Cache+Activation Overhead\text{Memory} \approx \text{Weights} + \text{KV Cache} + \text{Activation Overhead} Memory≈Weights+KV Cache+Activation Overhead

Where:

  • Weights≈Params×Precision (bytes)\text{Weights} \approx \text{Params} \times \text{Precision (bytes)}Weights≈Params×Precision (bytes) (e.g., 70B ×\times× 2 bytes = 140GB)
  • KV Cache≈2×Layers×KV Heads×Head Dim×Seq Len×Concurrency×Precision\text{KV Cache} \approx 2 \times \text{Layers} \times \text{KV Heads} \times \text{Head Dim} \times \text{Seq Len} \times \text{Concurrency} \times \text{Precision}KV Cache≈2×Layers×KV Heads×Head Dim×Seq Len×Concurrency×Precision (the factor of 2 accounts for both the key and value tensors)

That distinction matters. The older shorthand using Hidden Dim assumes full multi-head attention. Many modern decoder models use grouped-query attention (GQA) or multi-query attention (MQA), so the number of KV heads is much smaller than the total attention head count. If you ignore that, you'll often overestimate KV cache size by a wide margin.

Here's a practical Python function that estimates the required GPU memory based on model size and cache expectations. It takes the model size plus the architecture terms that control KV growth (num_layers, num_kv_heads, head_dim, context length, and concurrency), then returns a dictionary detailing the number of specific GPU models needed. This helps engineers plan capacity before deploying:

sizing-and-selection.py
1from collections.abc import Mapping 2import math 3 4def estimate_gpu_requirements( 5 model_params_b: float, # Billions of parameters 6 num_layers: int, 7 num_kv_heads: int, 8 head_dim: int, 9 context_len: int, 10 target_concurrency: int, 11 weight_bytes: int = 2, # BF16/FP16=2, INT8=1, FP8=1 12 kv_bytes: int = 2, # KV cache often stays in BF16/FP16 13 overhead_factor: float = 1.15, 14) -> Mapping[str, object]: 15 # 1. Model weights. model_params_b is already in billions, so 16 # model_params_b * bytes gives an approximate size in decimal GB. 17 weight_memory_gb = model_params_b * weight_bytes 18 19 # 2. KV cache. For GQA/MQA models, num_kv_heads is smaller than 20 # the total attention head count. 21 kv_memory_bytes = ( 22 2 23 * num_layers 24 * num_kv_heads 25 * head_dim 26 * context_len 27 * target_concurrency 28 * kv_bytes 29 ) 30 kv_memory_gb = kv_memory_bytes / 1e9 31 32 total_gb = (weight_memory_gb + kv_memory_gb) * overhead_factor 33 34 gpu_options: dict[str, Mapping[str, int]] = { 35 "L4_24GB": {"mem": 24, "needed": max(1, int(math.ceil(total_gb / 22)))}, 36 "A100_40GB": {"mem": 40, "needed": max(1, int(math.ceil(total_gb / 38)))}, 37 "A100_80GB": {"mem": 80, "needed": max(1, int(math.ceil(total_gb / 76)))}, 38 "H100_80GB": {"mem": 80, "needed": max(1, int(math.ceil(total_gb / 76)))}, 39 "H200_141GB": {"mem": 141, "needed": max(1, int(math.ceil(total_gb / 134)))}, 40 } 41 return { 42 "weights_gb": weight_memory_gb, 43 "kv_cache_gb": kv_memory_gb, 44 "total_with_overhead_gb": total_gb, 45 "gpu_options": gpu_options, 46 } 47 48def show_case(name: str, result: Mapping[str, object]) -> None: 49 gpu_options = result["gpu_options"] 50 print(name) 51 print(f" weights: {result['weights_gb']:.1f} GB") 52 print(f" kv cache: {result['kv_cache_gb']:.1f} GB") 53 print(f" total with overhead: {result['total_with_overhead_gb']:.1f} GB") 54 print(f" H100_80GB needed: {gpu_options['H100_80GB']['needed']}") 55 56chat_8b = estimate_gpu_requirements( 57 model_params_b=8, 58 num_layers=32, 59 num_kv_heads=8, 60 head_dim=128, 61 context_len=4096, 62 target_concurrency=8, 63) 64 65llama_70b = estimate_gpu_requirements( 66 model_params_b=70, 67 num_layers=80, 68 num_kv_heads=8, 69 head_dim=128, 70 context_len=8192, 71 target_concurrency=4, 72) 73 74show_case("8B BF16, 4k context, 8 active requests", chat_8b) 75show_case("70B BF16, 8k context, 4 active requests", llama_70b)
Output
18B BF16, 4k context, 8 active requests 2 weights: 16.0 GB 3 kv cache: 4.3 GB 4 total with overhead: 23.3 GB 5 H100_80GB needed: 1 670B BF16, 8k context, 4 active requests 7 weights: 140.0 GB 8 kv cache: 10.7 GB 9 total with overhead: 173.3 GB 10 H100_80GB needed: 3

Notice how the 70B BF16 case needs more than the theoretical two-GPU weight fit once you add four active 8k contexts and operational headroom. The table below is a minimum-fit reference; production sizing starts from measured prompt and concurrency distributions.

As a reference, the following table gives order-of-magnitude sizing for common dense decoder deployments. The KV numbers assume GQA-style architectures and FP16/BF16 KV cache. Full multi-head attention or larger KV precision pushes the cache higher.

Model SizePrecisionWeights MemoryEst. KV Cache (per 1k cached tokens, 1 active request)Minimum GPU Suggestion
7B dense modelFP16/BF16~14 GB~0.10-0.13 GB1x 24GB GPU (A10G, L4)
32B dense modelFP16/BF16~64 GB~0.15-0.30 GB1x 80GB GPU
70B dense modelINT8 weights + BF16 KV~70 GB~0.25-0.35 GB1x 80GB only for short contexts and tight concurrency limits
70B dense modelFP16/BF16~140 GB~0.25-0.35 GB2x 80GB GPUs via TP

That 70B INT8 row is a tight fit. In practice, many teams still use two GPUs so they have room for longer prompts, prefix caching, and higher concurrency.

Multi-Instance GPU (MIG)

For cost-efficient autoscaling of smaller models, Multi-Instance GPU (MIG) allows you to partition a single GPU into hardware-isolated slices. Ampere- and Hopper-class parts such as A100, H100, and H200 can expose up to seven instances on supported SKUs.[14] Instead of dedicating an entire large GPU to a single 7B model, you can run multiple replicas on isolated slices with dedicated compute and memory resources.

Throughput & bandwidth constraints

For low-batch decode of large decoder models, weight reads commonly make the system memory-bandwidth bound. Batching, cache traffic, kernels, and multi-GPU communication can change which limit dominates.

The theoretical maximum throughput (TPSmaxTPS_{max}TPSmax​) for a batch size of 1 is roughly (a simplified model derived from memory-bandwidth analyses such as Pope et al.)[15]:

TPSmax≈Memory Bandwidth (GB/s)Model Size (GB)TPS_{max} \approx \frac{\text{Memory Bandwidth (GB/s)}}{\text{Model Size (GB)}} TPSmax​≈Model Size (GB)Memory Bandwidth (GB/s)​

Reading the formula

In this simplified batch-size-one model, the GPU streams the weight footprint from HBM for each generated token. There is little weight reuse across sequential steps (the KV cache reuses activations, not the static weights). Memory bandwidth therefore gives a useful ceiling:

TPSmax≈Memory Bandwidth (GB/s)Model Size (GB)TPS_{max} \approx \frac{\text{Memory Bandwidth (GB/s)}}{\text{Model Size (GB)}}TPSmax​≈Model Size (GB)Memory Bandwidth (GB/s)​

This is the reciprocal of the time required to load the entire model from HBM once. An H100 SXM (3.35 TB/s HBM bandwidth) serving a 7B dense model in FP16 (≈14 GB of weights) yields an upper bound of roughly 239 tokens/sec for a single sequence.[16] Real systems achieve lower throughput because:

  • KV cache reads/writes, activations, and sampling also consume memory bandwidth
  • Kernel launch, attention computation, and scheduler overhead
  • The achieved bandwidth of fused kernels is rarely 100 % of the theoretical peak
Bandwidth-bound decode chart showing the theoretical H100 token-rate ceiling falling as model weight footprint grows Bandwidth-bound decode chart showing the theoretical H100 token-rate ceiling falling as model weight footprint grows
The simple bandwidth bound shows why smaller weight footprints and more per-GPU memory bandwidth matter so much during single-token decode.

This bound explains why quantization can improve decode throughput: halving only the weight-footprint term (FP16 to INT8, for example) doubles this simplified ceiling. Actual speedup is smaller or different when kernel support, KV-cache traffic, compute, or communication dominate. It also shows why large models may require tensor parallelism: the denominator grows while per-GPU bandwidth stays fixed.

The next calculation is deliberately a ceiling, not a benchmark. It isolates weight traffic so you can see what quantization changes before adding runtime overhead:

decode-bandwidth-ceiling.py
1bandwidth_gb_s = 3350 2weight_footprints_gb = { 3 "7B BF16": 14, 4 "7B INT8 weights": 7, 5 "70B BF16": 140, 6} 7 8for name, weights_gb in weight_footprints_gb.items(): 9 ceiling_tps = bandwidth_gb_s / weights_gb 10 print(f"{name:16} weight-only ceiling = {ceiling_tps:6.1f} token/s")
Output
17B BF16 weight-only ceiling = 239.3 token/s 27B INT8 weights weight-only ceiling = 478.6 token/s 370B BF16 weight-only ceiling = 23.9 token/s

Tensor parallelism

For models that don't fit on a single GPU (for example, a 70B dense model in BF16), we use Tensor Parallelism (TP). TP splits the individual weight matrices (e.g., WQ,WK,WVW_Q, W_K, W_VWQ​,WK​,WV​) across multiple GPUs so that the computation is distributed evenly. This technique was pioneered in the Megatron-LM training system[17] and adapted for inference serving.

TP operates by slicing the matrix computations, meaning all participating GPUs must communicate their partial results at each layer before the model can proceed to the next layer. This constant, high-volume communication requires immense bandwidth.

  • Intra-node: TP works best when GPUs share a very fast fabric such as NVLink (NVIDIA high-bandwidth interconnect) or NVSwitch (NVIDIA multi-GPU switch).
  • Inter-node: TP can extend across RDMA (Remote Direct Memory Access) or InfiniBand, but the communication tax rises quickly. For multi-node serving, teams often keep TP within a node when they can, and use Pipeline Parallelism (PP) or independent replicas across nodes instead. PP only communicates activations at stage boundaries, but it introduces pipeline bubbles and can worsen tail latency.

The following diagram illustrates the structural difference between intra-node Tensor Parallelism (where computation is split and synchronized at every layer) and inter-node Pipeline Parallelism (where computation is chained sequentially across nodes):

Tensor parallelism synchronizes split layer shards inside one node, while pipeline parallelism sends activations between staged nodes Tensor parallelism synchronizes split layer shards inside one node, while pipeline parallelism sends activations between staged nodes
Tensor parallelism pays communication on every layer, so it works best on fast intra-node links; pipeline parallelism moves activations between staged parts of the model.

Communication frequency is not total cost: tensor-parallel messages may travel over faster links, while pipeline stages can create bubbles. This toy count makes the first screening question explicit:

parallelism-communication-check.py
1layers = 80 2pipeline_stages = 4 3decode_steps = 32 4 5tp_sync_events = layers * decode_steps 6pp_boundary_events = (pipeline_stages - 1) * decode_steps 7 8print(f"tensor-parallel sync events: {tp_sync_events}") 9print(f"pipeline boundary transfers: {pp_boundary_events}") 10print("Measure bytes, fabric speed, and pipeline bubbles before choosing.")
Output
1tensor-parallel sync events: 2560 2pipeline boundary transfers: 96 3Measure bytes, fabric speed, and pipeline bubbles before choosing.

Autoscaling strategies

CPU-based autoscaling alone is insufficient for LLM serving because CPU isn't usually the scarce inference resource. CPU can still reveal overloaded gateways or tokenizers. GPU utilization can also be misleading: a GPU might be busy with a healthy batch, or it might be memory-bound while compute duty cycle looks modest.

Key metrics

The most reliable metrics for autoscaling come from two sources: NVIDIA's DCGM (Data Center GPU Manager) for hardware-level visibility, and the serving framework itself for application-level signals. In vLLM, for example, the Prometheus endpoint exposes metrics such as vllm:num_requests_waiting, vllm:num_requests_running, vllm:kv_cache_usage_perc, vllm:time_to_first_token_seconds, and vllm:inter_token_latency_seconds.[4]

MetricSourceWhy It MattersExample Signal
Request Queue DepthServing framework or gatewayBacklog of waiting requestsSustained upward trend
KV Cache UtilizationServing frameworkMemory pressure; near saturation means admission gets tightSustained high watermark
GPU Duty CycleDCGMUseful supporting telemetry, but easy to misread aloneCorroborate with app metrics
Time To First Token (TTFT)Serving frameworkUser-facing latency SLASLO breach
Time Per Output Token (TPOT)Serving frameworkStreaming perceived speedSLO breach

DCGM (Data Center GPU Manager) exposes hardware-level GPU metrics like temperature, power, clock rates, and utilization. While GPU utilization seems like an obvious scaling signal, it's often misleading for LLM inference. A GPU can show 100% "utilization" while sitting idle waiting for memory (memory-bound), or show low utilization during long context prefill phases.

That's why application-level metrics (queue depth, KV cache utilization, TTFT, TPOT) are more reliable scaling signals. They directly measure capacity constraints rather than hardware activity. The exact thresholds are workload-specific, so treat any numbers you see in dashboards or sample code as starting points, not universal defaults.

KEDA and Karpenter integration

One control-loop design works as follows: the serving framework exposes metrics via Prometheus. KEDA can activate a scaled workload from zero and configure the generated HPA for running replica scaling. If the scheduler can't place new GPU pods, the node autoscaler layer (often Karpenter or Cluster Autoscaler) provisions nodes that satisfy pod requirements.[1][2]

This two-layer approach separates desired pod count from hardware supply. Ready pods may scale quickly if spare GPU capacity exists; new nodes and model loading can take far longer. The useful warm buffer is a measured tradeoff between idle GPU cost and the latency damage of a cold burst.

The autoscaling control loop continuously monitors queue depth, KV cache utilization, and latency metrics to make scaling decisions. As the diagram below shows, the metric path and the node-provisioning path are separate:

Autoscaling control loop chart showing queue, KV, latency, and pending-pod pressure feeding KEDA, HPA, the scheduler, and node autoscaler Autoscaling control loop chart showing queue, KV, latency, and pending-pod pressure feeding KEDA, HPA, the scheduler, and node autoscaler
Application metrics decide desired replicas first; pending GPU pods then tell the node autoscaler that the cluster needs more hardware.

In a real Kubernetes deployment, KEDA and HPA compute the desired replica count for you. The following Python snippet is just a mental model for the control logic. It takes the current metrics and replica count as inputs, then returns the next target replica count. The thresholds are illustrative only:

keda-and-karpenter-integration.py
1import time 2 3class GPUAutoscaler: 4 def __init__(self, min_replicas: int = 1, max_replicas: int = 20, cooldown_s: int = 300): 5 self.min_replicas = min_replicas 6 self.max_replicas = max_replicas 7 self.cooldown_s = cooldown_s 8 self.last_scale_time = 0.0 9 10 def recommend(self, metrics: dict, current_replicas: int) -> int: 11 now = time.time() 12 if now - self.last_scale_time < self.cooldown_s: 13 return current_replicas 14 15 scale_up = ( 16 metrics["num_requests_waiting"] > 10 17 or metrics["kv_cache_utilization"] > 0.85 18 or metrics["ttft_p95_ms"] > 1500 19 ) 20 21 lightly_loaded = ( 22 metrics["num_requests_waiting"] == 0 23 and metrics["num_requests_running"] <= max(1, current_replicas // 2) 24 and metrics["kv_cache_utilization"] < 0.20 25 ) 26 27 if scale_up: 28 self.last_scale_time = now 29 return min(self.max_replicas, current_replicas + 1) 30 if lightly_loaded: 31 self.last_scale_time = now 32 return max(self.min_replicas, current_replicas - 1) 33 return current_replicas 34 35autoscaler = GPUAutoscaler(min_replicas=1, max_replicas=20, cooldown_s=0) 36scenarios = [ 37 ( 38 "spike", 39 { 40 "num_requests_waiting": 42, 41 "num_requests_running": 8, 42 "kv_cache_utilization": 0.72, 43 "ttft_p95_ms": 1200, 44 }, 45 4, 46 ), 47 ( 48 "cache pressure", 49 { 50 "num_requests_waiting": 2, 51 "num_requests_running": 7, 52 "kv_cache_utilization": 0.91, 53 "ttft_p95_ms": 900, 54 }, 55 5, 56 ), 57 ( 58 "quiet", 59 { 60 "num_requests_waiting": 0, 61 "num_requests_running": 1, 62 "kv_cache_utilization": 0.12, 63 "ttft_p95_ms": 450, 64 }, 65 5, 66 ), 67] 68 69for label, metrics, replicas in scenarios: 70 print(f"{label}: {replicas} -> {autoscaler.recommend(metrics, replicas)} replicas")
Output
1spike: 4 -> 5 replicas 2cache pressure: 5 -> 6 replicas 3quiet: 5 -> 4 replicas

The autoscaler determines a target; it doesn't erase startup time. This capacity calculation separates the eventual replica count from the traffic a warm buffer can accept immediately:

capacity-warm-buffer.py
1import math 2 3def capacity_plan(concurrent_requests: int, requests_per_replica: int, warm_replicas: int) -> dict[str, int]: 4 target_replicas = math.ceil(concurrent_requests / requests_per_replica) 5 immediately_admitted = min(concurrent_requests, warm_replicas * requests_per_replica) 6 return { 7 "target_replicas": target_replicas, 8 "replicas_to_start": max(0, target_replicas - warm_replicas), 9 "requests_waiting_for_cold_capacity": concurrent_requests - immediately_admitted, 10 } 11 12for warm in [1, 3, 20]: 13 plan = capacity_plan(concurrent_requests=100, requests_per_replica=5, warm_replicas=warm) 14 print(f"warm={warm:2}: {plan}")
Output
1warm= 1: {'target_replicas': 20, 'replicas_to_start': 19, 'requests_waiting_for_cold_capacity': 95} 2warm= 3: {'target_replicas': 20, 'replicas_to_start': 17, 'requests_waiting_for_cold_capacity': 85} 3warm=20: {'target_replicas': 20, 'replicas_to_start': 0, 'requests_waiting_for_cold_capacity': 0}

A worked example: scaling from zero

Let's walk through a concrete scenario. You're hosting a Llama-3-8B model for your merchant-support chat app. Traffic is nearly zero at night, but at 9 AM it jumps to 100 concurrent users. How do you scale without blowing the budget or the user experience?

Step 1: pick the right metric. Scaling only on CPU percentage misses engine capacity. A GPU worker can show little CPU pressure while its weights and KV blocks constrain admission. Use request queue depth, queue age, KV cache utilization, and latency SLOs; retain CPU signals for gateway or preprocessing failures.

Step 2: set a threshold. Through benchmarking, you determine that one GPU can comfortably handle about 5 concurrent requests while keeping TPOT under 50 ms per token. That's your target capacity per replica.

Step 3: do the math. With 100 concurrent requests and 5 requests per GPU, you need 100 / 5 = 20 GPUs. If you're starting from one warm instance at 8:59 AM, the autoscaler must add 19 more replicas quickly.

Step 4: solve the cold-start gap. Even if a replica target changes promptly, node provisioning plus model loading can outlast the traffic's latency budget. Measure that ready time on your platform. If the morning rush is predictable, start warming nodes far enough ahead of the observed ready time, or keep a benchmarked buffer pool that absorbs its first wave.

Scaling only when latency spikes is late. By the time TTFT breaches your SLO, users are already frustrated. Scale on queue depth trends, not only on lagging latency indicators.

Handling cold starts

Cold starts in GPU serving environments are significantly more disruptive than in traditional microservices. While a typical web container might start in seconds, a GPU workload often has to provision a node, pull a large image, fetch model weights, and then warm up the runtime before it can accept traffic.

Here's the difference between a warm GPU (already loaded and serving) and a cold GPU (starting from scratch):

Warm GPU serving immediately compared with cold GPU startup phases for node provisioning, container pull, weight loading, runtime warmup, and readiness Warm GPU serving immediately compared with cold GPU startup phases for node provisioning, container pull, weight loading, runtime warmup, and readiness
A cold GPU replica must become schedulable, pull enough image data, load weights, and pass warmup before it can reduce the user-facing queue.

The cold start has four distinct phases, each with its own mitigation strategy. The ranges below are illustrative planning inputs, not platform guarantees; record your own p50 and p95 timings:

PhaseIllustrative DurationMitigation Strategy
Node provisioning30-180sUse warm pools, reserved capacity, or faster node images
Container pull10-90sUse lazy pulling or container streaming so startup doesn't wait for the full image
Model weight loading30-180sKeep weights on local NVMe or a warm shared cache close to the node
Runtime warmup5-30sPre-build optimized engines where possible and run readiness warmups

Container streaming lets the container start executing before the entire image is downloaded. Only the required layers are fetched on demand, which cuts startup time for large inference images.

Model caching keeps frequently used model weights on high-speed local NVMe storage rather than pulling from remote object storage. For multi-node setups, shared network volumes like Amazon FSx for Lustre can serve weights at high speed to multiple nodes.

Some platforms also offer snapshot or restore features for a preloaded container or VM. Those can help, but they're more vendor-specific than warm pools plus local weight caches.

To mitigate cold start latency impact on users, engineers implement several proactive strategies:

  • Over-provisioning: Keep a small headroom buffer to absorb sudden traffic spikes while new instances spin up.
  • Predictive scaling: Use historical traffic patterns to scale up the cluster before anticipated usage spikes (e.g., scaling up just before the morning rush hour).
  • Minimum warm pool: For critical workloads, retain enough measured warm capacity to protect the initial-burst TTFT target while cold replicas load.

You can turn measured phase durations into a readiness decision. In this example, predictive scaling four minutes before a known carrier-status surge is sufficient for the measured path, while scaling two minutes before it is not:

cold-start-readiness.py
1phase_seconds = { 2 "node provisioning": 95, 3 "container pull": 22, 4 "weight loading": 78, 5 "runtime warmup": 12, 6} 7measured_ready_s = sum(phase_seconds.values()) 8 9for lead_time_s in [120, 240]: 10 spare_s = lead_time_s - measured_ready_s 11 status = "ready before surge" if spare_s >= 0 else "surge sees cold queue" 12 print(f"lead={lead_time_s:3}s ready={measured_ready_s:3}s margin={spare_s:4}s: {status}")
Output
1lead=120s ready=207s margin= -87s: surge sees cold queue 2lead=240s ready=207s margin= 33s: ready before surge

Cost optimization

Running top-end GPUs 24/7 is expensive, and idle replicas burn budget fast. To optimize costs without breaking Service Level Agreements (SLAs):

Spot instances and fault tolerance

Cloud providers offer spot instances at discounts, but they come with preemption risk. Inference can be easier to retry than long-running training, but reasoning runs and long generations may exceed a provider's warning window. Use measured request durations, a drain deadline, and retry behavior before assigning traffic to spot capacity.

In retail, seasonal spikes such as holiday rushes can be candidates for spot overflow capacity when a stable baseline pool meets the promised SLO and interrupted work can drain or retry cleanly.

  • Graceful shutdown: On SIGTERM, remove the replica from new admission, let completions that fit the remaining drain budget finish, and retry or fail over requests that cannot finish before termination.
  • Mixed pools: A common policy retains on-demand or reserved capacity for baseline SLO traffic and uses an autoscaling spot pool only for retryable overflow.

The drain policy must compare remaining work with a deadline rather than assume all requests finish. This example reserves twenty seconds for shutdown and transfer after a 120-second termination warning:

spot-drain-deadline.py
1warning_seconds = 120 2shutdown_margin_seconds = 20 3finish_budget_seconds = warning_seconds - shutdown_margin_seconds 4active_requests = { 5 "order-status": 12, 6 "refund-summary": 84, 7 "bulk-claims-reasoning": 170, 8} 9 10for request, remaining_seconds in active_requests.items(): 11 action = "finish during drain" if remaining_seconds <= finish_budget_seconds else "retry on stable pool" 12 print(f"{request:23} remaining={remaining_seconds:3}s -> {action}")
Output
1order-status remaining= 12s -> finish during drain 2refund-summary remaining= 84s -> finish during drain 3bulk-claims-reasoning remaining=170s -> retry on stable pool

Scale-down cooldown & thrashing

A common (and expensive) mistake is scaling down too aggressively. If you terminate a GPU node because traffic dipped for 30 seconds, you'll pay the cold start penalty again when traffic returns a minute later. This "thrashing" can actually increase costs while degrading user experience.

In e-commerce logistics, this is like closing a return-processing lane because the conveyor cleared for 30 seconds during a lull, only to have the post-holiday rush hit again before the lane can reopen.

Best practices for scale-down:

  • Cooldown period: Start with a cooldown longer than short observed lulls and tune it against measured cold-start cost, idle spend, and recurring traffic patterns.
  • Drain before terminate: Give the autoscaler time to drain in-flight requests before removing a replica. Don't kill a GPU mid-generation.
  • Scale down gradually: Remove one replica at a time and wait to see if queue depth remains low before removing more.

Deployment options

When deciding how to deploy GPU inference, teams face a build-vs-buy decision:

ApproachProvidersProsCons
Serverless GPUsModal, RunPod, Replicate, Together AILess platform work, built-in autoscaling, usage-based billingLess control, potentially higher unit cost, cold starts on infrequent traffic
Specialized GPU cloudCoreWeave, Lambda CloudFast access to newer GPUs, strong price/performance, more control over instances and storage topologyMore infrastructure work than serverless, portability can be weaker
Managed K8s (GPU)EKS/GKE/AKS + KarpenterFull control, spot instance support, custom metricsComplex to set up and maintain, requires ML platform expertise

Serverless platforms like Modal and RunPod abstract away the Kubernetes layer entirely. They typically handle autoscaling and much of the instance lifecycle for you. This is ideal for teams without dedicated ML infrastructure engineers or for workloads with highly variable traffic.

Specialized GPU clouds sit in the middle. You usually get raw instances, storage, or managed Kubernetes primitives without the full DIY burden of the hyperscalers. Managed Kubernetes still gives you the most control over the full stack (vLLM versions, custom schedulers, quantization methods) and can be cheaper once utilization is high and predictable enough to pay back the platform engineering overhead.

Common pitfalls

Even experienced engineers trip over the same patterns when moving from web serving to GPU inference. Here are the frequent failures to catch during design review, with their symptoms, root causes, and fixes.

Flapping: scaling up and down too fast

Symptom: Your GPU node count oscillates wildly. Cloud bills spike, but user latency doesn't improve. The autoscaler logs show scale-up events followed by scale-down events within minutes.

Cause: The autoscaler reacts to short traffic blips instead of sustained trends. A GPU node that just finished loading weights gets terminated before it serves enough requests to justify its startup cost.

Fix: Choose a cooldown from traffic and startup measurements rather than a universal duration. Scale down gradually, and require queue depth near zero, low KV cache usage, and low running requests before removing capacity.

Ignoring tail latency

Symptom: Your dashboard shows an average response time of 800 ms, but support tickets complain about 30-second waits. The p95 or p99 latency is an order of magnitude worse than the mean.

Cause: Average latency hides the users with long prompts or the batches where one straggler request keeps the GPU occupied. Static batching makes this worse, but even continuous batching can suffer if a single request with a 4,000-token prompt monopolizes a slot.

Fix: Monitor TTFT and TPOT at the p95 or p99 percentile, not the mean. Set SLOs on tail latency. If long prompts are common, enable chunked prefill so they don't starve shorter decode requests.

No explicit queue or backpressure

Symptom: Requests keep arriving even after the GPU workers are full. The service eventually times out randomly, and retry storms make the spike worse.

Cause: The system accepts more work than the serving engine can schedule. Without a visible queue, admission policy, timeout budget, and retry contract, overload becomes invisible until users see failures.

Fix: Put a queue or gateway in front of the inference engine. Track queue depth, queue age, and rejection rate. Return controlled overload responses before the GPU fleet collapses.

Treating every request as equal priority

Symptom: A batch analytics job slows down interactive customer support, or one tenant's long prompts make all tenants miss their TTFT target.

Cause: The scheduler sees all requests as identical even though some workloads are interactive, some are batch, and some have stricter contractual latency limits.

Fix: Separate traffic classes. Use priority queues, per-tenant rate limits, maximum prompt and generation budgets, and different pools when workload shapes are too different to share one scheduler fairly.

Underestimating long-context KV growth

Symptom: A model that passed a short-prompt load test starts rejecting work or swapping under real chat history. GPU memory looks fine at startup and then collapses as conversations lengthen.

Cause: The capacity plan counted weights but not enough KV cache. Long prompts, high concurrency, prefix-cache headroom, and larger generation limits all consume blocks during the request lifetime.

Fix: Size with realistic prompt and output distributions, not only maximum model weights. Load test with long conversations, watch vllm:kv_cache_usage_perc, and apply admission limits before the cache pool hits saturation.

Confusing PagedAttention with prefix caching

Symptom: A team enables PagedAttention and expects repeated system prompts to become free, then sees no TTFT improvement on new replicas.

Cause: PagedAttention is an allocation strategy for KV blocks. Prefix caching is a reuse strategy for shared prompt prefixes. One reduces fragmentation; the other avoids recomputing prefix KV state.

Fix: Use both when the workload benefits from both. Treat PagedAttention as baseline memory management and prefix caching as a workload-specific optimization that needs stable shared prefixes and warm caches.

Cache blindness during scale-up

Symptom: You scale from 2 to 10 GPUs during a traffic spike, but the new nodes serve requests slower than the old ones. TTFT actually increases right after scaling.

Cause: The original nodes have been running for hours and have accumulated prefix cache hits (for example, a shared system prompt that every request includes). The brand-new nodes start with empty caches. They must recompute the full prefill from scratch, so their first tokens take much longer.

Fix: Warm new nodes with a few synthetic requests that populate the common prefix before adding them to the load balancer rotation. Alternatively, use vLLM's prefix caching and ensure new nodes receive "seed" traffic to warm their KV cache before taking full production load.

Advanced techniques

Prefill-decode disaggregation

Prefill and decode stress GPUs differently. Prefill processes many prompt tokens in parallel and tends to be compute-heavy. Decode produces one token per active request and tends to be memory-bandwidth-heavy. If you run both phases on the same worker pool, a few long prompts can delay short streaming responses, and decode traffic can leave tensor cores underused.

Prefill-decode disaggregation, introduced for production serving by Splitwise[18], splits the serving fleet into two pools:

  • Prefill workers ingest prompts, build the initial KV cache, and hand off cache state.
  • Decode workers continue token generation for active requests with tighter scheduling around TPOT.

This is not free. You now need cache transfer, placement logic, and backpressure between pools. But it can help high-traffic systems where long prompts and streaming generations compete for the same GPU budget. The autoscaling signals also become more specific: prefill workers scale on prompt-token backlog and TTFT, while decode workers scale on active sequences, KV-cache utilization, and TPOT.

NVIDIA Dynamo is one documented open-source control-plane example. Its documentation describes disaggregated prefill and decode deployments across backends including vLLM, SGLang, and TensorRT-LLM, with KV-aware routing and KV-transfer mechanisms for split deployments.[3] This makes it a useful implementation to study, but the operational win still depends on measured prompt mix, transfer cost, and cache-hit behavior.

Don't start here. First tune continuous batching, chunked prefill, prefix caching, and queue policy. Disaggregation is a later optimization when one mixed pool can no longer hit both TTFT and TPOT targets.

Use measurements, not architecture fashion, to decide whether to split pools. The following gate recommends investigation only when a tuned mixed pool breaches both service objectives under a long-prompt-heavy trace:

disaggregation-trigger.py
1workload_trials = [ 2 {"name": "short support chat", "long_prompt_share": 0.08, "ttft_p95_ms": 820, "tpot_p95_ms": 44}, 3 {"name": "policy-document surge", "long_prompt_share": 0.61, "ttft_p95_ms": 2450, "tpot_p95_ms": 93}, 4] 5ttft_slo_ms = 1500 6tpot_slo_ms = 60 7 8for trial in workload_trials: 9 both_breach = trial["ttft_p95_ms"] > ttft_slo_ms and trial["tpot_p95_ms"] > tpot_slo_ms 10 action = "benchmark split pools" if both_breach else "keep tuning mixed pool" 11 print(f"{trial['name']:23} long-prompts={trial['long_prompt_share']:.0%} -> {action}")
Output
1short support chat long-prompts=8% -> keep tuning mixed pool 2policy-document surge long-prompts=61% -> benchmark split pools

Speculative decoding

Speculative decoding speeds up generation by using a smaller "draft" model to predict multiple tokens ahead, then verifying them with the larger "target" model. Leviathan et al. report roughly 2-3x acceleration in their evaluated settings, not a universal serving guarantee.[19]

The speedup comes from amortizing target-model work across several draft tokens. The draft model's predictions are treated as hypotheses; the target model evaluates them in one verification pass and keeps the valid prefix. When the draft model is reasonably accurate for the traffic pattern, this reduces the number of expensive target forward passes. It still pays draft-model compute and rejection overhead, so you measure it against your own workload instead of assuming it always helps.

For example, you might pair a small draft model with a 70B target model to accelerate decode-heavy traffic. The draft model generates candidate tokens quickly; the target model then verifies all candidates in a single forward pass, accepting all correct predictions up to the first mismatch.

Multi-region failover

GPU availability varies by region and time. During peak demand (like major AI product launches), entire regions can run out of H100 capacity. A resilient serving architecture should be able to "overflow" traffic to a secondary region when the primary is at capacity or experiencing issues.

This requires:

  • Global load balancing: Route requests to the nearest healthy region
  • Model sync: Keep model weights and LoRA adapters synchronized across regions
  • Failover logic: Detect regional capacity constraints and automatically shift traffic

Regional failover is particularly important for spot instance workloads. If a regional spot fleet is reclaimed, your tested recovery-time objective determines whether secondary on-demand capacity can receive new requests before client timeout budgets expire. Long in-flight generations still need retry or resumption semantics.

Multi-tenancy with LoRA

Instead of deploying a separate 70B model for every customer fine-tune, use a shared base model and hot-swap LoRA (Low-Rank Adaptation)[20] adapters. Systems like S-LoRA[21] show how to batch requests across different adapters while keeping the base model weights shared in GPU memory. vLLM also supports per-request LoRA serving, with explicit warnings around runtime adapter loading in untrusted environments.[22]

To implement multi-tenancy, we can dynamically load LoRA adapters per request. The class below shows how a single base model instance can serve requests for different tenants by applying the corresponding tenant's adapter on the fly. While a true production environment would use vLLM's AsyncLLMEngine to handle concurrent requests without blocking, the synchronous LLM class is shown here for conceptual clarity. The function takes the incoming HTTP request containing a tenant ID, looks up the adapter path, and outputs the generated response using the dynamically merged weights:

multi-tenancy-with-lora.py
1from collections.abc import Mapping 2from typing import Protocol 3 4from vllm import LLM, SamplingParams 5from vllm.lora.request import LoRARequest 6 7class TenantRequest(Protocol): 8 headers: Mapping[str, str] 9 prompt: str 10 11class MultiTenantServer: 12 """ 13 Serves multiple fine-tunes on a single GPU using a shared base model. 14 """ 15 def __init__(self, base_model_path: str): 16 self.engine = LLM(model=base_model_path, enable_lora=True) 17 self.adapters = { 18 "customer_A": { 19 "name": "adapter_A", 20 "id": 101, 21 "path": "path/to/adapter_A", 22 }, 23 # Additional tenants... 24 } 25 26 def serve(self, request: TenantRequest) -> str: 27 tenant_id = request.headers.get("X-Tenant-ID") 28 adapter = self.adapters.get(tenant_id) 29 30 sampling_params = SamplingParams(temperature=0.7) 31 32 lora_request = None 33 if adapter: 34 lora_request = LoRARequest( 35 adapter["name"], 36 adapter["id"], 37 adapter["path"], 38 ) 39 40 outputs = self.engine.generate( 41 [request.prompt], 42 sampling_params, 43 lora_request=lora_request, 44 ) 45 return outputs[0].outputs[0].text

In production, prefer the async server path over a synchronous wrapper like this, and don't expose arbitrary adapter loading to untrusted tenants.[22]

Evaluation rubric

At this point, you should be able to explain a serving design from first principles, not only name tools:

  • Compare vLLM, TensorRT-LLM, TGI, SGLang, and llama.cpp in terms of throughput, latency, ecosystem fit, hardware coupling, and operational complexity.
  • Explain how continuous batching avoids the straggler waste of static batching.
  • Describe how PagedAttention maps logical KV blocks to physical GPU blocks and why that reduces fragmentation.
  • Choose scaling signals from queue depth, KV cache utilization, TTFT, and TPOT instead of CPU percent alone.
  • Recommend MIG for smaller models only when the target GPU SKU supports it and the slice size still fits weights plus KV cache.
  • Plan cold-start mitigation with warm pools, predictive scaling, cached weights, and readiness warmups.
  • Explain why quantization reduces both VRAM pressure and the decode bandwidth denominator.
  • Use LoRA adapters for multi-tenant fine-tunes without turning adapter loading into an untrusted file-access path.
  • Say when to stay on one mixed pool versus move to prefill-decode disaggregation (Splitwise-style, operated by frameworks like NVIDIA Dynamo) once one pool cannot hit both TTFT and TPOT.

Follow-up questions

How do you handle cold starts for GPU instances in production?

Cold starts are dominated by node provisioning, image pulls, weight loading, and runtime warmup. Keep a small warm pool for critical traffic, pre-warm ahead of predictable spikes, store weights close to the node, and run readiness warmups before sending real users to a replica. Snapshot or restore features can help on specific platforms, but warm capacity and local weight caches are the common starting point.

When would you choose TensorRT-LLM over vLLM?

Choose TensorRT-LLM when measured latency or throughput on NVIDIA hardware justifies the build and deployment complexity. It fits tightly controlled fleets where the model, GPU type, precision mode, and engine build process are stable. vLLM can be easier to operate for general serving and rapid iteration; benchmark either claim on your workload.

How does continuous batching improve throughput compared with static batching?

Static batching waits for the longest request in the batch. Continuous batching repacks active work at token-step boundaries, so a finished request frees its slot for queued work immediately. That matters because decode produces one token at a time and request lengths vary widely.

Which metrics matter most for scaling LLM inference clusters?

Start with request queue depth and queue age, KV cache utilization, TTFT, and TPOT. Hardware metrics such as GPU duty cycle, memory bandwidth, temperature, and power are useful supporting signals, but they don't tell you whether users are waiting or whether the scheduler has enough KV blocks left.

Try it: three practice labs

The best way to internalize GPU serving concepts is to simulate the decisions yourself. Each lab below builds on the article's examples.

Lab A: the architect

Set up a dummy inference service and configure an autoscaler to scale based on queue depth. Use a tool like KEDA with a Prometheus metric source, or simulate the control loop in Python using the GPUAutoscaler class from this article. Feed it a synthetic traffic trace (flat at night, spike at 9 AM) and plot the replica count over time. Compare a cooldown shorter than your measured node-ready time with one longer than a typical lull.

Lab B: the optimizer

Measure how long it takes to load a 7B model versus a 70B model into GPU memory on your hardware (or using cloud instance startup logs). Calculate the "cost of a cold start" in dollars: multiply measured loading time by hourly GPU cost. Compare that cost and queue impact with keeping one ready replica through a known low-traffic period.

Lab C: the interviewer

A user reports that the first token takes 5 seconds, but subsequent tokens arrive quickly. Which part of your stack is likely at fault: the autoscaler or the model engine? Write down your reasoning, then check it against the cold-start phases and the prefill-vs-decode discussion in this article.

Key takeaways

You started with a merchant chat app that went viral and discovered why LLM serving isn't web serving. KV-cache capacity constrains active conversations; low-batch decode is often limited by memory bandwidth; and autoscaling needs queue, cache, and latency signals in addition to infrastructure telemetry.

A Kubernetes pattern using KEDA/HPA, a GPU node autoscaler, and a serving engine separates desired replicas from hardware provisioning. Continuous batching and PagedAttention improve GPU use, but a reliable fleet still needs measured capacity, a tuned cooldown, and a warmup policy justified by its latency target and cost.

After this chapter, you can explain why a GPU serving lane needs a different control loop than a web server, why new nodes sometimes run slower than old ones, and how to keep a thousand-user spike from overrunning your inference budget.

Next Step
Continue to A/B Testing for LLMs

Once your serving stack can handle traffic spikes, you need to know whether the model changes you ship actually help users. A/B testing gives you the measurement framework to answer that with statistical confidence, rather than gut feeling.

PreviousAdvanced MLOps & DevOps for AI
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Scaling Deployments, StatefulSets & Custom Resources

KEDA · 2026

Concepts

Karpenter · 2026

NVIDIA Dynamo: A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models

NVIDIA · 2025

Metrics

vLLM · 2026

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Automatic Prefix Caching

vLLM · 2026

NVIDIA TensorRT-LLM Documentation.

NVIDIA · 2026

TensorRT-LLM Quantization.

NVIDIA · 2026

Text Generation Inference.

Hugging Face · 2026

SGLang: Efficient Execution of Structured Language Model Programs.

Zheng, L., et al. · 2023

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

Optimization and Tuning.

vLLM · 2026

Supported GPUs

NVIDIA · 2026

Efficiently Scaling Transformer Inference.

Pope, R., et al. · 2023 · arXiv preprint

H100 GPU

NVIDIA · 2026

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.

Shoeybi, M., et al. · 2019

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., et al. · 2022

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

S-LoRA: Serving Thousands of Concurrent LoRA Adapters.

Sheng, Y., et al. · 2023 · arXiv preprint

LoRA Adapters

vLLM · 2026