LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnSystem Design CapstonesMulti-Tenant LLM Platform
🏗️HardSystem Design

Multi-Tenant LLM Platform

Design a shared LLM platform with tenant-scoped state, quota enforcement, adapter routing, KV accounting, and measured GPU utilization.

35 min read
Learning path
Step 145 of 155 in the full curriculum
Code Completion SystemLLM-Powered Search Engine

Code completion gave you a single high-frequency product surface: one developer, one editor context, one low-latency serving path. A multi-tenant large language model (LLM) platform generalizes that serving path into shared infrastructure where many tenants, adapters, quotas, and privacy boundaries coexist on the same fleet.

A multi-tenant LLM platform shares expensive serving infrastructure while enforcing tenant-scoped state, scheduler policy, and measurable latency objectives. This design chapter covers routing, quotas, batching, data boundaries, and cost control.

Imagine you run an AI logistics platform. One hundred online merchants use it to answer customer questions, track packages, and draft return labels. Each merchant requires authorization boundaries around customer data, prompts, adapters, and usage records. Your job is to share GPU capacity while keeping every stateful path scoped to the authorized tenant.

This is the multi-tenant LLM serving problem. In this article we will follow one concrete request through a shared platform and see, at each layer, how it enforces tenant scopes, schedules shared work, and measures latency.

Before we start, recall three ideas from earlier in the curriculum. First, an LLM generates text one at a time. Second, to avoid recomputing the entire prompt on every single token, the model stores intermediate results in a structure called the KV cache. Third, batching runs multiple requests together so the GPU loads the model weights once and amortizes the cost across many users. We will build on all three.

Multi-tenant LLM platform: requests from multiple tenants flow through a gateway into shared GPU pools with rate limits, fair scheduling, and per-tenant billing. Multi-tenant LLM platform: requests from multiple tenants flow through a gateway into shared GPU pools with rate limits, fair scheduling, and per-tenant billing.
Every request carries tenant identity through the gateway, scheduler, shared model pool, and billing path so isolation and fairness stay attached to the work.

Why shared capacity needs hard boundaries

Consider a design scenario with a dense 72-billion-parameter model stored in FP16 (16-bit floating point). Weight storage alone is about 144 GB in decimal units. One NVIDIA H100 SXM configuration has 80 GB of HBM3 memory.[1] In this scenario, one copy of the model weights exceeds one such GPU's memory before adding or serving overhead.

If each of one hundred merchants had a separate copy of those weights, weight memory alone would be 14.4 TB. Shared base weights can avoid that duplication, but they do not automatically isolate prompts, retrieved documents, adapters, caches, or billing state.

Make the scenario calculation runnable before discussing schedulers:

shared-weight-capacity.py
1def weight_storage_gb(parameters_billions: int, bytes_per_parameter: int) -> float: 2 return parameters_billions * bytes_per_parameter 3 4base_weight_gb = weight_storage_gb(parameters_billions=72, bytes_per_parameter=2) 5per_tenant_weight_tb = base_weight_gb * 100 / 1000 6 7assert base_weight_gb == 144 8assert per_tenant_weight_tb == 14.4 9print("one_fp16_weight_copy_gb:", base_weight_gb) 10print("one_hundred_copies_tb:", per_tenant_weight_tb)
Output
1one_fp16_weight_copy_gb: 144 2one_hundred_copies_tb: 14.4

Sharing introduces three concrete engineering tensions:

  1. Compute contention. All merchants want the GPU's CUDA cores at the same time during peak shopping hours.
  2. Memory contention. Every active conversation consumes KV cache memory. A merchant with a long support transcript can evict another merchant's conversation if limits aren't enforced.
  3. Weight customization. Merchants want different behaviors. One needs customer-service answers; another needs terse warehouse commands. Loading a full model copy per customization wastes capacity, so we need scoped lightweight adapters or separate pools where required.

The rest of the article solves these three tensions in order.

How we pack requests together: continuous batching

When a GPU processes a batch of requests, it loads the model weights once and reuses them for every request in the batch. The simplest approach is static batching: collect eight requests, run them together, and wait until every single one finishes before starting a new batch. This is easy to implement but wasteful. If Merchant A's tracking query generates only 10 output tokens while Merchant B's returns analysis generates 500 tokens, the GPU sits idle for Merchant A while Merchant B finishes the remaining 490 tokens.

Continuous batching (also called in-flight batching, described in the Orca paper[2]) replaces completed requests with queued work at iteration boundaries. When Merchant A reaches its EOS (End of Sequence) token, the scheduler can admit another request without waiting for Merchant B to finish.

The throughput gain depends on prompt lengths, decode lengths, admission policy, and scheduler overhead. The useful principle is that a finished request need not occupy a decode slot.

In a multi-tenant environment, the scheduler also has to respect priority and fairness. A high-tier merchant may have a tighter latency objective. A tenant-aware continuous batcher therefore balances throughput (packing as many tokens as possible) against measured latency objectives for prioritized tenants. We will see how it does that in the rate-limiting and preemption sections below.

Think of a shared shuttle bus. Static batching is a charter bus that waits until every passenger reaches their destination before returning to the depot. Continuous batching lets an empty seat be offered to the queue at the next scheduled stop.

This miniature schedule keeps a long request active while replacing a completed short request:

continuous-batch-slots.py
1from collections import deque 2 3active = {"merchant-a": 1, "merchant-b": 3} 4waiting = deque([("merchant-c", 2)]) 5 6for tenant in list(active): 7 active[tenant] -= 1 8 if active[tenant] == 0: 9 del active[tenant] 10 admitted, remaining_tokens = waiting.popleft() 11 active[admitted] = remaining_tokens 12 13assert active == {"merchant-b": 2, "merchant-c": 2} 14print("active_after_iteration:", active)
Output
1active_after_iteration: {'merchant-b': 2, 'merchant-c': 2}

How we customize behavior without duplicating weights: LoRA adapters

Merchants may need different behaviors without separate full model copies. LoRA (Low-Rank Adaptation[3]) learns low-rank matrices next to selected original weight layers. During inference, the base model remains fixed and the selected adapter contributes an additional projection.

Adapter size is not one universal number: it depends on rank, target modules, model dimensions, and dtype. It is typically much smaller than a full base copy, but a platform must measure its chosen adapter footprint and decide how many adapters can remain resident alongside KV-cache budgets.

The S-LoRA (Serving Thousands of Concurrent LoRA Adapters) system[4] studies serving many concurrent adapters while keeping base weights shared. The hard part isn't only caching adapters; it is executing requests with different adapters in shared serving steps. Multi-LoRA engines need runtime support that maps each request to its adapter while preserving base-model sharing.[5]

Punica introduces Segmented Gather Matrix-Vector multiplication (SGMV) for batched LoRA serving and evaluates mixed-adapter overhead.[6] Production engines such as vLLM expose LoRA serving configuration including enabled adapters, resident adapter limits, and maximum supported rank.[5] Benchmark your actual adapter mix and hardware before treating the paper result as a fleet capacity plan.

Analogy: the shared fulfillment line Imagine a fulfillment line that uses one fixed conveyor but selects an approved merchant rule card at each station. The conveyor (the base model) stays shared. The rule card (the adapter) is small, but the station must still verify which merchant is authorized to use it.

The platform stores adapters in an object-storage registry such as S3 (Simple Storage Service) and loads them into GPU memory on demand. Frequently used adapters stay in an LRU cache on the GPU; rarely used ones are evicted to host RAM or disk.

LoRA adapter routing for multi-tenant serving: tenant requests resolve small adapter weights from a registry, hot adapters stay in GPU cache, and all tenants share one frozen base model. LoRA adapter routing for multi-tenant serving: tenant requests resolve small adapter weights from a registry, hot adapters stay in GPU cache, and all tenants share one frozen base model.
LoRA adapter routing keeps the base model shared while tenant-specific adapters, cache policy, and billing context remain tenant-scoped.

The conceptual lifecycle below shows how the adapter manager takes a request, resolves the correct adapter, and runs inference. In practice, advanced kernels (like the ones in S-LoRA) fuse the adapter addition directly into the base forward pass so the swap is nearly free:

lora-adapter-routing.py
1from collections import OrderedDict 2from dataclasses import dataclass 3 4@dataclass 5class Request: 6 tenant_id: str 7 requested_adapter: str 8 prompt: str 9 10@dataclass 11class Response: 12 text: str 13 14@dataclass 15class AdapterWeights: 16 adapter_id: str 17 18class AdapterStore: 19 def download(self, adapter_id: str) -> AdapterWeights: 20 print(f"Loading adapter {adapter_id} into GPU cache") 21 return AdapterWeights(adapter_id) 22 23class BaseModel: 24 def generate(self, request: Request, adapter: AdapterWeights) -> Response: 25 token_count = len(request.prompt.split()) 26 return Response( 27 f"tenant={request.tenant_id} adapter={adapter.adapter_id} " 28 f"prompt_tokens={token_count}" 29 ) 30 31class LoRAAdapterManager: 32 """Routes only authorized adapters on a shared base model.""" 33 def __init__(self, max_hot_adapters: int = 2): 34 self.base_model = BaseModel() 35 self.adapter_cache: OrderedDict[str, AdapterWeights] = OrderedDict() 36 self.adapter_store = AdapterStore() 37 self.max_hot_adapters = max_hot_adapters 38 self.authorized_adapter = { 39 "merchant-a": "returns-v2", 40 "merchant-b": "warehouse-v7", 41 "merchant-c": "fraud-v1", 42 } 43 44 def serve(self, request: Request) -> Response: 45 if self.authorized_adapter.get(request.tenant_id) != request.requested_adapter: 46 raise PermissionError("adapter is not authorized for tenant") 47 adapter_id = f"{request.tenant_id}/{request.requested_adapter}" 48 49 if adapter_id not in self.adapter_cache: 50 if len(self.adapter_cache) >= self.max_hot_adapters: 51 evicted_id, _ = self.adapter_cache.popitem(last=False) 52 print(f"Evicting adapter {evicted_id}") 53 adapter_weights = self.adapter_store.download(adapter_id) 54 self.adapter_cache[adapter_id] = adapter_weights 55 56 self.adapter_cache.move_to_end(adapter_id) 57 return self.base_model.generate(request, self.adapter_cache[adapter_id]) 58 59manager = LoRAAdapterManager(max_hot_adapters=2) 60for request in [ 61 Request("merchant-a", "returns-v2", "draft a return label"), 62 Request("merchant-b", "warehouse-v7", "summarize the package status"), 63 Request("merchant-c", "fraud-v1", "review this payment dispute quickly"), 64]: 65 response = manager.serve(request) 66 67print(response.text) 68print("hot adapters:", list(manager.adapter_cache)) 69try: 70 manager.serve(Request("merchant-a", "fraud-v1", "use another policy")) 71except PermissionError as error: 72 print("blocked:", error)
Output
1Loading adapter merchant-a/returns-v2 into GPU cache 2Loading adapter merchant-b/warehouse-v7 into GPU cache 3Evicting adapter merchant-a/returns-v2 4Loading adapter merchant-c/fraud-v1 into GPU cache 5tenant=merchant-c adapter=merchant-c/fraud-v1 prompt_tokens=5 6hot adapters: ['merchant-b/warehouse-v7', 'merchant-c/fraud-v1'] 7blocked: adapter is not authorized for tenant

Adapter loading from object storage to GPU is fast for small adapters but still measurable. If Merchant A and Merchant B alternate on every request, the cache thrashes and latency spikes. Profile your actual adapter reuse patterns before assuming LRU is sufficient. Some platforms pin high-tier adapters permanently and only evict best-effort ones.

How we keep conversations separate: KV cache isolation

Every token the model generates relies on the KV cache, which stores intermediate key and value vectors from earlier tokens. Without it, autoregressive decoding repeats work. In a multi-tenant system, KV allocation is also sensitive state: the runtime must not attach Merchant A's live blocks or cache entries to Merchant B's request.

The memory cost in concrete numbers

KV-cache memory per request grows with context length. For a decoder using fixed-width KV heads, a useful estimate is:

KV memory per request=2×nlayers×nkv_heads×dhead×seq_len×dtype_bytes\begin{aligned} \text{KV memory per request} = &2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \\ &\times \text{seq\_len} \times \text{dtype\_bytes} \end{aligned}KV memory per request=​2×nlayers​×nkv_heads​×dhead​×seq_len×dtype_bytes​

The factor of 2 counts keys and values. The other terms are model depth, KV heads, head dimension, request context length, and bytes per element. Grouped-query attention (GQA) stores fewer KV heads than attention heads, which is why the n_kv_heads term matters.

Before you memorize the symbols, calculate a concrete scenario. Suppose a model has 80 layers, 8 KV heads, head dimension 128, a 4,000-token context, and FP16 (2 bytes per element):

2×80×8×128×4,000×2=1,310,720,000 bytes≈1.22 GiB per request2 \times 80 \times 8 \times 128 \times 4{,}000 \times 2 = 1{,}310{,}720{,}000 \text{ bytes} \approx 1.22 \text{ GiB per request}2×80×8×128×4,000×2=1,310,720,000 bytes≈1.22 GiB per request

If the same model family used 16 KV heads instead of 8, that doubles to about 2.44 GiB. At high concurrency, this memory can become the admission constraint before compute throughput does.

kv-cache-budget.py
1def kv_gib(layers: int, kv_heads: int, head_dim: int, tokens: int, dtype_bytes: int = 2) -> float: 2 bytes_used = 2 * layers * kv_heads * head_dim * tokens * dtype_bytes 3 return bytes_used / (1024 ** 3) 4 5request_gib = kv_gib(layers=80, kv_heads=8, head_dim=128, tokens=4_000) 6double_heads_gib = kv_gib(layers=80, kv_heads=16, head_dim=128, tokens=4_000) 7 8assert round(request_gib, 2) == 1.22 9assert round(double_heads_gib, 2) == 2.44 10print("eight_kv_heads_gib:", round(request_gib, 2)) 11print("sixteen_kv_heads_gib:", round(double_heads_gib, 2))
Output
1eight_kv_heads_gib: 1.22 2sixteen_kv_heads_gib: 2.44

PagedAttention: paging for GPUs

PagedAttention (introduced in the vLLM paper[7]) treats KV storage in blocks rather than reserving one long contiguous chunk for each request. A block table maps each request's logical sequence to physical blocks. The example figure uses 16-token blocks to make the mapping visible; production block sizes are runtime configuration and performance choices.

In that illustration, Merchant A's 47-token conversation occupies three physical blocks and Merchant B's 31-token conversation occupies two. PagedAttention improves memory allocation efficiency; it does not by itself enforce tenant authorization. The serving layer must associate request ownership with block tables, invalidate released references, and implement any required clearing policy before reallocation.

PagedAttention KV-cache diagram where Merchant A and Merchant B use separate logical sequences and block tables, while platform policy must enforce access to physical GPU KV blocks. PagedAttention KV-cache diagram where Merchant A and Merchant B use separate logical sequences and block tables, while platform policy must enforce access to physical GPU KV blocks.
PagedAttention packs KV memory into blocks; the platform must separately bind block-table access to each tenant and clear or invalidate released state.

Prefix caching and the cross-tenant leak risk

Multi-tenant traffic often repeats the same system prompt, tool schema, or long retrieved prefix. Runtimes can cache those KV blocks and skip recomputing the shared prefix on later requests. SGLang introduced RadixAttention for this pattern, while vLLM's automatic prefix caching uses hashed KV blocks rather than a radix tree.[8][9]

Prefix caching mainly lowers TTFT (Time to First Token) because it eliminates repeated prefill work. It doesn't make decode itself cheaper.

The critical rule is isolation. Reuse prefixes only inside an authorized cache namespace, such as one tenant or an explicitly public shared prompt. vLLM documents an optional cache salt intended to isolate cache reuse across trust groups and mitigate timing-based probing; platform routing must also keep adapter and model compatibility consistent.[9]

Enabling private-prefix reuse without a tenant or trust-group namespace is an isolation bug, not only a performance mistake. Scope private cache reuse by tenant or authorized trust group and compatible model, tokenizer, and adapter configuration.

tenant-scoped-prefix-cache.py
1def cache_key(trust_group: str, model: str, adapter: str, tokenizer: str, prefix: str) -> tuple[str, ...]: 2 return trust_group, model, adapter, tokenizer, prefix 3 4prompt = "You are the returns assistant." 5cache = { 6 cache_key("tenant:merchant-a", "base-v3", "returns-v2", "tok-v3", prompt): "kv-7" 7} 8 9same_tenant = cache_key("tenant:merchant-a", "base-v3", "returns-v2", "tok-v3", prompt) 10other_tenant = cache_key("tenant:merchant-b", "base-v3", "returns-v2", "tok-v3", prompt) 11 12assert cache.get(same_tenant) == "kv-7" 13assert cache.get(other_tenant) is None 14print("authorized_hit:", same_tenant in cache) 15print("cross_tenant_hit:", other_tenant in cache)
Output
1authorized_hit: True 2cross_tenant_hit: False

Chunked prefill for multi-tenant fairness

The prefill phase processes the input prompt to build the initial KV cache. It's compute-intensive but doesn't generate tokens. The decode phase generates output tokens one at a time. It's memory-intensive but uses less compute.

Without scheduling controls, a long prefill from one merchant can delay decode operations from others. Chunked prefill (studied in Sarathi-Serve[10]) breaks long prompts into bounded chunks so decode work from other requests can be scheduled between prefill chunks.

The chunk size is a tuning knob, not a fixed constant. It is controlled by the runtime's per-step token budget (max_num_batched_tokens in vLLM). Smaller budgets lower inter-token latency for in-flight decodes; larger budgets improve TTFT and prefill throughput. Current vLLM docs show chunked prefill enabled by default in V1, example low-latency settings like 2,048 tokens, and throughput-oriented settings above 8,192 tokens, so tune it to your latency target instead of copying one number blindly.[10][11]

In multi-tenant settings, chunked prefill is one useful noisy neighbor control. It still needs admission limits and a tenant-aware scheduler; chunking alone does not guarantee fair service.

Tenant-aware preemption

When admitted work approaches its KV budget, the scheduler may reject, queue, or preempt requests according to the published service policy. A priority tier can permit an enterprise request to displace best-effort work, but the choice must be metered and observable rather than hidden.

Modern runtimes often prefer recomputation over CPU swap because host-memory transfers can cost more than rebuilding the evicted prefix. Swap is still useful when recomputation isn't supported or would discard too much work.[12]

The conceptual scheduler below demonstrates the decision logic. It sorts running requests by priority, then by KV footprint, then by tokens already generated. If a new request has higher priority than the lowest-priority running request, the scheduler preempts the victim and schedules the newcomer:

tenant-aware-preemption.py
1from dataclasses import dataclass 2from typing import Protocol 3 4class GPUAllocator(Protocol): 5 def get_num_free_blocks(self) -> int: ... 6 7@dataclass 8class Tenant: 9 name: str 10 priority: int # Larger number = higher priority 11 12@dataclass 13class Request: 14 tenant: Tenant 15 estimated_kv_memory: int 16 tokens_generated: int 17 can_recompute: bool = True 18 19class TenantAwareScheduler: 20 def __init__(self, gpu_allocator: GPUAllocator, block_size_mb: int): 21 self.running_requests: list[Request] = [] 22 self.gpu_allocator = gpu_allocator 23 self.block_size_mb = block_size_mb 24 25 def available_kv_memory(self) -> int: 26 return self.gpu_allocator.get_num_free_blocks() * self.block_size_mb 27 28 def evict_for_recompute(self, request: Request) -> None: 29 print( 30 f"Evicting KV cache for tenant={request.tenant.name} " 31 f"priority={request.tenant.priority}; " 32 "the request will be recomputed if resumed." 33 ) 34 35 def swap_to_cpu(self, request: Request) -> None: 36 print( 37 f"Swapping KV cache for tenant={request.tenant.name} " 38 f"priority={request.tenant.priority} " 39 "to host memory." 40 ) 41 42 def schedule(self, request: Request) -> None: 43 self.running_requests.append(request) 44 print( 45 f"Scheduling request tenant={request.tenant.name} " 46 f"priority={request.tenant.priority}." 47 ) 48 49 def preempt(self, request: Request) -> None: 50 if request.can_recompute: 51 self.evict_for_recompute(request) 52 else: 53 self.swap_to_cpu(request) 54 55 def preempt_if_needed(self, new_request: Request) -> None: 56 if self.available_kv_memory() >= new_request.estimated_kv_memory: 57 self.schedule(new_request) 58 return 59 60 candidates = sorted( 61 self.running_requests, 62 key=lambda r: ( 63 r.tenant.priority, # Lowest priority first 64 -r.estimated_kv_memory, # Free the biggest KV footprint first 65 r.tokens_generated, # Prefer to kill work that has done less decode 66 ), 67 ) 68 69 if not candidates: 70 print("No running requests to preempt.") 71 return 72 73 victim = candidates[0] 74 if new_request.tenant.priority > victim.tenant.priority: 75 self.preempt(victim) 76 self.running_requests.remove(victim) 77 self.schedule(new_request) 78 else: 79 print("Cannot preempt a higher or equal priority request.") 80 81class FakeAllocator: 82 def __init__(self, free_blocks: int): 83 self.free_blocks = free_blocks 84 85 def get_num_free_blocks(self) -> int: 86 return self.free_blocks 87 88scheduler = TenantAwareScheduler(FakeAllocator(free_blocks=4), block_size_mb=16) 89scheduler.running_requests = [ 90 Request(Tenant("starter", priority=1), estimated_kv_memory=96, tokens_generated=8), 91 Request(Tenant("business", priority=2), estimated_kv_memory=80, tokens_generated=120), 92] 93 94incoming = Request(Tenant("enterprise", priority=4), estimated_kv_memory=80, tokens_generated=0) 95scheduler.preempt_if_needed(incoming) 96print("running tenants:", [request.tenant.name for request in scheduler.running_requests])
Output
1Evicting KV cache for tenant=starter priority=1; the request will be recomputed if resumed. 2Scheduling request tenant=enterprise priority=4. 3running tenants: ['business', 'enterprise']

Hard per-tenant limits

Scheduling alone isn't enough. The platform also enforces hard quotas on context sizes and concurrent requests based on the merchant's tier. The illustration below summarizes the isolation spectrum from shared pools to dedicated hardware:

Tenant isolation strategies: shared pool with noisy-neighbor risk, namespace isolation with separate queues and KV accounting, and dedicated GPU pools with stronger runtime boundaries. Tenant isolation strategies: shared pool with noisy-neighbor risk, namespace isolation with separate queues and KV accounting, and dedicated GPU pools with stronger runtime boundaries.
Isolation is a spectrum. Shared pools are cheapest, namespace isolation adds per-tenant control, and dedicated GPU pools give the strongest boundary for regulated workloads.

Example admission policy (numbers are scenario inputs, not universal tiers):

Tenant TierMax Concurrent RequestsMax Context LengthKV Cache Budget
Enterprise508K256 GB
Business204K80 GB
Starter52K20 GB

Apply those limits before scheduling GPU work:

admit-under-tenant-kv-budget.py
1TIERS = { 2 "enterprise": {"max_context": 8_000, "kv_gib": 256.0}, 3 "starter": {"max_context": 2_000, "kv_gib": 20.0}, 4} 5 6def admit(tier: str, context_tokens: int, projected_kv_gib: float) -> str: 7 policy = TIERS[tier] 8 if context_tokens > policy["max_context"]: 9 return "REJECT_CONTEXT_LIMIT" 10 if projected_kv_gib > policy["kv_gib"]: 11 return "REJECT_KV_BUDGET" 12 return "ADMIT" 13 14assert admit("starter", 2_400, 3.0) == "REJECT_CONTEXT_LIMIT" 15assert admit("starter", 1_900, 22.0) == "REJECT_KV_BUDGET" 16assert admit("enterprise", 7_500, 180.0) == "ADMIT" 17print("starter_long_prompt:", admit("starter", 2_400, 3.0)) 18print("enterprise_request:", admit("enterprise", 7_500, 180.0))
Output
1starter_long_prompt: REJECT_CONTEXT_LIMIT 2enterprise_request: ADMIT

How we prevent one merchant from overwhelming the rest: rate limiting and fair queues

Rate limiting sits at the gateway, before a request ever reaches the GPU. It enforces two distinct budgets:

  • Requests per minute (RPM): Controls burst traffic to protect the API gateway from connection exhaustion.
  • Tokens per minute (TPM): Controls sustained throughput to protect GPU compute capacity.

The difference matters. A merchant sending one request with a 64K prompt consumes far more GPU time than a merchant sending one hundred requests with 100-token prompts, even though the first merchant uses fewer requests. RPM alone would let the 64K prompt through and monopolize the KV cache.

Distributed sliding-window enforcement

For RPM, a distributed sliding-window limiter using Redis with a Lua script gives consistent enforcement across all gateway nodes. A local in-memory limiter isn't enough because requests are load-balanced across many gateway instances.

The Lua script below removes entries older than the window, counts the remaining requests, and either allows the new request or rejects it. The key detail is using a unique sorted-set member (a request ID with a timestamp) instead of the raw timestamp alone. If two requests land in the same clock tick and you use the timestamp as both score and member, Redis collapses them into one entry and undercounts traffic:

distributed-sliding-window-enforcement.lua
1-- Redis Lua Script for RPM Sliding-Window Limiting 2local key = KEYS[1] 3local limit = tonumber(ARGV[1]) 4local window_ms = tonumber(ARGV[2]) -- e.g., 60_000 5local now_ms = tonumber(ARGV[3]) 6local member = ARGV[4] -- unique request id, e.g. "1713468123456:req-9f3c" 7 8-- Remove timestamped entries older than the window 9redis.call('ZREMRANGEBYSCORE', key, 0, now_ms - window_ms) 10 11-- Count current requests 12local count = redis.call('ZCARD', key) 13 14if count < limit then 15 redis.call('ZADD', key, now_ms, member) 16 redis.call('PEXPIRE', key, window_ms) 17 return 1 -- Allowed 18else 19 return 0 -- Rejected 20end

TPM is trickier because you don't know the final output length at admission time. In practice, reserve a budget based on prompt tokens plus max_output_tokens, then reconcile the counter with actual usage when the stream finishes.

For high-throughput services, strictly synchronized Redis limits can become a bottleneck. A platform may choose bounded burst allowance or approximate local counters, but that weakens strict limit semantics and must be documented and measured.

Token admission needs reservation and reconciliation. Reserve prompt plus maximum allowed output before execution, then release unused output capacity after the stream completes:

reserve-and-reconcile-token-budget.py
1class TokenBudget: 2 def __init__(self, remaining: int): 3 self.remaining = remaining 4 5 def reserve(self, prompt_tokens: int, max_output_tokens: int) -> int: 6 reservation = prompt_tokens + max_output_tokens 7 if reservation > self.remaining: 8 raise ValueError("TPM budget exceeded") 9 self.remaining -= reservation 10 return reservation 11 12 def reconcile(self, reservation: int, prompt_tokens: int, output_tokens: int) -> None: 13 self.remaining += reservation - (prompt_tokens + output_tokens) 14 15budget = TokenBudget(remaining=1_000) 16held = budget.reserve(prompt_tokens=300, max_output_tokens=400) 17budget.reconcile(held, prompt_tokens=300, output_tokens=120) 18 19assert budget.remaining == 580 20print("tokens_remaining_after_actual_usage:", budget.remaining)
Output
1tokens_remaining_after_actual_usage: 580

Fairness inside the scheduler

RPM and TPM protect the gateway edge, but they don't fully solve scheduler fairness inside the serving engine. A merchant with one 64K prompt can consume far more GPU time than dozens of merchants sending short chat turns.

Inside the runtime, keep per-tenant queues and charge a virtual token budget for every admitted prefill chunk and every decode step. Then schedule by priority tier plus virtual finish time, not raw request count. That gives each merchant forward progress while still letting higher-SLA traffic buy more share.

Do not collapse rate limiting and quota management into one counter. Rate limits prevent burst and protect the system, while quotas cap total usage and protect the budget. A merchant can stay under their RPM limit and still burn through their monthly token quota in one afternoon.

How we keep data private: the isolation stack

Every state-bearing layer needs an authorization boundary and a testable release policy. A cross-tenant retrieval, adapter, cache, or KV access is a security incident even if the other layers behaved correctly.

The RAG relevance vs. authorization gap

Many multi-tenant platforms augment LLMs with retrieval (RAG). A vector database finds the most relevant documents for a query. In multi-tenancy, "relevant" doesn't mean "allowed."

Imagine Merchant X searches for "best shipping carrier rates." The vector DB might find a highly relevant internal contract that belongs to Merchant Y, because both merchants ship packages and the embeddings overlap. Without a hard filter, the LLM could summarize Merchant Y's confidential rates and return them to Merchant X.

The fix is authorization filtering in the retrieval operation itself. An application should pass authorized scope into the database query, and tests should fail if another tenant's result can cross that boundary:

tenant-filtered-retrieval.py
1documents = [ 2 {"tenant": "merchant-a", "text": "FastShip discount tier A", "score": 0.88}, 3 {"tenant": "merchant-b", "text": "FastShip confidential tier B", "score": 0.99}, 4] 5 6def authorized_search(tenant: str, top_k: int) -> list[str]: 7 allowed = [doc for doc in documents if doc["tenant"] == tenant] 8 ranked = sorted(allowed, key=lambda doc: doc["score"], reverse=True) 9 return [doc["text"] for doc in ranked[:top_k]] 10 11results = authorized_search("merchant-a", top_k=1) 12assert results == ["FastShip discount tier A"] 13assert all("tier B" not in text for text in results) 14print("authorized_results:", results)
Output
1authorized_results: ['FastShip discount tier A']

The authorization predicate is applied before candidate results leave the data layer. Post-filtering after a broad top-K query can return foreign identifiers, scores, or content to application memory and may also erase every authorized candidate.

Model and prompt isolation

  • LoRA adapter isolation. Authorize adapter IDs against the tenant before loading encrypted artifacts. If adapters contain sensitive tenant tuning, include residency and release rules in the data contract.
  • Prompt isolation. Keep raw prompts out of plaintext logs and authenticate/encrypt gateway-to-worker transport. The exact mTLS (mutual Transport Layer Security) and memory-boundary policy depends on the deployment threat model.
  • Harder runtime boundaries. For workloads whose threat model rules out shared workers, use dedicated node pools, MIG (Multi-Instance GPU) partitions, or VM boundaries as evaluated controls.[13] Kubernetes placement selects hardware; it doesn't create isolation by itself.

State sanitization

When a shared worker serves more than one tenant, state lifecycle rules matter:

  • KV reference lifecycle. Remove request access to released KV blocks when a sequence completes. If the threat model requires cleared memory before cross-tenant reuse, implement and verify that clearing policy rather than assuming the allocator provides it.
  • Batch construction policy. Multi-tenant batching is fine on a shared base model, but every row in the batch must keep its own tenant ID, adapter handle, KV block table, and metering context. For regulated workloads, the simpler answer is dedicated pools or MIG / VM boundaries instead of trying to harden every shared-kernel path.

PII masking as a tiered control

For regulated workloads that can tolerate redaction, requests can pass through a lightweight PII masking service (e.g., Presidio[14]) before they hit the model router. This reduces the chance that the LLM ever sees raw credit card numbers or Social Security numbers, while still letting downstream systems map placeholders back to original values when needed.

PII and tenant-isolation stack: encrypted request, gateway auth, PII masking sidecar, tenant-filtered router, mTLS to GPU worker, and tenant-scoped KV access. PII and tenant-isolation stack: encrypted request, gateway auth, PII masking sidecar, tenant-filtered router, mTLS to GPU worker, and tenant-scoped KV access.
PII masking is one control in a larger isolation stack; tenant authorization still has to hold at retrieval, routing, batching, and KV allocation.

How we attribute cost: per-tenant metering and chargeback

A shared fleet only stays profitable if you can answer one question per tenant: what did this merchant actually cost us, and what should we charge? Token counts alone are a weak proxy because two requests with the same token counts can consume very different GPU time depending on prompt-vs-output split, batch occupancy, preemptions, and cache hits.

A defensible metering record attaches to every request and carries: tenant_id, model and adapter version, prompt tokens, output tokens, cache-hit tokens, queue wait, prefill time, decode time, KV blocks held, preemption count, and GPU worker type. Cache-hit tokens matter because reused prefix blocks skip prefill compute. A public billing policy may choose a distinct cached-input rate, as current provider pricing documents illustrate.[15][16]

For internal cost allocation rather than customer billing, the honest unit is GPU-time, not tokens. A reasonable per-request cost estimate looks like:

costreq≈gpu_seconds×node_hourly_rate/3600+adapter_residency+storage\text{cost}_{\text{req}} \approx \text{gpu\_seconds} \times \text{node\_hourly\_rate} / 3600 + \text{adapter\_residency} + \text{storage}costreq​≈gpu_seconds×node_hourly_rate/3600+adapter_residency+storage

where gpu_seconds is the request's share of busy GPU time (prefill plus its decode steps, divided by batch occupancy so shared steps are split across co-batched tenants). Charge customers on the simpler token-and-tier dimensions, but reconcile against measured GPU-time so you can spot tenants whose traffic shape (long prompts, low batchability, adapter thrash) costs far more than their token bill suggests.

meter-shared-gpu-cost.py
1records = [ 2 {"tenant": "merchant-a", "gpu_seconds": 0.40, "cache_hit_tokens": 800}, 3 {"tenant": "merchant-b", "gpu_seconds": 1.25, "cache_hit_tokens": 0}, 4] 5NODE_HOURLY_RATE = 8.00 6 7def compute_dollars(record: dict[str, float]) -> float: 8 return record["gpu_seconds"] * NODE_HOURLY_RATE / 3600 9 10costs = {record["tenant"]: compute_dollars(record) for record in records} 11assert costs["merchant-b"] > costs["merchant-a"] 12print("metered_gpu_cost_usd:", {tenant: round(value, 6) for tenant, value in costs.items()})
Output
1metered_gpu_cost_usd: {'merchant-a': 0.000889, 'merchant-b': 0.002778}

How the system grows: scaling, canary, and fault tolerance

A production platform should handle spikes, deploy new models with release gates, and degrade predictably during failures.

Auto-scaling on queue depth

CPU utilization alone doesn't describe LLM serving pressure. Queue depth, admitted-token backlog, inference latency, and KV-cache utilization provide signals for capacity and admission decisions. When new workers cannot become ready in time, shed or downgrade eligible best-effort work instead of silently violating every tenant's objective.

Cold-starting GPU workers includes making model weights available before the pool can serve requests. Depending on cost and latency objectives, a platform may keep warm capacity, forecast predictable demand, or reject low-priority excess load while workers start.

Model versioning and canary rollouts

Models and LoRA adapters require versioned rollout because a new artifact can degrade generation quality or latency.

In one canary rollout policy, when a merchant deploys a new adapter version (for example, v2), the router sends a controlled slice of eligible traffic to it while the rest continues on v1. Monitor quality evaluation, latency, error rates, and safety signals over a predefined observation window.

If gates pass, the router can increase exposure. If a gate fails, route new eligible requests back to v1; active streams and adapter residency still need explicit handling, so a routing change is not a blanket zero-downtime promise.

For base model updates, the process is more complex. Unlike lightweight adapters, base models require spinning up entirely new GPU worker pools. The platform routes shadow traffic (duplicate asynchronous requests) to the new base model cluster to validate correctness and measure throughput before exposing it to real merchant traffic. Once validated, the gateway shifts live traffic to the new cluster and gracefully drains the old one.

Fault tolerance

System reliability relies on handling GPU failures and model crashes gracefully:

StrategyTrigger ConditionAction TakenArchitectural Impact
Dead Letter Queues (DLQ)Repeated CUDA out-of-memory (OOM) or crashesMove a repeatedly failing request to DLQ after a bounded retry countStops that request from causing an unbounded retry loop
Circuit BreakingModel/adapter error rate crosses a configured thresholdFast-fail new requests for that specific adapterLimits repeated work while operators investigate
Active Health ChecksMissed node heartbeats (e.g., stuck kernel)Mark unhealthy, stop new routing, drain or fail in-flight work according to policyRemoves a suspected worker from new admission
Zone RedundancyEntire Availability Zone failureShift eligible traffic to healthy zones, subject to spare capacityReduces zone-failure impact when capacity is available

These mechanisms need a control plane or equivalent coordination layer to track worker readiness, orchestrate model deployments, and update routing. Readiness checks reduce routing to unavailable workers; they do not prove model quality or prevent every runtime failure.

Try it yourself

Here are three practice levels you can build to internalize the concepts:

Level 1: Tenant-aware API wrapper. Implement a simple FastAPI endpoint that accepts a tenant_id header, forwards the body to an OpenAI-compatible API, and logs input tokens, output tokens, and latency per tenant. Add a basic in-memory RPM limiter that rejects requests when a tenant exceeds 10 requests per minute.

Level 2: LoRA adapter swap measurement. Use vLLM or LoRAX to serve two different LoRA adapters on one base model. Send alternating requests for Adapter A and Adapter B and measure the latency of each swap. Does the first request after a swap take longer than subsequent ones? Can you warm both adapters simultaneously if GPU memory allows?

Level 3: Isolation contract tests. Build a local router with tenant-filtered retrieval, tenant-scoped prefix-cache keys, and adapter authorization. Write adversarial tests that submit the same query or prefix from two tenants and assert that no foreign document, cache hit, or adapter route is exposed.

Mastery check

Suppose your platform serves three merchant tiers on one GPU fleet. Enterprise merchants have tighter latency objectives and stronger privacy boundaries, business merchants need adapter customization, and starter merchants need low cost. Explain how you would route requests, isolate retrieval and KV state, enforce fairness, and roll out a new adapter version without causing a cross-tenant leak or a fleet-wide latency spike.

Key concepts

  1. Share weights, isolate state. The base model, scheduler, and worker pool can be shared, but tenant identity must stay attached to retrieval, adapters, KV blocks, and billing.
  2. KV memory is central bottleneck. Large prompts and long chats compete for VRAM before raw FLOPS become the main limit.
  3. Continuous batching needs fairness policy. Good schedulers keep the GPU full while still honoring tenant quotas, priorities, and latency goals.
  4. LoRA reduces customization weight overhead. Tenant-specific behavior can come from measured adapter footprints rather than separate full model copies.
  5. Authorization must happen inside retrieval. Relevant documents are not enough; every RAG query must enforce tenant filters at the database layer.

Evaluation rubric

You are in good shape if you can:

  • explain why a shared base model is economically necessary and why KV memory still has to be budgeted per tenant
  • describe how continuous batching, chunked prefill, preemption, and hard KV limits work together
  • justify when shared pools are enough and when dedicated pools, MIG slices, or VMs are required
  • design cost attribution that combines tokens, cache hits, queue time, and GPU-seconds instead of relying on token count alone
  • explain how retrieval filters, prefix-cache namespaces, adapter routing, and block scrubbing prevent cross-tenant leaks

Follow-up questions

Common pitfalls

Symptom: Merchant B sees wording or policy hints that belong to Merchant A. Cause: KV state, prefix-cache entries, adapter buffers, or other attention-side state was reused without full invalidation. Fix: Invalidate block tables and tenant-scoped caches after every request. For stricter workloads, scrub memory deterministically or move the tenant to dedicated infrastructure.

Symptom: Capacity plan asks for far more GPUs than the traffic actually needs. Cause: The plan assumes one merchant maps to one slice of compute and ignores batching, burstiness, and shared adapters. Fix: Benchmark node throughput on the real model, scheduler, and context mix. Size the fleet from measured queue depth, TTFT, and KV pressure instead of a hand-wavy merchant-to-GPU ratio.

Symptom: One long-context tenant causes everybody else's latency to spike. Cause: That tenant monopolizes KV memory because the platform lacks hard context and KV budget limits. Fix: Enforce per-tier context caps and KV budgets at admission time. Preempt or reject oversized requests before they occupy shared blocks.

Symptom: Retrieved answers are relevant but unauthorized. Cause: Retrieval was filtered after vector search instead of inside the database query. Fix: Push tenant_id into the retrieval predicate itself. Never let foreign document IDs or scores leave the vector store.

Symptom: Pricing looks fair by tokens, but some tenants are still unprofitable. Cause: Token counts hide expensive traffic shapes such as long prefills, low batchability, repeated preemption, or adapter cache thrash. Fix: Reconcile customer billing against measured GPU-seconds, cache hits, queue time, and adapter residency so internal cost tracks actual fleet burn.

Next Step
Continue to LLM-Powered Search Engine

You will design a search and answer system that combines retrieval, ranking, synthesis, and citation, while applying the tenant-scoped retrieval and context controls introduced here.

PreviousCode Completion System
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

H100 GPU

NVIDIA · 2026

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

S-LoRA: Serving Thousands of Concurrent LoRA Adapters.

Sheng, Y., et al. · 2023 · arXiv preprint

LoRA Adapters

vLLM · 2026

Punica: Multi-Tenant LoRA Serving

Chen, L., Ye, Z., Wu, Y., Zhuo, D., Ceze, L., & Krishnamurthy, A. · 2023

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., et al. · 2023 · SOSP 2023

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2023 · arXiv:2312.07104

Automatic Prefix Caching

vLLM · 2026

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Optimization and Tuning.

vLLM · 2026

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM Team · 2024

Supported GPUs

NVIDIA · 2026

Presidio: Data Protection and De-identification SDK.

Microsoft Presidio. · 2023 · GitHub

Prompt caching

OpenAI · 2026

Prompt caching.

Anthropic. · 2026 · Official documentation