Design a shared LLM platform with tenant-scoped state, quota enforcement, adapter routing, KV accounting, and measured GPU utilization.
Code completion gave you a single high-frequency product surface: one developer, one editor context, one low-latency serving path. A multi-tenant large language model (LLM) platform generalizes that serving path into shared infrastructure where many tenants, adapters, quotas, and privacy boundaries coexist on the same fleet.
A multi-tenant LLM platform shares expensive serving infrastructure while enforcing tenant-scoped state, scheduler policy, and measurable latency objectives. This design chapter covers routing, quotas, batching, data boundaries, and cost control.
Imagine you run an AI logistics platform. One hundred online merchants use it to answer customer questions, track packages, and draft return labels. Each merchant requires authorization boundaries around customer data, prompts, adapters, and usage records. Your job is to share GPU capacity while keeping every stateful path scoped to the authorized tenant.
This is the multi-tenant LLM serving problem. In this article we will follow one concrete request through a shared platform and see, at each layer, how it enforces tenant scopes, schedules shared work, and measures latency.
Before we start, recall three ideas from earlier in the curriculum. First, an LLM generates text one at a time. Second, to avoid recomputing the entire prompt on every single token, the model stores intermediate results in a structure called the KV cache. Third, batching runs multiple requests together so the GPU loads the model weights once and amortizes the cost across many users. We will build on all three.
Consider a design scenario with a dense 72-billion-parameter model stored in FP16 (16-bit floating point). Weight storage alone is about 144 GB in decimal units. One NVIDIA H100 SXM configuration has 80 GB of HBM3 memory.[1] In this scenario, one copy of the model weights exceeds one such GPU's memory before adding or serving overhead.
If each of one hundred merchants had a separate copy of those weights, weight memory alone would be 14.4 TB. Shared base weights can avoid that duplication, but they do not automatically isolate prompts, retrieved documents, adapters, caches, or billing state.
Make the scenario calculation runnable before discussing schedulers:
1def weight_storage_gb(parameters_billions: int, bytes_per_parameter: int) -> float:
2 return parameters_billions * bytes_per_parameter
3
4base_weight_gb = weight_storage_gb(parameters_billions=72, bytes_per_parameter=2)
5per_tenant_weight_tb = base_weight_gb * 100 / 1000
6
7assert base_weight_gb == 144
8assert per_tenant_weight_tb == 14.4
9print("one_fp16_weight_copy_gb:", base_weight_gb)
10print("one_hundred_copies_tb:", per_tenant_weight_tb)1one_fp16_weight_copy_gb: 144
2one_hundred_copies_tb: 14.4Sharing introduces three concrete engineering tensions:
The rest of the article solves these three tensions in order.
When a GPU processes a batch of requests, it loads the model weights once and reuses them for every request in the batch. The simplest approach is static batching: collect eight requests, run them together, and wait until every single one finishes before starting a new batch. This is easy to implement but wasteful. If Merchant A's tracking query generates only 10 output tokens while Merchant B's returns analysis generates 500 tokens, the GPU sits idle for Merchant A while Merchant B finishes the remaining 490 tokens.
Continuous batching (also called in-flight batching, described in the Orca paper[2]) replaces completed requests with queued work at iteration boundaries. When Merchant A reaches its EOS (End of Sequence) token, the scheduler can admit another request without waiting for Merchant B to finish.
The throughput gain depends on prompt lengths, decode lengths, admission policy, and scheduler overhead. The useful principle is that a finished request need not occupy a decode slot.
In a multi-tenant environment, the scheduler also has to respect priority and fairness. A high-tier merchant may have a tighter latency objective. A tenant-aware continuous batcher therefore balances throughput (packing as many tokens as possible) against measured latency objectives for prioritized tenants. We will see how it does that in the rate-limiting and preemption sections below.
Think of a shared shuttle bus. Static batching is a charter bus that waits until every passenger reaches their destination before returning to the depot. Continuous batching lets an empty seat be offered to the queue at the next scheduled stop.
This miniature schedule keeps a long request active while replacing a completed short request:
1from collections import deque
2
3active = {"merchant-a": 1, "merchant-b": 3}
4waiting = deque([("merchant-c", 2)])
5
6for tenant in list(active):
7 active[tenant] -= 1
8 if active[tenant] == 0:
9 del active[tenant]
10 admitted, remaining_tokens = waiting.popleft()
11 active[admitted] = remaining_tokens
12
13assert active == {"merchant-b": 2, "merchant-c": 2}
14print("active_after_iteration:", active)1active_after_iteration: {'merchant-b': 2, 'merchant-c': 2}Merchants may need different behaviors without separate full model copies. LoRA (Low-Rank Adaptation[3]) learns low-rank matrices next to selected original weight layers. During inference, the base model remains fixed and the selected adapter contributes an additional projection.
Adapter size is not one universal number: it depends on rank, target modules, model dimensions, and dtype. It is typically much smaller than a full base copy, but a platform must measure its chosen adapter footprint and decide how many adapters can remain resident alongside KV-cache budgets.
The S-LoRA (Serving Thousands of Concurrent LoRA Adapters) system[4] studies serving many concurrent adapters while keeping base weights shared. The hard part isn't only caching adapters; it is executing requests with different adapters in shared serving steps. Multi-LoRA engines need runtime support that maps each request to its adapter while preserving base-model sharing.[5]
Punica introduces Segmented Gather Matrix-Vector multiplication (SGMV) for batched LoRA serving and evaluates mixed-adapter overhead.[6] Production engines such as vLLM expose LoRA serving configuration including enabled adapters, resident adapter limits, and maximum supported rank.[5] Benchmark your actual adapter mix and hardware before treating the paper result as a fleet capacity plan.
Analogy: the shared fulfillment line Imagine a fulfillment line that uses one fixed conveyor but selects an approved merchant rule card at each station. The conveyor (the base model) stays shared. The rule card (the adapter) is small, but the station must still verify which merchant is authorized to use it.
The platform stores adapters in an object-storage registry such as S3 (Simple Storage Service) and loads them into GPU memory on demand. Frequently used adapters stay in an LRU cache on the GPU; rarely used ones are evicted to host RAM or disk.
The conceptual lifecycle below shows how the adapter manager takes a request, resolves the correct adapter, and runs inference. In practice, advanced kernels (like the ones in S-LoRA) fuse the adapter addition directly into the base forward pass so the swap is nearly free:
1from collections import OrderedDict
2from dataclasses import dataclass
3
4@dataclass
5class Request:
6 tenant_id: str
7 requested_adapter: str
8 prompt: str
9
10@dataclass
11class Response:
12 text: str
13
14@dataclass
15class AdapterWeights:
16 adapter_id: str
17
18class AdapterStore:
19 def download(self, adapter_id: str) -> AdapterWeights:
20 print(f"Loading adapter {adapter_id} into GPU cache")
21 return AdapterWeights(adapter_id)
22
23class BaseModel:
24 def generate(self, request: Request, adapter: AdapterWeights) -> Response:
25 token_count = len(request.prompt.split())
26 return Response(
27 f"tenant={request.tenant_id} adapter={adapter.adapter_id} "
28 f"prompt_tokens={token_count}"
29 )
30
31class LoRAAdapterManager:
32 """Routes only authorized adapters on a shared base model."""
33 def __init__(self, max_hot_adapters: int = 2):
34 self.base_model = BaseModel()
35 self.adapter_cache: OrderedDict[str, AdapterWeights] = OrderedDict()
36 self.adapter_store = AdapterStore()
37 self.max_hot_adapters = max_hot_adapters
38 self.authorized_adapter = {
39 "merchant-a": "returns-v2",
40 "merchant-b": "warehouse-v7",
41 "merchant-c": "fraud-v1",
42 }
43
44 def serve(self, request: Request) -> Response:
45 if self.authorized_adapter.get(request.tenant_id) != request.requested_adapter:
46 raise PermissionError("adapter is not authorized for tenant")
47 adapter_id = f"{request.tenant_id}/{request.requested_adapter}"
48
49 if adapter_id not in self.adapter_cache:
50 if len(self.adapter_cache) >= self.max_hot_adapters:
51 evicted_id, _ = self.adapter_cache.popitem(last=False)
52 print(f"Evicting adapter {evicted_id}")
53 adapter_weights = self.adapter_store.download(adapter_id)
54 self.adapter_cache[adapter_id] = adapter_weights
55
56 self.adapter_cache.move_to_end(adapter_id)
57 return self.base_model.generate(request, self.adapter_cache[adapter_id])
58
59manager = LoRAAdapterManager(max_hot_adapters=2)
60for request in [
61 Request("merchant-a", "returns-v2", "draft a return label"),
62 Request("merchant-b", "warehouse-v7", "summarize the package status"),
63 Request("merchant-c", "fraud-v1", "review this payment dispute quickly"),
64]:
65 response = manager.serve(request)
66
67print(response.text)
68print("hot adapters:", list(manager.adapter_cache))
69try:
70 manager.serve(Request("merchant-a", "fraud-v1", "use another policy"))
71except PermissionError as error:
72 print("blocked:", error)1Loading adapter merchant-a/returns-v2 into GPU cache
2Loading adapter merchant-b/warehouse-v7 into GPU cache
3Evicting adapter merchant-a/returns-v2
4Loading adapter merchant-c/fraud-v1 into GPU cache
5tenant=merchant-c adapter=merchant-c/fraud-v1 prompt_tokens=5
6hot adapters: ['merchant-b/warehouse-v7', 'merchant-c/fraud-v1']
7blocked: adapter is not authorized for tenantAdapter loading from object storage to GPU is fast for small adapters but still measurable. If Merchant A and Merchant B alternate on every request, the cache thrashes and latency spikes. Profile your actual adapter reuse patterns before assuming LRU is sufficient. Some platforms pin high-tier adapters permanently and only evict best-effort ones.
Every token the model generates relies on the KV cache, which stores intermediate key and value vectors from earlier tokens. Without it, autoregressive decoding repeats work. In a multi-tenant system, KV allocation is also sensitive state: the runtime must not attach Merchant A's live blocks or cache entries to Merchant B's request.
KV-cache memory per request grows with context length. For a decoder using fixed-width KV heads, a useful estimate is:
The factor of 2 counts keys and values. The other terms are model depth, KV heads, head dimension, request context length, and bytes per element. Grouped-query attention (GQA) stores fewer KV heads than attention heads, which is why the n_kv_heads term matters.
Before you memorize the symbols, calculate a concrete scenario. Suppose a model has 80 layers, 8 KV heads, head dimension 128, a 4,000-token context, and FP16 (2 bytes per element):
If the same model family used 16 KV heads instead of 8, that doubles to about 2.44 GiB. At high concurrency, this memory can become the admission constraint before compute throughput does.
1def kv_gib(layers: int, kv_heads: int, head_dim: int, tokens: int, dtype_bytes: int = 2) -> float:
2 bytes_used = 2 * layers * kv_heads * head_dim * tokens * dtype_bytes
3 return bytes_used / (1024 ** 3)
4
5request_gib = kv_gib(layers=80, kv_heads=8, head_dim=128, tokens=4_000)
6double_heads_gib = kv_gib(layers=80, kv_heads=16, head_dim=128, tokens=4_000)
7
8assert round(request_gib, 2) == 1.22
9assert round(double_heads_gib, 2) == 2.44
10print("eight_kv_heads_gib:", round(request_gib, 2))
11print("sixteen_kv_heads_gib:", round(double_heads_gib, 2))1eight_kv_heads_gib: 1.22
2sixteen_kv_heads_gib: 2.44PagedAttention (introduced in the vLLM paper[7]) treats KV storage in blocks rather than reserving one long contiguous chunk for each request. A block table maps each request's logical sequence to physical blocks. The example figure uses 16-token blocks to make the mapping visible; production block sizes are runtime configuration and performance choices.
In that illustration, Merchant A's 47-token conversation occupies three physical blocks and Merchant B's 31-token conversation occupies two. PagedAttention improves memory allocation efficiency; it does not by itself enforce tenant authorization. The serving layer must associate request ownership with block tables, invalidate released references, and implement any required clearing policy before reallocation.
Multi-tenant traffic often repeats the same system prompt, tool schema, or long retrieved prefix. Runtimes can cache those KV blocks and skip recomputing the shared prefix on later requests. SGLang introduced RadixAttention for this pattern, while vLLM's automatic prefix caching uses hashed KV blocks rather than a radix tree.[8][9]
Prefix caching mainly lowers TTFT (Time to First Token) because it eliminates repeated prefill work. It doesn't make decode itself cheaper.
The critical rule is isolation. Reuse prefixes only inside an authorized cache namespace, such as one tenant or an explicitly public shared prompt. vLLM documents an optional cache salt intended to isolate cache reuse across trust groups and mitigate timing-based probing; platform routing must also keep adapter and model compatibility consistent.[9]
Enabling private-prefix reuse without a tenant or trust-group namespace is an isolation bug, not only a performance mistake. Scope private cache reuse by tenant or authorized trust group and compatible model, tokenizer, and adapter configuration.
1def cache_key(trust_group: str, model: str, adapter: str, tokenizer: str, prefix: str) -> tuple[str, ...]:
2 return trust_group, model, adapter, tokenizer, prefix
3
4prompt = "You are the returns assistant."
5cache = {
6 cache_key("tenant:merchant-a", "base-v3", "returns-v2", "tok-v3", prompt): "kv-7"
7}
8
9same_tenant = cache_key("tenant:merchant-a", "base-v3", "returns-v2", "tok-v3", prompt)
10other_tenant = cache_key("tenant:merchant-b", "base-v3", "returns-v2", "tok-v3", prompt)
11
12assert cache.get(same_tenant) == "kv-7"
13assert cache.get(other_tenant) is None
14print("authorized_hit:", same_tenant in cache)
15print("cross_tenant_hit:", other_tenant in cache)1authorized_hit: True
2cross_tenant_hit: FalseThe prefill phase processes the input prompt to build the initial KV cache. It's compute-intensive but doesn't generate tokens. The decode phase generates output tokens one at a time. It's memory-intensive but uses less compute.
Without scheduling controls, a long prefill from one merchant can delay decode operations from others. Chunked prefill (studied in Sarathi-Serve[10]) breaks long prompts into bounded chunks so decode work from other requests can be scheduled between prefill chunks.
The chunk size is a tuning knob, not a fixed constant. It is controlled by the runtime's per-step token budget (max_num_batched_tokens in vLLM). Smaller budgets lower inter-token latency for in-flight decodes; larger budgets improve TTFT and prefill throughput. Current vLLM docs show chunked prefill enabled by default in V1, example low-latency settings like 2,048 tokens, and throughput-oriented settings above 8,192 tokens, so tune it to your latency target instead of copying one number blindly.[10][11]
In multi-tenant settings, chunked prefill is one useful noisy neighbor control. It still needs admission limits and a tenant-aware scheduler; chunking alone does not guarantee fair service.
When admitted work approaches its KV budget, the scheduler may reject, queue, or preempt requests according to the published service policy. A priority tier can permit an enterprise request to displace best-effort work, but the choice must be metered and observable rather than hidden.
Modern runtimes often prefer recomputation over CPU swap because host-memory transfers can cost more than rebuilding the evicted prefix. Swap is still useful when recomputation isn't supported or would discard too much work.[12]
The conceptual scheduler below demonstrates the decision logic. It sorts running requests by priority, then by KV footprint, then by tokens already generated. If a new request has higher priority than the lowest-priority running request, the scheduler preempts the victim and schedules the newcomer:
1from dataclasses import dataclass
2from typing import Protocol
3
4class GPUAllocator(Protocol):
5 def get_num_free_blocks(self) -> int: ...
6
7@dataclass
8class Tenant:
9 name: str
10 priority: int # Larger number = higher priority
11
12@dataclass
13class Request:
14 tenant: Tenant
15 estimated_kv_memory: int
16 tokens_generated: int
17 can_recompute: bool = True
18
19class TenantAwareScheduler:
20 def __init__(self, gpu_allocator: GPUAllocator, block_size_mb: int):
21 self.running_requests: list[Request] = []
22 self.gpu_allocator = gpu_allocator
23 self.block_size_mb = block_size_mb
24
25 def available_kv_memory(self) -> int:
26 return self.gpu_allocator.get_num_free_blocks() * self.block_size_mb
27
28 def evict_for_recompute(self, request: Request) -> None:
29 print(
30 f"Evicting KV cache for tenant={request.tenant.name} "
31 f"priority={request.tenant.priority}; "
32 "the request will be recomputed if resumed."
33 )
34
35 def swap_to_cpu(self, request: Request) -> None:
36 print(
37 f"Swapping KV cache for tenant={request.tenant.name} "
38 f"priority={request.tenant.priority} "
39 "to host memory."
40 )
41
42 def schedule(self, request: Request) -> None:
43 self.running_requests.append(request)
44 print(
45 f"Scheduling request tenant={request.tenant.name} "
46 f"priority={request.tenant.priority}."
47 )
48
49 def preempt(self, request: Request) -> None:
50 if request.can_recompute:
51 self.evict_for_recompute(request)
52 else:
53 self.swap_to_cpu(request)
54
55 def preempt_if_needed(self, new_request: Request) -> None:
56 if self.available_kv_memory() >= new_request.estimated_kv_memory:
57 self.schedule(new_request)
58 return
59
60 candidates = sorted(
61 self.running_requests,
62 key=lambda r: (
63 r.tenant.priority, # Lowest priority first
64 -r.estimated_kv_memory, # Free the biggest KV footprint first
65 r.tokens_generated, # Prefer to kill work that has done less decode
66 ),
67 )
68
69 if not candidates:
70 print("No running requests to preempt.")
71 return
72
73 victim = candidates[0]
74 if new_request.tenant.priority > victim.tenant.priority:
75 self.preempt(victim)
76 self.running_requests.remove(victim)
77 self.schedule(new_request)
78 else:
79 print("Cannot preempt a higher or equal priority request.")
80
81class FakeAllocator:
82 def __init__(self, free_blocks: int):
83 self.free_blocks = free_blocks
84
85 def get_num_free_blocks(self) -> int:
86 return self.free_blocks
87
88scheduler = TenantAwareScheduler(FakeAllocator(free_blocks=4), block_size_mb=16)
89scheduler.running_requests = [
90 Request(Tenant("starter", priority=1), estimated_kv_memory=96, tokens_generated=8),
91 Request(Tenant("business", priority=2), estimated_kv_memory=80, tokens_generated=120),
92]
93
94incoming = Request(Tenant("enterprise", priority=4), estimated_kv_memory=80, tokens_generated=0)
95scheduler.preempt_if_needed(incoming)
96print("running tenants:", [request.tenant.name for request in scheduler.running_requests])1Evicting KV cache for tenant=starter priority=1; the request will be recomputed if resumed.
2Scheduling request tenant=enterprise priority=4.
3running tenants: ['business', 'enterprise']Scheduling alone isn't enough. The platform also enforces hard quotas on context sizes and concurrent requests based on the merchant's tier. The illustration below summarizes the isolation spectrum from shared pools to dedicated hardware:
Example admission policy (numbers are scenario inputs, not universal tiers):
| Tenant Tier | Max Concurrent Requests | Max Context Length | KV Cache Budget |
|---|---|---|---|
| Enterprise | 50 | 8K | 256 GB |
| Business | 20 | 4K | 80 GB |
| Starter | 5 | 2K | 20 GB |
Apply those limits before scheduling GPU work:
1TIERS = {
2 "enterprise": {"max_context": 8_000, "kv_gib": 256.0},
3 "starter": {"max_context": 2_000, "kv_gib": 20.0},
4}
5
6def admit(tier: str, context_tokens: int, projected_kv_gib: float) -> str:
7 policy = TIERS[tier]
8 if context_tokens > policy["max_context"]:
9 return "REJECT_CONTEXT_LIMIT"
10 if projected_kv_gib > policy["kv_gib"]:
11 return "REJECT_KV_BUDGET"
12 return "ADMIT"
13
14assert admit("starter", 2_400, 3.0) == "REJECT_CONTEXT_LIMIT"
15assert admit("starter", 1_900, 22.0) == "REJECT_KV_BUDGET"
16assert admit("enterprise", 7_500, 180.0) == "ADMIT"
17print("starter_long_prompt:", admit("starter", 2_400, 3.0))
18print("enterprise_request:", admit("enterprise", 7_500, 180.0))1starter_long_prompt: REJECT_CONTEXT_LIMIT
2enterprise_request: ADMITRate limiting sits at the gateway, before a request ever reaches the GPU. It enforces two distinct budgets:
The difference matters. A merchant sending one request with a 64K prompt consumes far more GPU time than a merchant sending one hundred requests with 100-token prompts, even though the first merchant uses fewer requests. RPM alone would let the 64K prompt through and monopolize the KV cache.
For RPM, a distributed sliding-window limiter using Redis with a Lua script gives consistent enforcement across all gateway nodes. A local in-memory limiter isn't enough because requests are load-balanced across many gateway instances.
The Lua script below removes entries older than the window, counts the remaining requests, and either allows the new request or rejects it. The key detail is using a unique sorted-set member (a request ID with a timestamp) instead of the raw timestamp alone. If two requests land in the same clock tick and you use the timestamp as both score and member, Redis collapses them into one entry and undercounts traffic:
1-- Redis Lua Script for RPM Sliding-Window Limiting
2local key = KEYS[1]
3local limit = tonumber(ARGV[1])
4local window_ms = tonumber(ARGV[2]) -- e.g., 60_000
5local now_ms = tonumber(ARGV[3])
6local member = ARGV[4] -- unique request id, e.g. "1713468123456:req-9f3c"
7
8-- Remove timestamped entries older than the window
9redis.call('ZREMRANGEBYSCORE', key, 0, now_ms - window_ms)
10
11-- Count current requests
12local count = redis.call('ZCARD', key)
13
14if count < limit then
15 redis.call('ZADD', key, now_ms, member)
16 redis.call('PEXPIRE', key, window_ms)
17 return 1 -- Allowed
18else
19 return 0 -- Rejected
20endTPM is trickier because you don't know the final output length at admission time. In practice, reserve a budget based on prompt tokens plus max_output_tokens, then reconcile the counter with actual usage when the stream finishes.
For high-throughput services, strictly synchronized Redis limits can become a bottleneck. A platform may choose bounded burst allowance or approximate local counters, but that weakens strict limit semantics and must be documented and measured.
Token admission needs reservation and reconciliation. Reserve prompt plus maximum allowed output before execution, then release unused output capacity after the stream completes:
1class TokenBudget:
2 def __init__(self, remaining: int):
3 self.remaining = remaining
4
5 def reserve(self, prompt_tokens: int, max_output_tokens: int) -> int:
6 reservation = prompt_tokens + max_output_tokens
7 if reservation > self.remaining:
8 raise ValueError("TPM budget exceeded")
9 self.remaining -= reservation
10 return reservation
11
12 def reconcile(self, reservation: int, prompt_tokens: int, output_tokens: int) -> None:
13 self.remaining += reservation - (prompt_tokens + output_tokens)
14
15budget = TokenBudget(remaining=1_000)
16held = budget.reserve(prompt_tokens=300, max_output_tokens=400)
17budget.reconcile(held, prompt_tokens=300, output_tokens=120)
18
19assert budget.remaining == 580
20print("tokens_remaining_after_actual_usage:", budget.remaining)1tokens_remaining_after_actual_usage: 580RPM and TPM protect the gateway edge, but they don't fully solve scheduler fairness inside the serving engine. A merchant with one 64K prompt can consume far more GPU time than dozens of merchants sending short chat turns.
Inside the runtime, keep per-tenant queues and charge a virtual token budget for every admitted prefill chunk and every decode step. Then schedule by priority tier plus virtual finish time, not raw request count. That gives each merchant forward progress while still letting higher-SLA traffic buy more share.
Do not collapse rate limiting and quota management into one counter. Rate limits prevent burst and protect the system, while quotas cap total usage and protect the budget. A merchant can stay under their RPM limit and still burn through their monthly token quota in one afternoon.
Every state-bearing layer needs an authorization boundary and a testable release policy. A cross-tenant retrieval, adapter, cache, or KV access is a security incident even if the other layers behaved correctly.
Many multi-tenant platforms augment LLMs with retrieval (RAG). A vector database finds the most relevant documents for a query. In multi-tenancy, "relevant" doesn't mean "allowed."
Imagine Merchant X searches for "best shipping carrier rates." The vector DB might find a highly relevant internal contract that belongs to Merchant Y, because both merchants ship packages and the embeddings overlap. Without a hard filter, the LLM could summarize Merchant Y's confidential rates and return them to Merchant X.
The fix is authorization filtering in the retrieval operation itself. An application should pass authorized scope into the database query, and tests should fail if another tenant's result can cross that boundary:
1documents = [
2 {"tenant": "merchant-a", "text": "FastShip discount tier A", "score": 0.88},
3 {"tenant": "merchant-b", "text": "FastShip confidential tier B", "score": 0.99},
4]
5
6def authorized_search(tenant: str, top_k: int) -> list[str]:
7 allowed = [doc for doc in documents if doc["tenant"] == tenant]
8 ranked = sorted(allowed, key=lambda doc: doc["score"], reverse=True)
9 return [doc["text"] for doc in ranked[:top_k]]
10
11results = authorized_search("merchant-a", top_k=1)
12assert results == ["FastShip discount tier A"]
13assert all("tier B" not in text for text in results)
14print("authorized_results:", results)1authorized_results: ['FastShip discount tier A']The authorization predicate is applied before candidate results leave the data layer. Post-filtering after a broad top-K query can return foreign identifiers, scores, or content to application memory and may also erase every authorized candidate.
When a shared worker serves more than one tenant, state lifecycle rules matter:
For regulated workloads that can tolerate redaction, requests can pass through a lightweight PII masking service (e.g., Presidio[14]) before they hit the model router. This reduces the chance that the LLM ever sees raw credit card numbers or Social Security numbers, while still letting downstream systems map placeholders back to original values when needed.
A shared fleet only stays profitable if you can answer one question per tenant: what did this merchant actually cost us, and what should we charge? Token counts alone are a weak proxy because two requests with the same token counts can consume very different GPU time depending on prompt-vs-output split, batch occupancy, preemptions, and cache hits.
A defensible metering record attaches to every request and carries: tenant_id, model and adapter version, prompt tokens, output tokens, cache-hit tokens, queue wait, prefill time, decode time, KV blocks held, preemption count, and GPU worker type. Cache-hit tokens matter because reused prefix blocks skip prefill compute. A public billing policy may choose a distinct cached-input rate, as current provider pricing documents illustrate.[15][16]
For internal cost allocation rather than customer billing, the honest unit is GPU-time, not tokens. A reasonable per-request cost estimate looks like:
where gpu_seconds is the request's share of busy GPU time (prefill plus its decode steps, divided by batch occupancy so shared steps are split across co-batched tenants). Charge customers on the simpler token-and-tier dimensions, but reconcile against measured GPU-time so you can spot tenants whose traffic shape (long prompts, low batchability, adapter thrash) costs far more than their token bill suggests.
1records = [
2 {"tenant": "merchant-a", "gpu_seconds": 0.40, "cache_hit_tokens": 800},
3 {"tenant": "merchant-b", "gpu_seconds": 1.25, "cache_hit_tokens": 0},
4]
5NODE_HOURLY_RATE = 8.00
6
7def compute_dollars(record: dict[str, float]) -> float:
8 return record["gpu_seconds"] * NODE_HOURLY_RATE / 3600
9
10costs = {record["tenant"]: compute_dollars(record) for record in records}
11assert costs["merchant-b"] > costs["merchant-a"]
12print("metered_gpu_cost_usd:", {tenant: round(value, 6) for tenant, value in costs.items()})1metered_gpu_cost_usd: {'merchant-a': 0.000889, 'merchant-b': 0.002778}A production platform should handle spikes, deploy new models with release gates, and degrade predictably during failures.
CPU utilization alone doesn't describe LLM serving pressure. Queue depth, admitted-token backlog, inference latency, and KV-cache utilization provide signals for capacity and admission decisions. When new workers cannot become ready in time, shed or downgrade eligible best-effort work instead of silently violating every tenant's objective.
Cold-starting GPU workers includes making model weights available before the pool can serve requests. Depending on cost and latency objectives, a platform may keep warm capacity, forecast predictable demand, or reject low-priority excess load while workers start.
Models and LoRA adapters require versioned rollout because a new artifact can degrade generation quality or latency.
In one canary rollout policy, when a merchant deploys a new adapter version (for example, v2), the router sends a controlled slice of eligible traffic to it while the rest continues on v1. Monitor quality evaluation, latency, error rates, and safety signals over a predefined observation window.
If gates pass, the router can increase exposure. If a gate fails, route new eligible requests back to v1; active streams and adapter residency still need explicit handling, so a routing change is not a blanket zero-downtime promise.
For base model updates, the process is more complex. Unlike lightweight adapters, base models require spinning up entirely new GPU worker pools. The platform routes shadow traffic (duplicate asynchronous requests) to the new base model cluster to validate correctness and measure throughput before exposing it to real merchant traffic. Once validated, the gateway shifts live traffic to the new cluster and gracefully drains the old one.
System reliability relies on handling GPU failures and model crashes gracefully:
| Strategy | Trigger Condition | Action Taken | Architectural Impact |
|---|---|---|---|
| Dead Letter Queues (DLQ) | Repeated CUDA out-of-memory (OOM) or crashes | Move a repeatedly failing request to DLQ after a bounded retry count | Stops that request from causing an unbounded retry loop |
| Circuit Breaking | Model/adapter error rate crosses a configured threshold | Fast-fail new requests for that specific adapter | Limits repeated work while operators investigate |
| Active Health Checks | Missed node heartbeats (e.g., stuck kernel) | Mark unhealthy, stop new routing, drain or fail in-flight work according to policy | Removes a suspected worker from new admission |
| Zone Redundancy | Entire Availability Zone failure | Shift eligible traffic to healthy zones, subject to spare capacity | Reduces zone-failure impact when capacity is available |
These mechanisms need a control plane or equivalent coordination layer to track worker readiness, orchestrate model deployments, and update routing. Readiness checks reduce routing to unavailable workers; they do not prove model quality or prevent every runtime failure.
Here are three practice levels you can build to internalize the concepts:
Level 1: Tenant-aware API wrapper.
Implement a simple FastAPI endpoint that accepts a tenant_id header, forwards the body to an OpenAI-compatible API, and logs input tokens, output tokens, and latency per tenant. Add a basic in-memory RPM limiter that rejects requests when a tenant exceeds 10 requests per minute.
Level 2: LoRA adapter swap measurement. Use vLLM or LoRAX to serve two different LoRA adapters on one base model. Send alternating requests for Adapter A and Adapter B and measure the latency of each swap. Does the first request after a swap take longer than subsequent ones? Can you warm both adapters simultaneously if GPU memory allows?
Level 3: Isolation contract tests. Build a local router with tenant-filtered retrieval, tenant-scoped prefix-cache keys, and adapter authorization. Write adversarial tests that submit the same query or prefix from two tenants and assert that no foreign document, cache hit, or adapter route is exposed.
Suppose your platform serves three merchant tiers on one GPU fleet. Enterprise merchants have tighter latency objectives and stronger privacy boundaries, business merchants need adapter customization, and starter merchants need low cost. Explain how you would route requests, isolate retrieval and KV state, enforce fairness, and roll out a new adapter version without causing a cross-tenant leak or a fleet-wide latency spike.
You are in good shape if you can:
Symptom: Merchant B sees wording or policy hints that belong to Merchant A. Cause: KV state, prefix-cache entries, adapter buffers, or other attention-side state was reused without full invalidation. Fix: Invalidate block tables and tenant-scoped caches after every request. For stricter workloads, scrub memory deterministically or move the tenant to dedicated infrastructure.
Symptom: Capacity plan asks for far more GPUs than the traffic actually needs. Cause: The plan assumes one merchant maps to one slice of compute and ignores batching, burstiness, and shared adapters. Fix: Benchmark node throughput on the real model, scheduler, and context mix. Size the fleet from measured queue depth, TTFT, and KV pressure instead of a hand-wavy merchant-to-GPU ratio.
Symptom: One long-context tenant causes everybody else's latency to spike. Cause: That tenant monopolizes KV memory because the platform lacks hard context and KV budget limits. Fix: Enforce per-tier context caps and KV budgets at admission time. Preempt or reject oversized requests before they occupy shared blocks.
Symptom: Retrieved answers are relevant but unauthorized.
Cause: Retrieval was filtered after vector search instead of inside the database query.
Fix: Push tenant_id into the retrieval predicate itself. Never let foreign document IDs or scores leave the vector store.
Symptom: Pricing looks fair by tokens, but some tenants are still unprofitable. Cause: Token counts hide expensive traffic shapes such as long prefills, low batchability, repeated preemption, or adapter cache thrash. Fix: Reconcile customer billing against measured GPU-seconds, cache hits, queue time, and adapter residency so internal cost tracks actual fleet burn.
H100 GPU
NVIDIA · 2026
Orca: A Distributed Serving System for Transformer-Based Generative Models.
Yu, G.-I., et al. · 2022 · OSDI 2022
LoRA: Low-Rank Adaptation of Large Language Models.
Hu, E. J., et al. · 2021 · ICLR
S-LoRA: Serving Thousands of Concurrent LoRA Adapters.
Sheng, Y., et al. · 2023 · arXiv preprint
LoRA Adapters
vLLM · 2026
Punica: Multi-Tenant LoRA Serving
Chen, L., Ye, Z., Wu, Y., Zhuo, D., Ceze, L., & Krishnamurthy, A. · 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention
Kwon, W., et al. · 2023 · SOSP 2023
SGLang: Efficient Execution of Structured Language Model Programs
Zheng, L., Yin, L., Xie, Z., et al. · 2023 · arXiv:2312.07104
Automatic Prefix Caching
vLLM · 2026
Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.
Agrawal, A., et al. · 2023 · arXiv preprint
Optimization and Tuning.
vLLM · 2026
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
vLLM Team · 2024
Supported GPUs
NVIDIA · 2026
Presidio: Data Protection and De-identification SDK.
Microsoft Presidio. · 2023 · GitHub
Prompt caching
OpenAI · 2026
Prompt caching.
Anthropic. · 2026 · Official documentation