LearnSystem Design CapstonesMulti-Tenant LLM Platform

🏗️HardSystem Design

Multi-Tenant LLM Platform

Design a shared LLM platform with tenant-scoped state, quota enforcement, adapter routing, KV accounting, and measured GPU utilization.

36 min read

Learning path

Step 148 of 158 in the full curriculum

Code Completion System LLM-Powered Search Engine

Code completion gave you a single high-frequency product surface: one developer, one editor context, one low-latency serving path. A multi-tenant large language model (LLM) platform generalizes that serving path into shared infrastructure where many tenants, adapters, quotas, and privacy boundaries coexist on the same GPU fleet.

A multi-tenant LLM platform shares expensive serving infrastructure while enforcing tenant-scoped state, scheduler policy, and measurable latency objectives. This design chapter covers routing, quotas, batching, data boundaries, and cost control.

A shared AI platform for developer workspaces serves one hundred teams. They use it to answer internal-doc questions, summarize incidents, and draft migration notes. Each workspace requires authorization boundaries around private docs, prompts, adapters, and usage records. Your job is to share GPU capacity while keeping every stateful path scoped to the authorized tenant.

This is the multi-tenant LLM serving problem. One concrete request moves through a shared platform below, layer by layer, so you can see how the platform enforces tenant scopes, schedules shared work, and measures latency.

Start with three ideas from earlier in the curriculum. First, an LLM generates text one token at a time. Second, to avoid recomputing the entire prompt on every single token, the model stores intermediate results in a structure called the KV cache. Third, batching lets the GPU apply resident model weights across multiple requests in shared serving steps, amortizing work across users. The design below builds on all three.

Multi-tenant serving path where tenant-tagged requests move through gateway, fair scheduler, shared GPU workers, and per-tenant metering, while shared base weights stay separate from tenant-scoped state. — Shared fleets are cheap only when tenant identity stays attached to routing, state, and billing through every shared step.

Why shared capacity needs hard boundaries

Consider a design scenario with a dense 72-billion-parameter model stored in FP16 (16-bit floating point). Weight storage alone is about 144 GB in decimal units. One NVIDIA H100 SXM configuration has 80 GB of HBM3 memory.^{[1]Reference 1H100 GPUhttps://www.nvidia.com/en-us/data-center/h100/} In this scenario, one copy of the model weights exceeds one such GPU's memory before adding KV cache or serving overhead.

If each of one hundred tenants had a separate copy of those weights, weight memory alone would be 14.4 TB. Shared base weights can avoid that duplication, but they don't automatically isolate prompts, retrieved documents, adapters, caches, or billing state.

Make the scenario calculation runnable before discussing schedulers:

shared-weight-capacity.py

def weight_storage_gb(parameters_billions: int, bytes_per_parameter: int) -> float:
    return parameters_billions * bytes_per_parameter

base_weight_gb = weight_storage_gb(parameters_billions=72, bytes_per_parameter=2)
per_tenant_weight_tb = base_weight_gb * 100 / 1000

assert base_weight_gb == 144
assert per_tenant_weight_tb == 14.4
print("one_fp16_weight_copy_gb:", base_weight_gb)
print("one_hundred_copies_tb:", per_tenant_weight_tb)

Output

one_fp16_weight_copy_gb: 144
one_hundred_copies_tb: 14.4

Sharing introduces three concrete engineering tensions:

Compute contention. All tenants want the GPU's CUDA cores during deploy windows, incidents, and review bursts.
Memory contention. Every active conversation consumes KV cache memory. A tenant with a long design-review transcript can evict another tenant's chat if limits aren't enforced.
Weight customization. Tenants want different behaviors. One needs terse API-reference answers; another needs incident-triage summaries. Loading a full model copy per customization wastes capacity, so we need scoped lightweight adapters or separate pools where required.

The rest of the article solves these three tensions in order.

How we pack requests together: continuous batching

When a GPU processes a batch, one serving step applies the resident model weights across multiple requests. The simplest approach is static batching: collect eight requests, run them together, and wait until every single one finishes before starting a new batch. This is easy to implement but wasteful. If Tenant A's API-reference answer generates only 10 output tokens while Tenant B's incident retrospective generates 500 tokens, the GPU slot sits idle for Tenant A while Tenant B finishes the remaining 490 tokens.

Continuous batching (also called in-flight batching, described in the Orca paper^{[2]Reference 2Orca: A Distributed Serving System for Transformer-Based Generative Models.https://www.usenix.org/conference/osdi22/presentation/yu}) replaces completed requests with queued work at iteration boundaries. When Tenant A reaches its EOS (End of Sequence) token, the scheduler can admit another request without waiting for Tenant B to finish.

The throughput gain depends on prompt lengths, decode lengths, admission policy, and scheduler overhead. The useful principle is that a finished request need not occupy a decode slot.

In a multi-tenant environment, the scheduler also has to respect priority and fairness. A high-tier tenant may have tighter latency SLOs. A tenant-aware continuous batcher therefore balances throughput (packing as many tokens as possible) against measured latency objectives for prioritized tenants. The rate-limiting and preemption sections below show how that works.

Static batching is a charter bus that waits until every passenger reaches their destination before returning to the depot. Continuous batching lets an empty seat be offered to the queue at the next scheduled stop.

This miniature schedule keeps a long request active while replacing a completed short request:

continuous-batch-slots.py

from collections import deque

active = {"tenant-a": 1, "tenant-b": 3}
waiting = deque([("tenant-c", 2)])

for tenant in list(active):
    active[tenant] -= 1
    if active[tenant] == 0:
        del active[tenant]
        admitted, remaining_tokens = waiting.popleft()
        active[admitted] = remaining_tokens

assert active == {"tenant-b": 2, "tenant-c": 2}
print("active_after_iteration:", active)

Output

active_after_iteration: {'tenant-b': 2, 'tenant-c': 2}

How we customize behavior without duplicating weights: LoRA adapters

Tenants may need different behaviors without separate full model copies. LoRA (Low-Rank Adaptation^{[3]Reference 3LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685}) learns low-rank matrices next to selected original weight layers. During inference, the base model remains fixed and the selected adapter contributes an additional projection.

Adapter size isn't one universal number: it depends on rank, target modules, model dimensions, and dtype. It's typically much smaller than a full base copy, but a platform must measure its chosen adapter footprint and decide how many adapters can remain resident alongside KV-cache budgets.

The S-LoRA (Serving Thousands of Concurrent LoRA Adapters) system^{[4]Reference 4S-LoRA: Serving Thousands of Concurrent LoRA Adapters.https://arxiv.org/abs/2311.03285} studies serving many concurrent adapters while keeping base weights shared. Adapter caching is only part of the problem; the runtime also has to execute requests with different adapters in shared serving steps. Multi-LoRA engines need runtime support that maps each request to its adapter while preserving base-model sharing.^{[5]Reference 5LoRA Adaptershttps://docs.vllm.ai/en/stable/features/lora/}

Punica introduces Segmented Gather Matrix-Vector multiplication (SGMV) for batched LoRA serving and evaluates mixed-adapter overhead.^{[6]Reference 6Punica: Multi-Tenant LoRA Servinghttps://arxiv.org/abs/2310.18547} Production engines such as vLLM expose LoRA serving configuration including enabled adapters, resident adapter limits, and maximum supported rank.^{[5]Reference 5LoRA Adaptershttps://docs.vllm.ai/en/stable/features/lora/} Benchmark your actual adapter mix and hardware before treating the paper result as a fleet capacity plan.

Analogy: a shared compiler backend Imagine one compiler backend that loads a small approved plugin for each workspace. The backend (the base model) stays shared. The plugin (the adapter) is small, but the runtime must still verify which tenant is authorized to use it.

The platform stores adapters in an object-storage registry such as S3 (Simple Storage Service) and loads them into GPU memory on demand. Frequently used adapters stay in an LRU cache on the GPU; rarely used ones are evicted to host RAM or disk.

LoRA adapter routing for multi-tenant serving where tenant identity selects an authorized adapter, hot adapters stay in GPU cache, and one frozen base model stays shared while adapter choice, cache residency, and billing remain tenant-scoped. — LoRA routing shares the big frozen base once, but adapter identity, cache residency, and billing still stay tenant-scoped.

The conceptual lifecycle below shows how the adapter manager takes a request, resolves the correct adapter, and runs inference. Multi-LoRA runtimes use adapter-aware kernels and memory management so requests with different adapters can share base-model serving steps. Adapter lookup, residency, and mixed-adapter execution still have measurable cost, so benchmark the actual mix:

lora-adapter-routing.py

from collections import OrderedDict
from dataclasses import dataclass

@dataclass
class Request:
    tenant_id: str
    requested_adapter: str
    prompt: str

@dataclass
class Response:
    text: str

@dataclass
class AdapterWeights:
    adapter_id: str

class AdapterStore:
    def download(self, adapter_id: str) -> AdapterWeights:
        print(f"Loading adapter {adapter_id} into GPU cache")
        return AdapterWeights(adapter_id)

class BaseModel:
    def generate(self, request: Request, adapter: AdapterWeights) -> Response:
        token_count = len(request.prompt.split())
        return Response(
            f"tenant={request.tenant_id} adapter={adapter.adapter_id} "
            f"prompt_tokens={token_count}"
        )

class LoRAAdapterManager:
    """Routes only authorized adapters on a shared base model."""
    def __init__(self, max_hot_adapters: int = 2):
        self.base_model = BaseModel()
        self.adapter_cache: OrderedDict[str, AdapterWeights] = OrderedDict()
        self.adapter_store = AdapterStore()
        self.max_hot_adapters = max_hot_adapters
        self.authorized_adapter = {
            "tenant-a": "docs-v2",
            "tenant-b": "incident-v7",
            "tenant-c": "code-review-v1",
        }

    def serve(self, request: Request) -> Response:
        if self.authorized_adapter.get(request.tenant_id) != request.requested_adapter:
            raise PermissionError("adapter is not authorized for tenant")
        adapter_id = f"{request.tenant_id}/{request.requested_adapter}"

        if adapter_id not in self.adapter_cache:
            if len(self.adapter_cache) >= self.max_hot_adapters:
                evicted_id, _ = self.adapter_cache.popitem(last=False)
                print(f"Evicting adapter {evicted_id}")
            adapter_weights = self.adapter_store.download(adapter_id)
            self.adapter_cache[adapter_id] = adapter_weights

        self.adapter_cache.move_to_end(adapter_id)
        return self.base_model.generate(request, self.adapter_cache[adapter_id])

manager = LoRAAdapterManager(max_hot_adapters=2)
for request in [
    Request("tenant-a", "docs-v2", "draft a migration note"),
    Request("tenant-b", "incident-v7", "summarize the outage timeline"),
    Request("tenant-c", "code-review-v1", "review this auth diff quickly"),
]:
    response = manager.serve(request)

print(response.text)
print("hot adapters:", list(manager.adapter_cache))
try:
    manager.serve(Request("tenant-a", "code-review-v1", "use another policy"))
except PermissionError as error:
    print("blocked:", error)

Output

Loading adapter tenant-a/docs-v2 into GPU cache
Loading adapter tenant-b/incident-v7 into GPU cache
Evicting adapter tenant-a/docs-v2
Loading adapter tenant-c/code-review-v1 into GPU cache
tenant=tenant-c adapter=tenant-c/code-review-v1 prompt_tokens=5
hot adapters: ['tenant-b/incident-v7', 'tenant-c/code-review-v1']
blocked: adapter is not authorized for tenant

Adapters are much smaller than full base weights, but loading one from object storage into GPU memory is still measurable. If Tenant A and Tenant B alternate on every request, the adapter cache thrashes and latency spikes. Profile your actual adapter reuse patterns before assuming LRU is sufficient. Some platforms pin high-tier adapters permanently and only evict best-effort ones.

How we keep conversations separate: KV cache isolation

Every token the model generates relies on the KV cache, which stores intermediate key and value vectors from earlier tokens. Without it, autoregressive decoding repeats work. In a multi-tenant system, KV allocation is also sensitive state: the runtime must not attach Tenant A's live blocks or cache entries to Tenant B's request.

The memory cost in concrete numbers

KV-cache memory per request grows with context length. For a decoder using fixed-width KV heads, a useful estimate is:

\begin{aligned} \text{KV memory per request} = &2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \\ &\times \text{seq\_len} \times \text{dtype\_bytes} \end{aligned}

The factor of 2 counts keys and values. The other terms are model depth, KV heads, head dimension, request context length, and bytes per element. Grouped-query attention (GQA) stores fewer KV heads than attention heads, which is why the n_kv_heads term matters.

Before you memorize the symbols, calculate a concrete scenario. Suppose a model has 80 layers, 8 KV heads, head dimension 128, a 4,000-token context, and FP16 (2 bytes per element):

$2 \times 80 \times 8 \times 128 \times 4{,}000 \times 2 = 1{,}310{,}720{,}000 \text{ bytes} \approx 1.22 \text{ GiB per request}$

If the same model family used 16 KV heads instead of 8, that doubles to about 2.44 GiB. At high concurrency, this memory can become the admission constraint before compute throughput does.

kv-cache-budget.py

def kv_gib(layers: int, kv_heads: int, head_dim: int, tokens: int, dtype_bytes: int = 2) -> float:
    bytes_used = 2 * layers * kv_heads * head_dim * tokens * dtype_bytes
    return bytes_used / (1024 ** 3)

request_gib = kv_gib(layers=80, kv_heads=8, head_dim=128, tokens=4_000)
double_heads_gib = kv_gib(layers=80, kv_heads=16, head_dim=128, tokens=4_000)

assert round(request_gib, 2) == 1.22
assert round(double_heads_gib, 2) == 2.44
print("eight_kv_heads_gib:", round(request_gib, 2))
print("sixteen_kv_heads_gib:", round(double_heads_gib, 2))

Output

eight_kv_heads_gib: 1.22
sixteen_kv_heads_gib: 2.44

PagedAttention: paging for GPUs

PagedAttention (introduced in the vLLM paper^{[7]Reference 7Efficient Memory Management for Large Language Model Serving with PagedAttentionhttps://arxiv.org/abs/2309.06180}) treats KV storage in blocks rather than reserving one long contiguous chunk for each request. A block table maps each request's logical sequence to physical blocks. The example figure uses 16-token blocks to make the mapping visible; production block sizes are runtime configuration and performance choices.

In that illustration, Tenant A's 47-token conversation occupies three physical blocks and Tenant B's 31-token conversation occupies two. PagedAttention improves memory allocation efficiency; it doesn't by itself enforce tenant authorization. The serving layer must associate request ownership with block tables, invalidate released references, and implement any required clearing policy before reallocation.

PagedAttention KV-cache trace where logical tenant sequences map through block tables into reused physical KV pages, while tenant-scoped access rules and page release policy stay outside the memory-packing mechanism. — PagedAttention packs KV blocks efficiently, but tenant-scoped table access and released-page handling still belong to platform policy.

Prefix caching and the cross-tenant leak risk

Multi-tenant traffic often repeats the same system prompt, tool schema, or long retrieved prefix. Runtimes can cache those KV blocks and skip recomputing the shared prefix on later requests. SGLang introduced RadixAttention for this pattern, while vLLM's automatic prefix caching uses hashed KV blocks rather than a radix tree.^{[8]Reference 8SGLang: Efficient Execution of Structured Language Model Programshttps://arxiv.org/abs/2312.07104}^{[9]Reference 9Automatic Prefix Cachinghttps://docs.vllm.ai/en/latest/features/automatic_prefix_caching/}

Prefix caching mainly lowers TTFT (Time to First Token) because it eliminates repeated prefill work. It doesn't make decode itself cheaper.

The critical rule is isolation. Reuse prefixes only inside an authorized cache namespace, such as one tenant or an explicitly public shared prompt. vLLM's prefix-caching design docs describe an optional cache salt intended to isolate cache reuse across trust groups and mitigate timing-based probing; platform routing must also keep adapter and model compatibility consistent.^{[9]Reference 9Automatic Prefix Cachinghttps://docs.vllm.ai/en/latest/features/automatic_prefix_caching/}

Enabling private-prefix reuse without a tenant or trust-group namespace is an isolation bug first, even if the original change was meant to improve performance. Scope private cache reuse by tenant or authorized trust group and compatible model, tokenizer, and adapter configuration.

tenant-scoped-prefix-cache.py

def cache_key(trust_group: str, model: str, adapter: str, tokenizer: str, prefix: str) -> tuple[str, ...]:
    return trust_group, model, adapter, tokenizer, prefix

prompt = "You are the internal docs assistant."
cache = {
    cache_key("tenant:tenant-a", "base-v3", "docs-v2", "tok-v3", prompt): "kv-7"
}

same_tenant = cache_key("tenant:tenant-a", "base-v3", "docs-v2", "tok-v3", prompt)
other_tenant = cache_key("tenant:tenant-b", "base-v3", "docs-v2", "tok-v3", prompt)

assert cache.get(same_tenant) == "kv-7"
assert cache.get(other_tenant) is None
print("authorized_hit:", same_tenant in cache)
print("cross_tenant_hit:", other_tenant in cache)

Output

authorized_hit: True
cross_tenant_hit: False

Chunked prefill for multi-tenant fairness

The prefill phase processes the input prompt to build the initial KV cache. It's compute-intensive but doesn't generate tokens. The decode phase generates output tokens one at a time. It's memory-intensive but uses less compute.

Without scheduling controls, a long prefill from one tenant can delay decode operations from others. Chunked prefill (studied in Sarathi-Serve^{[10]Reference 10Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.https://arxiv.org/abs/2308.16369}) breaks long prompts into bounded chunks so decode work from other requests can be scheduled between prefill chunks.

The chunk size is a tuning knob, not a fixed constant. It's controlled by the runtime's per-step token budget (max_num_batched_tokens in vLLM). Smaller budgets lower inter-token latency for in-flight decodes; larger budgets improve TTFT and prefill throughput. Current vLLM docs show chunked prefill enabled whenever possible in V1, example low-latency settings like 2,048 tokens, and throughput-oriented settings above 8,192 tokens, so tune it to your latency target instead of copying one number blindly.^{[10]Reference 10Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.https://arxiv.org/abs/2308.16369}^{[11]Reference 11Optimization and Tuning.https://docs.vllm.ai/en/latest/configuration/optimization.html}

In multi-tenant settings, chunked prefill is one useful noisy neighbor control. It still needs admission limits and a tenant-aware scheduler; chunking alone doesn't guarantee fair service.

Tenant-aware preemption

When admitted work approaches its KV budget, the scheduler may reject, queue, or preempt requests according to the published service policy. A priority tier can permit an enterprise request to displace best-effort work, but the choice must be metered and observable rather than hidden.

Current vLLM V1 docs default to RECOMPUTE rather than SWAP because recomputation has lower overhead in that architecture. A platform should still choose a recovery mode based on runtime support and measured cost.^{[11]Reference 11Optimization and Tuning.https://docs.vllm.ai/en/latest/configuration/optimization.html}

The conceptual scheduler below demonstrates the decision logic. It sorts running requests by priority, then by KV footprint, then by tokens already generated. If a new request has higher priority than the lowest-priority running request, the scheduler preempts the victim and schedules the newcomer:

tenant-aware-preemption.py

from dataclasses import dataclass
from typing import Protocol

class GPUAllocator(Protocol):
    def get_num_free_blocks(self) -> int: ...

@dataclass
class Tenant:
    name: str
    priority: int  # Larger number = higher priority

@dataclass
class Request:
    tenant: Tenant
    estimated_kv_memory: int
    tokens_generated: int
    can_recompute: bool = True

class TenantAwareScheduler:
    def __init__(self, gpu_allocator: GPUAllocator, block_size_mb: int):
        self.running_requests: list[Request] = []
        self.gpu_allocator = gpu_allocator
        self.block_size_mb = block_size_mb

    def available_kv_memory(self) -> int:
        return self.gpu_allocator.get_num_free_blocks() * self.block_size_mb

    def evict_for_recompute(self, request: Request) -> None:
        print(
            f"Evicting KV cache for tenant={request.tenant.name} "
            f"priority={request.tenant.priority}; "
            "the request will be recomputed if resumed."
        )

    def swap_to_cpu(self, request: Request) -> None:
        print(
            f"Swapping KV cache for tenant={request.tenant.name} "
            f"priority={request.tenant.priority} "
            "to host memory."
        )

    def schedule(self, request: Request) -> None:
        self.running_requests.append(request)
        print(
            f"Scheduling request tenant={request.tenant.name} "
            f"priority={request.tenant.priority}."
        )

    def preempt(self, request: Request) -> None:
        if request.can_recompute:
            self.evict_for_recompute(request)
        else:
            self.swap_to_cpu(request)

    def preempt_if_needed(self, new_request: Request) -> None:
        free_memory_mb = self.available_kv_memory()
        if free_memory_mb >= new_request.estimated_kv_memory:
            self.schedule(new_request)
            return

        candidates = sorted(
            self.running_requests,
            key=lambda r: (
                r.tenant.priority,       # Lowest priority first
                -r.estimated_kv_memory,    # Free the biggest KV footprint first
                r.tokens_generated,      # Prefer to kill work that has done less decode
            ),
        )

        victims = []
        for candidate in candidates:
            if new_request.tenant.priority <= candidate.tenant.priority:
                continue
            victims.append(candidate)
            free_memory_mb += candidate.estimated_kv_memory
            if free_memory_mb >= new_request.estimated_kv_memory:
                break

        if free_memory_mb < new_request.estimated_kv_memory:
            print("Cannot free enough KV memory without preempting higher or equal priority requests.")
            return

        for victim in victims:
            self.preempt(victim)
            self.running_requests.remove(victim)
        self.schedule(new_request)

class FakeAllocator:
    def __init__(self, free_blocks: int):
        self.free_blocks = free_blocks

    def get_num_free_blocks(self) -> int:
        return self.free_blocks

scheduler = TenantAwareScheduler(FakeAllocator(free_blocks=4), block_size_mb=16)
scheduler.running_requests = [
    Request(Tenant("starter", priority=1), estimated_kv_memory=96, tokens_generated=8),
    Request(Tenant("business", priority=2), estimated_kv_memory=80, tokens_generated=120),
]

incoming = Request(Tenant("enterprise", priority=4), estimated_kv_memory=80, tokens_generated=0)
scheduler.preempt_if_needed(incoming)
print("running tenants:", [request.tenant.name for request in scheduler.running_requests])

Output

Evicting KV cache for tenant=starter priority=1; the request will be recomputed if resumed.
Scheduling request tenant=enterprise priority=4.
running tenants: ['business', 'enterprise']

The miniature loop accounts for released KV memory before admitting the newcomer. A production scheduler must make release and allocation atomic so concurrent scheduling decisions can't over-admit work.

Hard per-tenant limits

Scheduling alone isn't enough. The platform also enforces hard quotas on context sizes and concurrent requests based on the tenant's tier. The illustration below summarizes the isolation spectrum from shared pools to stronger runtime boundaries:

Tenant isolation ladder from shared pool to namespace isolation to dedicated runtime boundary, trading cost for lower noisy-neighbor risk. — Isolation is a cost-vs-risk dial: shared pools are cheapest, and dedicated runtime boundaries remove more shared paths for sensitive workloads.

Example admission policy (numbers are scenario inputs, not universal tiers):

Tenant Tier	Max Concurrent Requests	Max Context Length	KV Cache Budget
Enterprise	50	8K	256 GB
Business	20	4K	80 GB
Starter	5	2K	20 GB

Apply those limits before scheduling GPU work:

admit-under-tenant-kv-budget.py

TIERS = {
    "enterprise": {"max_concurrent": 50, "max_context": 8_000, "kv_gib": 256.0},
    "starter": {"max_concurrent": 5, "max_context": 2_000, "kv_gib": 20.0},
}

def admit(tier: str, active_requests: int, context_tokens: int, projected_kv_gib: float) -> str:
    policy = TIERS[tier]
    if active_requests >= policy["max_concurrent"]:
        return "REJECT_CONCURRENCY_LIMIT"
    if context_tokens > policy["max_context"]:
        return "REJECT_CONTEXT_LIMIT"
    if projected_kv_gib > policy["kv_gib"]:
        return "REJECT_KV_BUDGET"
    return "ADMIT"

assert admit("starter", 5, 1_000, 3.0) == "REJECT_CONCURRENCY_LIMIT"
assert admit("starter", 2, 2_400, 3.0) == "REJECT_CONTEXT_LIMIT"
assert admit("starter", 2, 1_900, 22.0) == "REJECT_KV_BUDGET"
assert admit("enterprise", 12, 7_500, 180.0) == "ADMIT"
print("starter_at_capacity:", admit("starter", 5, 1_000, 3.0))
print("starter_long_prompt:", admit("starter", 2, 2_400, 3.0))
print("enterprise_request:", admit("enterprise", 12, 7_500, 180.0))

Output

starter_at_capacity: REJECT_CONCURRENCY_LIMIT
starter_long_prompt: REJECT_CONTEXT_LIMIT
enterprise_request: ADMIT

How we prevent one tenant from overwhelming the rest: rate limiting and fair queues

Rate limiting sits at the gateway, before a request ever reaches the GPU. It enforces two distinct budgets:

Requests per minute (RPM): Controls burst traffic to protect the API gateway from connection exhaustion.
Tokens per minute (TPM): Controls sustained throughput to protect GPU compute capacity.

The difference matters. A tenant sending one request with a 64K prompt consumes far more GPU time than a tenant sending one hundred requests with 100-token prompts, even though the first tenant uses fewer requests. RPM alone would let the 64K prompt through and monopolize the KV cache.

Distributed sliding-window enforcement

For RPM, a distributed sliding-window limiter using Redis with a Lua script gives consistent enforcement across all gateway nodes. A local in-memory limiter isn't enough because requests are load-balanced across many gateway instances.

The Lua script below removes entries older than the window, counts the remaining requests, and either allows the new request or rejects it. The key detail is using a unique sorted-set member (a request ID with a timestamp) instead of the raw timestamp alone. If two requests land in the same clock tick and you use the timestamp as both score and member, Redis collapses them into one entry and undercounts traffic:

distributed-sliding-window-enforcement.lua

-- Redis Lua Script for RPM Sliding-Window Limiting
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window_ms = tonumber(ARGV[2]) -- e.g., 60_000
local now_ms = tonumber(ARGV[3])
local member = ARGV[4] -- unique request id, e.g. "1713468123456:req-9f3c"

-- Remove timestamped entries older than the window
redis.call('ZREMRANGEBYSCORE', key, 0, now_ms - window_ms)

-- Count current requests
local count = redis.call('ZCARD', key)

if count < limit then
    redis.call('ZADD', key, now_ms, member)
    redis.call('PEXPIRE', key, window_ms)
    return 1 -- Allowed
else
    return 0 -- Rejected
end

TPM is trickier because you don't know the final output length at admission time. In practice, reserve a budget based on prompt tokens plus max_output_tokens, then reconcile the counter with actual usage when the stream finishes.

For high-throughput services, strictly synchronized Redis limits can become a bottleneck. A platform may choose bounded burst allowance or approximate local counters, but that weakens strict limit semantics and must be documented and measured.

Token admission needs reservation and reconciliation. Reserve prompt plus maximum allowed output before execution, then release unused output capacity after the stream completes:

reserve-and-reconcile-token-budget.py

class TokenBudget:
    def __init__(self, remaining: int):
        self.remaining = remaining

    def reserve(self, prompt_tokens: int, max_output_tokens: int) -> int:
        reservation = prompt_tokens + max_output_tokens
        if reservation > self.remaining:
            raise ValueError("TPM budget exceeded")
        self.remaining -= reservation
        return reservation

    def reconcile(self, reservation: int, prompt_tokens: int, output_tokens: int) -> None:
        self.remaining += reservation - (prompt_tokens + output_tokens)

budget = TokenBudget(remaining=1_000)
held = budget.reserve(prompt_tokens=300, max_output_tokens=400)
budget.reconcile(held, prompt_tokens=300, output_tokens=120)

assert budget.remaining == 580
print("tokens_remaining_after_actual_usage:", budget.remaining)

Output

tokens_remaining_after_actual_usage: 580

Fairness inside the scheduler

RPM and TPM protect the gateway edge, but they don't fully solve scheduler fairness inside the serving engine. A tenant with one 64K prompt can consume far more GPU time than dozens of tenants sending short chat turns.

Inside the runtime, keep per-tenant queues and charge a virtual token budget for every admitted prefill chunk and every decode step. Then schedule by priority tier plus virtual finish time, not raw request count. That gives each tenant forward progress while still letting higher-SLA traffic buy more share.

Don't collapse rate limiting and quota management into one counter. Rate limits prevent burst and protect the system, while quotas cap total usage and protect the budget. A tenant can stay under their RPM limit and still burn through their monthly token quota in one afternoon.

How we keep data private: the isolation stack

Every state-bearing layer needs an authorization boundary and a testable release policy. A cross-tenant retrieval, adapter, cache, or KV access is a security incident even if the other layers behaved correctly.

The RAG relevance vs. authorization gap

Many multi-tenant platforms augment LLMs with retrieval-augmented generation (RAG). A vector database finds the most relevant documents for a query. In multi-tenancy, "relevant" doesn't mean "allowed."

Tenant X searches for "incident-retention exception policy." The vector DB might find a highly relevant private postmortem that belongs to Tenant Y, because both teams use the same reliability vocabulary. Without a hard filter, the LLM could summarize Tenant Y's confidential incident details and return them to Tenant X.

Use authorization filtering in the retrieval operation itself. An application should pass authorized scope into the database query, and tests should fail if another tenant's result can cross that boundary:

tenant-filtered-retrieval.py

documents = [
    {"tenant": "tenant-a", "text": "Retention exception policy A", "score": 0.88},
    {"tenant": "tenant-b", "text": "Private postmortem details B", "score": 0.99},
]

def authorized_search(tenant: str, top_k: int) -> list[str]:
    allowed = [doc for doc in documents if doc["tenant"] == tenant]
    ranked = sorted(allowed, key=lambda doc: doc["score"], reverse=True)
    return [doc["text"] for doc in ranked[:top_k]]

results = authorized_search("tenant-a", top_k=1)
assert results == ["Retention exception policy A"]
assert all("postmortem details B" not in text for text in results)
print("authorized_results:", results)

Output

authorized_results: ['Retention exception policy A']

The authorization predicate is applied before candidate results leave the data layer. Post-filtering after a broad top-K query can return foreign identifiers, scores, or content to application memory and may also erase every authorized candidate.

Model and prompt isolation

LoRA adapter isolation. Authorize adapter IDs against the tenant before loading encrypted artifacts. If adapters contain sensitive tenant tuning, include residency and release rules in the data contract.
Prompt isolation. Keep raw prompts out of plaintext logs and authenticate/encrypt gateway-to-worker transport. The exact mTLS (mutual Transport Layer Security) and memory-boundary policy depends on the deployment threat model.
Harder runtime boundaries. For workloads whose threat model rules out shared workers, use dedicated node pools, MIG (Multi-Instance GPU) partitions, or VM boundaries as evaluated controls.^{[12]Reference 12Supported GPUshttps://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-gpus.html} Kubernetes placement selects hardware; it doesn't create isolation by itself.

State sanitization

When a shared worker serves more than one tenant, state lifecycle rules matter:

KV reference lifecycle. Remove request access to released KV blocks when a sequence completes. If the threat model requires cleared memory before cross-tenant reuse, implement and verify that clearing policy rather than assuming the allocator provides it.
Batch construction policy. Multi-tenant batching is fine on a shared base model, but every row in the batch must keep its own tenant ID, adapter handle, KV block table, and metering context. For regulated workloads, the simpler answer is dedicated pools or MIG / VM boundaries instead of trying to harden every shared-kernel path.

PII masking as a tiered control

For regulated workloads that can tolerate redaction, requests can pass through a lightweight PII masking service (e.g., Presidio^{[13]Reference 13Presidio: Data Protection and De-identification SDK.https://github.com/microsoft/presidio}) before they hit the model router. This reduces the chance that the LLM ever sees raw credit card numbers or Social Security numbers, while still letting downstream systems map placeholders back to original values when needed.

Tenant A request trace where auth, PII masking, tenant-filtered retrieval, and GPU worker state stay on one identity thread, while sensitive values are replaced with placeholders and runtime state remains isolated per tenant. — Masking lowers exposure, but tenant identity still has to survive retrieval, routing, adapter choice, and KV ownership.

How we attribute cost: per-tenant metering and chargeback

A shared fleet only stays profitable if you can answer one question per tenant: what did this workspace cost us, and what should we charge? Token counts alone are a weak proxy because two requests with the same token counts can consume very different GPU time depending on prompt-vs-output split, batch occupancy, preemptions, and cache hits.

A defensible metering record attaches to every request and carries: tenant_id, model and adapter version, prompt tokens, output tokens, cache-hit tokens, queue wait, prefill time, decode time, KV blocks held, preemption count, and GPU worker type. Cache-hit tokens matter because reused prefix blocks skip prefill compute. A public billing policy may choose a distinct cached-input rate, as current provider pricing documents illustrate.^{[14]Reference 14Prompt cachinghttps://developers.openai.com/api/docs/guides/prompt-caching}^{[15]Reference 15Prompt caching.https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching}

For internal cost allocation rather than customer billing, the honest unit is GPU-time, not tokens. A reasonable per-request cost estimate looks like:

$\text{cost}_{\text{req}} \approx \text{gpu\_seconds} \times \text{node\_hourly\_rate} / 3600 + \text{adapter\_residency} + \text{storage}$

where gpu_seconds is the request's share of busy GPU time (prefill plus its decode steps, divided by batch occupancy so shared steps are split across co-batched tenants). Charge customers on the simpler token-and-tier dimensions, but reconcile against measured GPU-time so you can spot tenants whose traffic shape (long prompts, low batchability, adapter thrash) costs far more than their token bill suggests.

meter-shared-gpu-cost.py

records = [
    {"tenant": "tenant-a", "gpu_seconds": 0.40, "cache_hit_tokens": 800},
    {"tenant": "tenant-b", "gpu_seconds": 1.25, "cache_hit_tokens": 0},
]
NODE_HOURLY_RATE = 8.00

def compute_dollars(record: dict[str, float]) -> float:
    return record["gpu_seconds"] * NODE_HOURLY_RATE / 3600

costs = {record["tenant"]: compute_dollars(record) for record in records}
assert costs["tenant-b"] > costs["tenant-a"]
print("metered_gpu_cost_usd:", {tenant: round(value, 6) for tenant, value in costs.items()})

Output

metered_gpu_cost_usd: {'tenant-a': 0.000889, 'tenant-b': 0.002778}

How the system grows: scaling, canary, and fault tolerance

A production platform should handle spikes, deploy new models with release gates, and degrade predictably during failures.

Auto-scaling on queue depth

CPU utilization alone doesn't describe LLM serving pressure. Queue depth, admitted-token backlog, inference latency, and KV-cache utilization provide signals for capacity and admission decisions. When new workers can't become ready in time, shed or downgrade eligible best-effort work instead of silently violating every tenant's objective.

Cold-starting GPU workers includes making model weights available before the pool can serve requests. Depending on cost and latency objectives, a platform may keep warm capacity, forecast predictable demand, or reject low-priority excess load while workers start.

Model versioning and canary rollouts

Models and LoRA adapters require versioned rollout because a new artifact can degrade generation quality or latency.

In one canary rollout policy, when a tenant deploys a new adapter version (for example, v2), the router sends a controlled slice of eligible traffic to it while the rest continues on v1. Monitor quality evaluation, latency, error rates, and safety signals over a predefined observation window.

If gates pass, the router can increase exposure. If a gate fails, route new eligible requests back to v1; active streams and adapter residency still need explicit handling, so a routing change isn't a blanket zero-downtime promise.

For base model updates, the process is more complex. Unlike lightweight adapters, base models require spinning up entirely new GPU worker pools. The platform routes shadow traffic (duplicate asynchronous requests) to the new base model cluster to validate correctness and measure throughput before exposing it to real tenant traffic. Once validated, the gateway shifts live traffic to the new cluster and gracefully drains the old one.

Fault tolerance

System reliability relies on handling GPU failures and model crashes gracefully:

Strategy	Trigger Condition	Action Taken	Architectural Impact
Dead Letter Queues (DLQ)	Repeated CUDA out-of-memory (OOM) or crashes	Move a repeatedly failing request to DLQ after a bounded retry count	Stops that request from causing an unbounded retry loop
Circuit Breaking	Model/adapter error rate crosses a configured threshold	Fast-fail new requests for that specific adapter	Limits repeated work while operators investigate
Active Health Checks	Missed node heartbeats (e.g., stuck kernel)	Mark unhealthy, stop new routing, drain or fail in-flight work according to policy	Removes a suspected worker from new admission
Zone Redundancy	Entire Availability Zone failure	Shift eligible traffic to healthy zones, subject to spare capacity	Reduces zone-failure impact when capacity is available

These mechanisms need a control plane or equivalent coordination layer to track worker readiness, orchestrate model deployments, and update routing. Readiness checks reduce routing to unavailable workers; they don't prove model quality or prevent every runtime failure.

Try it yourself

Build three practice levels to internalize the concepts:

Level 1: tenant-aware API wrapper. Implement a simple FastAPI endpoint that accepts a tenant_id header, forwards the body to an OpenAI-compatible API, and logs input tokens, output tokens, and latency per tenant. Add a basic in-memory RPM limiter that rejects requests when a tenant exceeds 10 requests per minute.
Level 2: LoRA adapter swap measurement. Use vLLM or LoRAX to serve two different LoRA adapters on one base model. Send alternating requests for Adapter A and Adapter B and measure the latency of each swap. Does the first request after a swap take longer than subsequent ones? Can you warm both adapters simultaneously if GPU memory allows?
Level 3: isolation contract tests. Build a local router with tenant-filtered retrieval, tenant-scoped prefix-cache keys, and adapter authorization. Write adversarial tests that submit the same query or prefix from two tenants and assert that no foreign document, cache hit, or adapter route is exposed.

Mastery check

Suppose your platform serves three tenant tiers on one GPU fleet. Enterprise tenants have tighter latency objectives and stronger privacy boundaries, business tenants need adapter customization, and starter tenants need low cost. Explain how you would route requests, isolate retrieval and KV state, enforce fairness, and roll out a new adapter version without causing a cross-tenant leak or a fleet-wide latency spike.

Key concepts

Share weights, isolate state. The base model, scheduler, and worker pool can be shared, but tenant identity must stay attached to retrieval, adapters, KV blocks, and billing.
KV memory is central bottleneck. Large prompts and long chats compete for VRAM before raw FLOPS become the main limit.
Continuous batching needs fairness policy. Good schedulers keep the GPU full while still honoring tenant quotas, priorities, and latency goals.
LoRA reduces customization weight overhead. Tenant-specific behavior can come from measured adapter footprints rather than separate full model copies.
Authorization must happen inside retrieval. Relevant documents aren't enough; every RAG query must enforce tenant filters at the database layer.

What strong answers show

You're in good shape if you can:

explain why a shared base model is economically necessary and why KV memory still has to be budgeted per tenant
describe how continuous batching, chunked prefill, preemption, and hard KV limits work together
justify when shared pools are enough and when dedicated pools, MIG slices, or VMs are required
design cost attribution that combines tokens, cache hits, queue time, and GPU-seconds instead of relying on token count alone
explain how retrieval filters, prefix-cache namespaces, adapter routing, and block scrubbing prevent cross-tenant leaks

Follow-up questions

Common pitfalls

Symptom: Tenant B sees wording or policy hints that belong to Tenant A.
Cause: KV state, prefix-cache entries, adapter buffers, or other attention-side state crossed a tenant boundary because ownership or namespace validation failed.
Fix: Revoke released request block-table references, keep reusable prefix entries inside authorized tenant or trust-group namespaces, and scrub memory when the threat model requires it. For stricter workloads, move the tenant to dedicated infrastructure.
Symptom: Capacity plan asks for far more GPUs than the traffic needs.
Cause: The plan assumes one tenant maps to one slice of compute and ignores batching, burstiness, and shared adapters.
Fix: Benchmark node throughput on the real model, scheduler, and context mix. Size the fleet from measured queue depth, TTFT, and KV pressure instead of a hand-wavy tenant-to-GPU ratio.
Symptom: One long-context tenant causes everybody else's latency to spike.
Cause: That tenant monopolizes KV memory because the platform lacks hard context and KV budget limits.
Fix: Enforce per-tier context caps and KV budgets at admission time. Preempt or reject oversized requests before they occupy shared blocks.
Symptom: Retrieved answers are relevant but unauthorized.
Cause: Retrieval was filtered after vector search instead of inside the database query.
Fix: Push tenant_id into the retrieval predicate itself. Never let foreign document IDs or scores leave the vector store.
Symptom: Pricing looks fair by tokens, but some tenants are still unprofitable.
Cause: Token counts hide expensive traffic shapes such as long prefills, low batchability, repeated preemption, or adapter cache thrash.
Fix: Reconcile customer billing against measured GPU-seconds, cache hits, queue time, and adapter residency so internal cost tracks actual fleet burn.

Next Step

Continue to LLM-Powered Search Engine

You'll design a search and answer system that combines retrieval, ranking, synthesis, and citation, while applying the tenant-scoped retrieval and context controls introduced here.

PreviousCode Completion System

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

H100 GPU

NVIDIA · 2026

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

S-LoRA: Serving Thousands of Concurrent LoRA Adapters.

Sheng, Y., et al. · 2023 · arXiv preprint

LoRA Adapters

vLLM · 2026

Punica: Multi-Tenant LoRA Serving

Chen, L., Ye, Z., Wu, Y., Zhuo, D., Ceze, L., & Krishnamurthy, A. · 2023

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., et al. · 2023 · SOSP 2023

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2023 · arXiv:2312.07104

Automatic Prefix Caching

vLLM · 2026

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Optimization and Tuning.

vLLM · 2026

Supported GPUs

NVIDIA · 2026

Presidio: Data Protection and De-identification SDK.

Microsoft Presidio. · 2023 · GitHub

Prompt caching

OpenAI · 2026

Prompt caching.

Anthropic. · 2026 · Official documentation

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Multi-Tenant LLM Platform

What is the central tension in a multi-tenant LLM platform?

Why shared capacity needs hard boundaries

Why is "one model copy per tenant" usually impractical for a large serving fleet?

How we pack requests together: continuous batching

Why does continuous batching need tenant awareness?

How we customize behavior without duplicating weights: LoRA adapters

What does LoRA share and what does it customize?

Why can an adapter cache become a latency problem?

How we keep conversations separate: KV cache isolation

The memory cost in concrete numbers

Why is KV cache both a capacity problem and a privacy boundary?

PagedAttention: paging for GPUs

Prefix caching and the cross-tenant leak risk

What must be included in a safe prefix-cache namespace?

Chunked prefill for multi-tenant fairness

Why does chunked prefill protect other tenants from a long prompt?

Tenant-aware preemption

What should a tenant-aware preemption policy optimize for?

Hard per-tenant limits

Why are hard KV limits necessary even with fair scheduling?

How we prevent one tenant from overwhelming the rest: rate limiting and fair queues

Distributed sliding-window enforcement

Fairness inside the scheduler

Why is RPM not enough for LLM rate limiting?

How we keep data private: the isolation stack

The RAG relevance vs. authorization gap

Why must tenant filtering happen inside the vector database query?

Model and prompt isolation

State sanitization

PII masking as a tiered control

When is PII masking helpful, and when is it insufficient?

How we attribute cost: per-tenant metering and chargeback

Why is per-tenant token count not enough for cost attribution?

How the system grows: scaling, canary, and fault tolerance

Auto-scaling on queue depth

Model versioning and canary rollouts

Why are adapter canaries easier than base-model canaries?

Fault tolerance

Why does a poison request need a dead-letter queue?

Try it yourself

Mastery check

Key concepts

What strong answers show

Follow-up questions

How do you handle noisy neighbors during an incident or deploy burst?

How should the system degrade gracefully when load exceeds safe capacity?

Why are base-model rollbacks harder than adapter rollbacks?

What trace fields do you need when one tenant reports a sudden latency spike?

Common pitfalls

Mastery Check

Discussion