The inference engine you choose costs more per token than the model itself. We ran vLLM, SGLang, TensorRT-LLM, and Ollama through identical benchmarks on H100 GPUs to find out which engine actually wins for your workload.
You have picked the perfect model for your application. Now comes the part that will actually cost you money every month: the inference engine. The difference between the right and wrong choice here can mean a 2x difference in throughput, a 10x difference in cold-start time, and months of avoided DevOps headache.
In this guide, we run four inference engines through identical benchmarks on H100 GPUs, explain how each architecture makes its tradeoffs, and give you a decision framework so you can pick the right one for your workload. No fluff. Just the numbers and what they mean for your system.
Before diving into the benchmarks, here is how the four engines stack up against each other on the dimensions that matter most for production deployments.
| Engine | Best For | Throughput Range | Cold Start | Hardware |
|---|---|---|---|---|
| vLLM | General production, widest model support | 1,000 to 2,400 tok/s | ~62 sec | NVIDIA, AMD, TPU, Trainium |
| SGLang | Shared-prefix workloads, RAG, multi-turn chat | 1,200 to 2,500 tok/s | ~58 sec | NVIDIA, some AMD |
| TensorRT-LLM | Maximum throughput, fixed models | 1,300 to 2,800 tok/s | ~28 min (compile) | NVIDIA only |
| Ollama | Local development, fast prototyping | 60 to 100 tok/s | ~30 sec | CPU, Mac, consumer GPU |
💡 Key insight: All four engines expose an OpenAI-compatible API. Switching between them is a base URL change in your application code, not a rewrite. This means you can prototype with Ollama and deploy to production with vLLM or SGLang using the same SDK.
We ran all engines on a single H100 SXM5 80GB GPU with Llama 3.3 70B Instruct at FP8 precision, using 512-token inputs and 256-token outputs across concurrency levels of 1, 10, 50, and 100 simultaneous requests. The methodology used async Python clients with a 60-second warm-up period before each 3-minute measurement window.[1]
The results below represent tokens per second of output generation, which is what users actually perceive as speed. Time to first token (TTFT) numbers are also included because for interactive applications, waiting for the first word matters as much as overall throughput.
At 1 concurrent request, all three datacentre engines perform similarly because the bottleneck is not the engine but the model's autoregressive generation speed. The gap opens as concurrency increases.
| Concurrency | vLLM (tok/s) | SGLang (tok/s) | TensorRT-LLM (tok/s) |
|---|---|---|---|
| 1 | 120 | 125 | 130 |
| 10 | 650 | 680 | 710 |
| 50 | 1,850 | 1,920 | 2,100 |
| 100 | 2,400 | 2,460 | 2,780 |
TensorRT-LLM leads at every concurrency level, with the margin ranging from 8% at low load to 16% at 100 concurrent requests. SGLang sits between vLLM and TensorRT-LLM on raw throughput, but this comparison uses unique prompts throughout. SGLang's advantage appears when requests share prefixes, and in that scenario it can deliver 29% higher throughput than vLLM.
TTFT is the metric that determines whether your application feels responsive. A user staring at a blank screen for 500ms experiences your product as slow, regardless of how fast tokens arrive after that.
| Concurrency | vLLM p50 | vLLM p95 | TRT-LLM p50 | TRT-LLM p95 | SGLang p50 | SGLang p95 |
|---|---|---|---|---|---|---|
| 1 | 45 ms | 68 ms | 38 ms | 55 ms | 42 ms | 61 ms |
| 10 | 120 ms | 195 ms | 105 ms | 170 ms | 112 ms | 178 ms |
| 50 | 380 ms | 720 ms | 340 ms | 620 ms | 360 ms | 680 ms |
| 100 | 740 ms | 1,450 ms | 680 ms | 1,280 ms | 710 ms | 1,380 ms |
TensorRT-LLM delivers the lowest p95 TTFT at every concurrency level. At 100 concurrent requests, the difference between TRT-LLM's p95 of 1,280ms and vLLM's 1,450ms is noticeable in interactive applications. SGLang's p95 sits between the two others.
Understanding why these engines produce different numbers helps you reason about which one fits your workload, not just today's workload but tomorrow's as traffic patterns evolve.
vLLM introduced PagedAttention, and it remains the engine's defining innovation.[2] The core problem vLLM solved: GPU memory for the KV cache (the attention activations that store everything the model has "seen" in a sequence) was being wasted at 60 to 80% because it had to be allocated as one contiguous block per request. If you had 80GB of VRAM and a 70B parameter model using 70GB, you had roughly 10GB left for the KV cache across all concurrent requests, and fragmentation made even that 10GB mostly unusable.
PagedAttention borrowed virtual memory paging from operating systems. Instead of one large contiguous block, the KV cache lives in small fixed-size pages that can be stored anywhere in GPU memory. Each sequence grows its cache page by page on demand. When a request finishes, its pages are immediately freed. Memory waste drops from 60-80% to under 4%, and vLLM can serve far more concurrent requests on the same hardware.
Combined with continuous batching (starting new requests at the iteration level rather than waiting for a full batch to complete), PagedAttention made vLLM the production default when it launched, and it remains there.
TensorRT-LLM takes a fundamentally different approach. Instead of running model weights through a general-purpose PyTorch runtime, it compiles the model into an optimized CUDA kernel graph specific to your GPU architecture, batch size, and sequence length configuration. The compiled engine binary extracts more hardware efficiency than any runtime-based approach because the compiler can see the entire computation graph and make global optimization decisions.
The tradeoff is the compilation step. Building a TRT-LLM engine for a 70B model takes roughly 28 minutes on an H100. This is not a flaw in the design; it is a deliberate trade of compile-time for inference-time performance. The compiled engine is saved to disk and reused on subsequent starts, where reloading takes about 90 seconds. If your model changes weekly or you deploy with auto-scaling from zero, that 28-minute compile step becomes a serious operational burden.
TensorRT-LLM also has the narrowest model support of the three datacentre engines. It targets NVIDIA GPUs exclusively and requires an explicit build step for each model. The ecosystem is smaller, which means fewer community workarounds when you hit edge cases.
SGLang builds on paged memory management but adds a critical insight that changes the economics for most real-world applications: do not throw away the KV cache after a request finishes.[3]
RadixAttention maintains an LRU cache of KV computations in a radix tree data structure. When a new request arrives, SGLang performs a prefix match against the tree. If the new request shares a prefix with a previous one, SGLang reuses the cached computation instead of recomputing it from scratch.
This happens constantly in production workloads. A chatbot with a system prompt shares that prompt across every turn of every conversation. A RAG pipeline serving multiple users querying the same retrieved document shares that document's KV cache. A few-shot classification task shares the same examples across every request. In these scenarios, SGLang's cache hit rates reach 75-95%, and the throughput improvement over engines without cross-request caching reaches 6.4x.
SGLang also includes a cache-aware scheduler that prioritizes requests with longer shared prefixes, approximating a depth-first traversal of the radix tree that maximizes cache hits. The practical implication: if 10 users query the same 10,000-word document in a RAG pipeline, SGLang processes those 10,000 words once. vLLM processes them 10 times.
Ollama wraps llama.cpp with a clean CLI, a model library, and an OpenAI-compatible API. It is not trying to win the throughput benchmark. It is trying to be the fastest path from zero to a running model on a laptop, a Mac Studio, or a workstation with consumer GPUs.
For local development, quick experiments, and applications where API costs outweigh engineering time, Ollama is the right tool. The 60-100 tok/s you get from Ollama on a local GPU feels fast enough for interactive use, and the zero-config setup means you are productive in minutes rather than hours.
The benchmark numbers tell you what each engine can do in isolation. The decision framework tells you which one matches your workload.
You need the widest possible model compatibility. vLLM supports hundreds of model architectures including multimodal models like Qwen3-VL, InternVL3, LLaVA-Next, and Pixtral-12B, plus all popular open-source families. If your pipeline might serve different model families over time, vLLM is the safe default.
You are running on non-NVIDIA hardware. vLLM supports AMD GPUs, Google TPUs, AWS Trainium and Inferentia, Intel Gaudi, and Arm processors. SGLang and TensorRT-LLM target NVIDIA exclusively. If your infrastructure is mixed, vLLM is your only option for a unified serving layer.
Your team has limited DevOps capacity. vLLM loads a HuggingFace model directly with no compilation step, deploys in a single Docker command, and has the largest community of the three datacentre engines. The 3x larger contributor base means faster resolution when you hit edge cases at 2 AM.
You run encoder-decoder models like T5 or BART. SGLang does not support encoder-decoder architectures. If your pipeline includes these models, vLLM is the only viable option.
You serve one model in long-term production. If your application ships with a fixed model that will not change for months and you need every possible token per second, TensorRT-LLM delivers 13-16% higher throughput than vLLM at high concurrency. At scale, that difference translates to meaningful GPU cost savings.
You can absorb the compilation step. The 28-minute build is a one-time cost per model version, and the compiled engine is reused across restarts. Blue-green deployments, auto-scaling from zero, and frequent model updates all require planning around this. If your pipeline can handle it, the performance gains are real.
Throughput is more important than flexibility. TensorRT-LLM is the right choice for high-volume API platforms where the model is stable and every token of margin matters economically.
Your workload has shared prefixes. Multi-turn conversations, RAG pipelines over shared document corpora, and few-shot prompting tasks all generate cache hits that SGLang exploits and vLLM cannot. After migrating one client from vLLM to SGLang, GPU bills dropped by $12,000 per month at the same traffic level.
You deploy DeepSeek models. SGLang is the officially recommended inference engine for DeepSeek V3 and R1. It ships with optimized attention backends (FlashAttention3, FlashMLA, CutlassMLA) that deliver 3.1x faster inference on DeepSeek V3 compared to vLLM.
Structured JSON output matters. SGLang uses a compressed finite state machine for constrained decoding that runs roughly 3x faster than standard guided decoding. JSON compliance rates reach 96-98.2%, compared to 90-94% without constrained decoding.
You run AI agents with iterative reasoning loops. Agent loops repeatedly call the same tools with overlapping context, which creates ideal conditions for RadixAttention's prefix caching.
You are developing locally or prototyping. ollama run qwen3.5 gets you a running model in under a minute with no Docker knowledge and no cloud account. This is the right starting point for exploring capabilities before investing in production infrastructure.
You need to run on consumer GPUs or Macs. Ollama's C++ backend (llama.cpp) runs on everything from a Raspberry Pi to Apple Silicon, and the quantization support (Q4_K_M, Q8_0, and newer 1-bit formats) makes larger models feasible on limited VRAM.
If you are still unsure, start with this hierarchy:
The OpenAI-compatible API that all four engines expose means this progression is not a rewrite. You change your base URL and your serving configuration; your application code stays the same.
The inference engine debate is not about finding the single best engine. It is about matching the architecture to your workload shape, your team's operational capacity, and your model's deployment cadence.
TensorRT-LLM wins on raw throughput but demands a 28-minute compilation step and only runs on NVIDIA. vLLM wins on ecosystem breadth and deployment simplicity. SGLang wins on prefix-heavy workloads where cache reuse compounds with user count. Ollama wins on accessibility and local development speed.
Most teams should start with vLLM, measure their actual workload characteristics, and migrate to SGLang or TensorRT-LLM when they have concrete evidence that the performance gap justifies the operational cost. Premature optimization of the inference engine is less valuable than getting the rest of your application right.
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
vLLM Team · 2024
SGLang: Structured Generation Language for LLM Inference
LMSYS · 2024
TensorRT-LLM: A High-Performance Inference Framework for LLMs.
NVIDIA · 2024
Ollama GitHub Repository
Ollama Team · 2026
Efficient Memory Management for Large Language Model Serving with PagedAttention
Kwon, W., et al. · 2023 · SOSP 2023
vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)
Spheron · 2026