LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

Blog
InferencevLLMSGLangTensorRT-LLM+3

vLLM vs SGLang vs TensorRT-LLM vs Ollama: Choosing an Inference Engine in 2026

Raw throughput is only half the inference-engine decision. Read an H100 benchmark snapshot, reason about KV-cache pressure, and choose between vLLM, SGLang, TensorRT-LLM, and Ollama.

LeetLLM TeamApril 1, 2026Updated June 12, 202610 min read

You just deployed an internal engineering assistant. You picked a strong model. On launch day, fifty engineers ask it to summarize traces and explain recent incidents. The first ten get instant replies. The rest stare at a loading spinner. Your GPU dashboard shows only 60% utilization.

The model isn't the bottleneck. The inference engine is.

Choosing between vLLM, SGLang, TensorRT-LLM, and Ollama isn't about finding the highest bar in one chart. It's about matching a runtime to your workload shape, hardware, and tolerance for operational work.

Read benchmark shape, understand KV-cache pressure, then choose the simplest engine that fits the measured bottleneck.

What engines compete over

When a large language model generates text, it predicts one token at a time. To predict token 10, it needs information from tokens 1 through 9. Recomputing that history at every step would waste most GPU work, so serving systems store intermediate attention state in a KV cache.

Inference usually has two phases. Prefill processes the prompt and writes the first KV-cache entries. Decode loops one output token at a time while reading that cached state. TTFT, time to first token, is mostly prefill and scheduling delay. TPOT, time per output token, is mostly decode throughput.

Memory is the pressure point. Every request has a different context length. If the engine reserves one giant block per request, short requests leave memory empty while other users wait. Good serving runtimes are mostly KV-cache allocators, schedulers, and kernel stacks wrapped around model weights.

Decision rule: Don't ask "which engine is best?" Ask "which bottleneck dominates this workload: local setup, broad production serving, shared-prefix reuse, or stable NVIDIA throughput?"

Benchmark snapshot

Spheron's March 2026 benchmark is useful because it states the workload. It ran vLLM, SGLang, and TensorRT-LLM on 1x H100 SXM5 80 GB with Llama 3.3 70B Instruct, 8-bit floating point (FP8) precision, 512 input tokens, 256 output tokens, and concurrency levels of 1, 10, 50, and 100.[1]Reference 1vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/

Those details matter because inference-engine results are workload-shaped. The "winner" can change if you switch hardware, model family, prompt sharing pattern, or latency target.

Spheron also notes that its vLLM row predates vLLM's MRV2 path. That means the published vLLM numbers are a point-in-time configuration, not a permanent ceiling for the project.[1]Reference 1vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/

Inference throughput shape where vLLM, SGLang, and TensorRT-LLM stay close at low load, then TensorRT-LLM pulls ahead only under high concurrency. Inference throughput shape where vLLM, SGLang, and TensorRT-LLM stay close at low load, then TensorRT-LLM pulls ahead only under high concurrency.
The benchmark is most useful for reading shape, not rank alone: the early race is tight, then TensorRT-LLM's lead widens only after concurrency becomes large.

Practical read: TensorRT-LLM leads on raw throughput in this setup, SGLang sits between TensorRT-LLM and vLLM, and vLLM remains close enough that operational simplicity still matters.

Production check: If prompt length, cache hit rate, hardware, or model churn differs from this setup, rerun the comparison before changing runtimes.

Throughput wasn't the only spread. On the same benchmark, time to first token (TTFT) p50 at 10 concurrent requests was 120 ms for vLLM, 112 ms for SGLang, and 105 ms for TensorRT-LLM. Cold start was about 62 seconds for vLLM, 58 seconds for SGLang, and 28 minutes for the compiled TensorRT-LLM path.[1]Reference 1vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/ That compile tax matters if you rotate models often. TensorRT-LLM also has a PyTorch-oriented path that lowers setup friction, but then you're no longer comparing the exact runtime path that produced the peak-throughput numbers.[1]Reference 1vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/[2]Reference 2NVIDIA TensorRT-LLM Documentation.https://docs.nvidia.com/tensorrt-llm/

Why PagedAttention matters

vLLM's original breakthrough was PagedAttention. It attacks KV-cache fragmentation by allocating cache in fixed-size blocks instead of one large contiguous segment per request.[3]Reference 3Efficient Memory Management for Large Language Model Serving with PagedAttentionhttps://arxiv.org/abs/2309.06180

Use concrete numbers. A GPU has 24 GB of VRAM. Model weights consume 18 GB, leaving 6 GB for KV cache and active requests. If contiguous reservation takes 1 GB per request slot, you get six slots. If most requests use only 400 MB, 600 MB per slot sits empty.

PagedAttention breaks the cache into blocks. A 400 MB request gets roughly 400 MB of blocks, so the same 6 GB budget can hold about fifteen average requests instead of six worst-case reservations. The toy multiplier is 15 / 6 = 2.5x.

That arithmetic is simplified, but it explains why cache allocation affects user-visible throughput. More memory headroom means the scheduler can keep more sequences active instead of leaving GPU capacity idle.

PagedAttention turns a fixed KV-cache budget from a few oversized slots into many right-sized blocks for active requests. PagedAttention turns a fixed KV-cache budget from a few oversized slots into many right-sized blocks for active requests.
Contiguous reservation wastes memory on requests that never use their worst-case cache budget. Fixed-size KV blocks turn that headroom into more active sessions.

Engine choices

vLLM: broad production default

vLLM is the balanced starting point for many production teams. It has broad model coverage, OpenAI-compatible serving patterns, a large user base, and PagedAttention as its core memory-management idea.[4]Reference 4vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttentionhttps://github.com/vllm-project/vllm[3]Reference 3Efficient Memory Management for Large Language Model Serving with PagedAttentionhttps://arxiv.org/abs/2309.06180

Current vLLM docs cover automatic prefix caching, so shared-prefix reuse isn't exclusive to SGLang.[5]Reference 5Automatic Prefix Cachinghttps://docs.vllm.ai/en/latest/features/automatic_prefix_caching/ That matters when you read older comparisons: a benchmark row can lag active runtime changes.

Use vLLM when you want broad model coverage, predictable deployment, a large ecosystem, and no runtime specialization before measurements justify it.

SGLang: prefix-reuse specialist

SGLang's paper describes a runtime for structured language-model programs and introduces RadixAttention for KV-cache reuse across shared token prefixes.[6]Reference 6SGLang: Efficient Execution of Structured Language Model Programshttps://arxiv.org/abs/2312.07104 That design is most relevant when requests repeat large prompt regions: multi-turn chat with a long system prompt, retrieval over repeated context, few-shot prompts with shared exemplars, or agent loops that revisit similar state.

Picture an internal assistant where every request starts with the same 500-token system prompt that explains schema, permissions, and tool format. Only the final incident ID changes. That workload is exactly where prefix reuse can beat raw unique-prompt throughput.

The Spheron benchmark used unique prompts, so it doesn't fully capture prefix-heavy workloads. That's why reading only the headline throughput bar chart isn't enough.

Use SGLang when shared prefixes dominate the request mix, structured generation matters, and your traffic pattern rewards cache reuse.

TensorRT-LLM: NVIDIA throughput path

TensorRT-LLM is the NVIDIA-specialized path. It targets high serving performance through optimized kernels, engine artifacts, and deployment-time specialization.[2]Reference 2NVIDIA TensorRT-LLM Documentation.https://docs.nvidia.com/tensorrt-llm/

That's why it often wins peak-throughput benchmarks.

One important serving feature is in-flight batching. Instead of waiting for every request in a batch to finish before admitting new work, the runtime can remove completed sequences and insert fresh requests while generation is still running. NVIDIA's docs describe this as a way to reduce latency and use GPUs better.[7]Reference 7Paged Attention, IFB, and Request Scheduling.https://nvidia.github.io/TensorRT-LLM/features/paged-attention-ifb-scheduler.html

Artifact management is the cost. The compiled path can deliver strong peak numbers on stable NVIDIA deployments, but it adds build time, rollback complexity, and configuration discipline. Spheron's measured cold-start difference makes that cost visible: seconds for vLLM and SGLang, about 28 minutes for the compiled TensorRT-LLM path in its benchmark.[1]Reference 1vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/

Use TensorRT-LLM when you serve a stable model for a long time, run NVIDIA-only infrastructure, and have enough traffic to justify runtime optimization work.

If your model mix changes constantly, or you care more about flexibility than absolute tokens per second, the benchmark lead may not be worth the operational cost.

Ollama: local-development runtime

Ollama isn't trying to win the H100 datacenter benchmark. It's trying to make local LLM usage predictable.

Ollama provides one local daemon, published model tags, simple CLI flows, and a clean local API.

Ollama also exposes OpenAI-compatible local endpoints now, which makes it easier to reuse client code during prototyping even if you later move the serving layer elsewhere.[8]Reference 8OpenAI compatibility - Ollamahttps://docs.ollama.com/api/openai-compatibility

Ollama can run local models through backends such as llama.cpp, and its import path supports GGUF (GPT-Generated Unified Format) model files. That makes it useful on Macs, laptops, and consumer GPUs where portability matters more than cluster scheduling.[9]Reference 9Ollama GitHub Repositoryhttps://github.com/ollama/ollama[10]Reference 10llama.cpp: Inference of LLaMA model in pure C/C++https://github.com/ggml-org/llama.cpp[11]Reference 11Importing a Model - Ollamahttps://docs.ollama.com/import

Use Ollama when you're prototyping locally, targeting a workstation or laptop, or optimizing for the shortest path from "download model" to "send request."

Quick comparison

EngineUse it for
vLLMBroad production default for stable APIs and mixed open models; benchmark rows can lag current paths
SGLangPrefix-heavy agent and chat runtime for reused system prompts, RAG contexts, and agent loops
TensorRT-LLMNVIDIA peak-throughput path for stable high-traffic deployments; budget for build artifacts and compile time
OllamaLocal development runtime for laptops, prototypes, and internal demos, not large multi-tenant serving

Two related names sit in different layers. Text Generation Inference (TGI) is now in Hugging Face maintenance mode. Hugging Face says it accepts minor fixes and docs improvements while recommending vLLM, SGLang, llama.cpp, and MLX (Apple's machine-learning framework) going forward.[12]Reference 12Text Generation Inference.https://huggingface.co/docs/text-generation-inference/index NVIDIA Dynamo isn't a single-node engine choice. It coordinates engines such as TensorRT-LLM, vLLM, and SGLang across nodes, including disaggregated serving that separates prefill and decode work onto different GPU groups.[13]Reference 13NVIDIA Dynamo: A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Modelshttps://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/

How to choose

Benchmark tables are only useful if they change a deployment decision. Start with vLLM for server serving, use Ollama for local iteration, and specialize only when measurements point elsewhere.

Runtime routing map where local iteration, broad serving, shared-prefix reuse, and stable NVIDIA load point to different inference engines. Runtime routing map where local iteration, broad serving, shared-prefix reuse, and stable NVIDIA load point to different inference engines.
Measure first. vLLM is the default branch; leave it only when local iteration speed, prefix reuse, or stable NVIDIA scale dominates enough to justify a specialized runtime.

Benchmark checklist before switching

Before replacing a serving runtime, run a benchmark that looks like production traffic. Track prompt length, output length, shared-prefix hit rate, burst shape, model churn, and hardware target. Then collect four metrics: TTFT, TPOT, generated tokens per second at target concurrency, and operational cost from deployment time through rollback complexity.

Benchmark two workloads, not one. Use a unique-prompt workload to test raw serving efficiency, then use a shared-prefix workload to test chat, retrieval, and agent traffic. If those results point at different engines, route traffic by workload instead of forcing one runtime to serve everything.

Review prompt: Before approving a runtime change, ask: "What production measurement would make us reverse this decision?"

Three mistakes teams make when choosing an engine

Avoid three traps: compiling TensorRT-LLM before a prototype needs it, treating aggregate tokens per second as user latency, and reading one benchmark as universal truth. Start locally with Ollama, stage with vLLM, track TTFT and TPOT separately, and benchmark both unique-prompt and shared-prefix workloads before migrating.

Final takeaways

A 60% GPU utilization figure can still mean users are waiting if KV-cache pressure, prefill delay, or scheduler shape is wrong. PagedAttention turns wasted memory into extra active slots. Prefix reuse can change the winner. TensorRT-LLM's throughput lead matters most when model and hardware stay stable.

The next step is serving architecture: batching, quantization, routing, and autoscaling. Once you know why engine choice matters, those tuning decisions stop looking like random flags and start looking like workload controls.

PreviousDeepSeek V4 and the US AI Lab SqueezeNext50 Essential LLM Engineering Concepts for 2026
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)

Spheron · 2026

NVIDIA TensorRT-LLM Documentation.

NVIDIA · 2026

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., et al. · 2023 · SOSP 2023

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM Team · 2024

Automatic Prefix Caching

vLLM · 2026

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2023 · arXiv:2312.07104

Paged Attention, IFB, and Request Scheduling.

NVIDIA · 2026

OpenAI compatibility - Ollama

Ollama · 2026

Ollama GitHub Repository

Ollama Team · 2026

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

Importing a Model - Ollama

Ollama · 2026

Text Generation Inference.

Hugging Face · 2026

NVIDIA Dynamo: A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models

NVIDIA · 2025