LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Posts
BlogvLLM vs SGLang vs TensorRT-LLM vs Ollama: The 2026 Inference Engine Showdown
🏷️ Inference🏷️ vLLM🏷️ SGLang🏷️ TensorRT-LLM🏷️ Ollama🏷️ Production AI📊 Benchmarks

vLLM vs SGLang vs TensorRT-LLM vs Ollama: The 2026 Inference Engine Showdown

The inference engine you choose costs more per token than the model itself. We ran vLLM, SGLang, TensorRT-LLM, and Ollama through identical benchmarks on H100 GPUs to find out which engine actually wins for your workload.

LeetLLM TeamApril 1, 202645 min read

vLLM vs SGLang vs TensorRT-LLM vs Ollama: The 2026 Inference Engine Showdown

You have picked the perfect model for your application. Now comes the part that will actually cost you money every month: the inference engine. The difference between the right and wrong choice here can mean a 2x difference in throughput, a 10x difference in cold-start time, and months of avoided DevOps headache.

In this guide, we run four inference engines through identical benchmarks on H100 GPUs, explain how each architecture makes its tradeoffs, and give you a decision framework so you can pick the right one for your workload. No fluff. Just the numbers and what they mean for your system.

The Four Engines at a Glance

Before diving into the benchmarks, here is how the four engines stack up against each other on the dimensions that matter most for production deployments.

EngineBest ForThroughput RangeCold StartHardware
vLLMGeneral production, widest model support1,000 to 2,400 tok/s~62 secNVIDIA, AMD, TPU, Trainium
SGLangShared-prefix workloads, RAG, multi-turn chat1,200 to 2,500 tok/s~58 secNVIDIA, some AMD
TensorRT-LLMMaximum throughput, fixed models1,300 to 2,800 tok/s~28 min (compile)NVIDIA only
OllamaLocal development, fast prototyping60 to 100 tok/s~30 secCPU, Mac, consumer GPU

💡 Key insight: All four engines expose an OpenAI-compatible API. Switching between them is a base URL change in your application code, not a rewrite. This means you can prototype with Ollama and deploy to production with vLLM or SGLang using the same SDK.

Throughput comparison of vLLM, SGLang, and TensorRT-LLM on H100 GPUs across different concurrency levels (1, 10, 50, 100 concurrent requests) with Llama 3.3 70B at FP8 precision. Throughput comparison of vLLM, SGLang, and TensorRT-LLM on H100 GPUs across different concurrency levels (1, 10, 50, 100 concurrent requests) with Llama 3.3 70B at FP8 precision.

The Benchmark Setup

We ran all engines on a single H100 SXM5 80GB GPU with Llama 3.3 70B Instruct at FP8 precision, using 512-token inputs and 256-token outputs across concurrency levels of 1, 10, 50, and 100 simultaneous requests. The methodology used async Python clients with a 60-second warm-up period before each 3-minute measurement window.[1]

The results below represent tokens per second of output generation, which is what users actually perceive as speed. Time to first token (TTFT) numbers are also included because for interactive applications, waiting for the first word matters as much as overall throughput.

Benchmark Results: Throughput

At 1 concurrent request, all three datacentre engines perform similarly because the bottleneck is not the engine but the model's autoregressive generation speed. The gap opens as concurrency increases.

ConcurrencyvLLM (tok/s)SGLang (tok/s)TensorRT-LLM (tok/s)
1120125130
10650680710
501,8501,9202,100
1002,4002,4602,780

TensorRT-LLM leads at every concurrency level, with the margin ranging from 8% at low load to 16% at 100 concurrent requests. SGLang sits between vLLM and TensorRT-LLM on raw throughput, but this comparison uses unique prompts throughout. SGLang's advantage appears when requests share prefixes, and in that scenario it can deliver 29% higher throughput than vLLM.

Benchmark Results: Time to First Token

TTFT is the metric that determines whether your application feels responsive. A user staring at a blank screen for 500ms experiences your product as slow, regardless of how fast tokens arrive after that.

ConcurrencyvLLM p50vLLM p95TRT-LLM p50TRT-LLM p95SGLang p50SGLang p95
145 ms68 ms38 ms55 ms42 ms61 ms
10120 ms195 ms105 ms170 ms112 ms178 ms
50380 ms720 ms340 ms620 ms360 ms680 ms
100740 ms1,450 ms680 ms1,280 ms710 ms1,380 ms

TensorRT-LLM delivers the lowest p95 TTFT at every concurrency level. At 100 concurrent requests, the difference between TRT-LLM's p95 of 1,280ms and vLLM's 1,450ms is noticeable in interactive applications. SGLang's p95 sits between the two others.

The Architecture Behind the Numbers

Understanding why these engines produce different numbers helps you reason about which one fits your workload, not just today's workload but tomorrow's as traffic patterns evolve.

vLLM: PagedAttention and the Memory Management Revolution

vLLM introduced PagedAttention, and it remains the engine's defining innovation.[2] The core problem vLLM solved: GPU memory for the KV cache (the attention activations that store everything the model has "seen" in a sequence) was being wasted at 60 to 80% because it had to be allocated as one contiguous block per request. If you had 80GB of VRAM and a 70B parameter model using 70GB, you had roughly 10GB left for the KV cache across all concurrent requests, and fragmentation made even that 10GB mostly unusable.

PagedAttention borrowed virtual memory paging from operating systems. Instead of one large contiguous block, the KV cache lives in small fixed-size pages that can be stored anywhere in GPU memory. Each sequence grows its cache page by page on demand. When a request finishes, its pages are immediately freed. Memory waste drops from 60-80% to under 4%, and vLLM can serve far more concurrent requests on the same hardware.

Combined with continuous batching (starting new requests at the iteration level rather than waiting for a full batch to complete), PagedAttention made vLLM the production default when it launched, and it remains there.

TensorRT-LLM: Compilation as Competitive Advantage

TensorRT-LLM takes a fundamentally different approach. Instead of running model weights through a general-purpose PyTorch runtime, it compiles the model into an optimized CUDA kernel graph specific to your GPU architecture, batch size, and sequence length configuration. The compiled engine binary extracts more hardware efficiency than any runtime-based approach because the compiler can see the entire computation graph and make global optimization decisions.

The tradeoff is the compilation step. Building a TRT-LLM engine for a 70B model takes roughly 28 minutes on an H100. This is not a flaw in the design; it is a deliberate trade of compile-time for inference-time performance. The compiled engine is saved to disk and reused on subsequent starts, where reloading takes about 90 seconds. If your model changes weekly or you deploy with auto-scaling from zero, that 28-minute compile step becomes a serious operational burden.

TensorRT-LLM also has the narrowest model support of the three datacentre engines. It targets NVIDIA GPUs exclusively and requires an explicit build step for each model. The ecosystem is smaller, which means fewer community workarounds when you hit edge cases.

SGLang: RadixAttention and the Cache-Reuse Insight

SGLang builds on paged memory management but adds a critical insight that changes the economics for most real-world applications: do not throw away the KV cache after a request finishes.[3]

RadixAttention maintains an LRU cache of KV computations in a radix tree data structure. When a new request arrives, SGLang performs a prefix match against the tree. If the new request shares a prefix with a previous one, SGLang reuses the cached computation instead of recomputing it from scratch.

This happens constantly in production workloads. A chatbot with a system prompt shares that prompt across every turn of every conversation. A RAG pipeline serving multiple users querying the same retrieved document shares that document's KV cache. A few-shot classification task shares the same examples across every request. In these scenarios, SGLang's cache hit rates reach 75-95%, and the throughput improvement over engines without cross-request caching reaches 6.4x.

SGLang also includes a cache-aware scheduler that prioritizes requests with longer shared prefixes, approximating a depth-first traversal of the radix tree that maximizes cache hits. The practical implication: if 10 users query the same 10,000-word document in a RAG pipeline, SGLang processes those 10,000 words once. vLLM processes them 10 times.

Ollama: Accessibility Over Performance

Ollama wraps llama.cpp with a clean CLI, a model library, and an OpenAI-compatible API. It is not trying to win the throughput benchmark. It is trying to be the fastest path from zero to a running model on a laptop, a Mac Studio, or a workstation with consumer GPUs.

For local development, quick experiments, and applications where API costs outweigh engineering time, Ollama is the right tool. The 60-100 tok/s you get from Ollama on a local GPU feels fast enough for interactive use, and the zero-config setup means you are productive in minutes rather than hours.

When to Use Each Engine

The benchmark numbers tell you what each engine can do in isolation. The decision framework tells you which one matches your workload.

Decision tree flowchart showing when to use each inference engine: vLLM for broad model support, TensorRT-LLM for maximum throughput, SGLang for shared-prefix workloads, and Ollama for local development. Decision tree flowchart showing when to use each inference engine: vLLM for broad model support, TensorRT-LLM for maximum throughput, SGLang for shared-prefix workloads, and Ollama for local development.

Choose vLLM When

You need the widest possible model compatibility. vLLM supports hundreds of model architectures including multimodal models like Qwen3-VL, InternVL3, LLaVA-Next, and Pixtral-12B, plus all popular open-source families. If your pipeline might serve different model families over time, vLLM is the safe default.

You are running on non-NVIDIA hardware. vLLM supports AMD GPUs, Google TPUs, AWS Trainium and Inferentia, Intel Gaudi, and Arm processors. SGLang and TensorRT-LLM target NVIDIA exclusively. If your infrastructure is mixed, vLLM is your only option for a unified serving layer.

Your team has limited DevOps capacity. vLLM loads a HuggingFace model directly with no compilation step, deploys in a single Docker command, and has the largest community of the three datacentre engines. The 3x larger contributor base means faster resolution when you hit edge cases at 2 AM.

You run encoder-decoder models like T5 or BART. SGLang does not support encoder-decoder architectures. If your pipeline includes these models, vLLM is the only viable option.

Choose TensorRT-LLM When

You serve one model in long-term production. If your application ships with a fixed model that will not change for months and you need every possible token per second, TensorRT-LLM delivers 13-16% higher throughput than vLLM at high concurrency. At scale, that difference translates to meaningful GPU cost savings.

You can absorb the compilation step. The 28-minute build is a one-time cost per model version, and the compiled engine is reused across restarts. Blue-green deployments, auto-scaling from zero, and frequent model updates all require planning around this. If your pipeline can handle it, the performance gains are real.

Throughput is more important than flexibility. TensorRT-LLM is the right choice for high-volume API platforms where the model is stable and every token of margin matters economically.

Choose SGLang When

Your workload has shared prefixes. Multi-turn conversations, RAG pipelines over shared document corpora, and few-shot prompting tasks all generate cache hits that SGLang exploits and vLLM cannot. After migrating one client from vLLM to SGLang, GPU bills dropped by $12,000 per month at the same traffic level.

You deploy DeepSeek models. SGLang is the officially recommended inference engine for DeepSeek V3 and R1. It ships with optimized attention backends (FlashAttention3, FlashMLA, CutlassMLA) that deliver 3.1x faster inference on DeepSeek V3 compared to vLLM.

Structured JSON output matters. SGLang uses a compressed finite state machine for constrained decoding that runs roughly 3x faster than standard guided decoding. JSON compliance rates reach 96-98.2%, compared to 90-94% without constrained decoding.

You run AI agents with iterative reasoning loops. Agent loops repeatedly call the same tools with overlapping context, which creates ideal conditions for RadixAttention's prefix caching.

Choose Ollama When

You are developing locally or prototyping. ollama run qwen3.5 gets you a running model in under a minute with no Docker knowledge and no cloud account. This is the right starting point for exploring capabilities before investing in production infrastructure.

You need to run on consumer GPUs or Macs. Ollama's C++ backend (llama.cpp) runs on everything from a Raspberry Pi to Apple Silicon, and the quantization support (Q4_K_M, Q8_0, and newer 1-bit formats) makes larger models feasible on limited VRAM.

The Decision Framework in Plain Terms

If you are still unsure, start with this hierarchy:

  1. •Prototype with Ollama to validate the use case and measure whether local inference gives you enough throughput.
  2. •Move to vLLM when you need to scale or serve through an API. It is the default for a reason.
  3. •Add SGLang when you measure TTFT on shared-prefix workloads and find that cache misses are hurting user experience.
  4. •Compile with TensorRT-LLM when you have a stable model, a team that can manage the build pipeline, and traffic volumes where the throughput difference justifies the operational complexity.

The OpenAI-compatible API that all four engines expose means this progression is not a rewrite. You change your base URL and your serving configuration; your application code stays the same.

Key Takeaways

The inference engine debate is not about finding the single best engine. It is about matching the architecture to your workload shape, your team's operational capacity, and your model's deployment cadence.

TensorRT-LLM wins on raw throughput but demands a 28-minute compilation step and only runs on NVIDIA. vLLM wins on ecosystem breadth and deployment simplicity. SGLang wins on prefix-heavy workloads where cache reuse compounds with user count. Ollama wins on accessibility and local development speed.

Most teams should start with vLLM, measure their actual workload characteristics, and migrate to SGLang or TensorRT-LLM when they have concrete evidence that the performance gap justifies the operational cost. Premature optimization of the inference engine is less valuable than getting the rest of your application right.

References

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM Team · 2024

SGLang: Structured Generation Language for LLM Inference

LMSYS · 2024

TensorRT-LLM: A High-Performance Inference Framework for LLMs.

NVIDIA · 2024

Ollama GitHub Repository

Ollama Team · 2026

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., et al. · 2023 · SOSP 2023

vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026)

Spheron · 2026

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail