LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Posts
BlogThe Million-Token Era: What 1M Context Windows Actually Change
๐Ÿ“ Context Windows๐Ÿ“œ Long Context๐Ÿ“Š Benchmarks๐Ÿ—๏ธ Infrastructure๐ŸŠ Deep Dive

The Million-Token Era: What 1M Context Windows Actually Change

Every frontier model now ships with a 1M+ token context window. But what fits in a million tokens? What breaks? And how do you know if the model actually uses all that context? We break it all down.

LeetLLM TeamMarch 14, 202618 min read

You've probably seen the announcements. Claude Opus 4.6 and Sonnet 4.6 now offer 1 million tokens of context at no extra cost.[1] GPT-5.4 supports 1M tokens with a cost tier.[2] Gemini 3.1 Pro charges double above 200K tokens.[3] And Grok 4.20 is pushing the envelope with a 2 million token context window.[4] Even open-source models like Qwen3.5 now support 262,144 tokens. A year ago, 128K tokens felt generous. Today, a million is the new baseline for frontier models.

But here's the question nobody seems to be asking: what does any of this actually mean? How much text fits in a million tokens? Does the model actually use all that context, or does information decay the further back you go? And if you're building production systems, should you even care?

This post answers all of that. We'll make the numbers tangible, explore what goes right (and wrong) with massive context windows, and dig into the benchmarks that separate marketing claims from real capability. Let's go.

How big is a million tokens, really?

Token counts are abstract. Nobody thinks in tokens. So let's translate. (If you're not sure what a token even is, our deep dive on tokenization covers BPE, WordPiece, and SentencePiece in detail.)

A rough rule of thumb: 1 token โ‰ˆ 0.75 English words โ‰ˆ 4 characters. That means 1 million tokens is roughly 750,000 words. To put that in perspective:

Visual scale of what fits inside a 1M token context window, from emails to the entire Lord of the Rings trilogy and the King James Bible. Visual scale of what fits inside a 1M token context window, from emails to the entire Lord of the Rings trilogy and the King James Bible.
What you're fittingApproximate sizeFits in 1M tokens?
A typical email200-500 tokensโœ… ~2,000 emails
A single Slack thread (50 messages)~5,000 tokensโœ… ~200 threads
The Great Gatsby~72,000 tokensโœ… 13 copies
A PhD dissertation~80,000 tokensโœ… ~12 dissertations
A 400-page legal contract~120,000 tokensโœ… ~8 full contracts
10 hours of meeting transcripts~150,000 tokensโœ… ~66 hours total
The Lord of the Rings (trilogy)~576,000 tokensโœ… ~1.7 copies
React.js source code (entire repo)~600,000 tokensโœ… Fits comfortably
King James Bible~783,000 tokensโœ… ~1.3 copies
Harry Potter (all 7 books)~1,100,000 tokensโŒ Just barely exceeds
Linux kernel (core, no drivers)~5,000,000 tokensโŒ Way too large

๐Ÿ’ก Key insight: 1M tokens isn't infinite. It's roughly the size of a large novel, or a medium-sized codebase. Enough to hold an entire project's source code in memory, but not enough for a mature open-source project like the Linux kernel. The practical question isn't "can I fit everything?" but "can I fit enough?"

For engineers, the most exciting comparison might be codebases. A medium-sized production service (50-100 files, 20K lines of code) translates to roughly 100,000-200,000 tokens. That means you could load 5-10 complete microservices into a single context window. That is a fundamentally different capability than what we had two years ago.

โš ๏ธ Important caveat: language matters. These estimates assume English text. Tokenizers like BPE were predominantly trained on English corpora, so non-Latin scripts use significantly more tokens per word. Chinese, Japanese, Korean, Arabic, and Hindi text can consume 1.5-3x more tokens for the same semantic content. A 1M context window that holds a 750K-word English document might only hold 300K-400K words of Chinese text. If you're building multilingual systems, always measure your actual token consumption. Don't assume the English ratios hold. For a deeper understanding of why this happens, see our article on tokenization algorithms.

The pricing picture (March 2026)

Context window size is only half the story. The economics of using that context matter just as much.

ModelContext windowInput price (per 1M tokens)Long-context premium?
Claude Opus 4.61M$5โŒ No premium (as of March 13, 2026)
Claude Sonnet 4.61M$3โŒ No premium
GPT-5.41MAvailable at costCost scales with usage
Gemini 3.1 Pro1M$2 (โ‰ค200K), $4 (>200K)โš ๏ธ 2x above 200K tokens
Grok 4.202M$2โŒ No premium
Qwen3.5 (open)262KSelf-hostedN/A (your GPU bill)

๐ŸŽฏ Production tip: Anthropic dropping the long-context premium is a bigger deal than it sounds. Previously, filling a 900K-token prompt with Opus could cost significantly more per token than a 9K prompt. Now the per-token rate is flat across the entire window. This changes the economics of "just include everything" strategies from prohibitively expensive to genuinely viable.

What actually gets better with longer context

Bigger context windows unlock use cases that were genuinely impossible before. Here are the most impactful:

Whole-codebase reasoning

Instead of carefully selecting which files to include in your prompt, you can load an entire project. The model sees the full dependency graph, all the configuration files, the test suite, and the README. When you ask "why does this API endpoint return 500 when I pass an empty array?", the model doesn't need to guess which files matter. It can trace the call path from router to handler to database layer in a single pass.

๐Ÿ”ฌ Research insight: Anthropic reports that Cognition's Devin agent saw a measurable quality improvement in code reviews after switching to 1M context with Opus 4.6, because full diffs no longer needed to be chunked across multiple passes.[1]

Long-running agent memory

Agentic systems that run for hours accumulate massive context: tool calls, observations, intermediate reasoning, error logs. Before 1M context, agents needed compaction (summarizing earlier parts of the conversation to free up space). Compaction loses detail. With 1M tokens, agents can hold 5-10x more history before needing to compress, and some conversations never need compaction at all.

Document-scale analysis

Legal teams can load five complete contracts into a single prompt and ask "what are the inconsistencies in the termination clauses across these agreements?" Medical researchers can load dozens of papers and ask for a synthesis. Financial analysts can include an entire year of quarterly reports plus earnings call transcripts.

Diagram Diagram

Few-shot learning at scale

With 1M tokens, you can include hundreds of examples in your prompt. Instead of 5-10 few-shot examples, you can provide 200+. For tasks like classification, entity extraction, or structured output generation, more examples consistently improve accuracy. The context window was previously the limiting factor; now your budget is.

What gets worse (the hidden costs)

More context isn't free, even when the pricing says it is. There are real trade-offs that most announcements gloss over.

The "Lost in the Middle" problem

One of the most important findings in long-context research: models don't pay equal attention to all parts of a long prompt. Information placed in the middle of a very long context is significantly harder for the model to recall than information at the beginning or end.[5]

U-shaped retrieval accuracy curve showing the Lost in the Middle effect, where models recall information at the beginning and end of context better than the middle. U-shaped retrieval accuracy curve showing the Lost in the Middle effect, where models recall information at the beginning and end of context better than the middle.

This creates a U-shaped retrieval curve. If you bury a critical fact on page 200 of a 400-page prompt, the model is more likely to miss it than if you put it on page 1 or page 400. The practical implication: order matters, even with a 1M context window.

โš ๏ธ Common mistake: Assuming "more context = better answers." If you dump an entire codebase into the prompt without structuring it, the model may perform worse than if you carefully selected the 5 most relevant files. Long context is a capability, not a strategy.

Latency explosion

Filling a 1M token context window is slow. Self-attention in the standard Transformer architecture has O(nยฒ) complexity with respect to sequence length.[6] That means doubling the context doesn't just double the compute; it roughly quadruples the attention computation. Modern implementations use optimizations like FlashAttention[7] and KV caching to mitigate this, but the fundamental scaling challenge remains.

Prompt sizeApproximate TTFT (Time to First Token)
10K tokens< 1 second
100K tokens2-5 seconds
500K tokens15-30 seconds
1M tokens30-60+ seconds

For interactive chat, making a user wait 45 seconds for a response isn't acceptable. For batch processing or agentic workflows where latency is less critical, it's a worthwhile trade-off. Choose your architecture accordingly.

Cost multiplication

Even with flat per-token pricing, the sheer volume adds up. A single 1M-token prompt to Claude Opus 4.6 at $5 per million input tokens costs $5 just for the input. If the model generates a 4K-token response at $25 per million output tokens, that's another $0.10. Now imagine an agent that makes 20 such calls in a session: that's $100+ in API costs for a single conversation.

Diagram Diagram

๐ŸŽฏ Production tip: Don't default to filling the full context window. Use it strategically. Pre-filter with embeddings or keyword search, then include only the most relevant content. A 200K prompt that's 90% relevant will outperform a 1M prompt that's 20% relevant, and cost 80% less.

Context rot and attention decay

Even models that score well on benchmarks show degraded reasoning quality as context grows. Anthropic calls this "context rot." The model doesn't suddenly forget everything, but its ability to reason across distant parts of the context weakens. Cross-referencing a detail from page 3 with a detail from page 350 is harder than cross-referencing two details on the same page, regardless of how large the context window is.[8]

How to measure: the benchmarks that matter

When a provider claims "1M context window," the natural question is: does the model actually use all those tokens effectively? Three families of benchmarks try to answer this question, each testing a different dimension.

Needle in a Haystack (NIAH)

The simplest and most intuitive benchmark. You insert a specific fact (the "needle") at a random position within a large block of irrelevant text (the "haystack"), then ask the model to retrieve that fact.[9]

Needle in a Haystack benchmark heatmap showing retrieval accuracy across context lengths and needle placement depths. Needle in a Haystack benchmark heatmap showing retrieval accuracy across context lengths and needle placement depths.

How it works

  1. โ€ขGenerate a haystack of distractor text (usually essays or Wikipedia paragraphs) at a target length (e.g., 500K tokens)
  2. โ€ขInsert a short fact at a specific depth (e.g., 25%, 50%, 75% through the text)
  3. โ€ขAsk the model to retrieve the fact
  4. โ€ขRepeat across multiple depths and haystack sizes
  5. โ€ขPlot retrieval success rate as a heatmap (x-axis: depth, y-axis: context length)

What it tells you

NIAH tests recall: can the model find a specific piece of information regardless of where it sits? A perfect score means the model has functional access to the entire context window. Most modern frontier models score near-perfect (>99%) on single-needle NIAH across their full stated context length.

What it misses

NIAH is deliberately simple. It tests memorization, not reasoning. A model that scores 100% on NIAH might still fail to synthesize information from different parts of a long context. Retrieving a phone number from page 200 is easy. Comparing two contradictory statements on page 50 and page 350 is hard. NIAH doesn't test the latter.

๐Ÿ”ฌ Research insight: Single-needle NIAH has become almost too easy for frontier models. Multi-needle variants (hiding 8-10 facts and asking for all of them) are now the standard for discriminating between models. This is closer to real-world use cases where you need to gather multiple pieces of information from a large document.

RULER: What's the Real Context Size?

RULER (Real-world Understanding of Long-context Language models via Encompassing Retrieval) expands on NIAH by testing four categories of capabilities across varying context lengths.[10]

CategoryWhat it testsExample task
RetrievalFinding specific informationMulti-key NIAH (retrieve multiple needles)
Multi-hop tracingFollowing chains of references"X is Y. Y is Z. What is X?" across 500K tokens
AggregationCounting or summarizing patterns across the full context"How many times does entity X appear?"
Question answeringAnswering questions requiring multi-document reasoningSynthesizing information across documents

Why RULER matters

RULER revealed a critical gap in the industry: many models that claim large context windows show significant performance degradation well before reaching their stated limit. A model might have a 128K token window but only effectively use 64K tokens before accuracy starts dropping. RULER measures the "effective context length," not just the "maximum context length."

๐Ÿ’ก Key insight: The gap between stated and effective context length is the single most important thing to understand about context window claims. A model with a 200K effective context is more useful than a model with a 1M stated context but 300K effective context, at least for tasks that require reasoning across the full window.

MRCR: Multi-Round Coreference Resolution

Anthropic's internal benchmark that specifically tests how well a model maintains entity tracking across a long, multi-turn conversation.[1] The name stands for "Multi-Round Coreference Resolution."

How it works

MRCR creates a conversation where entities (people, objects, concepts) are introduced, referenced, and modified across many turns. The model must track which pronouns refer to which entities, even when the conversation spans hundreds of thousands of tokens. The v2 variant (MRCR v2) uses 8 needles, making it significantly harder.

Real-world relevance

This benchmark is closest to what long-running agents experience. An agent that has been debugging a complex issue for 50 turns needs to remember that "the service" from turn 3 refers to the authentication microservice, not the payments service introduced in turn 30. MRCR directly tests this capability.

MRCR v2 (8-needle) benchmark comparison showing long context retrieval performance across five models at 256K and 1M tokens. MRCR v2 (8-needle) benchmark comparison showing long context retrieval performance across five models at 256K and 1M tokens.

The results are striking, and reveal a massive gap between models that looks nothing like the near-perfect NIAH scores:

Model256K tokens1M tokensDrop
Claude Opus 4.691.9%78.3%-13.6
Claude Sonnet 4.690.6%65.1%-25.5
GPT-5.479.3%36.6%-42.7
Gemini 3.1 Pro59.1%25.9%-33.2
Claude Sonnet 4.510.8%18.5%+7.7

Several things jump out. Claude Opus 4.6 retains almost 80% accuracy at the full 1M window, far ahead of the competition. GPT-5.4 drops to 36.6% at 1M despite a strong 79.3% at 256K, suggesting it struggles with genuine long-range entity tracking. And Gemini 3.1 Pro's 25.9% at 1M tells a very different story than its "1M context window" marketing might suggest. It's an especially revealing contrast: a model can technically accept 1M tokens while functionally losing track of entities well before that.

โš ๏ธ Common mistake: Treating all context window benchmarks as equivalent. NIAH tests recall. RULER tests multi-skill effectiveness. MRCR tests sustained entity tracking. A model can ace one and fail another. Always check which benchmark is being cited.

The technique stack: how models handle long context

Supporting a 1M token context window isn't just about training on longer sequences. There's a deep stack of architectural innovations that make it work.

Positional encoding for long sequences

Standard absolute positional encodings break down beyond the training length. Two techniques dominate:

RoPE (Rotary Position Embeddings) encodes relative position using rotation matrices, which naturally extends to longer sequences.[11] Most modern models (including GPT-5.4, Gemini, Qwen) use RoPE or a variant.

ALiBi (Attention with Linear Biases) adds a simple linear penalty to attention scores based on distance, allowing the model to extrapolate to longer sequences than it was trained on.[12]

Both allow models trained on shorter contexts to extend to longer ones, though performance typically degrades without fine-tuning on longer data. Positional interpolation techniques (like YaRN and NTK-aware scaling) address this by scaling the position encodings to fit more positions into the same range.[13][14][15]

KV cache optimization

In autoregressive generation, the Key-Value (KV) cache stores computed attention states for all previous tokens. At 1M tokens, this cache can consume 50-100+ GB of GPU memory. Several optimizations make this tractable:

Diagram Diagram
  • โ€ขGQA (Grouped-Query Attention): Shares KV heads across multiple query heads, reducing cache size by 4-8x with minimal quality loss.
  • โ€ขPaged Attention: Manages KV cache like virtual memory pages, eliminating fragmentation and enabling near-100% memory utilization.
  • โ€ขKV Cache Compression: Evicts or quantizes the least-important keys and values, trading a small accuracy hit for major memory savings.

FlashAttention and IO-aware computing

FlashAttention reformulates the attention computation to minimize reads and writes between GPU HBM (High Bandwidth Memory) and SRAM (on-chip fast memory).[7] This doesn't change the O(nยฒ) computational complexity, but it dramatically reduces the wall-clock time by avoiding memory bottlenecks. Without FlashAttention, 1M token context windows would be impractical on current hardware.

When to use (and when to avoid) full context

Here's a practical decision framework for production systems:

ScenarioStrategyWhy
Single-document analysis (legal, medical)โœ… Full contextThe document is your context. No retrieval needed.
Codebase Q&Aโœ… Full context (if it fits)Models reason better with full dependency graphs visible.
Long-running agentsโœ… Full context + late compactionAvoid lossy summarization as long as possible.
Searching across many documentsโš ๏ธ RAG first, then contextUse embeddings to retrieve top-K, then stuff into context. 1M tokens of irrelevant text hurts accuracy.
Real-time chat (latency-sensitive)โŒ Avoid filling the window30-60s TTFT is unacceptable for interactive use. Use sliding windows.
High-volume batch processingโš ๏ธ Cost-benefit analysisAt $5 per 1M input tokens, processing 10,000 documents gets expensive fast.

๐Ÿ’ก Key insight: Long context and RAG aren't competing strategies. They're complementary. Use RAG to filter down to the most relevant content, then use the large context window to include all the relevant content without truncation. The combination outperforms either approach alone.

When designing your architecture, always default to the smallest context window that reliably solves your problem. You can always scale up when you hit a ceiling, but starting with a massive context window will needlessly inflate your latency and API costs.

Key takeaways

  • โ€ข

    1M tokens โ‰ˆ 750K words โ‰ˆ 2,500 pages. Enough for an entire codebase, but not for everything. Think "project-scale," not "enterprise-scale."

  • โ€ข

    Stated context length โ‰  effective context length. Always check benchmark results (especially RULER and MRCR) to understand how much of the window the model actually uses well.

  • โ€ข

    The "Lost in the Middle" problem is real. Information placement matters. Put critical content at the beginning or end, not buried in the middle.

  • โ€ข

    Pricing is shifting. Anthropic's flat-rate pricing across the full window is a strong signal. Expect other providers to follow. But filling 1M tokens still costs real money per request.

  • โ€ข

    Latency is the hidden cost. Long prompts mean slow Time to First Token. Design your architecture around this: batch processing is fine, real-time chat isn't.

  • โ€ข

    Long context + RAG > either alone. Use retrieval to filter, then context to reason. Don't dump everything into the prompt and hope for the best.

What comes next

The context window race isn't slowing down. Grok 4.20's 2M-token window suggests we'll see 5-10M within a year. But the real frontier isn't window size, it's effective utilization. The models that win won't be the ones with the largest context, but the ones that can reason, cross-reference, and synthesize across every token they receive.

As the "Lost in the Middle" research shows, we're still far from perfect recall across even current window sizes. Architectural innovations (sparse attention, hierarchical context representations, retrieval-augmented memory) will likely matter more than raw window expansion. Keep watching the benchmarks, not the press releases.

Ultimately, context length will become an invisible implementation detail rather than a headline feature. Just as developers today don't worry about the RAM limits of individual functions unless they hit extreme edge cases, future engineers will treat context windows as functionally unbounded. Until then, you have to measure, validate, and build defensively.

References

1M context is now generally available for Opus 4.6 and Sonnet 4.6

Anthropic ยท 2026

Introducing GPT-5.4

OpenAI ยท 2026

xAI Models and Pricing

xAI ยท 2026

Gemini API Pricing

Google ยท 2026

RULER: What's the Real Context Size of Your Long-Context Language Models?

Hsieh, C.-Y., et al. ยท 2024

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. ยท 2023

Needle In A Haystack: Pressure Testing LLMs

Kamradt, G. ยท 2023

Extending Context Window of Large Language Models via Positional Interpolation.

Chen, S., et al. ยท 2023

YaRN: Efficient Context Window Extension of Large Language Models.

Peng, B., et al. ยท 2023

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.

bloc97 ยท 2023

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Rรฉ, C. ยท 2022 ยท NeurIPS 2022

Attention Is All You Need.

Vaswani, A., et al. ยท 2017

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Su, J., et al. ยท 2021

Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.

Press, O., Smith, N. A., & Lewis, M. ยท 2022 ยท ICLR 2022

Long-context LLMs Struggle with Long In-context Learning

Li, T., et al. ยท 2024

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail