Every frontier model now ships with a 1M+ token context window. But what fits in a million tokens? What breaks? And how do you know if the model actually uses all that context? We break it all down.
You've probably seen the announcements. Claude Opus 4.6 and Sonnet 4.6 now offer 1 million tokens of context at no extra cost.[1] GPT-5.4 supports 1M tokens with a cost tier.[2] Gemini 3.1 Pro charges double above 200K tokens.[3] And Grok 4.20 is pushing the envelope with a 2 million token context window.[4] Even open-source models like Qwen3.5 now support 262,144 tokens. A year ago, 128K tokens felt generous. Today, a million is the new baseline for frontier models.
But here's the question nobody seems to be asking: what does any of this actually mean? How much text fits in a million tokens? Does the model actually use all that context, or does information decay the further back you go? And if you're building production systems, should you even care?
This post answers all of that. We'll make the numbers tangible, explore what goes right (and wrong) with massive context windows, and dig into the benchmarks that separate marketing claims from real capability. Let's go.
Token counts are abstract. Nobody thinks in tokens. So let's translate. (If you're not sure what a token even is, our deep dive on tokenization covers BPE, WordPiece, and SentencePiece in detail.)
A rough rule of thumb: 1 token โ 0.75 English words โ 4 characters. That means 1 million tokens is roughly 750,000 words. To put that in perspective:
| What you're fitting | Approximate size | Fits in 1M tokens? |
|---|---|---|
| A typical email | 200-500 tokens | โ ~2,000 emails |
| A single Slack thread (50 messages) | ~5,000 tokens | โ ~200 threads |
| The Great Gatsby | ~72,000 tokens | โ 13 copies |
| A PhD dissertation | ~80,000 tokens | โ ~12 dissertations |
| A 400-page legal contract | ~120,000 tokens | โ ~8 full contracts |
| 10 hours of meeting transcripts | ~150,000 tokens | โ ~66 hours total |
| The Lord of the Rings (trilogy) | ~576,000 tokens | โ ~1.7 copies |
| React.js source code (entire repo) | ~600,000 tokens | โ Fits comfortably |
| King James Bible | ~783,000 tokens | โ ~1.3 copies |
| Harry Potter (all 7 books) | ~1,100,000 tokens | โ Just barely exceeds |
| Linux kernel (core, no drivers) | ~5,000,000 tokens | โ Way too large |
๐ก Key insight: 1M tokens isn't infinite. It's roughly the size of a large novel, or a medium-sized codebase. Enough to hold an entire project's source code in memory, but not enough for a mature open-source project like the Linux kernel. The practical question isn't "can I fit everything?" but "can I fit enough?"
For engineers, the most exciting comparison might be codebases. A medium-sized production service (50-100 files, 20K lines of code) translates to roughly 100,000-200,000 tokens. That means you could load 5-10 complete microservices into a single context window. That is a fundamentally different capability than what we had two years ago.
โ ๏ธ Important caveat: language matters. These estimates assume English text. Tokenizers like BPE were predominantly trained on English corpora, so non-Latin scripts use significantly more tokens per word. Chinese, Japanese, Korean, Arabic, and Hindi text can consume 1.5-3x more tokens for the same semantic content. A 1M context window that holds a 750K-word English document might only hold 300K-400K words of Chinese text. If you're building multilingual systems, always measure your actual token consumption. Don't assume the English ratios hold. For a deeper understanding of why this happens, see our article on tokenization algorithms.
Context window size is only half the story. The economics of using that context matter just as much.
| Model | Context window | Input price (per 1M tokens) | Long-context premium? |
|---|---|---|---|
| Claude Opus 4.6 | 1M | $5 | โ No premium (as of March 13, 2026) |
| Claude Sonnet 4.6 | 1M | $3 | โ No premium |
| GPT-5.4 | 1M | Available at cost | Cost scales with usage |
| Gemini 3.1 Pro | 1M | $2 (โค200K), $4 (>200K) | โ ๏ธ 2x above 200K tokens |
| Grok 4.20 | 2M | $2 | โ No premium |
| Qwen3.5 (open) | 262K | Self-hosted | N/A (your GPU bill) |
๐ฏ Production tip: Anthropic dropping the long-context premium is a bigger deal than it sounds. Previously, filling a 900K-token prompt with Opus could cost significantly more per token than a 9K prompt. Now the per-token rate is flat across the entire window. This changes the economics of "just include everything" strategies from prohibitively expensive to genuinely viable.
Bigger context windows unlock use cases that were genuinely impossible before. Here are the most impactful:
Instead of carefully selecting which files to include in your prompt, you can load an entire project. The model sees the full dependency graph, all the configuration files, the test suite, and the README. When you ask "why does this API endpoint return 500 when I pass an empty array?", the model doesn't need to guess which files matter. It can trace the call path from router to handler to database layer in a single pass.
๐ฌ Research insight: Anthropic reports that Cognition's Devin agent saw a measurable quality improvement in code reviews after switching to 1M context with Opus 4.6, because full diffs no longer needed to be chunked across multiple passes.[1]
Agentic systems that run for hours accumulate massive context: tool calls, observations, intermediate reasoning, error logs. Before 1M context, agents needed compaction (summarizing earlier parts of the conversation to free up space). Compaction loses detail. With 1M tokens, agents can hold 5-10x more history before needing to compress, and some conversations never need compaction at all.
Legal teams can load five complete contracts into a single prompt and ask "what are the inconsistencies in the termination clauses across these agreements?" Medical researchers can load dozens of papers and ask for a synthesis. Financial analysts can include an entire year of quarterly reports plus earnings call transcripts.
With 1M tokens, you can include hundreds of examples in your prompt. Instead of 5-10 few-shot examples, you can provide 200+. For tasks like classification, entity extraction, or structured output generation, more examples consistently improve accuracy. The context window was previously the limiting factor; now your budget is.
More context isn't free, even when the pricing says it is. There are real trade-offs that most announcements gloss over.
One of the most important findings in long-context research: models don't pay equal attention to all parts of a long prompt. Information placed in the middle of a very long context is significantly harder for the model to recall than information at the beginning or end.[5]
This creates a U-shaped retrieval curve. If you bury a critical fact on page 200 of a 400-page prompt, the model is more likely to miss it than if you put it on page 1 or page 400. The practical implication: order matters, even with a 1M context window.
โ ๏ธ Common mistake: Assuming "more context = better answers." If you dump an entire codebase into the prompt without structuring it, the model may perform worse than if you carefully selected the 5 most relevant files. Long context is a capability, not a strategy.
Filling a 1M token context window is slow. Self-attention in the standard Transformer architecture has O(nยฒ) complexity with respect to sequence length.[6] That means doubling the context doesn't just double the compute; it roughly quadruples the attention computation. Modern implementations use optimizations like FlashAttention[7] and KV caching to mitigate this, but the fundamental scaling challenge remains.
| Prompt size | Approximate TTFT (Time to First Token) |
|---|---|
| 10K tokens | < 1 second |
| 100K tokens | 2-5 seconds |
| 500K tokens | 15-30 seconds |
| 1M tokens | 30-60+ seconds |
For interactive chat, making a user wait 45 seconds for a response isn't acceptable. For batch processing or agentic workflows where latency is less critical, it's a worthwhile trade-off. Choose your architecture accordingly.
Even with flat per-token pricing, the sheer volume adds up. A single 1M-token prompt to Claude Opus 4.6 at $5 per million input tokens costs $5 just for the input. If the model generates a 4K-token response at $25 per million output tokens, that's another $0.10. Now imagine an agent that makes 20 such calls in a session: that's $100+ in API costs for a single conversation.
๐ฏ Production tip: Don't default to filling the full context window. Use it strategically. Pre-filter with embeddings or keyword search, then include only the most relevant content. A 200K prompt that's 90% relevant will outperform a 1M prompt that's 20% relevant, and cost 80% less.
Even models that score well on benchmarks show degraded reasoning quality as context grows. Anthropic calls this "context rot." The model doesn't suddenly forget everything, but its ability to reason across distant parts of the context weakens. Cross-referencing a detail from page 3 with a detail from page 350 is harder than cross-referencing two details on the same page, regardless of how large the context window is.[8]
When a provider claims "1M context window," the natural question is: does the model actually use all those tokens effectively? Three families of benchmarks try to answer this question, each testing a different dimension.
The simplest and most intuitive benchmark. You insert a specific fact (the "needle") at a random position within a large block of irrelevant text (the "haystack"), then ask the model to retrieve that fact.[9]
NIAH tests recall: can the model find a specific piece of information regardless of where it sits? A perfect score means the model has functional access to the entire context window. Most modern frontier models score near-perfect (>99%) on single-needle NIAH across their full stated context length.
NIAH is deliberately simple. It tests memorization, not reasoning. A model that scores 100% on NIAH might still fail to synthesize information from different parts of a long context. Retrieving a phone number from page 200 is easy. Comparing two contradictory statements on page 50 and page 350 is hard. NIAH doesn't test the latter.
๐ฌ Research insight: Single-needle NIAH has become almost too easy for frontier models. Multi-needle variants (hiding 8-10 facts and asking for all of them) are now the standard for discriminating between models. This is closer to real-world use cases where you need to gather multiple pieces of information from a large document.
RULER (Real-world Understanding of Long-context Language models via Encompassing Retrieval) expands on NIAH by testing four categories of capabilities across varying context lengths.[10]
| Category | What it tests | Example task |
|---|---|---|
| Retrieval | Finding specific information | Multi-key NIAH (retrieve multiple needles) |
| Multi-hop tracing | Following chains of references | "X is Y. Y is Z. What is X?" across 500K tokens |
| Aggregation | Counting or summarizing patterns across the full context | "How many times does entity X appear?" |
| Question answering | Answering questions requiring multi-document reasoning | Synthesizing information across documents |
RULER revealed a critical gap in the industry: many models that claim large context windows show significant performance degradation well before reaching their stated limit. A model might have a 128K token window but only effectively use 64K tokens before accuracy starts dropping. RULER measures the "effective context length," not just the "maximum context length."
๐ก Key insight: The gap between stated and effective context length is the single most important thing to understand about context window claims. A model with a 200K effective context is more useful than a model with a 1M stated context but 300K effective context, at least for tasks that require reasoning across the full window.
Anthropic's internal benchmark that specifically tests how well a model maintains entity tracking across a long, multi-turn conversation.[1] The name stands for "Multi-Round Coreference Resolution."
MRCR creates a conversation where entities (people, objects, concepts) are introduced, referenced, and modified across many turns. The model must track which pronouns refer to which entities, even when the conversation spans hundreds of thousands of tokens. The v2 variant (MRCR v2) uses 8 needles, making it significantly harder.
This benchmark is closest to what long-running agents experience. An agent that has been debugging a complex issue for 50 turns needs to remember that "the service" from turn 3 refers to the authentication microservice, not the payments service introduced in turn 30. MRCR directly tests this capability.
The results are striking, and reveal a massive gap between models that looks nothing like the near-perfect NIAH scores:
| Model | 256K tokens | 1M tokens | Drop |
|---|---|---|---|
| Claude Opus 4.6 | 91.9% | 78.3% | -13.6 |
| Claude Sonnet 4.6 | 90.6% | 65.1% | -25.5 |
| GPT-5.4 | 79.3% | 36.6% | -42.7 |
| Gemini 3.1 Pro | 59.1% | 25.9% | -33.2 |
| Claude Sonnet 4.5 | 10.8% | 18.5% | +7.7 |
Several things jump out. Claude Opus 4.6 retains almost 80% accuracy at the full 1M window, far ahead of the competition. GPT-5.4 drops to 36.6% at 1M despite a strong 79.3% at 256K, suggesting it struggles with genuine long-range entity tracking. And Gemini 3.1 Pro's 25.9% at 1M tells a very different story than its "1M context window" marketing might suggest. It's an especially revealing contrast: a model can technically accept 1M tokens while functionally losing track of entities well before that.
โ ๏ธ Common mistake: Treating all context window benchmarks as equivalent. NIAH tests recall. RULER tests multi-skill effectiveness. MRCR tests sustained entity tracking. A model can ace one and fail another. Always check which benchmark is being cited.
Supporting a 1M token context window isn't just about training on longer sequences. There's a deep stack of architectural innovations that make it work.
Standard absolute positional encodings break down beyond the training length. Two techniques dominate:
RoPE (Rotary Position Embeddings) encodes relative position using rotation matrices, which naturally extends to longer sequences.[11] Most modern models (including GPT-5.4, Gemini, Qwen) use RoPE or a variant.
ALiBi (Attention with Linear Biases) adds a simple linear penalty to attention scores based on distance, allowing the model to extrapolate to longer sequences than it was trained on.[12]
Both allow models trained on shorter contexts to extend to longer ones, though performance typically degrades without fine-tuning on longer data. Positional interpolation techniques (like YaRN and NTK-aware scaling) address this by scaling the position encodings to fit more positions into the same range.[13][14][15]
In autoregressive generation, the Key-Value (KV) cache stores computed attention states for all previous tokens. At 1M tokens, this cache can consume 50-100+ GB of GPU memory. Several optimizations make this tractable:
FlashAttention reformulates the attention computation to minimize reads and writes between GPU HBM (High Bandwidth Memory) and SRAM (on-chip fast memory).[7] This doesn't change the O(nยฒ) computational complexity, but it dramatically reduces the wall-clock time by avoiding memory bottlenecks. Without FlashAttention, 1M token context windows would be impractical on current hardware.
Here's a practical decision framework for production systems:
| Scenario | Strategy | Why |
|---|---|---|
| Single-document analysis (legal, medical) | โ Full context | The document is your context. No retrieval needed. |
| Codebase Q&A | โ Full context (if it fits) | Models reason better with full dependency graphs visible. |
| Long-running agents | โ Full context + late compaction | Avoid lossy summarization as long as possible. |
| Searching across many documents | โ ๏ธ RAG first, then context | Use embeddings to retrieve top-K, then stuff into context. 1M tokens of irrelevant text hurts accuracy. |
| Real-time chat (latency-sensitive) | โ Avoid filling the window | 30-60s TTFT is unacceptable for interactive use. Use sliding windows. |
| High-volume batch processing | โ ๏ธ Cost-benefit analysis | At $5 per 1M input tokens, processing 10,000 documents gets expensive fast. |
๐ก Key insight: Long context and RAG aren't competing strategies. They're complementary. Use RAG to filter down to the most relevant content, then use the large context window to include all the relevant content without truncation. The combination outperforms either approach alone.
When designing your architecture, always default to the smallest context window that reliably solves your problem. You can always scale up when you hit a ceiling, but starting with a massive context window will needlessly inflate your latency and API costs.
1M tokens โ 750K words โ 2,500 pages. Enough for an entire codebase, but not for everything. Think "project-scale," not "enterprise-scale."
Stated context length โ effective context length. Always check benchmark results (especially RULER and MRCR) to understand how much of the window the model actually uses well.
The "Lost in the Middle" problem is real. Information placement matters. Put critical content at the beginning or end, not buried in the middle.
Pricing is shifting. Anthropic's flat-rate pricing across the full window is a strong signal. Expect other providers to follow. But filling 1M tokens still costs real money per request.
Latency is the hidden cost. Long prompts mean slow Time to First Token. Design your architecture around this: batch processing is fine, real-time chat isn't.
Long context + RAG > either alone. Use retrieval to filter, then context to reason. Don't dump everything into the prompt and hope for the best.
The context window race isn't slowing down. Grok 4.20's 2M-token window suggests we'll see 5-10M within a year. But the real frontier isn't window size, it's effective utilization. The models that win won't be the ones with the largest context, but the ones that can reason, cross-reference, and synthesize across every token they receive.
As the "Lost in the Middle" research shows, we're still far from perfect recall across even current window sizes. Architectural innovations (sparse attention, hierarchical context representations, retrieval-augmented memory) will likely matter more than raw window expansion. Keep watching the benchmarks, not the press releases.
Ultimately, context length will become an invisible implementation detail rather than a headline feature. Just as developers today don't worry about the RAM limits of individual functions unless they hit extreme edge cases, future engineers will treat context windows as functionally unbounded. Until then, you have to measure, validate, and build defensively.
1M context is now generally available for Opus 4.6 and Sonnet 4.6
Anthropic ยท 2026
Introducing GPT-5.4
OpenAI ยท 2026
xAI Models and Pricing
xAI ยท 2026
Gemini API Pricing
Google ยท 2026
RULER: What's the Real Context Size of Your Long-Context Language Models?
Hsieh, C.-Y., et al. ยท 2024
Lost in the Middle: How Language Models Use Long Contexts
Liu, N.F., et al. ยท 2023
Needle In A Haystack: Pressure Testing LLMs
Kamradt, G. ยท 2023
Extending Context Window of Large Language Models via Positional Interpolation.
Chen, S., et al. ยท 2023
YaRN: Efficient Context Window Extension of Large Language Models.
Peng, B., et al. ยท 2023
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
bloc97 ยท 2023
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Rรฉ, C. ยท 2022 ยท NeurIPS 2022
Attention Is All You Need.
Vaswani, A., et al. ยท 2017
RoFormer: Enhanced Transformer with Rotary Position Embedding.
Su, J., et al. ยท 2021
Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization.
Press, O., Smith, N. A., & Lewis, M. ยท 2022 ยท ICLR 2022
Long-context LLMs Struggle with Long In-context Learning
Li, T., et al. ยท 2024