Understand how reasoning models trade extra inference compute for better answers, and what that means for search, verifiers, KV cache pressure, and routing.
State Space Models showed one way to keep decode state from growing with context length by compressing history into recurrent state. Reasoning models move in the opposite direction on purpose: spend more inference compute when the task is hard enough to justify it.
Think about two dispatch planners handling the same delayed-truck problem. The first planner accepts the first reroute that looks plausible. The second planner checks carrier capacity, hub cutoffs, refund risk, and customer priority before committing. When those checks catch a real mistake, the second plan performs better, not because the planner has different data, but because it spends more compute on the decision.
The same idea now shapes how frontier large language models (LLMs) are built and evaluated. Classic scaling work focused on train-time compute: more parameters, more data, and more pretraining FLOPs (floating-point operations).[1] Work from 2024 onward made a second axis impossible to ignore: on hard reasoning tasks, you can often get better answers by spending more compute during generation itself.[2][3] This is test-time compute scaling. Reasoning-model APIs from OpenAI and open-weight systems such as DeepSeek-R1 made this shift visible, and Snell et al. showed that, on tasks where a smaller model already has a non-trivial chance of success, extra inference-time compute can beat a much larger single-pass model.[3][4][5][2]
Provider controls are model-specific and they change. OpenAI documents reasoning.effort; Google's Gemini docs use thinkingLevel for Gemini 3 models and thinkingBudget for Gemini 2.5 models; Anthropic's current Claude Opus 4.7 docs use adaptive thinking with an effort parameter, while older Claude models use manual extended thinking with budget_tokens.[4][6][7] The durable skill is deciding how much compute a request deserves, because more thinking is not always better.[8]
In this article you'll learn why some problems need deliberate reasoning instead of fast pattern matching, how test-time compute scaling works, and what it means for production systems. You'll need a basic mental model of how transformers predict the next token (covered earlier in the preparation path). By the end, you'll be able to choose between single-pass, best-of-N, and guided-search strategies, and you'll know why routing and token budgets are as important as the algorithms themselves.
Psychologist Daniel Kahneman's framework is a useful starting point. System 1 is fast, instinctive, and pattern-driven. When you read the word "strawberry" and immediately know it's a fruit, that's System 1. System 2 is slow, deliberate, and step-by-step. When you count how many times the letter "r" appears in "strawberry," you have to switch to System 2 because your gut reaction ("two?") is often wrong. The correct answer is three, but you only get there by checking each letter deliberately.
The System 1/System 2 distinction is an analogy for serving behavior, not an architectural taxonomy. A low-budget single-pass call may answer from strong patterns; a model or surrounding system with more budget can check intermediate work, sample alternatives, or run tools. On tasks that require counting, lookahead, or multi-step deduction, that extra verification path can matter.
Reasoning-oriented models and systems are designed to make deliberate behavior easier to buy at inference time. Depending on the model and wrapper, they may spend non-visible reasoning tokens, sample alternatives, run an external verifier, or revise a candidate. Think of the difference between a routing system that accepts the first carrier option and one that checks cutoff times, capacity, destination risk, and refund impact before committing.
This shift from pure pattern matching to deliberate reasoning is what makes test-time compute scaling possible.
Here's a concrete example. Ask a model: "How many times does the letter 'r' appear in the word 'strawberry'?" A rushed answer can be "two," because the third "r" in "strawberry" is easy to miss without a check.
The problem isn't that the model lacks knowledge. It's that the task requires deliberate step-by-step verification, not fast association. Pretraining optimizes next-token prediction, which rewards fluent continuations more directly than checked final answers. On problems like complex math, code debugging, or logistics planning, the same dynamic appears: the model produces a plausible-looking answer that collapses under scrutiny.
Test-time compute scaling can address this failure mode by spending budget before committing. Instead of one shot, the model or the system around it can explore multiple approaches, verify intermediate steps, and select a candidate. The extra compute only helps when exploration or verification changes the result for the better.
Here's a useful analogy: think of a fulfillment network. Train-time compute is like months spent tuning forecasts, routes, and warehouse procedures before peak season. Test-time compute is the effort spent during a live incident: trying alternate routes, checking cutoff times, and verifying the plan before committing.
Classic scaling emphasized the pre-season tuning. Evaluations on reasoning tasks now show that inference-time allocation can sometimes compete with moving to a larger single-pass model.
Traditional scaling focuses on train-time compute, increasing model parameters or training tokens . The relationship between compute and loss follows a power law:
Empirical scaling studies fit loss with an approximate power law of compute , where training compute is often estimated as roughly 6 × model parameters × training tokens .[1] Within the measured regime, increasing training compute reduces loss predictably enough to guide training decisions.
Test-time compute scaling adds a second axis, inference compute . A practical mental model is:
In a measured operating range, more inference compute can improve accuracy with diminishing returns. It can also plateau or reduce accuracy when a trace overthinks, sampled candidates are correlated, or a verifier selects the wrong branch. That extra compute can come from longer reasoning traces, repeated sampling, revision loops, or explicit search with a verifier.[2][8]
Snell et al.[2] don't claim one universal law for every model and every benchmark. Their more useful result is operational: if you allocate inference compute well, test-time scaling is strong enough that it can outperform simply buying a larger model and sampling once.
This represents a fundamental shift in how we approach problem-solving with AI models:
1Traditional approach: Train bigger → Answer once → Done
2Reasoning approach: Train reasoner → Think longer → Search for best answer1budgets = [0, 128, 512, 2048]
2verified_accuracy = [0.71, 0.78, 0.83, 0.80]
3latency_ms = [220, 310, 610, 1840]
4latency_limit_ms = 1000
5
6eligible = [
7 (accuracy, budget, latency)
8 for budget, accuracy, latency in zip(budgets, verified_accuracy, latency_ms)
9 if latency <= latency_limit_ms
10]
11accuracy, budget, latency = max(eligible)
12print(f"chosen_budget={budget} accuracy={accuracy:.0%} latency_ms={latency}")
13print(f"max_budget_is_best={verified_accuracy[-1] == max(verified_accuracy)}")1chosen_budget=512 accuracy=83% latency_ms=610
2max_budget_is_best=FalseThis evaluation table encodes the production question: choose the best measured quality under the service-level objective, rather than assuming the longest trace is best.
At a high level, reasoning-oriented training and inference policies aim to allocate additional tokens or branches before returning an answer. The visible behavior can include decomposition, checks, or revision, but those behaviors are capabilities to evaluate rather than guarantees of every response.
Reasoning models often generate long scratchpads or intermediate traces before producing a final answer. Sometimes that trace is exposed, sometimes it's summarized, and sometimes it's hidden entirely by provider policy. What matters isn't whether the user sees every token, but whether the model is allowed to spend additional inference-time compute before answering.
This built-in reasoning process differs from prompted chain-of-thought (CoT)[9] in several key ways:
Current reasoning APIs make this concrete. OpenAI's reasoning docs describe reasoning tokens as non-visible output tokens that still consume context budget, and they recommend giving the task, constraints, and desired output format while treating reasoning effort as a tuning knob.[4]
The thinking budget is now a primary control surface, even though providers expose it with model-specific knobs. OpenAI's reasoning.effort sets an effort level for reasoning models.[4] Gemini 3 models use thinkingLevel, while Gemini 2.5 models retain a numeric thinkingBudget.[6] Claude Opus 4.7 uses adaptive thinking plus effort; Anthropic documents budget_tokens for older Claude extended-thinking models.[7] Provider prompting guidance also differs by model, so start with a clear task and constraints, tune the supported knob, and evaluate instead of assuming a chain-of-thought prompt helps.
The important point is that test-time compute can mean either a longer single trace or multiple sampled traces. OpenAI's o1 launch post made this visible at the benchmark level: on AIME 2024 (a math competition benchmark), reported accuracy improved when the system moved from a single sample to consensus over 64 samples and then to learned reranking over 1000 samples.[3] Treat this as one reported evaluation result, not a guarantee for other tasks or selection rules.
Test-time compute is an umbrella term, not a single algorithm. Some systems sample many complete answers and pick a winner. Some iteratively critique and revise one candidate. Others run explicit search over partial reasoning states using a verifier or reward model. The diagram below shows these patterns as branches of the same idea: spend extra compute to explore, score, and refine before returning an answer.
Not every reasoning model literally runs beam search or a PRM at inference time. The shared pattern is optional inference compute allocation. A good routing policy keeps easy requests cheap and assigns more tokens, branches, or verification only where evaluation shows a payoff.
Some hosted reasoning APIs report non-visible reasoning tokens in usage accounting while returning only the final visible answer. OpenAI's reasoning-token documentation is one concrete example.[4] Do not generalize this into one universal architecture: an open-weight deployment, explicit search wrapper, or provider with summarized thinking can expose and account for intermediate work differently.
For such APIs, the non-visible tokens matter for production because they consume context window space, increase key-value (KV) cache pressure, and may count toward billing even though users never see them.[4]
1visible_tokens = 180
2reasoning_tokens = 1220
3output_price_per_million = 10.0
4
5billed_output_tokens = visible_tokens + reasoning_tokens
6cost = billed_output_tokens / 1_000_000 * output_price_per_million
7print(f"visible={visible_tokens} billed_output={billed_output_tokens}")
8print(f"visible_fraction={visible_tokens / billed_output_tokens:.1%} output_cost=${cost:.4f}")1visible=180 billed_output=1400
2visible_fraction=12.9% output_cost=$0.0140Use the provider's usage schema and pricing table when implementing this calculation; the example demonstrates why billing and capacity dashboards cannot count visible text alone.
Several strategies allocate compute at inference time, each with different characteristics:
This is the simplest strategy: generate 16 candidate recovery plans for a delayed order and keep the one with the best verifier score. Each attempt is independent, and more attempts raise your ceiling as long as your verifier or selection rule can reliably identify the best one.
The conceptual interface sketch below demonstrates this approach. It takes a generative model and a prompt as inputs, alongside the number of desired completions (). It returns the single best response by either scoring them with a reward model or falling back to self-consistency (majority vote).
Cost note: This method is easily parallelizable (all N attempts can run simultaneously), making it simple to implement but potentially expensive since you pay for all N completions.
1from collections import Counter
2
3def extract_answer(completion: str) -> str:
4 lines = [line.strip() for line in completion.splitlines() if line.strip()]
5 return lines[-1] if lines else ""
6
7def best_of_n(
8 model,
9 prompt: str,
10 n: int = 16,
11 reward_model=None
12) -> str:
13 """Generate N responses and return the best-scored candidate completion.
14
15 Cost: O(N) forward passes, embarrassingly parallel.
16 Best for: Problems with verifiable correctness (math, code).
17 """
18 candidates = [model.generate(prompt) for _ in range(n)]
19
20 if reward_model:
21 scores = [reward_model.score(prompt, c) for c in candidates]
22 return candidates[scores.index(max(scores))]
23 else:
24 # Self-consistency: majority vote on final answer
25 answers = [extract_answer(c) for c in candidates]
26 answer_counts = Counter(answers)
27 best_answer = answer_counts.most_common(1)[0][0]
28 return next(c for c in candidates if extract_answer(c) == best_answer)With an oracle verifier and independent samples, the chance of generating at least one correct answer in tries follows:
Where is the base probability of the model generating a correct answer on a single attempt. Real systems do worse than this idealized formula because samples are correlated and verifiers make mistakes, but the equation explains why best-of-N works at all. OpenAI's o1 launch post reports the same pattern on AIME 2024: o1 improved from 74% with one sample to 83% with 64-sample consensus and 93% when reranking 1000 samples with a learned scorer.[3]
1def success_probability(p: float, n: int) -> float:
2 return 1 - (1 - p) ** n
3
4base_hit_rate = 0.25
5tokens_per_attempt = 800
6
7for n in [1, 4, 16, 64]:
8 success = success_probability(base_hit_rate, n)
9 output_tokens = n * tokens_per_attempt
10 print(f"N={n:>2} success={success:5.1%} output_tokens={output_tokens:>5}")1N= 1 success=25.0% output_tokens= 800
2N= 4 success=68.4% output_tokens= 3200
3N=16 success=99.0% output_tokens=12800
4N=64 success=100.0% output_tokens=51200The toy calculation shows why best-of-N is attractive and dangerous. The idealized success curve rises quickly, but cost rises linearly with every sampled completion. Real systems also plateau earlier because samples share model biases and verifiers make mistakes.
1def selected_success_probability(candidate_success: float, selector_recall: float) -> float:
2 return candidate_success * selector_recall
3
4base_hit_rate = 0.25
5selector_recall = 0.80
6for n in [1, 4, 16]:
7 oracle_success = 1 - (1 - base_hit_rate) ** n
8 deployed_success = selected_success_probability(oracle_success, selector_recall)
9 print(f"N={n:>2} oracle={oracle_success:5.1%} with_selector={deployed_success:5.1%}")1N= 1 oracle=25.0% with_selector=20.0%
2N= 4 oracle=68.4% with_selector=54.7%
3N=16 oracle=99.0% with_selector=79.2%This toy selector is deliberately simple: it shows that generating a correct candidate and selecting it are separate failure surfaces.
This is like drafting a carrier-recovery plan, checking it against policy and capacity, then revising the weak steps before committing to the customer. Unlike best-of-N (where each attempt is independent), each revision builds on the previous one.
The conceptual interface below takes an initial prompt and a model. It generates an initial response and then enters a loop, asking the model to critique its own output. If errors are found, it uses the critique to generate an improved version, returning the final refined text. The error detection ("no errors found") is intentionally simplified; production systems use a trained verifier model or structured critique format.
1def iterative_refinement(model, prompt: str, max_rounds: int = 5) -> str:
2 """Generate, critique, and refine until convergence.
3
4 Cost: O(rounds) sequential passes, each building on previous.
5 Best for: Open-ended tasks (writing, analysis, planning).
6 """
7 response = model.generate(prompt)
8
9 for _ in range(max_rounds):
10 critique = model.generate(
11 f"Find errors or improvements in this response:\n"
12 f"Question: {prompt}\nResponse: {response}"
13 )
14
15 if "no errors found" in critique.lower():
16 break
17
18 response = model.generate(
19 f"Question: {prompt}\n"
20 f"Previous response: {response}\n"
21 f"Critique: {critique}\n"
22 f"Provide an improved response:"
23 )
24
25 return response1verified_scores = [0.61, 0.76, 0.78, 0.781, 0.781]
2minimum_gain = 0.01
3used_rounds = 1
4
5for previous, current in zip(verified_scores, verified_scores[1:]):
6 if current - previous < minimum_gain:
7 break
8 used_rounds += 1
9
10print(f"used_rounds={used_rounds} selected_score={verified_scores[used_rounds - 1]:.3f}")
11print(f"skipped_rounds={len(verified_scores) - used_rounds}")1used_rounds=3 selected_score=0.780
2skipped_rounds=2This is the most sophisticated strategy, like a fulfillment planner who considers multiple recovery paths, evaluates each action, and prunes bad branches early. Instead of scoring just the final answer ("did you choose refund or reship?"), a process reward model scores each intermediate step ("is this policy check correct?").
Here's how a simplified beam search using a PRM operates. It accepts a model and a PRM, taking a prompt as the initial state. At each step, it generates multiple possible next steps (the beam width), scores them with the PRM, and keeps only the highest-scoring paths. The function outputs the best completed reasoning chain. This is a conceptual interface sketch; it assumes model.generate_step, prm.score_step, and is_final_answer exist.
1def beam_search_with_prm(
2 model,
3 prm,
4 prompt: str,
5 beam_width: int = 4,
6 branch_factor: int = 4,
7 max_steps: int = 20
8):
9 """Guided tree search using per-step reward scores.
10
11 Cost: O(beam_width × branch_factor × max_steps) forward passes.
12 Best for: Multi-step mathematical or logical reasoning.
13 """
14 beams = [{"steps": [], "score": 0.0, "text": prompt}]
15
16 for step in range(max_steps):
17 candidates = []
18 for beam in beams:
19 # Generate next reasoning step
20 new_steps = [
21 model.generate_step(beam["text"])
22 for _ in range(branch_factor)
23 ]
24
25 for new_step in new_steps:
26 # Score each step with the process reward model
27 step_score = prm.score_step(
28 prompt, beam["steps"], new_step
29 )
30 candidates.append({
31 "steps": beam["steps"] + [new_step],
32 "score": beam["score"] + step_score,
33 "text": beam["text"] + "\n" + new_step,
34 })
35
36 # Keep top-k beams
37 beams = sorted(candidates, key=lambda x: x["score"], reverse=True)[:beam_width]
38
39 # Check if any beam has reached a final answer
40 if any(is_final_answer(b["steps"][-1]) for b in beams):
41 break
42
43 return beams[0]Beam search is easiest to explain, but it has a well-known weakness for reasoning: beams can collapse onto near-duplicate traces. Stronger systems inject diversity, allow backtracking, or use Monte Carlo Tree Search-style expansion policies. The hard part isn't building a tree. It's having a verifier good enough to prune bad branches early.
The choice of reward model fundamentally affects search efficiency:
Think of it as final delivery audit versus checkpoint audit. An ORM only checks whether the final plan works: "delivered on time" or "missed SLA." You have no idea where the plan broke. A PRM checks each intermediate decision: "route chosen correctly," "hub cutoff valid," "carrier capacity wrong here." The per-step feedback is much more expensive to provide, and its own mistakes can discard a good path. When it is reliable enough, it can catch errors early and avoid wasting time on low-quality paths.
Outcome Reward Models (ORMs) evaluate the final generated output as a single, complete result. They act as a binary judge at the end of the generation process, determining only whether the final answer matches the expected solution.
Process Reward Models (PRMs), in contrast, evaluate the validity of each intermediate step in a chain of thought. By providing step-by-step guidance, they allow search algorithms to quickly abandon incorrect reasoning paths before wasting compute on them.
Here is a conceptual side-by-side comparison of how each model scores the same problem:
1# ORM: can only evaluate after the full solution is generated.
2# It returns a single score for the complete answer.
3orm_score = orm.score(
4 question="What is 847 × 293?",
5 full_solution="847 × 293 = 248,171"
6) # returns 1.0 (correct) or 0.0 (incorrect)
7
8# PRM: evaluates each step independently as the solution is built.
9# This lets the search algorithm prune bad paths before they grow.
10prm_scores = prm.score_steps(
11 question="What is 847 × 293?",
12 steps=[
13 "First, I'll break this into 847 × 300 - 847 × 7", # step score: 0.95
14 "847 × 300 = 254,100", # step score: 0.98
15 "847 × 7 = 5,929", # step score: 0.97
16 "254,100 - 5,929 = 248,171", # step score: 0.99
17 ]
18)Snell et al.[2] reported that compute-optimal test-time scaling can be more than 4× as efficient as a best-of-N baseline in their math-reasoning evaluation. Lightman et al.[10] reported that process supervision outperformed outcome supervision on MATH. These results motivate testing PRM-guided pruning on verifiable workloads; they do not establish that every PRM or task benefits.
That said, not every reasoning system uses a learned PRM or ORM. DeepSeek-R1-Zero trained with rule-based accuracy and format rewards rather than a neural reward model, which is one reason you should treat test-time compute as a family of techniques, not a single stack.[5]
1branch_lengths = [8, 8, 8, 8]
2prune_at_step = [None, 2, 3, 1]
3
4orm_scored_steps = sum(branch_lengths)
5prm_scored_steps = sum(
6 full_length if stop is None else stop
7 for full_length, stop in zip(branch_lengths, prune_at_step)
8)
9print(f"orm_steps={orm_scored_steps} prm_steps={prm_scored_steps}")
10print(f"saved_steps={orm_scored_steps - prm_scored_steps}")1orm_steps=32 prm_steps=14
2saved_steps=18This is the upside case in which the process scorer prunes the right paths. A deployment evaluation must also count false pruning of ultimately correct branches.
DeepSeek's contribution is easiest to understand as two related results. DeepSeek-R1-Zero showed that large-scale RL with verifiable rewards can induce strong reasoning behaviors without a supervised cold start.[5] DeepSeek-R1 then added cold-start data and additional post-training to make those behaviors more stable and readable.[5]
The final DeepSeek-R1 pipeline isn't "pure RL" end to end. The paper explicitly describes two supervised fine-tuning (SFT) stages and two RL stages:
That sequence matters because people often summarize R1 as "pure RL." Only R1-Zero fits that description.[5]
R1-Zero displayed behaviors such as self-verification, backtracking, and longer rollouts, but the paper also reports readability issues and occasional language mixing.[5] That detail matters. It means verifiable rewards can induce sophisticated reasoning behavior, but raw RL rollouts still need cleanup before they become a general-purpose product.
Not all problems benefit equally from extended reasoning, and on some tasks extra thinking actively hurts. Ghosal et al. studied this directly: across reasoning models, accuracy often rises with a little more thinking and then falls as the trace grows, an inverted-U rather than a monotonic climb.[8] They attribute the apparent early gains partly to higher output variance instead of genuinely better reasoning, and they recommend parallel sampling (best-of-N) over endlessly extending one trace. An Anthropic study found the same inverse-scaling effect on several tasks, where longer reasoning amplified distractions and errors rather than fixing them.[11] The lesson for production is that thinking budget is a real tuning parameter with a sweet spot, not a slider you turn to maximum.
The optimal test-time compute allocation depends on problem difficulty:
| Problem Type | Candidate strategy to evaluate | Example |
|---|---|---|
| Factual recall | Single pass | "What is the damaged-item return window?" |
| Simple reasoning | Short reasoning budget or brief scratchpad | "Can an order shipped yesterday arrive by Friday?" |
| Multi-step math | Longer scratchpad or best-of-N | Warehouse capacity or carrier-cost calculation |
| Complex code | Deliberation plus search or repair loops | Multi-file debugging and repair |
| Open-ended analysis | Sequential revision | Logistics rollout or support-policy analysis |
A task family may have a compute budget threshold where test-time scaling becomes more efficient than using a larger single-pass model. Below or above that point, the preferred path must be established with quality, latency, and cost measurements:
There's a compute budget above which a small model that searches through multiple solutions can outperform a large model answering in one shot. This isn't universal across all tasks, but it appears clearly on hard reasoning problems where the smaller model already has a non-zero hit rate.
Snell et al.[2] observed that in FLOPs-matched evaluations, a smaller model with additional test-time compute can outperform a model roughly 14× larger answering in a single pass on problems where the small model already achieves non-trivial success rates. This makes longer thinking a candidate for compensating for model size on similar evaluated tasks, not a general replacement for larger models.
To handle varying levels of difficulty in production without exploding costs, systems often implement a routing layer. This architecture evaluates an incoming task's complexity and directs it to the appropriate model and test-time strategy, balancing latency, quality, and computational cost.
1routes = [
2 {"name": "single-pass", "quality": 0.78, "latency_ms": 240, "cost": 0.002},
3 {"name": "bounded-reasoning", "quality": 0.86, "latency_ms": 780, "cost": 0.010},
4 {"name": "guided-search", "quality": 0.90, "latency_ms": 2600, "cost": 0.045},
5]
6latency_slo_ms = 1000
7minimum_quality = 0.84
8
9eligible = [
10 route for route in routes
11 if route["latency_ms"] <= latency_slo_ms and route["quality"] >= minimum_quality
12]
13chosen = min(eligible, key=lambda route: route["cost"])
14print(f"route={chosen['name']} quality={chosen['quality']:.0%}")
15print(f"latency_ms={chosen['latency_ms']} cost=${chosen['cost']:.3f}")1route=bounded-reasoning quality=86%
2latency_ms=780 cost=$0.010Deploying reasoning models in a real-world environment requires careful planning. While the capability gains are substantial, they come with operational challenges that engineering teams must proactively address.
Reasoning workloads stress inference engines differently from ordinary chat because they often generate far more intermediate tokens before the visible answer arrives.
For every generated token, the server appends a Key vector and a Value vector for every layer:
Where is the number of layers, is the number of KV heads, is the head dimension, and is bytes per value. The important scaling fact is simple: KV cache memory grows linearly with sequence length. A rollout that spends 10,000 tokens thinking creates roughly 10,000 tokens' worth of KV state, which can collapse batch size long before raw FLOPs become the bottleneck.
PagedAttention stores KV cache in fixed-size blocks to reduce fragmentation and make scheduling practical at long context lengths.[12] When multiple rollouts share the same prompt or partial trace, prefix-sharing runtimes can reuse that cached prefix instead of duplicating it across every branch. SGLang's RadixAttention is a good example.[13] This matters a lot for best-of-N, self-consistency, and tree search, where many candidates share the same long prompt and diverge only near the leaves.
1prompt_tokens = 4000
2continuation_tokens = 1000
3branches = 8
4kv_bytes_per_token = 128 * 1024
5
6without_sharing = branches * (prompt_tokens + continuation_tokens)
7with_sharing = prompt_tokens + branches * continuation_tokens
8saved_gib = (without_sharing - with_sharing) * kv_bytes_per_token / 1024 ** 3
9print(f"tokens_without_sharing={without_sharing:,}")
10print(f"tokens_with_sharing={with_sharing:,} saved_kv_gib={saved_gib:.2f}")1tokens_without_sharing=40,000
2tokens_with_sharing=12,000 saved_kv_gib=3.42Test-time compute distorts user-facing latency metrics. For a provider or wrapper that generates non-visible reasoning before revealing the answer, TTFT (time to first token) can increase substantially.[4] ITL (inter-token latency) can still be fine once the answer starts streaming. In other words, a reasoning system can feel slow even when decode throughput is healthy.
1base_ttft_ms = 180
2decode_tokens_per_second = 80
3budgets = [0, 128, 512, 1024]
4ttft_slo_ms = 3000
5
6for budget in budgets:
7 ttft_ms = base_ttft_ms + budget / decode_tokens_per_second * 1000
8 status = "fits" if ttft_ms <= ttft_slo_ms else "reject"
9 print(f"budget={budget:>4} ttft_ms={ttft_ms:>7.0f} {status}")1budget= 0 ttft_ms= 180 fits
2budget= 128 ttft_ms= 1780 fits
3budget= 512 ttft_ms= 6580 reject
4budget=1024 ttft_ms= 12980 rejectReasoning models introduce significant latency, but the exact numbers depend on hardware, batching policy, and provider implementation:
| Strategy | User-visible behavior | Best fit |
|---|---|---|
| Single-pass generation | Low TTFT, short answers | Chat, extraction, classification |
| Reasoning or search-heavy generation | Higher TTFT, variable token budget, sometimes hidden intermediate work | Math, code, planning, verification |
For strict interactive latency budgets, a long reasoning path may be unacceptable unless measured gains justify it. Candidates for larger budgets include:
Extended reasoning is expensive even when the final answer is short. Hosted APIs may charge directly for hidden reasoning tokens. OpenAI's reasoning docs, for example, note that these tokens are billed as output tokens even though they aren't returned verbatim in the API response.[4] Open-weight deployments pay through longer wall-clock time, lower throughput, and higher KV cache residency. Either way, longer rollouts reduce how many concurrent requests the same GPU can serve.
Production systems usually combine three controls:
Without those controls, ambiguous or unsolvable prompts can burn a large amount of inference budget without producing a better answer.
Distillation is like having a senior logistics planner write detailed recovery traces, then training a smaller model to solve similar incidents the same way. The smaller model won't match the source model everywhere, but it can inherit much of the solution style while being much cheaper to serve.
DeepSeek-R1 showed that reasoning capabilities can be distilled into much smaller models:
| Distilled Model | Source | AIME 2024 (pass@1: accuracy on first attempt) |
|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | R1 → Qwen2.5-Math-1.5B | 28.9% |
| DeepSeek-R1-Distill-Qwen-7B | R1 → Qwen2.5-Math-7B | 55.5% |
| DeepSeek-R1-Distill-Qwen-32B | R1 → Qwen2.5-32B | 72.6% |
| DeepSeek-R1 (full) | - | 79.8% |
In DeepSeek's reported AIME evaluation, the 7B and 32B distilled models retain substantial benchmark accuracy at lower parameter counts than R1. Parameter count is only a proxy, though. Real deployment cost still depends on active parameters, quantization, batch size, and engine choice. Distillation can make reasoning cheaper, but it doesn't remove the need for routing and token budgets.
1requests = [
2 {"kind": "extract", "verifiable": True, "predicted_gain": 0.00},
3 {"kind": "capacity_math", "verifiable": True, "predicted_gain": 0.09},
4 {"kind": "creative_copy", "verifiable": False, "predicted_gain": 0.03},
5]
6minimum_gain = 0.05
7
8for request in requests:
9 use_reasoning = request["verifiable"] and request["predicted_gain"] >= minimum_gain
10 route = "reasoning" if use_reasoning else "single-pass"
11 print(f"{request['kind']}: {route}")1extract: single-pass
2capacity_math: reasoning
3creative_copy: single-passreasoning.effort, Gemini thinkingLevel or thinkingBudget, Claude adaptive effort or legacy budget_tokens), and more thinking is not always better because accuracy can follow an inverted-U.Evaluate a reasoning path when the task needs multi-step deduction and you have a way to check whether the answer is good. Math, formal verification, code debugging, and structured planning are promising candidates. Start with a single-pass baseline for summarization, translation, extraction, and conversational UX, then promote only workloads where measured correctness gains justify latency and cost.
Reasoning workloads generate longer rollouts or multiple branches, which increases KV-cache pressure because each new token adds per-layer K/V state. KV-cache memory grows linearly with sequence length, so long reasoning traces reduce batch size and throughput. Engines like vLLM use paging to manage that memory, and prefix-sharing runtimes can reuse the shared prefix across best-of-N or search branches instead of duplicating it.
Common failure modes include loops, error propagation, overthinking, and weak self-verification. If the same model generates and judges, its blind spots are correlated, so it may confidently approve a bad trace. PRMs, task-specific checkers, explicit token budgets, and early stopping help by scoring intermediate steps and terminating unproductive branches earlier.
Distillation uses traces or verifier-filtered answers from a stronger reasoning system to train a smaller student. A student can learn useful solution patterns, as the DeepSeek results illustrate, but the gain must be measured against a comparable baseline. The catch is that the student may imitate output format more easily than the teacher's actual search policy, so good benchmark scores don't always mean the smaller model learned the same underlying reasoning process.
Start by separating tasks into three buckets: easy pattern matching, medium tasks with occasional reasoning upside, and hard tasks with clear verification signals. Route the first bucket to a fast single-pass model, route the second to a bounded reasoning budget or brief best-of-N, and reserve verifier-guided search for the third. Then instrument latency, total generated tokens, and verified win rate so the router can move requests down when extra thinking stops paying off.
Even experienced engineers misapply reasoning models. Here are frequent pitfalls, with symptoms to watch for and concrete fixes.
Symptom: A production task takes 30 seconds to complete and costs five times more than the standard model, but the output quality is identical.
Cause: You're using a reasoning model for a task that doesn't show measurable benefit from extra deduction. Summarization, translation, simple extraction, and sentiment analysis are often strong single-pass baselines. Extra thinking time may add latency and cost without an accuracy gain.
Fix: Route easy tasks to fast single-pass models. Reserve reasoning models for math, code, planning, verification, and other tasks where you can actually verify whether the answer is correct.
Symptom: You assume a reasoning model is just a standard model with a hidden system prompt like "You are a careful thinker..."
Cause: The API hides the reasoning tokens, so it's tempting to imagine they're produced by a wrapper around a normal model. That mental model misses the training difference.
Fix: Reasoning models may be post-trained, often with RL or distillation, to use extra inference compute productively. Treat this as different model behavior and serving accounting, not a prompt hack or necessarily a new transformer architecture.
Symptom: You assume a visible "think step by step" trace, a hidden reasoning trace, best-of-N, PRM search, and RL with verifiable rewards are interchangeable.
Cause: They all spend extra inference or training budget around reasoning, but they solve different problems.
Fix: Name the mechanism. Prompted CoT changes the prompt. Hidden reasoning changes model behavior and serving accounting. Best-of-N changes sampling. PRM search changes branch selection. Verifiable-reward RL changes post-training.
Symptom: A reasoning rollout looks cheap because the final answer is short, but batch size collapses under load.
Cause: Hidden and candidate tokens still create K/V state across layers.
Fix: Track total generated tokens, not just visible tokens. Use paging, prefix sharing, routing, and hard budgets before scaling reasoning traffic.
Symptom: Token usage spikes on certain prompts, the model spins in circles repeating the same deduction, or accuracy drops on simple prompts when the budget is set high.
Cause: No reasoning budget was set, or the budget was set too high. On ambiguous or simple prompts, more thinking can amplify distractions and errors rather than fix them, so accuracy follows an inverted-U as the trace grows.[8][11]
Fix: Set a reasoning budget tuned to the task instead of maximizing it, and add a hard cap. Implement early stopping when verifier confidence plateaus. For simple prompts, lower the effort level. Monitor per-request token distributions and alert on outliers.
Before you move on, try this short audit to make the trade-offs concrete.
The Efficiency Audit
Consider these five production tasks. For each one, predict whether it benefits from test-time compute scaling, then check your reasoning against the explanation.
| Task | Your prediction | Why it does or doesn't benefit |
|---|---|---|
| Creative writing (marketing copy) | Single-pass fluency is usually sufficient; revision helps but often isn't worth the latency cost | |
| Multi-step warehouse capacity calculation | Hard reasoning with verifiable arithmetic; measure best-of-N or guided search against baseline | |
| Sentiment analysis of support tickets | Strong single-pass baseline; only promote it if evaluation finds a gain | |
| Debugging a 500-error across three microservices | Multi-step code reasoning with executable checks; evaluate repair-loop gain | |
| Translation of product descriptions | Pattern-matching with strong base-model performance; extra reasoning adds little |
The warehouse calculation and microservice debugging are the strongest candidates because they provide verification signals. The core idea is that test-time compute is worth testing when the task requires verifiable, multi-step reasoning, not merely because a task sounds difficult.
Test-time compute adds a second scaling axis: spending more compute at inference can outperform moving to a larger single-pass model on hard reasoning tasks.[2]
Test-time compute is an umbrella, not one algorithm: longer scratchpads, repeated sampling, revision loops, and explicit search all fit under the same idea.
Good verifiers and compute-optimal policies matter: Snell et al. report more than 4× better test-time compute efficiency than a best-of-N baseline in their evaluation, and PRMs can support earlier pruning because they score intermediate steps instead of waiting for the end.[2][10]
DeepSeek-R1-Zero and DeepSeek-R1 aren't the same result: R1-Zero showed emergent reasoning from pure RL, while DeepSeek-R1 added cold-start data plus additional SFT and RL stages to make that behavior readable and broadly usable.[5]
Thinking is now a dial, and the sweet spot is not always the maximum: current providers expose model-specific effort or thinking controls, and accuracy can follow an inverted-U as the trace grows, so tune the supported knob rather than maxing it.[4][6][7][8]
Serving bottlenecks matter as much as algorithms: KV cache memory grows linearly with long reasoning traces, so prefix sharing, routing, and token budgets are core production tools.[12][13]
You should now be able to explain why a single-pass model can miss "How many r's in strawberry," why a smaller model with traces, search, or distillation can recover much of a larger model's math performance, and why your production stack needs a routing layer before it needs a bigger GPU.
Scaling Laws for Neural Language Models
Kaplan et al. · 2020
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.
Snell, C., et al. · 2024 · arXiv preprint
Learning to reason with LLMs
OpenAI · 2024
Reasoning models
OpenAI · 2026
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI · 2025
Gemini thinking
Google · 2025
Building with extended thinking
Anthropic · 2025
Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models
Ghosal, S. S., Chakraborty, S., Reddy, S., et al. · 2025
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
Wei, J., et al. · 2022 · NeurIPS
Let's Verify Step by Step.
Lightman, H., et al. · 2023 · ICLR
Inverse Scaling in Test-Time Compute
Anthropic · 2025
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
SGLang: Efficient Execution of Structured Language Model Programs
Zheng, L., Yin, L., Xie, Z., et al. · 2023 · arXiv:2312.07104