LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnSystem Design CapstonesReasoning & Test-Time Compute
🏗️HardSystem Design

Reasoning & Test-Time Compute

Design a production reasoning agent that routes by difficulty, evaluates candidate work, requires evidence before release, and survives serving bottlenecks like key-value (KV) cache growth.

43 min read
Learning path
Step 151 of 155 in the full curriculum
Real-Time Voice AI AgentAI Lab Coding Interview: Python Systems

Reasoning & Test-Time Compute

The Real-Time Voice AI Agent chapter focused on split-second streaming, interruption, and media state. This final capstone studies the opposite tradeoff: when a task is hard enough, the system may spend extra inference compute before it releases an answer or irreversible effect.

A reasoning agent can spend extra test-time compute on planning, checking, and tool use before its final answer. This design chapter explains when that checked work yields measured value and how to bound it with evidence contracts, evaluators, and budgets.

Imagine you manage a small fulfillment center. A customer sends a message: "I ordered a laptop, a monitor stand, and a cable on Monday with express shipping. The stand just arrived, but the laptop and cable haven't. The tracking number for the laptop says it's at the regional hub. Can I return the stand if I don't like it, and when will the rest arrive?"

A human support agent would not shout the first guess that comes to mind. They would check the order split, verify the tracking status, read the return policy, and only then compose an answer. A standard large language model (LLM) generates each continuation left to right; by itself, one answer attempt does not revisit earlier choices or consult missing evidence. A controller can still ask it to revise, call tools, or generate competing attempts. On tricky problems, the product question is whether that additional work is justified and verifiable.

Reasoning models such as OpenAI's earlier o-series reasoning models, including o1 [1], and DeepSeek-R1 [2] spend additional compute before returning an answer. Do not conflate that native internal reasoning with a product controller that explicitly samples candidates, calls tools, scores partial states, or performs tree search: those are separable designs with different evidence and serving contracts. This article builds the controller around those contracts, routing easy questions to bounded answers and hard, verifiable questions to deeper work.

Fast thinking and slow thinking

Fast/slow-thinking terminology gives us a useful analogy. Fast thinking is automatic and intuitive. When you read "2 + 2 =", you instantly know the answer. Slow thinking is deliberate and sequential. When you multiply 17 by 34 in your head, you work through steps, catch yourself if you miscarry a digit, and restart if the intermediate result looks wrong. In a product, however, enforceable controls are external candidate state, tool evidence, evaluator behavior, and release gates, not a promise about hidden mental steps.

Single-pass decoding looks more like fast thinking. At every step the model samples the next token conditioned on everything so far; without a controller or tool it can emit a plausible answer without validating it. A reasoning-enabled product can instead allocate a bounded check, use an exact tool when one exists, and release only a result supported by that check.

The useful shift is to treat some inference requests as a budgeted search and verification problem, not just one left-to-right decoding loop. A controller may explore candidate actions, score partial solutions, consult external evidence, and spend extra compute only when evaluation justifies it.

The extra compute is called test-time compute: the floating-point operations (FLOPs) and tokens you spend during inference, after training is done. In a FLOPs-matched study, Snell et al. show that a smaller model plus the right inference-time strategy can beat a 14x larger model on easy-to-intermediate prompts where the smaller model already has a non-trivial chance of succeeding [3]. The same paper warns that the gain reverses on the hardest prompts and when you serve many inference tokens per query, so test-time compute is not a free substitute for a stronger base model. The lever you control is no longer just model size; it is measured inference budget.

Budget ladder comparing fixed, escalate, and adaptive test-time compute policies for easy, medium, and hard tasks. Budget ladder comparing fixed, escalate, and adaptive test-time compute policies for easy, medium, and hard tasks.
The same request classes can stay cheap or get deeper search depending on the budget policy.

The illustration above shows three budget strategies. A fixed budget spends the same tokens on every request. An escalation rule adds more compute only after a weak first attempt. An adaptive budget uses a difficulty router to match the strategy to the task before the expensive search begins. The rest of this article builds that adaptive system from the ground up.

The simplest form of thinking longer: one deliberate path

Before we build a tree-search engine, let's look at the cheapest way to add reasoning. You generate one candidate solution, but you structure the output so the model shows its checkpoints. This is the modern successor to the classic "Chain of Thought" prompt, where you explicitly asked the model to "think step by step" [4].

On reasoning-trained APIs, direct task instructions are generally preferred over requests to reveal step-by-step thinking, because supported models already perform internal reasoning [5]. Your job is to request only outward artifacts the application needs: tool evidence, concise checked milestones, and a final answer. Hidden scratchpad text is not a proof or an audit log.

One common budget knob on OpenAI reasoning models is a reasoning.effort setting [6]. Supported values vary by model family, but the operational lesson is the same. Lower effort favors latency and fewer reasoning tokens, while higher effort permits a larger internal reasoning budget before answering. This is the cheapest form of test-time compute control: raise effort only when evals show a measurable quality gain that justifies the extra latency and cost, and keep the lowest useful setting for fast, deterministic tasks. Treat it as the single-path equivalent of the difficulty router you build later in this article.

Here's a lightweight parser for the interface you want the model to satisfy. It turns a compact XML response into visible checkpoints plus a final answer:

the-simplest-form-of-thinking-longer-one.py
1from dataclasses import asdict, dataclass 2import json 3import re 4 5@dataclass 6class ReasoningStep: 7 kind: str 8 content: str 9 10@dataclass 11class ReasoningTrace: 12 steps: list[ReasoningStep] 13 final_answer: str 14 15class SinglePathReasoner: 16 RESPONSE_TEMPLATE = """ 17 Answer using supplied tool observations only. 18 19 Return XML with this structure: 20 <response> 21 <checkpoint>major milestone only</checkpoint> 22 <checkpoint>major milestone only</checkpoint> 23 <verification_summary>brief check against supplied evidence</verification_summary> 24 <final_answer>final answer</final_answer> 25 </response> 26 27 Keep checkpoints concise. Do not dump a full scratchpad. 28 """ 29 30 def parse(self, content: str) -> ReasoningTrace: 31 checkpoints = re.findall(r"<checkpoint>(.*?)</checkpoint>", content, flags=re.DOTALL) 32 steps = [ReasoningStep("checkpoint", cp.strip()) for cp in checkpoints if cp.strip()] 33 verification = self._extract_tag(content, "verification_summary") 34 if verification: 35 steps.append(ReasoningStep("verification", verification)) 36 return ReasoningTrace( 37 steps=steps, 38 final_answer=self._extract_tag(content, "final_answer"), 39 ) 40 41 def _extract_tag(self, content: str, tag: str) -> str: 42 match = re.search(rf"<{tag}>(.*?)</{tag}>", content, flags=re.DOTALL) 43 return match.group(1).strip() if match else "" 44 45model_response = """ 46<response> 47 <checkpoint>Order API reports the stand shipped separately.</checkpoint> 48 <checkpoint>Carrier API reports laptop ETA as tomorrow.</checkpoint> 49 <verification_summary>Return-policy tool permits an opened stand within 30 days.</verification_summary> 50 <final_answer>You can return the stand; the current laptop ETA is tomorrow.</final_answer> 51</response> 52""" 53 54trace = SinglePathReasoner().parse(model_response) 55print(json.dumps({ 56 "steps": [asdict(step) for step in trace.steps], 57 "final_answer": trace.final_answer, 58}, indent=2))
Output
1{ 2 "steps": [ 3 { 4 "kind": "checkpoint", 5 "content": "Order API reports the stand shipped separately." 6 }, 7 { 8 "kind": "checkpoint", 9 "content": "Carrier API reports laptop ETA as tomorrow." 10 }, 11 { 12 "kind": "verification", 13 "content": "Return-policy tool permits an opened stand within 30 days." 14 } 15 ], 16 "final_answer": "You can return the stand; the current laptop ETA is tomorrow." 17}

Notice what this code does. The RESPONSE_TEMPLATE constrains output to major evidence-backed milestones. It does not make the claims true by itself; downstream code still needs to require tool results for external facts. Hidden token-by-token deliberation stays inside the model runtime, while the application receives a compact interface it can validate.

For our fulfillment-center problem, a single-path trace might look like this:

  • Checkpoint 1: Order tool reports that the stand shipped separately.
  • Checkpoint 2: Carrier tool reports the current laptop ETA as tomorrow.
  • Verification summary: Return-policy tool permits this return within its stated window.
  • Final answer: You can return the stand. The current laptop ETA is tomorrow.

A single bounded path is the cheapest candidate strategy when tool evidence or an exact check can validate the result. But what if multiple plans remain plausible?

releasing-only-evidence-backed-claims.py
1import json 2 3def release_answer(tool_results: dict[str, str | None]) -> dict[str, object]: 4 missing = [key for key, value in tool_results.items() if value is None] 5 if missing: 6 return {"release": False, "missing_evidence": missing, "next_step": "request_tool_data"} 7 return { 8 "release": True, 9 "answer": ( 10 f"Return status: {tool_results['return_policy']}. " 11 f"Current laptop ETA: {tool_results['laptop_eta']}." 12 ), 13 } 14 15attempts = [ 16 {"return_policy": "allowed within 30 days", "laptop_eta": None}, 17 {"return_policy": "allowed within 30 days", "laptop_eta": "tomorrow"}, 18] 19 20print(json.dumps([release_answer(attempt) for attempt in attempts], indent=2))
Output
1[ 2 { 3 "release": false, 4 "missing_evidence": [ 5 "laptop_eta" 6 ], 7 "next_step": "request_tool_data" 8 }, 9 { 10 "release": true, 11 "answer": "Return status: allowed within 30 days. Current laptop ETA: tomorrow." 12 } 13]

Asking several agents: Best-of-N sampling

A simple way to pursue better reliability is to generate multiple candidate answers and rank them. Think of it like asking five support agents to solve the same ticket independently, then requiring a policy check or a validated reviewer before selecting a response.

This is called Best-of-N sampling. You run the single-path generator N times with a diversity-producing configuration, rank candidates with the most reliable evaluator available, and release an eligible winner. Temperature is one knob: low temperature tends to collapse diversity, while a higher value can produce different plans. Neither setting guarantees diversity or correctness.

Here's a deterministic version of the selector. In production, the candidates come from separate model calls; the selection rule stays the same:

asking-several-agents-best-of-n-sampling.py
1from dataclasses import asdict, dataclass 2import json 3import math 4 5@dataclass 6class CandidateTrace: 7 candidate_id: str 8 answer: str 9 verifier_score: float 10 11class BestOfNSampler: 12 @staticmethod 13 def probability_at_least_one_success(base_success_rate: float, n: int) -> float: 14 return 1 - math.pow(1 - base_success_rate, n) 15 16 def choose(self, traces: list[CandidateTrace]) -> CandidateTrace: 17 return max(traces, key=lambda trace: trace.verifier_score) 18 19traces = [ 20 CandidateTrace("A", "Refund stand; laptop ETA unknown.", 0.61), 21 CandidateTrace("B", "Return allowed; laptop and cable due tomorrow.", 0.93), 22 CandidateTrace("C", "Wait for all items before returning anything.", 0.28), 23 CandidateTrace("D", "Open carrier claim immediately.", 0.44), 24 CandidateTrace("E", "Return denied because order is partial.", 0.19), 25] 26 27sampler = BestOfNSampler() 28winner = sampler.choose(traces) 29 30print(json.dumps({ 31 "n": len(traces), 32 "base_success_rate": 0.30, 33 "at_least_one_success": round( 34 sampler.probability_at_least_one_success(0.30, len(traces)), 35 3, 36 ), 37 "selected": asdict(winner), 38}, indent=2))
Output
1{ 2 "n": 5, 3 "base_success_rate": 0.3, 4 "at_least_one_success": 0.832, 5 "selected": { 6 "candidate_id": "B", 7 "answer": "Return allowed; laptop and cable due tomorrow.", 8 "verifier_score": 0.93 9 } 10}

Why this helps: a worked example

Suppose the base model has a 30% chance of producing a correct routing plan for a complex multi-stop delivery problem. With one sample, your success rate is 30%. With five independent samples, the probability that at least one is correct is:

P(at least one success)=1−(1−0.30)5=1−0.168=0.83P(\text{at least one success}) = 1 - (1 - 0.30)^5 = 1 - 0.168 = 0.83P(at least one success)=1−(1−0.30)5=1−0.168=0.83

That's 83% for the event that a correct candidate exists under the independence assumption. It is not an 83% success rate for the returned answer: selection succeeds only when a checker or verifier recognizes the correct candidate. Extra sampling is wasted if candidates are correlated or the verifier cannot identify the useful one.

Worse, a reward model is only a proxy for correctness, so pushing N too high can backfire. Gao et al. show that as you optimize harder against a learned reward model, true (gold) quality eventually drops even while the proxy score keeps climbing, a Goodhart-style effect they call reward-model overoptimization [7]. In practice you cap N, ensemble or regularize the reward model, and watch a held-out gold metric rather than trusting the verifier score alone.

When an exact checker exists, it should outrank a persuasive learned score. The example below selects a tested route rather than the answer with the highest style score:

prefer-exact-checks-over-proxy-scores.py
1import json 2 3candidates = [ 4 {"id": "A", "route": ["hub", "bridge", "customer"], "proxy_score": 0.96}, 5 {"id": "B", "route": ["hub", "tunnel", "customer"], "proxy_score": 0.82}, 6] 7blocked_edges = {("hub", "bridge")} 8 9def passes_constraints(route: list[str]) -> bool: 10 edges = set(zip(route, route[1:])) 11 return not bool(edges & blocked_edges) 12 13checked = [ 14 {**candidate, "exact_check": passes_constraints(candidate["route"])} 15 for candidate in candidates 16] 17eligible = [candidate for candidate in checked if candidate["exact_check"]] 18winner = max(eligible, key=lambda candidate: candidate["proxy_score"]) 19 20print(json.dumps({"checked": checked, "selected": winner["id"]}, indent=2))
Output
1{ 2 "checked": [ 3 { 4 "id": "A", 5 "route": [ 6 "hub", 7 "bridge", 8 "customer" 9 ], 10 "proxy_score": 0.96, 11 "exact_check": false 12 }, 13 { 14 "id": "B", 15 "route": [ 16 "hub", 17 "tunnel", 18 "customer" 19 ], 20 "proxy_score": 0.82, 21 "exact_check": true 22 } 23 ], 24 "selected": "B" 25}

Self-consistency: Best-of-N without a reward model

When the answer is a comparable string (a number, a label, a parse), you can sometimes skip a learned reward model. Self-consistency samples several traces and returns the most frequent answer [8]. It can improve selected reasoning benchmarks when correct answers concentrate more than incorrect ones, but majority vote is not verification: correlated wrong answers still win. Prefer an exact checker whenever available; use voting when endpoints are comparable and empirical evaluation supports it.

Where beam search fits

Beam search is the classic decoding baseline: keep the top-kkk most likely partial continuations at each step. It is easy to batch, but on tasks requiring distinct hypotheses its high-probability branches can become near-duplicates rather than useful alternatives. Evaluate it as a baseline or pruning mechanism against sampling and search on the task distribution you actually serve.

When one path isn't enough: exploring a tree

Some problems force you to backtrack. Imagine a warehouse robot that must pack orders into vans while respecting weight limits, fragile-item rules, and delivery time windows. If the robot assigns the heavy pallet to Van A and later discovers Van A can't cross the bridge, it needs to undo that decision and try Van C instead.

Tree of Thoughts (ToT), proposed by Yao et al. [9], treats reasoning as a tree search. At each step, the model generates multiple possible next actions, scores them, and explores the most promising branches while pruning the weak ones. It's similar to playing chess: you consider several moves, evaluate the board, explore the promising lines, and abandon the bad ones.

Tree search diagram showing early rejection of a warehouse route that fails a constraint and continued expansion of candidates that remain eligible. Tree search diagram showing early rejection of a warehouse route that fails a constraint and continued expansion of candidates that remain eligible.
Tree search spends tokens only on branches that survive partial-state scoring.

The diagram shows a search tree for our warehouse problem. From the initial state, three van assignments are possible. Thought B is pruned immediately because it violates a weight limit. Thought A looks promising, so it expands into two sub-assignments. A.2 turns out to break a time-window constraint and gets pruned. A.1 and C.1 both reach valid solutions.

To implement this, we maintain a search beam that keeps only candidate branches active. At each depth, the system generates multiple next steps, scores them with a task-specific evaluator, and prunes paths that fall below its threshold. That evaluator may be an exact constraint checker, a learned reward model, or a combination:

when-one-path-isnt-enough-exploring-a-tree.py
1from dataclasses import dataclass 2import json 3 4EXPANSIONS = { 5 "start": ["Assign pallet to Van A", "Assign pallet to Van B", "Assign pallet to Van C"], 6 "Assign pallet to Van A": ["A.1 move fragile boxes last", "A.2 use bridge route"], 7 "Assign pallet to Van C": ["C.1 keep express items together", "C.2 split cable to late route"], 8} 9 10SCORES = { 11 "Assign pallet to Van A": 0.74, 12 "Assign pallet to Van B": 0.12, 13 "Assign pallet to Van C": 0.81, 14 "A.1 move fragile boxes last": 0.70, 15 "A.2 use bridge route": 0.18, 16 "C.1 keep express items together": 0.96, 17 "C.2 split cable to late route": 0.39, 18} 19 20@dataclass 21class ThoughtNode: 22 content: str 23 depth: int 24 score: float 25 parent: "ThoughtNode | None" = None 26 27 def trace(self) -> str: 28 if self.parent is None: 29 return self.content 30 return f"{self.parent.trace()}\n{self.content}" 31 32class TreeOfThoughts: 33 def __init__(self, beam_width: int = 2, min_score: float = 0.35): 34 self.beam_width = beam_width 35 self.min_score = min_score 36 37 def search(self, max_depth: int = 2) -> dict[str, object]: 38 root = ThoughtNode("start", depth=0, score=1.0) 39 beam = [root] 40 rounds = [] 41 42 for depth in range(max_depth): 43 candidates = [] 44 for node in beam: 45 for thought in EXPANSIONS.get(node.content, []): 46 candidates.append(ThoughtNode(thought, depth + 1, SCORES[thought], node)) 47 48 survivors = [node for node in candidates if node.score >= self.min_score] 49 survivors.sort(key=lambda node: node.score, reverse=True) 50 beam = survivors[:self.beam_width] 51 rounds.append({ 52 "depth": depth + 1, 53 "kept": [node.content for node in beam], 54 "pruned": [ 55 node.content for node in candidates 56 if node.score < self.min_score 57 ], 58 }) 59 60 best = max(beam, key=lambda node: node.score) 61 return {"winner": best.trace().split("\n"), "rounds": rounds} 62 63print(json.dumps(TreeOfThoughts().search(), indent=2))
Output
1{ 2 "winner": [ 3 "start", 4 "Assign pallet to Van C", 5 "C.1 keep express items together" 6 ], 7 "rounds": [ 8 { 9 "depth": 1, 10 "kept": [ 11 "Assign pallet to Van C", 12 "Assign pallet to Van A" 13 ], 14 "pruned": [ 15 "Assign pallet to Van B" 16 ] 17 }, 18 { 19 "depth": 2, 20 "kept": [ 21 "C.1 keep express items together", 22 "A.1 move fragile boxes last" 23 ], 24 "pruned": [ 25 "A.2 use bridge route" 26 ] 27 } 28 ] 29}

This is a beam-search realization of Tree of Thoughts: simple to batch, simple to reason about, and often the first version teams ship. The algorithm starts with the problem, expands each active node into branching_factor children, evaluates every child, and keeps only the beam_width highest-ranked candidates for the next round.

When you want true MCTS

Tree of Thoughts is a general framework. If you need adaptive revisits instead of level-by-level pruning, you can swap the beam for Monte Carlo Tree Search. In LLM settings, a common variant is PUCT (Predictor + Upper Confidence bounds applied to Trees), the AlphaZero-style selection rule that combines a value estimate with the model's next-step prior [10]:

PUCT(s,a)=Q(s,a)+cpuct⋅P(s,a)⋅N(s)1+N(s,a)\text{PUCT}(s, a) = Q(s, a) + c_{\text{puct}} \cdot P(s, a) \cdot \frac{\sqrt{N(s)}}{1 + N(s, a)}PUCT(s,a)=Q(s,a)+cpuct​⋅P(s,a)⋅1+N(s,a)N(s)​​

Where:

  • Q(s,a)Q(s, a)Q(s,a) is the estimated downstream value for taking action aaa in state sss (from rollouts, a value model, or another evaluator)
  • P(s,a)P(s, a)P(s,a) is the policy prior for the next thought
  • N(s)N(s)N(s) is the total number of visits to the parent node
  • N(s,a)N(s, a)N(s,a) is the number of visits to the child node
  • cpuctc_{\text{puct}}cpuct​ is the exploration constant that balances exploration vs exploitation

The first term exploits high-value branches. The second term favors branches that either have a strong model prior or have not been explored much yet. Beam search is usually easier to run on GPUs because you can batch every frontier expansion together. MCTS can become attractive when evaluator calls are expensive and you want to revisit promising nodes instead of expanding every branch evenly.

Who judges the plans? Verifiers and reward models

Generating many candidate plans is useless if you cannot identify eligible results. Use exact checkers for objective constraints wherever possible. When exact checking is unavailable and a learned verifier scores candidate traces, two common forms are:

An Outcome Reward Model (ORM) scores only the final answer. It's like an audit that checks whether the final shipment plan is valid without inspecting the intermediate routing steps. An ORM is cheap to train because it needs only one label per chain, but it's sparse: if the answer is wrong, you get no signal about where the reasoning broke down.

A Process Reward Model (PRM) scores individual reasoning steps, not just the final answer [11]. When step quality is labelable and the PRM is validated on the served domain, it can help prune low-scored branches before paying for complete traces. It remains a learned proxy, not proof that a step is correct.

Think of the PRM like a fulfillment auditor who estimates risk at each routing step. When a candidate assigns the heavy pallet to Van A, a validated PRM may give that step a low score. If an exact bridge-rule checker is available, use it as the authoritative rejection; otherwise the score is a pruning signal whose false positives and negatives need evaluation.

How a PRM works

A PRM is typically a language model with a scalar reward head, trained to predict the correctness of a step sts_tst​ given the prompt and the previous steps s1...t−1s_{1...t-1}s1...t−1​. Lightman et al. released PRM800K, a dataset with 800,000 step-level human feedback labels, because step supervision is the bottleneck for building these verifiers well [11].

P(correct∣context,s1...t)=σ(W⋅ht)P(\text{correct} \mid \text{context}, s_{1...t}) = \sigma(W \cdot h_t)P(correct∣context,s1...t​)=σ(W⋅ht​)

We take the hidden representation of the current step (hth_tht​), map it through a learned linear layer (WWW), and squash the result with a sigmoid (σ\sigmaσ). That gives a score between 0.0 and 1.0; treating it as a calibrated probability requires separate calibration and held-out validation.

Training data is collected via three main paths:

  1. Human feedback: Experts label individual steps as Positive, Negative, or Neutral.
  2. Monte Carlo estimation: Roll out NNN completions from a step. If MMM of NNN lead to the correct answer, the step score is roughly M/NM/NM/N.
  3. Teacher-model supervision: Use a stronger verifier or frontier model to pre-label steps, then audit those labels before trusting them at scale.

Here's a practical scoring interface. In production the scoring call is a model with a reward head; this runnable version uses deterministic rules so you can see how weak steps pull down the whole trace:

how-a-prm-works.py
1from dataclasses import asdict, dataclass 2import json 3 4@dataclass 5class StepScore: 6 step: str 7 score: float 8 9class ProcessRewardModel: 10 def score_step(self, step: str) -> float: 11 lowered = step.lower() 12 if "bridge restriction" in lowered or "over weight" in lowered: 13 return 0.15 14 if "checked" in lowered or "within limit" in lowered: 15 return 0.92 16 return 0.65 17 18 def score_full_trace(self, steps: list[str]) -> dict[str, object]: 19 scores = [StepScore(step, self.score_step(step)) for step in steps] 20 trace_score = min(score.score for score in scores) if scores else 0.0 21 return { 22 "step_scores": [asdict(score) for score in scores], 23 "trace_score": trace_score, 24 "decision": "prune" if trace_score < 0.35 else "keep", 25 } 26 27trace = [ 28 "Checked package weights against each van limit.", 29 "Assign pallet to Van A despite bridge restriction.", 30 "Promise same-day delivery for every item.", 31] 32 33print(json.dumps(ProcessRewardModel().score_full_trace(trace), indent=2))
Output
1{ 2 "step_scores": [ 3 { 4 "step": "Checked package weights against each van limit.", 5 "score": 0.92 6 }, 7 { 8 "step": "Assign pallet to Van A despite bridge restriction.", 9 "score": 0.15 10 }, 11 { 12 "step": "Promise same-day delivery for every item.", 13 "score": 0.65 14 } 15 ], 16 "trace_score": 0.15, 17 "decision": "prune" 18}

The score_full_trace method uses the minimum score as a conservative pruning heuristic. In production, set thresholds against held-out task outcomes and prefer exact checks for constraints such as weight limits or policy eligibility.

ORM vs PRM at a glance

FeatureOutcome Reward Model (ORM)Process Reward Model (PRM)
ScoringOnly scores the final answerScores every intermediate step
Feedback signalSparse (1 signal per chain)Dense (1 signal per step)
Search utilityOnly helps select final outputGuides search; enables early pruning
Training costLower (less labeling effort)Higher (requires step-level labels)
Outcome reward model waits until final answer while a validated process reward model can flag a low-scored intermediate branch earlier. Outcome reward model waits until final answer while a validated process reward model can flag a low-scored intermediate branch earlier.
A validated PRM can prune low-scored intermediate steps earlier; exact checks remain stronger evidence where available.

Lightman et al. (2023) [11] found that process supervision outperformed outcome supervision in their challenging math setting. Carrying that result into routing, support policy, or code tasks requires domain-specific data and validation.

Training reasoning models with RL

Modern reasoning models don't rely only on external search controllers. DeepSeek-R1 shows a complementary path where reinforcement learning makes the base policy itself better at producing long, structured reasoning traces [2].

Group Relative Policy Optimization (GRPO) is one important algorithm in that line of work, and DeepSeek-R1 made it especially visible in open-model practice [2]. Compared with PPO (Proximal Policy Optimization), GRPO removes the need for a separate critic model:

  1. Sample a group of outputs from the policy for each problem
  2. Compute rewards for each output (using task rewards, exact verifiers where available, or validated reward models)
  3. Calculate the relative advantage within the group (baseline = group mean)
  4. Update the policy to favor higher-reward outputs

This cuts RL training overhead because you don't need a separate critic network. It can make the base model more sample-efficient at test time, but it doesn't remove the need for routing, budget limits, or verifier design in production.

Putting it together: a production reasoning agent

So far we've looked at individual techniques. A real production system wires them into a pipeline that classifies the incoming problem, picks a strategy, runs the search, verifies the results, and streams progress back to the user.

What the system must handle

Before drawing boxes, let's list the concrete responsibilities:

  • Complex reasoning: Accept problems that need multi-step logic (delivery routing, refund eligibility checks, inventory allocation).
  • Candidate state: Maintain external candidate actions, tool observations, and evaluator results needed for search and audit, without assuming access to a model's hidden scratchpad.
  • Adaptive compute: Scale inference compute proportional to problem difficulty. A "What is your return policy?" question shouldn't trigger a 20-second tree search.
  • Streaming: Emit status updates or safe progress summaries while the solver is working, without exposing unverified conclusions as completed checks.
  • Budget control: Enforce configurable reasoning budgets in tokens, time, branches, and cost; product tiers may change caps but must not weaken evidence requirements.

On the non-functional side, declare measurable objectives: concurrency for a target workload, quality lift over single pass on representative verifiable tasks, p95 latency, and per-request cost by route. A 3x or 10x budget is an experiment parameter, not a production guarantee.

Architecture overview

The diagram below shows the full pipeline. Requests enter through a gateway, get classified by a difficulty router, and then flow through either a fast path, a medium path (Best-of-N), or a deep path (Tree of Thoughts). Branching paths should use the best available evaluator and reuse key-value (KV) cache prefixes only where the serving runtime safely supports compatible prefix sharing.

Production reasoning agent architecture with gateway, difficulty router, fast path, Best-of-N path, tree-search path, verifier, cache layer, and streaming aggregator. Production reasoning agent architecture with gateway, difficulty router, fast path, Best-of-N path, tree-search path, verifier, cache layer, and streaming aggregator.
Production reasoning agents are routing and verification systems, not one giant prompt.

The request flow below shows the sequence in more detail:

Reasoning request flow timeline from user through gateway, router, solver, KV cache, verifier, aggregator, progress events, and final answer. Reasoning request flow timeline from user through gateway, router, solver, KV cache, verifier, aggregator, progress events, and final answer.
The expensive loop is generate, reuse cache, verify, prune, and stream safe progress.

Notice the loop in the request flow. The solver generates candidates, reuses shared KV prefixes, scores with the verifier, and streams progress events back. This loop is where the extra test-time compute lives.

The difficulty router

Not every problem needs extended thinking. Simple factual queries waste compute if treated as reasoning tasks. The router evaluates input complexity and assigns a strategy plus token budget. Here's a minimal implementation:

the-difficulty-router.py
1from dataclasses import asdict, dataclass 2import json 3 4@dataclass 5class Route: 6 difficulty: str 7 strategy: str 8 max_tokens: int 9 budget_multiplier: int 10 11class DifficultyRouter: 12 CONFIGS = { 13 "trivial": Route("trivial", "single_pass", 500, 1), 14 "easy": Route("easy", "single_path", 2_000, 2), 15 "medium": Route("medium", "best_of_3", 4_000, 5), 16 "hard": Route("hard", "tree_search", 8_000, 10), 17 "extreme": Route("extreme", "tree_search", 16_000, 20), 18 } 19 20 def classify(self, problem: str) -> Route: 21 text = problem.lower() 22 if "prove" in text or "counterexample" in text or "cyclic graph" in text: 23 return self.CONFIGS["hard"] 24 if "multiple constraints" in text or "debug" in text: 25 return self.CONFIGS["medium"] 26 if "policy" in text or "return" in text: 27 return self.CONFIGS["easy"] 28 return self.CONFIGS["trivial"] 29 30router = DifficultyRouter() 31problems = [ 32 "What is your return policy?", 33 "Debug this routing function with multiple constraints.", 34 "Find a counterexample for this cyclic graph algorithm.", 35] 36 37print(json.dumps([asdict(router.classify(problem)) for problem in problems], indent=2))
Output
1[ 2 { 3 "difficulty": "easy", 4 "strategy": "single_path", 5 "max_tokens": 2000, 6 "budget_multiplier": 2 7 }, 8 { 9 "difficulty": "medium", 10 "strategy": "best_of_3", 11 "max_tokens": 4000, 12 "budget_multiplier": 5 13 }, 14 { 15 "difficulty": "hard", 16 "strategy": "tree_search", 17 "max_tokens": 8000, 18 "budget_multiplier": 10 19 } 20]

The router should be cheap relative to solving work. Its job is to avoid expensive search on requests that do not benefit while escalating tasks whose risk and verifiability justify more work. Misclassification is not harmless: under-routing can release an incorrect answer, while over-routing can increase cost, latency, and exposure to verifier overoptimization.

Compute-optimal scheduling

The key insight from Snell et al. [3] is that no single test-time strategy dominates every prompt. The best move depends on problem difficulty and on whether the base model already has a plausible path to the correct answer. A production scheduler should start cheap and escalate only when an exact check or a validated evaluation signal says the first attempt is weak:

compute-optimal-scheduling.py
1import json 2 3class ComputeOptimalScheduler: 4 def choose_strategy( 5 self, evaluation_score: float, verifiable: bool, high_impact: bool = False 6 ) -> dict[str, object]: 7 if high_impact and not verifiable: 8 return {"strategy": "tool_or_human_review", "reason": "high-impact claim lacks verifier"} 9 if not verifiable: 10 return {"strategy": "single_pass", "reason": "creative output; no objective ranking"} 11 if evaluation_score >= 0.90: 12 return {"strategy": "single_path", "reason": "first trace passed verifier"} 13 if evaluation_score >= 0.60: 14 return {"strategy": "best_of_3", "reason": "plausible trace needs reranking"} 15 return {"strategy": "tree_search", "reason": "weak trace needs backtracking"} 16 17requests = [ 18 {"task": "Return-policy FAQ", "evaluation_score": 0.94, "verifiable": True}, 19 {"task": "Ambiguous route plan", "evaluation_score": 0.72, "verifiable": True}, 20 {"task": "Cyclic graph bug", "evaluation_score": 0.41, "verifiable": True}, 21 {"task": "Brand voice poem", "evaluation_score": 0.52, "verifiable": False}, 22 {"task": "Approve refund without policy record", "evaluation_score": 0.82, "verifiable": False, "high_impact": True}, 23] 24 25scheduler = ComputeOptimalScheduler() 26for request in requests: 27 request.update(scheduler.choose_strategy(request["evaluation_score"], request["verifiable"], request.get("high_impact", False))) 28 29print(json.dumps(requests, indent=2))
Output
1[ 2 { 3 "task": "Return-policy FAQ", 4 "evaluation_score": 0.94, 5 "verifiable": true, 6 "strategy": "single_path", 7 "reason": "first trace passed verifier" 8 }, 9 { 10 "task": "Ambiguous route plan", 11 "evaluation_score": 0.72, 12 "verifiable": true, 13 "strategy": "best_of_3", 14 "reason": "plausible trace needs reranking" 15 }, 16 { 17 "task": "Cyclic graph bug", 18 "evaluation_score": 0.41, 19 "verifiable": true, 20 "strategy": "tree_search", 21 "reason": "weak trace needs backtracking" 22 }, 23 { 24 "task": "Brand voice poem", 25 "evaluation_score": 0.52, 26 "verifiable": false, 27 "strategy": "single_pass", 28 "reason": "creative output; no objective ranking" 29 }, 30 { 31 "task": "Approve refund without policy record", 32 "evaluation_score": 0.82, 33 "verifiable": false, 34 "high_impact": true, 35 "strategy": "tool_or_human_review", 36 "reason": "high-impact claim lacks verifier" 37 } 38]

This creates an adaptive system that spends small budgets on easy, checkable problems and escalates only when expected quality lift is worth extra latency and KV-cache pressure. It does not treat "unverifiable" as permission to make high-impact decisions.

enforcing-branch-stop-budgets.py
1import json 2 3def stop_reason( 4 tokens_used: int, 5 token_cap: int, 6 elapsed_ms: int, 7 wall_clock_cap_ms: int, 8 repeated_state: bool, 9 score_history: list[float], 10) -> str: 11 if repeated_state: 12 return "repeated_state" 13 if tokens_used >= token_cap: 14 return "token_cap" 15 if elapsed_ms >= wall_clock_cap_ms: 16 return "wall_clock_cap" 17 if len(score_history) >= 3 and max(score_history[-3:]) <= max(score_history[:-3], default=-1): 18 return "no_recent_score_improvement" 19 return "continue" 20 21branches = [ 22 {"tokens_used": 1200, "token_cap": 4000, "elapsed_ms": 800, "wall_clock_cap_ms": 3000, 23 "repeated_state": True, "score_history": [0.54, 0.55]}, 24 {"tokens_used": 2500, "token_cap": 4000, "elapsed_ms": 1800, "wall_clock_cap_ms": 3000, 25 "repeated_state": False, "score_history": [0.72, 0.70, 0.71, 0.69]}, 26 {"tokens_used": 900, "token_cap": 4000, "elapsed_ms": 500, "wall_clock_cap_ms": 3000, 27 "repeated_state": False, "score_history": [0.40, 0.55]}, 28] 29 30print(json.dumps([stop_reason(**branch) for branch in branches], indent=2))
Output
1[ 2 "repeated_state", 3 "no_recent_score_improvement", 4 "continue" 5]

The serving bottleneck nobody talks about

Long reasoning traces can stress memory as well as floating-point operations (FLOPs). Every live generated token adds a new key and value vector for every transformer layer, so key-value (KV) cache pressure grows linearly with live context length and multiplies across unshared branches.

KV cache pressure

A useful back-of-the-envelope formula is:

KV bytes per token≈2⋅L⋅Hkv⋅dhead⋅bytes per element\text{KV bytes per token} \approx 2 \cdot L \cdot H_{kv} \cdot d_{\text{head}} \cdot \text{bytes per element}KV bytes per token≈2⋅L⋅Hkv​⋅dhead​⋅bytes per element

Where LLL is the number of layers, HkvH_{kv}Hkv​ is the number of KV heads (not always the full attention-head count if the model uses grouped-query attention (GQA)), and dheadd_{\text{head}}dhead​ is the head dimension. For an 80-layer model with 8 KV heads, head dimension 128, and bfloat16 (BF16) activations:

2⋅80⋅8⋅128⋅2=327,680 bytes per token≈320 KB/token2 \cdot 80 \cdot 8 \cdot 128 \cdot 2 = 327{,}680 \text{ bytes per token} \approx 320 \text{ KB/token}2⋅80⋅8⋅128⋅2=327,680 bytes per token≈320 KB/token

At 16k live tokens, that's roughly 5.0 GiB of KV cache for one branch. Without any prefix reuse, a tree search with eight active 16k-token branches would need about 40 GiB of KV cache before you count model weights or temporary activations.

Prefix sharing and paged KV storage

Best-of-N and tree search create many branches that share the same prompt and often the same early reasoning prefix. If you duplicate that prefix for every branch, memory usage grows with the number and length of branches. Systems like vLLM use PagedAttention to manage KV memory in blocks, which cuts fragmentation and makes long contexts much easier to serve [12].

PagedAttention and radix-style prefix caches solve related but different problems. PagedAttention makes allocation cheaper. RadixAttention-style caches, as used in SGLang, index shared prefixes in a tree so sibling branches can reuse the same cached prefix instead of copying it branch by branch [13]. In a reasoning agent you usually want both: block-level memory management plus prefix-aware reuse.

That changes the capacity math materially. If eight branches share a 12k-token prefix and each branch adds only a 4k-token unique suffix, the effective footprint is closer to one 12k prefix plus eight 4k suffixes, or about 14 GiB, not the 40 GiB worst case above. The exact savings depend on how quickly branches diverge and on whether your runtime can share prefixes at token-level or block-level granularity.

KV prefix reuse concept showing many reasoning branches sharing one cached prefix and paying only for unique suffixes. KV prefix reuse concept showing many reasoning branches sharing one cached prefix and paying only for unique suffixes.
Compatible prefix reuse reduces branch KV footprint by sharing prompt and early-reasoning blocks, then allocating each branch's new suffix.

This is one reason test-time search is an infrastructure problem, not just a prompting trick. A search policy may be sound on paper while its branch width, trace length, or lack of compatible prefix reuse exhausts the serving memory budget before useful search completes.

TTFT vs inter-token latency

Extended reasoning changes the latency contract. If the product emits no status until the answer is ready, a long hidden search phase inflates time to first user-visible token (TTFT). If it emits a quick progress event, TTFT can look healthy while time-to-final-answer and total budget still fail. Measure both.

That is why production systems can stream progress events instead of raw scratchpads. Safe events include selected strategy, budget usage, branch counts, or summaries of externally confirmed checks. Treat learned verifier scores as internal diagnostics unless they are calibrated and presented with an appropriate contract.

A sample capacity plan

The table below uses the 320 KB/token estimate to show how quickly memory usage grows as you add longer traces or more live branches:

WorkloadLive BranchesContext per BranchApprox KV CacheOperational Implication
Single CoT18k tokens2.5 GiBUsually easy to colocate with the base model
Best-of-448k tokens10.0 GiBParallel sampling is practical, but batching headroom shrinks
Tree Search816k tokens40.0 GiBUsually needs dedicated search workers and aggressive prefix reuse

Those are worst-case branch-multiplication numbers. Strong prefix reuse can cut the effective KV footprint substantially.

Streaming safe updates

Users need feedback while the system is spending extra inference budget, but that does not mean you should dump raw internal scratchpad text or expose an uncalibrated reward score as confidence. A safer pattern is to stream strategy selection, budget usage, and externally grounded summaries while the solver keeps search details private:

streaming-safe-updates.py
1from dataclasses import asdict, dataclass 2import json 3 4@dataclass 5class StreamEvent: 6 type: str 7 content: str 8 score: float | None = None 9 10class ReasoningSolver: 11 def solve_stream(self) -> list[StreamEvent]: 12 return [ 13 StreamEvent("progress", "checked carrier status", 0.82), 14 StreamEvent("progress", "verified return policy", 0.91), 15 StreamEvent("final", "Return allowed; remaining items arrive tomorrow.", 0.88), 16 ] 17 18def stream_reasoning(config: dict[str, object], solver: ReasoningSolver) -> list[dict[str, object]]: 19 events = [ 20 {"type": "status", "content": "Analyzing problem difficulty..."}, 21 { 22 "type": "config", 23 "content": f"Using {config['strategy']} with budget {config['max_tokens']} tokens", 24 }, 25 ] 26 27 for event in solver.solve_stream(): 28 if event.type == "progress": 29 events.append(asdict(event)) 30 elif event.type == "final": 31 events.append({ 32 "type": "final_answer", 33 "content": event.content, 34 }) 35 return events 36 37config = {"strategy": "best_of_3", "max_tokens": 4_000} 38print(json.dumps(stream_reasoning(config, ReasoningSolver()), indent=2))
Output
1[ 2 { 3 "type": "status", 4 "content": "Analyzing problem difficulty..." 5 }, 6 { 7 "type": "config", 8 "content": "Using best_of_3 with budget 4000 tokens" 9 }, 10 { 11 "type": "progress", 12 "content": "checked carrier status", 13 "score": 0.82 14 }, 15 { 16 "type": "progress", 17 "content": "verified return policy", 18 "score": 0.91 19 }, 20 { 21 "type": "final_answer", 22 "content": "Return allowed; remaining items arrive tomorrow." 23 } 24]

When reasoning is the wrong tool

Not every problem benefits from test-time scaling. Applying tree search or Best-of-N to the wrong use case results in unnecessary latency and wasted resources. You should bypass extended reasoning for:

  • Simple source-of-truth queries (e.g., "What is your return policy?"). Query the policy source directly; extra internal reasoning will not recover a missing or current fact.
  • Retrieval failures. If the problem requires external facts or tools, longer internal reasoning is the wrong fix. You need retrieval, search, or tool use.
  • Open-ended marketing copy (e.g., "Write three brand voice options for a product page"). There is no single objective answer unless the application defines a rubric and validates it, so deep PRM-driven search is usually unjustified.
  • Latency-sensitive conversational turns. Live products need measured response objectives; a deep search path that exceeds that objective should acknowledge, defer, or avoid the search.
  • Cost-sensitive batch processing. When running millions of documents through a classification pipeline, a large inference multiplier can erase the value of a small quality gain. Escalate only request cohorts whose measured lift pays for the extra cost.

The guiding heuristic is simple: allocate substantial test-time compute when evaluation shows gain for the request class and evidence or verification can keep the answer trustworthy. Creative revision may still use a small candidate budget, but it is not proof-oriented reasoning.

Overthinking: when more thinking lowers accuracy

Longer reasoning is not monotonically better. Empirical 2025 work finds that accuracy often plateaus and can decline once a trace runs past a problem-specific sweet spot: extra steps add error accumulation, and a model can talk itself out of a correct answer it already found [14]. This overthinking failure mode is also a cost and latency problem, because a model can burn thousands of reasoning tokens on a trivial question. The defenses are the same controls you already have: a router that keeps trivial tasks on the fast path, a capped reasoning_effort or token budget, and a stop rule that exits once exact results are sufficient or validated evaluation stops improving. Treat reasoning budget as a tuned hyperparameter, not a dial you turn to maximum.

Why not just use a bigger model?

It's tempting to think that deploying a bigger frontier model should dominate every reasoning workload. In practice, the trade-off is more subtle:

ApproachCost PatternControl
Bigger model (Training / model switch)Higher steady-state inference costFixed after deployment; every request pays for the larger model
Test-time compute (Inference)Higher per-request latency and token costDynamic per problem; easy to scale up or down

Snell et al. make the practical point: on prompts where a smaller model already has some chance of success, extra inference-time compute can beat simply switching to a much larger model at matched compute [3]. But if the base model has near-zero task competence, reranking or sampling is unlikely to manufacture the missing capability. Test-time compute amplifies evaluated candidate quality; it does not supply missing knowledge or tools.

Common pitfalls

"Higher temperature will reveal a great answer if I sample enough"

Symptom: Best-of-N produces more varied traces, but quality barely improves and the chosen winner still feels random. Cause: Diversity alone is not enough. Without a verifier that tracks correctness, more samples mostly add noise. Fix: Increase temperature only when you also have a PRM, ORM, or exact checker that can reliably separate strong traces from fluent wrong ones.

"Tree search is always better than single-path reasoning"

Symptom: Easy requests suddenly become slower and more expensive even though accuracy barely moves. Cause: Search budget was applied to every request instead of only the hard, verifiable ones. Fix: Route by difficulty. Keep trivial and easy requests on the fast path, then escalate only when the first trace scores weakly.

"More thinking always improves the answer"

Symptom: The system finds a decent first answer, then keeps exploring until it talks itself into a worse one. Cause: Overthinking. Past a task-specific sweet spot, extra search compounds errors instead of fixing them. Fix: Stop when exact checks pass or a validated evaluator plateaus, cap reasoning effort, and keep the best eligible fallback that can survive later noisy branches.

"One budget policy can serve every request"

Symptom: Cheap FAQ traffic burns deep-search budget while hard debugging tasks run out of depth and tokens too early. Cause: Compute was allocated uniformly instead of matching task difficulty and business value. Fix: Separate routing, branch limits, token caps, and wall-clock caps by request class. Budget policy is part of product design, not only model tuning.

"Users need raw chain-of-thought to trust the system"

Symptom: Streams become long, unstable, and hard to parse, while downstream tools cannot tell what is final. Cause: Internal scratchpad text was exposed instead of a stable progress interface. Fix: Stream strategy, budget usage, and short evidence-backed summaries. Keep hidden search traces and uncalibrated evaluator scores inside the runtime.

"Reasoning quality comes only from prompts"

Symptom: Prompt tweaks plateau, but the team still has no reliable gain on hard tasks. Cause: Model capability, verifier quality, cache reuse, and stopping logic were collapsed into one prompt-design problem. Fix: Treat base model, search controller, verifier, and serving runtime as separate levers. Improve whichever layer is bottlenecking quality or cost.

"Stop criteria are an optional cleanup detail"

Symptom: Branches revisit the same semantic state until they exhaust the token budget or hit wall-clock limits. Cause: The controller has no hard policy for repeated states, budget exhaustion, or diminishing verifier returns. Fix: Enforce token, depth, wall-clock, and repeated-state limits per branch, and return the highest-ranked surviving trace that satisfies required checks when the budget ends.

Follow-up questions

Use these checkpoints to test your routing instinct. Decide first, then compare against the answer sketch.

Mastery checklist

Use this checklist to judge whether your final design is defensible:

  • Choose a routing strategy for trivial, medium, and hard verifiable tasks, and explain why each class gets that budget.
  • Defend when a task needs a single path, Best-of-N, tree search, or a bigger base model instead of more search.
  • Pick the right judge for the job: exact checker, ORM, or PRM, and explain what failure appears if you choose the wrong one.
  • Estimate rough KV-cache pressure for one branch, many branches, and a shared-prefix layout.
  • Define hard stop criteria for depth, tokens, wall-clock time, and repeated semantic states.
  • Describe what the user sees while the solver works, and why that stream should expose progress summaries instead of raw scratchpads.

What this completes

You now understand when allocating additional inference work can outperform a fixed single attempt, and when it cannot. You can build a difficulty router, run Best-of-N sampling, explore a Tree of Thoughts, and score partial traces with a Process Reward Model while still preferring exact evidence when available. You also know the serving constraints (KV cache growth, prefix sharing, and final-answer latency) that determine whether a search design is affordable.

This is why the roadmap ends here. Earlier chapters taught the serving stack, retrieval layer, agent loop, evaluation harness, and safety controls one piece at a time. This capstone asks you to connect them: a reasoning agent is useful only when search policy, evidence contract, evaluator quality, KV-cache budget, streaming behavior, and fallback plan all work together.

Key takeaways:

  1. Test-time compute buys additional attempts or checks; keep it only where held-out evaluation shows quality gains worth its cost and latency.
  2. A strategy ladder covers most systems: single path with tuned reasoning effort -> self-consistency or Best-of-N -> Tree of Thoughts / MCTS. More compute helps only up to a problem-specific sweet spot; past it, overthinking degrades accuracy.
  3. Process Reward Models (PRM) can score individual steps and enable earlier pruning, but exact checkers and calibrated evaluation remain authoritative where available.
  4. Compute-optimal scheduling uses quick attempts first and only escalates for hard problems.
  5. KV-cache design is a first-class constraint. Prefix sharing and paged KV storage decide whether search is affordable.
  6. Streaming should focus on grounded progress updates and verified summaries, not raw internal scratchpads or uncalibrated confidence.

This capstone closes the system-design arc by asking you to connect the algorithm, verifier, cache budget, streaming behavior, and fallback plan into one defensible system design. Your final deliverable is a proof-of-skill checklist.

Next Step
Continue to AI Lab Coding Interview: Python Systems

You will turn the curriculum into interview execution: practical Python systems, staged requirements, concurrency invariants, and production-shaped tests.

PreviousReal-Time Voice AI Agent
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

o1 Preview Model

OpenAI · 2024

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

Reasoning models

OpenAI · 2026

GPT-5.5 Model

OpenAI · 2026

Scaling Laws for Reward Model Overoptimization

Gao, L., Schulman, J., & Hilton, J. · 2023

Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Wang, X., et al. · 2023 · ICLR 2023

Tree of Thoughts: Deliberate Problem Solving with Large Language Models.

Yao, S., et al. · 2023 · NeurIPS

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.

Silver, D., et al. · 2017 · arXiv preprint

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., et al. · 2023 · SOSP 2023

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2024 · NeurIPS 2024

Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models

Ghosal, S. S., Chakraborty, S., Reddy, S., et al. · 2025