Design a production reasoning agent that routes by difficulty, evaluates candidate work, requires evidence before release, and survives serving bottlenecks like key-value (KV) cache growth.
The Real-Time Voice AI Agent chapter focused on split-second streaming, interruption, and media state. This final capstone studies the opposite tradeoff: when a task is hard enough, the system may spend extra inference compute before it releases an answer or irreversible effect.
A reasoning agent can spend extra test-time compute on planning, checking, and tool use before its final answer. This design chapter explains when that checked work yields measured value and how to bound it with evidence contracts, evaluators, and budgets.
Imagine you manage a small fulfillment center. A customer sends a message: "I ordered a laptop, a monitor stand, and a cable on Monday with express shipping. The stand just arrived, but the laptop and cable haven't. The tracking number for the laptop says it's at the regional hub. Can I return the stand if I don't like it, and when will the rest arrive?"
A human support agent would not shout the first guess that comes to mind. They would check the order split, verify the tracking status, read the return policy, and only then compose an answer. A standard large language model (LLM) generates each continuation left to right; by itself, one answer attempt does not revisit earlier choices or consult missing evidence. A controller can still ask it to revise, call tools, or generate competing attempts. On tricky problems, the product question is whether that additional work is justified and verifiable.
Reasoning models such as OpenAI's earlier o-series reasoning models, including o1 [1], and DeepSeek-R1 [2] spend additional compute before returning an answer. Do not conflate that native internal reasoning with a product controller that explicitly samples candidates, calls tools, scores partial states, or performs tree search: those are separable designs with different evidence and serving contracts. This article builds the controller around those contracts, routing easy questions to bounded answers and hard, verifiable questions to deeper work.
Fast/slow-thinking terminology gives us a useful analogy. Fast thinking is automatic and intuitive. When you read "2 + 2 =", you instantly know the answer. Slow thinking is deliberate and sequential. When you multiply 17 by 34 in your head, you work through steps, catch yourself if you miscarry a digit, and restart if the intermediate result looks wrong. In a product, however, enforceable controls are external candidate state, tool evidence, evaluator behavior, and release gates, not a promise about hidden mental steps.
Single-pass decoding looks more like fast thinking. At every step the model samples the next token conditioned on everything so far; without a controller or tool it can emit a plausible answer without validating it. A reasoning-enabled product can instead allocate a bounded check, use an exact tool when one exists, and release only a result supported by that check.
The useful shift is to treat some inference requests as a budgeted search and verification problem, not just one left-to-right decoding loop. A controller may explore candidate actions, score partial solutions, consult external evidence, and spend extra compute only when evaluation justifies it.
The extra compute is called test-time compute: the floating-point operations (FLOPs) and tokens you spend during inference, after training is done. In a FLOPs-matched study, Snell et al. show that a smaller model plus the right inference-time strategy can beat a 14x larger model on easy-to-intermediate prompts where the smaller model already has a non-trivial chance of succeeding [3]. The same paper warns that the gain reverses on the hardest prompts and when you serve many inference tokens per query, so test-time compute is not a free substitute for a stronger base model. The lever you control is no longer just model size; it is measured inference budget.
The illustration above shows three budget strategies. A fixed budget spends the same tokens on every request. An escalation rule adds more compute only after a weak first attempt. An adaptive budget uses a difficulty router to match the strategy to the task before the expensive search begins. The rest of this article builds that adaptive system from the ground up.
Before we build a tree-search engine, let's look at the cheapest way to add reasoning. You generate one candidate solution, but you structure the output so the model shows its checkpoints. This is the modern successor to the classic "Chain of Thought" prompt, where you explicitly asked the model to "think step by step" [4].
On reasoning-trained APIs, direct task instructions are generally preferred over requests to reveal step-by-step thinking, because supported models already perform internal reasoning [5]. Your job is to request only outward artifacts the application needs: tool evidence, concise checked milestones, and a final answer. Hidden scratchpad text is not a proof or an audit log.
One common budget knob on OpenAI reasoning models is a reasoning.effort setting [6]. Supported values vary by model family, but the operational lesson is the same. Lower effort favors latency and fewer reasoning tokens, while higher effort permits a larger internal reasoning budget before answering. This is the cheapest form of test-time compute control: raise effort only when evals show a measurable quality gain that justifies the extra latency and cost, and keep the lowest useful setting for fast, deterministic tasks. Treat it as the single-path equivalent of the difficulty router you build later in this article.
Here's a lightweight parser for the interface you want the model to satisfy. It turns a compact XML response into visible checkpoints plus a final answer:
1from dataclasses import asdict, dataclass
2import json
3import re
4
5@dataclass
6class ReasoningStep:
7 kind: str
8 content: str
9
10@dataclass
11class ReasoningTrace:
12 steps: list[ReasoningStep]
13 final_answer: str
14
15class SinglePathReasoner:
16 RESPONSE_TEMPLATE = """
17 Answer using supplied tool observations only.
18
19 Return XML with this structure:
20 <response>
21 <checkpoint>major milestone only</checkpoint>
22 <checkpoint>major milestone only</checkpoint>
23 <verification_summary>brief check against supplied evidence</verification_summary>
24 <final_answer>final answer</final_answer>
25 </response>
26
27 Keep checkpoints concise. Do not dump a full scratchpad.
28 """
29
30 def parse(self, content: str) -> ReasoningTrace:
31 checkpoints = re.findall(r"<checkpoint>(.*?)</checkpoint>", content, flags=re.DOTALL)
32 steps = [ReasoningStep("checkpoint", cp.strip()) for cp in checkpoints if cp.strip()]
33 verification = self._extract_tag(content, "verification_summary")
34 if verification:
35 steps.append(ReasoningStep("verification", verification))
36 return ReasoningTrace(
37 steps=steps,
38 final_answer=self._extract_tag(content, "final_answer"),
39 )
40
41 def _extract_tag(self, content: str, tag: str) -> str:
42 match = re.search(rf"<{tag}>(.*?)</{tag}>", content, flags=re.DOTALL)
43 return match.group(1).strip() if match else ""
44
45model_response = """
46<response>
47 <checkpoint>Order API reports the stand shipped separately.</checkpoint>
48 <checkpoint>Carrier API reports laptop ETA as tomorrow.</checkpoint>
49 <verification_summary>Return-policy tool permits an opened stand within 30 days.</verification_summary>
50 <final_answer>You can return the stand; the current laptop ETA is tomorrow.</final_answer>
51</response>
52"""
53
54trace = SinglePathReasoner().parse(model_response)
55print(json.dumps({
56 "steps": [asdict(step) for step in trace.steps],
57 "final_answer": trace.final_answer,
58}, indent=2))1{
2 "steps": [
3 {
4 "kind": "checkpoint",
5 "content": "Order API reports the stand shipped separately."
6 },
7 {
8 "kind": "checkpoint",
9 "content": "Carrier API reports laptop ETA as tomorrow."
10 },
11 {
12 "kind": "verification",
13 "content": "Return-policy tool permits an opened stand within 30 days."
14 }
15 ],
16 "final_answer": "You can return the stand; the current laptop ETA is tomorrow."
17}Notice what this code does. The RESPONSE_TEMPLATE constrains output to major evidence-backed milestones. It does not make the claims true by itself; downstream code still needs to require tool results for external facts. Hidden token-by-token deliberation stays inside the model runtime, while the application receives a compact interface it can validate.
For our fulfillment-center problem, a single-path trace might look like this:
A single bounded path is the cheapest candidate strategy when tool evidence or an exact check can validate the result. But what if multiple plans remain plausible?
1import json
2
3def release_answer(tool_results: dict[str, str | None]) -> dict[str, object]:
4 missing = [key for key, value in tool_results.items() if value is None]
5 if missing:
6 return {"release": False, "missing_evidence": missing, "next_step": "request_tool_data"}
7 return {
8 "release": True,
9 "answer": (
10 f"Return status: {tool_results['return_policy']}. "
11 f"Current laptop ETA: {tool_results['laptop_eta']}."
12 ),
13 }
14
15attempts = [
16 {"return_policy": "allowed within 30 days", "laptop_eta": None},
17 {"return_policy": "allowed within 30 days", "laptop_eta": "tomorrow"},
18]
19
20print(json.dumps([release_answer(attempt) for attempt in attempts], indent=2))1[
2 {
3 "release": false,
4 "missing_evidence": [
5 "laptop_eta"
6 ],
7 "next_step": "request_tool_data"
8 },
9 {
10 "release": true,
11 "answer": "Return status: allowed within 30 days. Current laptop ETA: tomorrow."
12 }
13]A simple way to pursue better reliability is to generate multiple candidate answers and rank them. Think of it like asking five support agents to solve the same ticket independently, then requiring a policy check or a validated reviewer before selecting a response.
This is called Best-of-N sampling. You run the single-path generator N times with a diversity-producing configuration, rank candidates with the most reliable evaluator available, and release an eligible winner. Temperature is one knob: low temperature tends to collapse diversity, while a higher value can produce different plans. Neither setting guarantees diversity or correctness.
Here's a deterministic version of the selector. In production, the candidates come from separate model calls; the selection rule stays the same:
1from dataclasses import asdict, dataclass
2import json
3import math
4
5@dataclass
6class CandidateTrace:
7 candidate_id: str
8 answer: str
9 verifier_score: float
10
11class BestOfNSampler:
12 @staticmethod
13 def probability_at_least_one_success(base_success_rate: float, n: int) -> float:
14 return 1 - math.pow(1 - base_success_rate, n)
15
16 def choose(self, traces: list[CandidateTrace]) -> CandidateTrace:
17 return max(traces, key=lambda trace: trace.verifier_score)
18
19traces = [
20 CandidateTrace("A", "Refund stand; laptop ETA unknown.", 0.61),
21 CandidateTrace("B", "Return allowed; laptop and cable due tomorrow.", 0.93),
22 CandidateTrace("C", "Wait for all items before returning anything.", 0.28),
23 CandidateTrace("D", "Open carrier claim immediately.", 0.44),
24 CandidateTrace("E", "Return denied because order is partial.", 0.19),
25]
26
27sampler = BestOfNSampler()
28winner = sampler.choose(traces)
29
30print(json.dumps({
31 "n": len(traces),
32 "base_success_rate": 0.30,
33 "at_least_one_success": round(
34 sampler.probability_at_least_one_success(0.30, len(traces)),
35 3,
36 ),
37 "selected": asdict(winner),
38}, indent=2))1{
2 "n": 5,
3 "base_success_rate": 0.3,
4 "at_least_one_success": 0.832,
5 "selected": {
6 "candidate_id": "B",
7 "answer": "Return allowed; laptop and cable due tomorrow.",
8 "verifier_score": 0.93
9 }
10}Suppose the base model has a 30% chance of producing a correct routing plan for a complex multi-stop delivery problem. With one sample, your success rate is 30%. With five independent samples, the probability that at least one is correct is:
That's 83% for the event that a correct candidate exists under the independence assumption. It is not an 83% success rate for the returned answer: selection succeeds only when a checker or verifier recognizes the correct candidate. Extra sampling is wasted if candidates are correlated or the verifier cannot identify the useful one.
Worse, a reward model is only a proxy for correctness, so pushing N too high can backfire. Gao et al. show that as you optimize harder against a learned reward model, true (gold) quality eventually drops even while the proxy score keeps climbing, a Goodhart-style effect they call reward-model overoptimization [7]. In practice you cap N, ensemble or regularize the reward model, and watch a held-out gold metric rather than trusting the verifier score alone.
When an exact checker exists, it should outrank a persuasive learned score. The example below selects a tested route rather than the answer with the highest style score:
1import json
2
3candidates = [
4 {"id": "A", "route": ["hub", "bridge", "customer"], "proxy_score": 0.96},
5 {"id": "B", "route": ["hub", "tunnel", "customer"], "proxy_score": 0.82},
6]
7blocked_edges = {("hub", "bridge")}
8
9def passes_constraints(route: list[str]) -> bool:
10 edges = set(zip(route, route[1:]))
11 return not bool(edges & blocked_edges)
12
13checked = [
14 {**candidate, "exact_check": passes_constraints(candidate["route"])}
15 for candidate in candidates
16]
17eligible = [candidate for candidate in checked if candidate["exact_check"]]
18winner = max(eligible, key=lambda candidate: candidate["proxy_score"])
19
20print(json.dumps({"checked": checked, "selected": winner["id"]}, indent=2))1{
2 "checked": [
3 {
4 "id": "A",
5 "route": [
6 "hub",
7 "bridge",
8 "customer"
9 ],
10 "proxy_score": 0.96,
11 "exact_check": false
12 },
13 {
14 "id": "B",
15 "route": [
16 "hub",
17 "tunnel",
18 "customer"
19 ],
20 "proxy_score": 0.82,
21 "exact_check": true
22 }
23 ],
24 "selected": "B"
25}When the answer is a comparable string (a number, a label, a parse), you can sometimes skip a learned reward model. Self-consistency samples several traces and returns the most frequent answer [8]. It can improve selected reasoning benchmarks when correct answers concentrate more than incorrect ones, but majority vote is not verification: correlated wrong answers still win. Prefer an exact checker whenever available; use voting when endpoints are comparable and empirical evaluation supports it.
Beam search is the classic decoding baseline: keep the top- most likely partial continuations at each step. It is easy to batch, but on tasks requiring distinct hypotheses its high-probability branches can become near-duplicates rather than useful alternatives. Evaluate it as a baseline or pruning mechanism against sampling and search on the task distribution you actually serve.
Some problems force you to backtrack. Imagine a warehouse robot that must pack orders into vans while respecting weight limits, fragile-item rules, and delivery time windows. If the robot assigns the heavy pallet to Van A and later discovers Van A can't cross the bridge, it needs to undo that decision and try Van C instead.
Tree of Thoughts (ToT), proposed by Yao et al. [9], treats reasoning as a tree search. At each step, the model generates multiple possible next actions, scores them, and explores the most promising branches while pruning the weak ones. It's similar to playing chess: you consider several moves, evaluate the board, explore the promising lines, and abandon the bad ones.
The diagram shows a search tree for our warehouse problem. From the initial state, three van assignments are possible. Thought B is pruned immediately because it violates a weight limit. Thought A looks promising, so it expands into two sub-assignments. A.2 turns out to break a time-window constraint and gets pruned. A.1 and C.1 both reach valid solutions.
To implement this, we maintain a search beam that keeps only candidate branches active. At each depth, the system generates multiple next steps, scores them with a task-specific evaluator, and prunes paths that fall below its threshold. That evaluator may be an exact constraint checker, a learned reward model, or a combination:
1from dataclasses import dataclass
2import json
3
4EXPANSIONS = {
5 "start": ["Assign pallet to Van A", "Assign pallet to Van B", "Assign pallet to Van C"],
6 "Assign pallet to Van A": ["A.1 move fragile boxes last", "A.2 use bridge route"],
7 "Assign pallet to Van C": ["C.1 keep express items together", "C.2 split cable to late route"],
8}
9
10SCORES = {
11 "Assign pallet to Van A": 0.74,
12 "Assign pallet to Van B": 0.12,
13 "Assign pallet to Van C": 0.81,
14 "A.1 move fragile boxes last": 0.70,
15 "A.2 use bridge route": 0.18,
16 "C.1 keep express items together": 0.96,
17 "C.2 split cable to late route": 0.39,
18}
19
20@dataclass
21class ThoughtNode:
22 content: str
23 depth: int
24 score: float
25 parent: "ThoughtNode | None" = None
26
27 def trace(self) -> str:
28 if self.parent is None:
29 return self.content
30 return f"{self.parent.trace()}\n{self.content}"
31
32class TreeOfThoughts:
33 def __init__(self, beam_width: int = 2, min_score: float = 0.35):
34 self.beam_width = beam_width
35 self.min_score = min_score
36
37 def search(self, max_depth: int = 2) -> dict[str, object]:
38 root = ThoughtNode("start", depth=0, score=1.0)
39 beam = [root]
40 rounds = []
41
42 for depth in range(max_depth):
43 candidates = []
44 for node in beam:
45 for thought in EXPANSIONS.get(node.content, []):
46 candidates.append(ThoughtNode(thought, depth + 1, SCORES[thought], node))
47
48 survivors = [node for node in candidates if node.score >= self.min_score]
49 survivors.sort(key=lambda node: node.score, reverse=True)
50 beam = survivors[:self.beam_width]
51 rounds.append({
52 "depth": depth + 1,
53 "kept": [node.content for node in beam],
54 "pruned": [
55 node.content for node in candidates
56 if node.score < self.min_score
57 ],
58 })
59
60 best = max(beam, key=lambda node: node.score)
61 return {"winner": best.trace().split("\n"), "rounds": rounds}
62
63print(json.dumps(TreeOfThoughts().search(), indent=2))1{
2 "winner": [
3 "start",
4 "Assign pallet to Van C",
5 "C.1 keep express items together"
6 ],
7 "rounds": [
8 {
9 "depth": 1,
10 "kept": [
11 "Assign pallet to Van C",
12 "Assign pallet to Van A"
13 ],
14 "pruned": [
15 "Assign pallet to Van B"
16 ]
17 },
18 {
19 "depth": 2,
20 "kept": [
21 "C.1 keep express items together",
22 "A.1 move fragile boxes last"
23 ],
24 "pruned": [
25 "A.2 use bridge route"
26 ]
27 }
28 ]
29}This is a beam-search realization of Tree of Thoughts: simple to batch, simple to reason about, and often the first version teams ship. The algorithm starts with the problem, expands each active node into branching_factor children, evaluates every child, and keeps only the beam_width highest-ranked candidates for the next round.
Tree of Thoughts is a general framework. If you need adaptive revisits instead of level-by-level pruning, you can swap the beam for Monte Carlo Tree Search. In LLM settings, a common variant is PUCT (Predictor + Upper Confidence bounds applied to Trees), the AlphaZero-style selection rule that combines a value estimate with the model's next-step prior [10]:
Where:
The first term exploits high-value branches. The second term favors branches that either have a strong model prior or have not been explored much yet. Beam search is usually easier to run on GPUs because you can batch every frontier expansion together. MCTS can become attractive when evaluator calls are expensive and you want to revisit promising nodes instead of expanding every branch evenly.
Generating many candidate plans is useless if you cannot identify eligible results. Use exact checkers for objective constraints wherever possible. When exact checking is unavailable and a learned verifier scores candidate traces, two common forms are:
An Outcome Reward Model (ORM) scores only the final answer. It's like an audit that checks whether the final shipment plan is valid without inspecting the intermediate routing steps. An ORM is cheap to train because it needs only one label per chain, but it's sparse: if the answer is wrong, you get no signal about where the reasoning broke down.
A Process Reward Model (PRM) scores individual reasoning steps, not just the final answer [11]. When step quality is labelable and the PRM is validated on the served domain, it can help prune low-scored branches before paying for complete traces. It remains a learned proxy, not proof that a step is correct.
Think of the PRM like a fulfillment auditor who estimates risk at each routing step. When a candidate assigns the heavy pallet to Van A, a validated PRM may give that step a low score. If an exact bridge-rule checker is available, use it as the authoritative rejection; otherwise the score is a pruning signal whose false positives and negatives need evaluation.
A PRM is typically a language model with a scalar reward head, trained to predict the correctness of a step given the prompt and the previous steps . Lightman et al. released PRM800K, a dataset with 800,000 step-level human feedback labels, because step supervision is the bottleneck for building these verifiers well [11].
We take the hidden representation of the current step (), map it through a learned linear layer (), and squash the result with a sigmoid (). That gives a score between 0.0 and 1.0; treating it as a calibrated probability requires separate calibration and held-out validation.
Training data is collected via three main paths:
Here's a practical scoring interface. In production the scoring call is a model with a reward head; this runnable version uses deterministic rules so you can see how weak steps pull down the whole trace:
1from dataclasses import asdict, dataclass
2import json
3
4@dataclass
5class StepScore:
6 step: str
7 score: float
8
9class ProcessRewardModel:
10 def score_step(self, step: str) -> float:
11 lowered = step.lower()
12 if "bridge restriction" in lowered or "over weight" in lowered:
13 return 0.15
14 if "checked" in lowered or "within limit" in lowered:
15 return 0.92
16 return 0.65
17
18 def score_full_trace(self, steps: list[str]) -> dict[str, object]:
19 scores = [StepScore(step, self.score_step(step)) for step in steps]
20 trace_score = min(score.score for score in scores) if scores else 0.0
21 return {
22 "step_scores": [asdict(score) for score in scores],
23 "trace_score": trace_score,
24 "decision": "prune" if trace_score < 0.35 else "keep",
25 }
26
27trace = [
28 "Checked package weights against each van limit.",
29 "Assign pallet to Van A despite bridge restriction.",
30 "Promise same-day delivery for every item.",
31]
32
33print(json.dumps(ProcessRewardModel().score_full_trace(trace), indent=2))1{
2 "step_scores": [
3 {
4 "step": "Checked package weights against each van limit.",
5 "score": 0.92
6 },
7 {
8 "step": "Assign pallet to Van A despite bridge restriction.",
9 "score": 0.15
10 },
11 {
12 "step": "Promise same-day delivery for every item.",
13 "score": 0.65
14 }
15 ],
16 "trace_score": 0.15,
17 "decision": "prune"
18}The score_full_trace method uses the minimum score as a conservative pruning heuristic. In production, set thresholds against held-out task outcomes and prefer exact checks for constraints such as weight limits or policy eligibility.
| Feature | Outcome Reward Model (ORM) | Process Reward Model (PRM) |
|---|---|---|
| Scoring | Only scores the final answer | Scores every intermediate step |
| Feedback signal | Sparse (1 signal per chain) | Dense (1 signal per step) |
| Search utility | Only helps select final output | Guides search; enables early pruning |
| Training cost | Lower (less labeling effort) | Higher (requires step-level labels) |
Lightman et al. (2023) [11] found that process supervision outperformed outcome supervision in their challenging math setting. Carrying that result into routing, support policy, or code tasks requires domain-specific data and validation.
Modern reasoning models don't rely only on external search controllers. DeepSeek-R1 shows a complementary path where reinforcement learning makes the base policy itself better at producing long, structured reasoning traces [2].
Group Relative Policy Optimization (GRPO) is one important algorithm in that line of work, and DeepSeek-R1 made it especially visible in open-model practice [2]. Compared with PPO (Proximal Policy Optimization), GRPO removes the need for a separate critic model:
This cuts RL training overhead because you don't need a separate critic network. It can make the base model more sample-efficient at test time, but it doesn't remove the need for routing, budget limits, or verifier design in production.
So far we've looked at individual techniques. A real production system wires them into a pipeline that classifies the incoming problem, picks a strategy, runs the search, verifies the results, and streams progress back to the user.
Before drawing boxes, let's list the concrete responsibilities:
On the non-functional side, declare measurable objectives: concurrency for a target workload, quality lift over single pass on representative verifiable tasks, p95 latency, and per-request cost by route. A 3x or 10x budget is an experiment parameter, not a production guarantee.
The diagram below shows the full pipeline. Requests enter through a gateway, get classified by a difficulty router, and then flow through either a fast path, a medium path (Best-of-N), or a deep path (Tree of Thoughts). Branching paths should use the best available evaluator and reuse key-value (KV) cache prefixes only where the serving runtime safely supports compatible prefix sharing.
The request flow below shows the sequence in more detail:
Notice the loop in the request flow. The solver generates candidates, reuses shared KV prefixes, scores with the verifier, and streams progress events back. This loop is where the extra test-time compute lives.
Not every problem needs extended thinking. Simple factual queries waste compute if treated as reasoning tasks. The router evaluates input complexity and assigns a strategy plus token budget. Here's a minimal implementation:
1from dataclasses import asdict, dataclass
2import json
3
4@dataclass
5class Route:
6 difficulty: str
7 strategy: str
8 max_tokens: int
9 budget_multiplier: int
10
11class DifficultyRouter:
12 CONFIGS = {
13 "trivial": Route("trivial", "single_pass", 500, 1),
14 "easy": Route("easy", "single_path", 2_000, 2),
15 "medium": Route("medium", "best_of_3", 4_000, 5),
16 "hard": Route("hard", "tree_search", 8_000, 10),
17 "extreme": Route("extreme", "tree_search", 16_000, 20),
18 }
19
20 def classify(self, problem: str) -> Route:
21 text = problem.lower()
22 if "prove" in text or "counterexample" in text or "cyclic graph" in text:
23 return self.CONFIGS["hard"]
24 if "multiple constraints" in text or "debug" in text:
25 return self.CONFIGS["medium"]
26 if "policy" in text or "return" in text:
27 return self.CONFIGS["easy"]
28 return self.CONFIGS["trivial"]
29
30router = DifficultyRouter()
31problems = [
32 "What is your return policy?",
33 "Debug this routing function with multiple constraints.",
34 "Find a counterexample for this cyclic graph algorithm.",
35]
36
37print(json.dumps([asdict(router.classify(problem)) for problem in problems], indent=2))1[
2 {
3 "difficulty": "easy",
4 "strategy": "single_path",
5 "max_tokens": 2000,
6 "budget_multiplier": 2
7 },
8 {
9 "difficulty": "medium",
10 "strategy": "best_of_3",
11 "max_tokens": 4000,
12 "budget_multiplier": 5
13 },
14 {
15 "difficulty": "hard",
16 "strategy": "tree_search",
17 "max_tokens": 8000,
18 "budget_multiplier": 10
19 }
20]The router should be cheap relative to solving work. Its job is to avoid expensive search on requests that do not benefit while escalating tasks whose risk and verifiability justify more work. Misclassification is not harmless: under-routing can release an incorrect answer, while over-routing can increase cost, latency, and exposure to verifier overoptimization.
The key insight from Snell et al. [3] is that no single test-time strategy dominates every prompt. The best move depends on problem difficulty and on whether the base model already has a plausible path to the correct answer. A production scheduler should start cheap and escalate only when an exact check or a validated evaluation signal says the first attempt is weak:
1import json
2
3class ComputeOptimalScheduler:
4 def choose_strategy(
5 self, evaluation_score: float, verifiable: bool, high_impact: bool = False
6 ) -> dict[str, object]:
7 if high_impact and not verifiable:
8 return {"strategy": "tool_or_human_review", "reason": "high-impact claim lacks verifier"}
9 if not verifiable:
10 return {"strategy": "single_pass", "reason": "creative output; no objective ranking"}
11 if evaluation_score >= 0.90:
12 return {"strategy": "single_path", "reason": "first trace passed verifier"}
13 if evaluation_score >= 0.60:
14 return {"strategy": "best_of_3", "reason": "plausible trace needs reranking"}
15 return {"strategy": "tree_search", "reason": "weak trace needs backtracking"}
16
17requests = [
18 {"task": "Return-policy FAQ", "evaluation_score": 0.94, "verifiable": True},
19 {"task": "Ambiguous route plan", "evaluation_score": 0.72, "verifiable": True},
20 {"task": "Cyclic graph bug", "evaluation_score": 0.41, "verifiable": True},
21 {"task": "Brand voice poem", "evaluation_score": 0.52, "verifiable": False},
22 {"task": "Approve refund without policy record", "evaluation_score": 0.82, "verifiable": False, "high_impact": True},
23]
24
25scheduler = ComputeOptimalScheduler()
26for request in requests:
27 request.update(scheduler.choose_strategy(request["evaluation_score"], request["verifiable"], request.get("high_impact", False)))
28
29print(json.dumps(requests, indent=2))1[
2 {
3 "task": "Return-policy FAQ",
4 "evaluation_score": 0.94,
5 "verifiable": true,
6 "strategy": "single_path",
7 "reason": "first trace passed verifier"
8 },
9 {
10 "task": "Ambiguous route plan",
11 "evaluation_score": 0.72,
12 "verifiable": true,
13 "strategy": "best_of_3",
14 "reason": "plausible trace needs reranking"
15 },
16 {
17 "task": "Cyclic graph bug",
18 "evaluation_score": 0.41,
19 "verifiable": true,
20 "strategy": "tree_search",
21 "reason": "weak trace needs backtracking"
22 },
23 {
24 "task": "Brand voice poem",
25 "evaluation_score": 0.52,
26 "verifiable": false,
27 "strategy": "single_pass",
28 "reason": "creative output; no objective ranking"
29 },
30 {
31 "task": "Approve refund without policy record",
32 "evaluation_score": 0.82,
33 "verifiable": false,
34 "high_impact": true,
35 "strategy": "tool_or_human_review",
36 "reason": "high-impact claim lacks verifier"
37 }
38]This creates an adaptive system that spends small budgets on easy, checkable problems and escalates only when expected quality lift is worth extra latency and KV-cache pressure. It does not treat "unverifiable" as permission to make high-impact decisions.
1import json
2
3def stop_reason(
4 tokens_used: int,
5 token_cap: int,
6 elapsed_ms: int,
7 wall_clock_cap_ms: int,
8 repeated_state: bool,
9 score_history: list[float],
10) -> str:
11 if repeated_state:
12 return "repeated_state"
13 if tokens_used >= token_cap:
14 return "token_cap"
15 if elapsed_ms >= wall_clock_cap_ms:
16 return "wall_clock_cap"
17 if len(score_history) >= 3 and max(score_history[-3:]) <= max(score_history[:-3], default=-1):
18 return "no_recent_score_improvement"
19 return "continue"
20
21branches = [
22 {"tokens_used": 1200, "token_cap": 4000, "elapsed_ms": 800, "wall_clock_cap_ms": 3000,
23 "repeated_state": True, "score_history": [0.54, 0.55]},
24 {"tokens_used": 2500, "token_cap": 4000, "elapsed_ms": 1800, "wall_clock_cap_ms": 3000,
25 "repeated_state": False, "score_history": [0.72, 0.70, 0.71, 0.69]},
26 {"tokens_used": 900, "token_cap": 4000, "elapsed_ms": 500, "wall_clock_cap_ms": 3000,
27 "repeated_state": False, "score_history": [0.40, 0.55]},
28]
29
30print(json.dumps([stop_reason(**branch) for branch in branches], indent=2))1[
2 "repeated_state",
3 "no_recent_score_improvement",
4 "continue"
5]Long reasoning traces can stress memory as well as floating-point operations (FLOPs). Every live generated token adds a new key and value vector for every transformer layer, so key-value (KV) cache pressure grows linearly with live context length and multiplies across unshared branches.
A useful back-of-the-envelope formula is:
Where is the number of layers, is the number of KV heads (not always the full attention-head count if the model uses grouped-query attention (GQA)), and is the head dimension. For an 80-layer model with 8 KV heads, head dimension 128, and bfloat16 (BF16) activations:
At 16k live tokens, that's roughly 5.0 GiB of KV cache for one branch. Without any prefix reuse, a tree search with eight active 16k-token branches would need about 40 GiB of KV cache before you count model weights or temporary activations.
Best-of-N and tree search create many branches that share the same prompt and often the same early reasoning prefix. If you duplicate that prefix for every branch, memory usage grows with the number and length of branches. Systems like vLLM use PagedAttention to manage KV memory in blocks, which cuts fragmentation and makes long contexts much easier to serve [12].
PagedAttention and radix-style prefix caches solve related but different problems. PagedAttention makes allocation cheaper. RadixAttention-style caches, as used in SGLang, index shared prefixes in a tree so sibling branches can reuse the same cached prefix instead of copying it branch by branch [13]. In a reasoning agent you usually want both: block-level memory management plus prefix-aware reuse.
That changes the capacity math materially. If eight branches share a 12k-token prefix and each branch adds only a 4k-token unique suffix, the effective footprint is closer to one 12k prefix plus eight 4k suffixes, or about 14 GiB, not the 40 GiB worst case above. The exact savings depend on how quickly branches diverge and on whether your runtime can share prefixes at token-level or block-level granularity.
This is one reason test-time search is an infrastructure problem, not just a prompting trick. A search policy may be sound on paper while its branch width, trace length, or lack of compatible prefix reuse exhausts the serving memory budget before useful search completes.
Extended reasoning changes the latency contract. If the product emits no status until the answer is ready, a long hidden search phase inflates time to first user-visible token (TTFT). If it emits a quick progress event, TTFT can look healthy while time-to-final-answer and total budget still fail. Measure both.
That is why production systems can stream progress events instead of raw scratchpads. Safe events include selected strategy, budget usage, branch counts, or summaries of externally confirmed checks. Treat learned verifier scores as internal diagnostics unless they are calibrated and presented with an appropriate contract.
The table below uses the 320 KB/token estimate to show how quickly memory usage grows as you add longer traces or more live branches:
| Workload | Live Branches | Context per Branch | Approx KV Cache | Operational Implication |
|---|---|---|---|---|
| Single CoT | 1 | 8k tokens | 2.5 GiB | Usually easy to colocate with the base model |
| Best-of-4 | 4 | 8k tokens | 10.0 GiB | Parallel sampling is practical, but batching headroom shrinks |
| Tree Search | 8 | 16k tokens | 40.0 GiB | Usually needs dedicated search workers and aggressive prefix reuse |
Those are worst-case branch-multiplication numbers. Strong prefix reuse can cut the effective KV footprint substantially.
Users need feedback while the system is spending extra inference budget, but that does not mean you should dump raw internal scratchpad text or expose an uncalibrated reward score as confidence. A safer pattern is to stream strategy selection, budget usage, and externally grounded summaries while the solver keeps search details private:
1from dataclasses import asdict, dataclass
2import json
3
4@dataclass
5class StreamEvent:
6 type: str
7 content: str
8 score: float | None = None
9
10class ReasoningSolver:
11 def solve_stream(self) -> list[StreamEvent]:
12 return [
13 StreamEvent("progress", "checked carrier status", 0.82),
14 StreamEvent("progress", "verified return policy", 0.91),
15 StreamEvent("final", "Return allowed; remaining items arrive tomorrow.", 0.88),
16 ]
17
18def stream_reasoning(config: dict[str, object], solver: ReasoningSolver) -> list[dict[str, object]]:
19 events = [
20 {"type": "status", "content": "Analyzing problem difficulty..."},
21 {
22 "type": "config",
23 "content": f"Using {config['strategy']} with budget {config['max_tokens']} tokens",
24 },
25 ]
26
27 for event in solver.solve_stream():
28 if event.type == "progress":
29 events.append(asdict(event))
30 elif event.type == "final":
31 events.append({
32 "type": "final_answer",
33 "content": event.content,
34 })
35 return events
36
37config = {"strategy": "best_of_3", "max_tokens": 4_000}
38print(json.dumps(stream_reasoning(config, ReasoningSolver()), indent=2))1[
2 {
3 "type": "status",
4 "content": "Analyzing problem difficulty..."
5 },
6 {
7 "type": "config",
8 "content": "Using best_of_3 with budget 4000 tokens"
9 },
10 {
11 "type": "progress",
12 "content": "checked carrier status",
13 "score": 0.82
14 },
15 {
16 "type": "progress",
17 "content": "verified return policy",
18 "score": 0.91
19 },
20 {
21 "type": "final_answer",
22 "content": "Return allowed; remaining items arrive tomorrow."
23 }
24]Not every problem benefits from test-time scaling. Applying tree search or Best-of-N to the wrong use case results in unnecessary latency and wasted resources. You should bypass extended reasoning for:
The guiding heuristic is simple: allocate substantial test-time compute when evaluation shows gain for the request class and evidence or verification can keep the answer trustworthy. Creative revision may still use a small candidate budget, but it is not proof-oriented reasoning.
Longer reasoning is not monotonically better. Empirical 2025 work finds that accuracy often plateaus and can decline once a trace runs past a problem-specific sweet spot: extra steps add error accumulation, and a model can talk itself out of a correct answer it already found [14]. This overthinking failure mode is also a cost and latency problem, because a model can burn thousands of reasoning tokens on a trivial question. The defenses are the same controls you already have: a router that keeps trivial tasks on the fast path, a capped reasoning_effort or token budget, and a stop rule that exits once exact results are sufficient or validated evaluation stops improving. Treat reasoning budget as a tuned hyperparameter, not a dial you turn to maximum.
It's tempting to think that deploying a bigger frontier model should dominate every reasoning workload. In practice, the trade-off is more subtle:
| Approach | Cost Pattern | Control |
|---|---|---|
| Bigger model (Training / model switch) | Higher steady-state inference cost | Fixed after deployment; every request pays for the larger model |
| Test-time compute (Inference) | Higher per-request latency and token cost | Dynamic per problem; easy to scale up or down |
Snell et al. make the practical point: on prompts where a smaller model already has some chance of success, extra inference-time compute can beat simply switching to a much larger model at matched compute [3]. But if the base model has near-zero task competence, reranking or sampling is unlikely to manufacture the missing capability. Test-time compute amplifies evaluated candidate quality; it does not supply missing knowledge or tools.
Symptom: Best-of-N produces more varied traces, but quality barely improves and the chosen winner still feels random. Cause: Diversity alone is not enough. Without a verifier that tracks correctness, more samples mostly add noise. Fix: Increase temperature only when you also have a PRM, ORM, or exact checker that can reliably separate strong traces from fluent wrong ones.
Symptom: Easy requests suddenly become slower and more expensive even though accuracy barely moves. Cause: Search budget was applied to every request instead of only the hard, verifiable ones. Fix: Route by difficulty. Keep trivial and easy requests on the fast path, then escalate only when the first trace scores weakly.
Symptom: The system finds a decent first answer, then keeps exploring until it talks itself into a worse one. Cause: Overthinking. Past a task-specific sweet spot, extra search compounds errors instead of fixing them. Fix: Stop when exact checks pass or a validated evaluator plateaus, cap reasoning effort, and keep the best eligible fallback that can survive later noisy branches.
Symptom: Cheap FAQ traffic burns deep-search budget while hard debugging tasks run out of depth and tokens too early. Cause: Compute was allocated uniformly instead of matching task difficulty and business value. Fix: Separate routing, branch limits, token caps, and wall-clock caps by request class. Budget policy is part of product design, not only model tuning.
Symptom: Streams become long, unstable, and hard to parse, while downstream tools cannot tell what is final. Cause: Internal scratchpad text was exposed instead of a stable progress interface. Fix: Stream strategy, budget usage, and short evidence-backed summaries. Keep hidden search traces and uncalibrated evaluator scores inside the runtime.
Symptom: Prompt tweaks plateau, but the team still has no reliable gain on hard tasks. Cause: Model capability, verifier quality, cache reuse, and stopping logic were collapsed into one prompt-design problem. Fix: Treat base model, search controller, verifier, and serving runtime as separate levers. Improve whichever layer is bottlenecking quality or cost.
Symptom: Branches revisit the same semantic state until they exhaust the token budget or hit wall-clock limits. Cause: The controller has no hard policy for repeated states, budget exhaustion, or diminishing verifier returns. Fix: Enforce token, depth, wall-clock, and repeated-state limits per branch, and return the highest-ranked surviving trace that satisfies required checks when the budget ends.
Use these checkpoints to test your routing instinct. Decide first, then compare against the answer sketch.
Use this checklist to judge whether your final design is defensible:
You now understand when allocating additional inference work can outperform a fixed single attempt, and when it cannot. You can build a difficulty router, run Best-of-N sampling, explore a Tree of Thoughts, and score partial traces with a Process Reward Model while still preferring exact evidence when available. You also know the serving constraints (KV cache growth, prefix sharing, and final-answer latency) that determine whether a search design is affordable.
This is why the roadmap ends here. Earlier chapters taught the serving stack, retrieval layer, agent loop, evaluation harness, and safety controls one piece at a time. This capstone asks you to connect them: a reasoning agent is useful only when search policy, evidence contract, evaluator quality, KV-cache budget, streaming behavior, and fallback plan all work together.
Key takeaways:
This capstone closes the system-design arc by asking you to connect the algorithm, verifier, cache budget, streaming behavior, and fallback plan into one defensible system design. Your final deliverable is a proof-of-skill checklist.
o1 Preview Model
OpenAI · 2024
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI · 2025
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.
Snell, C., et al. · 2024 · arXiv preprint
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
Wei, J., et al. · 2022 · NeurIPS
Reasoning models
OpenAI · 2026
GPT-5.5 Model
OpenAI · 2026
Scaling Laws for Reward Model Overoptimization
Gao, L., Schulman, J., & Hilton, J. · 2023
Self-Consistency Improves Chain of Thought Reasoning in Language Models.
Wang, X., et al. · 2023 · ICLR 2023
Tree of Thoughts: Deliberate Problem Solving with Large Language Models.
Yao, S., et al. · 2023 · NeurIPS
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.
Silver, D., et al. · 2017 · arXiv preprint
Let's Verify Step by Step.
Lightman, H., et al. · 2023 · ICLR
Efficient Memory Management for Large Language Model Serving with PagedAttention
Kwon, W., et al. · 2023 · SOSP 2023
SGLang: Efficient Execution of Structured Language Model Programs
Zheng, L., Yin, L., Xie, Z., et al. · 2024 · NeurIPS 2024
Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models
Ghosal, S. S., Chakraborty, S., Reddy, S., et al. · 2025