LearnSystem Design CapstonesReasoning & Test-Time Compute

🏗️HardSystem Design

Reasoning & Test-Time Compute

Design a production reasoning agent that routes by difficulty, evaluates candidate work, requires evidence before release, and survives serving bottlenecks like key-value (KV) cache growth.

44 min read

Learning path

Step 154 of 158 in the full curriculum

Real-Time Voice AI Agent AI Lab Coding Interview: Python Systems

Real-time voice agents optimize for split-second streaming, interruption, and media state. This final capstone studies the opposite tradeoff: when a task is hard enough, the system may spend extra inference compute before it releases an answer or irreversible effect.

A reasoning agent can spend extra test-time compute on planning, checking, and tool use before its final answer. This design chapter explains when that checked work yields measured value and how to bound it with evidence contracts, evaluators, and budgets.

A release owner asks an AI deployment platform: "The canary is at 10%, smoke tests passed, but one latency panel spiked after a config flag changed. Should we roll back, hold, or continue the rollout?"

A human SRE wouldn't shout the first guess that comes to mind. They would check the rollout state, verify the smoke-test run, read the latency panel, inspect the config diff, and only then recommend hold, promote, or rollback. A standard large language model (LLM) generates each continuation left to right. It can't rewrite tokens already emitted, though later tokens can state a correction. External verification still requires a controller to call tools, request a revision, or generate competing attempts. On tricky problems, the product question is whether that additional work is justified and verifiable.

Reasoning models such as OpenAI's earlier o-series reasoning models, including o1 ^{[1]Reference 1o1 Preview Modelhttps://developers.openai.com/api/docs/models/o1-preview}, and DeepSeek-R1 ^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} spend additional compute before returning an answer. Don't conflate that native internal reasoning with a product controller that explicitly samples candidates, calls tools, scores partial states, or performs tree search: those are separable designs with different evidence and serving contracts. The controller below routes easy questions to bounded answers and hard, verifiable questions to deeper work.

Fast thinking and slow thinking

Fast/slow-thinking terminology gives us a useful analogy. Fast thinking is automatic and intuitive. When you read "2 + 2 =", you instantly know the answer. Slow thinking is deliberate and sequential. When you multiply 17 by 34 in your head, you work through steps, catch yourself if you miscarry a digit, and restart if the intermediate result looks wrong. In a product, however, enforceable controls are external candidate state, tool evidence, evaluator behavior, and release gates, not a promise about hidden mental steps.

Single-pass decoding looks more like fast thinking. At every step the model samples the next token conditioned on everything so far; without a controller or tool it can emit a plausible answer without validating it. A reasoning-enabled product can instead allocate a bounded check, use an exact tool when one exists, and release only a result supported by that check.

The useful shift is to treat some inference requests as a budgeted search and verification problem, rather than one left-to-right decoding loop. A controller may explore candidate actions, score partial solutions, consult external evidence, and spend extra compute only when evaluation justifies it.

The extra compute is called test-time compute: the floating-point operations (FLOPs) and tokens you spend during inference, after training is done. In a FLOPs-matched study, Snell et al. show that a smaller model plus the right inference-time strategy can beat a 14x larger model on easy-to-intermediate prompts where the smaller model already has a non-trivial chance of succeeding ^{[3]Reference 3Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314}. The same paper warns that the gain reverses on the hardest prompts and when you serve many inference tokens per query, so test-time compute isn't a free substitute for a stronger base model. The lever you control is no longer just model size; it's measured inference budget.

Shared task list where fixed budget stays flat, escalation grows after weak first traces, and adaptive routing allocates deepest search only to hardest requests. — The point isn't bigger budgets everywhere. It's sharper routing, so easy requests stay cheap while hard ones earn deeper search.

The illustration above shows three budget strategies. A fixed budget spends the same tokens on every request. An escalation rule adds more compute only after a weak first attempt. An adaptive budget uses a difficulty router to match the strategy to the task before the expensive search begins. The remaining sections build that adaptive system from the ground up.

The simplest form of extra compute: one deliberate path

Start with the cheapest inference control before building a tree-search engine: allocate one candidate a bounded reasoning budget. On supported reasoning APIs, reasoning.effort is the budget control. A structured response format is only the outward interface for checked milestones and a final answer; XML doesn't make the model think longer. Classic Chain of Thought prompts asked a model to "think step by step" ^{[4]Reference 4Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.https://arxiv.org/abs/2201.11903}, but production controls should separate hidden reasoning budget from evidence the application can validate.

On reasoning-trained APIs, direct task instructions are generally preferred over requests to reveal step-by-step thinking, because supported models already perform internal reasoning ^{[5]Reference 5Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning}. Your job is to request only outward artifacts the application needs: tool evidence, concise checked milestones, and a final answer. Hidden scratchpad text isn't a proof or an audit log.

One common budget knob on OpenAI reasoning models is a reasoning.effort setting ^{[6]Reference 6GPT-5.5 Modelhttps://developers.openai.com/api/docs/models/gpt-5.5}. Supported values vary by model family, but the operational lesson is the same. Lower effort favors latency and fewer reasoning tokens, while higher effort permits more internal reasoning before answering. This is the cheapest form of test-time compute control: raise effort only when evals show a measurable quality gain that justifies extra latency and cost, and keep the lowest useful setting for fast, deterministic tasks. Treat it as the single-path equivalent of the difficulty router you build below.

Here's a lightweight parser for the interface you want the model to satisfy. It turns a compact XML response into visible checkpoints plus a final answer:

the-simplest-form-of-thinking-longer-one.py

from dataclasses import asdict, dataclass
import json
import xml.etree.ElementTree as ET

@dataclass
class ReasoningStep:
    kind: str
    content: str

@dataclass
class ReasoningTrace:
    steps: list[ReasoningStep]
    final_answer: str

class SinglePathReasoner:
    RESPONSE_TEMPLATE = """
    Answer using supplied tool observations only.

    Return XML with this structure:
    <response>
      <checkpoint>major milestone only</checkpoint>
      <checkpoint>major milestone only</checkpoint>
      <verification_summary>brief check against supplied evidence</verification_summary>
      <final_answer>final answer</final_answer>
    </response>

    Keep checkpoints concise. Do not dump a full scratchpad.
    """

    def parse(self, content: str) -> ReasoningTrace:
        root = ET.fromstring(content)
        checkpoints = [node.text.strip() for node in root.findall("checkpoint") if node.text]
        steps = [ReasoningStep("checkpoint", cp) for cp in checkpoints if cp]
        verification = self._extract_text(root, "verification_summary")
        if verification:
            steps.append(ReasoningStep("verification", verification))
        return ReasoningTrace(
            steps=steps,
            final_answer=self._extract_text(root, "final_answer"),
        )

    def _extract_text(self, root: ET.Element, tag: str) -> str:
        text = root.findtext(tag, default="")
        return text.strip()

model_response = """
<response>
  <checkpoint>Deploy API reports canary still at 10%.</checkpoint>
  <checkpoint>Smoke-test tool reports the release passed.</checkpoint>
  <verification_summary>Metrics tool shows latency above watch threshold but below rollback threshold.</verification_summary>
  <final_answer>Hold the canary at 10%, keep rollback armed, and inspect the config flag before promotion.</final_answer>
</response>
"""

trace = SinglePathReasoner().parse(model_response)
print(json.dumps({
    "steps": [asdict(step) for step in trace.steps],
    "final_answer": trace.final_answer,
}, indent=2))

Output

{
  "steps": [
    {
      "kind": "checkpoint",
      "content": "Deploy API reports canary still at 10%."
    },
    {
      "kind": "checkpoint",
      "content": "Smoke-test tool reports the release passed."
    },
    {
      "kind": "verification",
      "content": "Metrics tool shows latency above watch threshold but below rollback threshold."
    }
  ],
  "final_answer": "Hold the canary at 10%, keep rollback armed, and inspect the config flag before promotion."
}

Notice what this code does. The RESPONSE_TEMPLATE constrains output to major evidence-backed milestones. It doesn't make the claims true by itself; downstream code still needs to require tool results for external facts. Hidden token-by-token deliberation stays inside the model runtime, while the application receives a compact interface it can validate.

For our canary-release problem, a single-path trace might look like this:

Checkpoint 1: Deploy tool reports that the canary is still at 10%.
Checkpoint 2: Smoke-test tool reports that the release passed.
Verification summary: Metrics tool shows elevated latency below the rollback threshold.
Final answer: Hold the canary, keep rollback armed, and inspect the changed config flag before promotion.

A single bounded path is the cheapest candidate strategy when tool evidence or an exact check can validate the result. But what if multiple plans remain plausible?

releasing-only-evidence-backed-claims.py

import json

def release_answer(tool_results: dict[str, str | None]) -> dict[str, object]:
    missing = [key for key, value in tool_results.items() if value is None]
    if missing:
        return {"release": False, "missing_evidence": missing, "next_step": "request_tool_data"}
    return {
        "release": True,
        "answer": (
            f"Canary action: {tool_results['canary_policy']}. "
            f"Latency status: {tool_results['latency_status']}."
        ),
    }

attempts = [
    {"canary_policy": "hold at 10%", "latency_status": None},
    {"canary_policy": "hold at 10%", "latency_status": "below rollback threshold"},
]

print(json.dumps([release_answer(attempt) for attempt in attempts], indent=2))

Output

[
  {
    "release": false,
    "missing_evidence": [
      "latency_status"
    ],
    "next_step": "request_tool_data"
  },
  {
    "release": true,
    "answer": "Canary action: hold at 10%. Latency status: below rollback threshold."
  }
]

Asking several agents: Best-of-N sampling

Generate multiple candidate answers and rank them when a single path isn't reliable enough. For an incident-analysis workflow, that means several independent candidates enter a policy check or validated reviewer before the system selects a response.

This is called Best-of-N sampling. You run the single-path generator N times with a diversity-producing configuration, rank candidates with the most reliable evaluator available, and release an eligible winner. Temperature is one knob: low temperature tends to collapse diversity, while a higher value can produce different plans. Neither setting guarantees diversity or correctness.

Here's a deterministic version of the selector. In production, the candidates come from separate model calls; the selection rule stays the same:

asking-several-agents-best-of-n-sampling.py

from dataclasses import asdict, dataclass
import json
import math

@dataclass
class CandidateTrace:
    candidate_id: str
    answer: str
    verifier_score: float

class BestOfNSampler:
    @staticmethod
    def probability_at_least_one_success(base_success_rate: float, n: int) -> float:
        return 1 - math.pow(1 - base_success_rate, n)

    def choose(self, traces: list[CandidateTrace]) -> CandidateTrace:
        return max(traces, key=lambda trace: trace.verifier_score)

traces = [
    CandidateTrace("A", "Rollback immediately; latency panel is unknown.", 0.61),
    CandidateTrace("B", "Hold canary; tests passed and latency is below rollback threshold.", 0.93),
    CandidateTrace("C", "Promote to 100% because smoke tests passed.", 0.28),
    CandidateTrace("D", "Disable all traffic before checking metrics.", 0.44),
    CandidateTrace("E", "Ignore the spike because rollout is partial.", 0.19),
]

sampler = BestOfNSampler()
winner = sampler.choose(traces)

print(json.dumps({
    "n": len(traces),
    "base_success_rate": 0.30,
    "at_least_one_success": round(
        sampler.probability_at_least_one_success(0.30, len(traces)),
        3,
    ),
    "selected": asdict(winner),
}, indent=2))

Output

{
  "n": 5,
  "base_success_rate": 0.3,
  "at_least_one_success": 0.832,
  "selected": {
    "candidate_id": "B",
    "answer": "Hold canary; tests passed and latency is below rollback threshold.",
    "verifier_score": 0.93
  }
}

Why this helps: a worked example

Suppose the base model has a 30% chance of producing a correct mitigation plan for a complex canary incident. With one sample, your success rate is 30%. With five independent samples, the probability that at least one is correct is:

P(\text{at least one success}) = 1 - (1 - 0.30)^5 = 1 - 0.168 = 0.83

That's 83% for the event that a correct candidate exists under the independence assumption. It's not an 83% success rate for the returned answer: selection succeeds only when a checker or verifier recognizes the correct candidate. Extra sampling is wasted if candidates are correlated or the verifier can't identify the useful one.

Worse, a reward model is only a proxy for correctness, so pushing N too high can backfire. Gao et al. show that as you optimize harder against a learned reward model, true (gold) quality eventually drops even while the proxy score keeps climbing, a Goodhart-style effect they call reward-model overoptimization ^{[7]Reference 7Scaling Laws for Reward Model Overoptimizationhttps://arxiv.org/abs/2210.10760}. In practice you cap N, ensemble or regularize the reward model, and watch a held-out gold metric rather than trusting the verifier score alone.

When an exact checker exists, it should outrank a persuasive learned score. This selector chooses a tested dependency path rather than the answer with the highest style score:

prefer-exact-checks-over-proxy-scores.py

import json

candidates = [
    {"id": "A", "path": ["api", "legacy_adapter", "database"], "proxy_score": 0.96},
    {"id": "B", "path": ["api", "feature_flag", "database"], "proxy_score": 0.82},
]
blocked_edges = {("api", "legacy_adapter")}

def passes_constraints(path: list[str]) -> bool:
    edges = set(zip(path, path[1:]))
    return not bool(edges & blocked_edges)

checked = [
    {**candidate, "exact_check": passes_constraints(candidate["path"])}
    for candidate in candidates
]
eligible = [candidate for candidate in checked if candidate["exact_check"]]
winner = max(eligible, key=lambda candidate: candidate["proxy_score"])

print(json.dumps({"checked": checked, "selected": winner["id"]}, indent=2))

Output

{
  "checked": [
    {
      "id": "A",
      "path": [
        "api",
        "legacy_adapter",
        "database"
      ],
      "proxy_score": 0.96,
      "exact_check": false
    },
    {
      "id": "B",
      "path": [
        "api",
        "feature_flag",
        "database"
      ],
      "proxy_score": 0.82,
      "exact_check": true
    }
  ],
  "selected": "B"
}

Self-consistency: Best-of-N without a reward model

When the answer is a comparable string (a number, a label, a parse), you can sometimes skip a learned reward model. Self-consistency samples several traces and returns the most frequent answer ^{[8]Reference 8Self-Consistency Improves Chain of Thought Reasoning in Language Models.https://arxiv.org/abs/2203.11171}. It can improve selected reasoning benchmarks when correct answers concentrate more than incorrect ones, but majority vote isn't verification: correlated wrong answers still win. Prefer an exact checker whenever available; use voting when endpoints are comparable and empirical evaluation supports it.

Where beam search fits

Beam search is the classic decoding baseline: keep the top- $k$ most likely partial continuations at each step. It's easy to batch, but on tasks requiring distinct hypotheses its high-probability branches can become near-duplicates rather than useful alternatives. Evaluate it as a baseline or pruning mechanism against sampling and search on the task distribution you serve.

When one path isn't enough: exploring a tree

Some problems force you to backtrack. Imagine a dependency resolver that must choose package versions while respecting API compatibility, Python version floors, and security advisories. If the resolver pins parser-v2 and later discovers that parser-v2 conflicts with the deployed runtime, it needs to undo that decision and try another branch.

Tree of Thoughts (ToT), proposed by Yao et al. ^{[9]Reference 9Tree of Thoughts: Deliberate Problem Solving with Large Language Models.https://arxiv.org/abs/2305.10601}, treats reasoning as a tree search. At each step, the model generates multiple possible next actions, scores them, and explores the most promising branches while pruning the weak ones. It's similar to playing chess: you consider several moves, evaluate the board, explore the promising lines, and abandon the bad ones.

Search tree where branch B is pruned early and only surviving branches expand into deeper candidates. — Prune weak partial states early, then spend deeper search only on branches that still look valid.

The dependency-resolution problem forms a search tree. From the initial state, three version choices are possible. Thought B is pruned immediately because it violates a runtime constraint. Thought A looks promising, so it expands into two sub-choices. A.2 turns out to break an API-compatibility constraint and gets pruned. A.1 and C.1 both reach valid solutions.

To implement this, we maintain a search beam that keeps only candidate branches active. At each depth, the system generates multiple next steps, scores them with a task-specific evaluator, and prunes paths that fall below its threshold. That evaluator may be an exact constraint checker, a learned reward model, or a combination:

when-one-path-isnt-enough-exploring-a-tree.py

from dataclasses import dataclass
import json

EXPANSIONS = {
    "start": ["Pin parser v2", "Pin parser nightly", "Keep parser v1"],
    "Pin parser v2": ["A.1 add compatibility shim", "A.2 call removed API"],
    "Keep parser v1": ["C.1 patch sanitizer", "C.2 defer security fix"],
}

SCORES = {
    "Pin parser v2": 0.74,
    "Pin parser nightly": 0.12,
    "Keep parser v1": 0.81,
    "A.1 add compatibility shim": 0.70,
    "A.2 call removed API": 0.18,
    "C.1 patch sanitizer": 0.96,
    "C.2 defer security fix": 0.39,
}

@dataclass
class ThoughtNode:
    content: str
    depth: int
    score: float
    parent: "ThoughtNode | None" = None

    def trace(self) -> str:
        if self.parent is None:
            return self.content
        return f"{self.parent.trace()}\n{self.content}"

class TreeOfThoughts:
    def __init__(self, beam_width: int = 2, min_score: float = 0.35):
        self.beam_width = beam_width
        self.min_score = min_score

    def search(self, max_depth: int = 2) -> dict[str, object]:
        root = ThoughtNode("start", depth=0, score=1.0)
        beam = [root]
        rounds = []

        for depth in range(max_depth):
            candidates = []
            for node in beam:
                for thought in EXPANSIONS.get(node.content, []):
                    candidates.append(ThoughtNode(thought, depth + 1, SCORES[thought], node))

            survivors = [node for node in candidates if node.score >= self.min_score]
            survivors.sort(key=lambda node: node.score, reverse=True)
            beam = survivors[:self.beam_width]
            rounds.append({
                "depth": depth + 1,
                "kept": [node.content for node in beam],
                "pruned": [
                    node.content for node in candidates
                    if node.score < self.min_score
                ],
            })

        best = max(beam, key=lambda node: node.score)
        return {"winner": best.trace().split("\n"), "rounds": rounds}

print(json.dumps(TreeOfThoughts().search(), indent=2))

Output

{
  "winner": [
    "start",
    "Keep parser v1",
    "C.1 patch sanitizer"
  ],
  "rounds": [
    {
      "depth": 1,
      "kept": [
        "Keep parser v1",
        "Pin parser v2"
      ],
      "pruned": [
        "Pin parser nightly"
      ]
    },
    {
      "depth": 2,
      "kept": [
        "C.1 patch sanitizer",
        "A.1 add compatibility shim"
      ],
      "pruned": [
        "A.2 call removed API"
      ]
    }
  ]
}

This is a beam-search realization of Tree of Thoughts: simple to batch, simple to reason about, and often the first version teams ship. The algorithm starts with the problem, expands each active node into branching_factor children, evaluates every child, and keeps only the beam_width highest-ranked candidates for the next round.

When you want true MCTS

Tree of Thoughts is a general framework. If you need adaptive revisits instead of level-by-level pruning, you can swap the beam for Monte Carlo Tree Search. In LLM settings, a common variant is PUCT (Predictor + Upper Confidence bounds applied to Trees), the AlphaZero-style selection rule that combines a value estimate with the model's next-step prior ^{[10]Reference 10Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.https://arxiv.org/abs/1712.01815}:

\text{PUCT}(s, a) = Q(s, a) + c_{\text{puct}} \cdot P(s, a) \cdot \frac{\sqrt{N(s)}}{1 + N(s, a)}

Where:

$Q(s, a)$ is the estimated downstream value for taking action $a$ in state $s$ (from rollouts, a value model, or another evaluator)
$P(s, a)$ is the policy prior for the next thought
$N(s)$ is the total number of visits to the parent node
$N(s, a)$ is the number of visits to the child node
$c_{\text{puct}}$ is the exploration constant that balances exploration vs exploitation

The first term exploits high-value branches. The second term favors branches that either have a strong model prior or have not been explored much yet. Beam search is usually easier to run on GPUs because you can batch every frontier expansion together. MCTS can become attractive when evaluator calls are expensive and you want to revisit promising nodes instead of expanding every branch evenly.

Who judges the plans? Verifiers and reward models

Generating many candidate plans is useless if you can't identify eligible results. Use exact checkers for objective constraints wherever possible. When exact checking is unavailable and a learned verifier scores candidate traces, two common forms are:

An Outcome Reward Model (ORM) scores only the final answer. It's like an audit that checks whether the final dependency plan is valid without inspecting the intermediate version choices. An ORM is cheap to train because it needs only one label per chain, but it's sparse: if the answer is wrong, you get no signal about where the reasoning broke down.

A Process Reward Model (PRM) scores individual reasoning steps, not the final answer alone ^{[11]Reference 11Let's Verify Step by Step.https://arxiv.org/abs/2305.20050}. When step quality is labelable and the PRM is validated on the served domain, it can help prune low-scored branches before paying for complete traces. It remains a learned proxy, not proof that a step is correct.

The PRM estimates risk at each dependency step. When a candidate calls a removed API, a validated PRM may give that step a low score. If an exact compatibility checker is available, use it as the authoritative rejection; otherwise the score is a pruning signal whose false positives and negatives need evaluation.

How a PRM works

A PRM is typically a language model with a scalar reward head, trained to predict the correctness of a step $s_t$ given the prompt and the previous steps $s_{1...t-1}$ . Lightman et al. released PRM800K, a dataset with 800,000 step-level human feedback labels, because step supervision is the bottleneck for building these verifiers well ^{[11]Reference 11Let's Verify Step by Step.https://arxiv.org/abs/2305.20050}.

P(\text{correct} \mid \text{context}, s_{1...t}) = \sigma(W \cdot h_t)

We take the hidden representation of the current step ( $h_t$ ), map it through a learned linear layer ( $W$ ), and squash the result with a sigmoid ( $\sigma$ ). That gives a score between 0.0 and 1.0; treating it as a calibrated probability requires separate calibration and held-out validation.

Training data is collected via three main paths:

Human feedback: Experts label individual steps as Positive, Negative, or Neutral.
Monte Carlo estimation: Roll out $N$ completions from a step. If $M$ of $N$ lead to the correct answer, the step score is roughly $M/N$ .
Teacher-model supervision: Use a stronger verifier or frontier model to pre-label steps, then audit those labels before trusting them at scale.

Here's a practical scoring interface. In production the scoring call is a model with a reward head; this runnable version uses deterministic rules so you can see how weak steps pull down the whole trace:

how-a-prm-works.py

from dataclasses import asdict, dataclass
import json

@dataclass
class StepScore:
    step: str
    score: float

class ProcessRewardModel:
    def score_step(self, step: str) -> float:
        lowered = step.lower()
        if "removed api" in lowered or "runtime conflict" in lowered:
            return 0.15
        if "checked" in lowered or "compatible" in lowered:
            return 0.92
        return 0.65

    def score_full_trace(self, steps: list[str]) -> dict[str, object]:
        scores = [StepScore(step, self.score_step(step)) for step in steps]
        trace_score = min(score.score for score in scores) if scores else 0.0
        return {
            "step_scores": [asdict(score) for score in scores],
            "trace_score": trace_score,
            "decision": "prune" if trace_score < 0.35 else "keep",
        }

trace = [
    "Checked parser version against runtime compatibility.",
    "Call removed API from the parser adapter.",
    "Recommend hold until compatibility tests pass.",
]

print(json.dumps(ProcessRewardModel().score_full_trace(trace), indent=2))

Output

{
  "step_scores": [
    {
      "step": "Checked parser version against runtime compatibility.",
      "score": 0.92
    },
    {
      "step": "Call removed API from the parser adapter.",
      "score": 0.15
    },
    {
      "step": "Recommend hold until compatibility tests pass.",
      "score": 0.65
    }
  ],
  "trace_score": 0.15,
  "decision": "prune"
}

The score_full_trace method uses the minimum score as a conservative pruning heuristic. In production, set thresholds against held-out task outcomes and prefer exact checks for constraints such as weight limits or policy eligibility.

ORM vs PRM at a glance

Feature	Outcome Reward Model (ORM)	Process Reward Model (PRM)
Scoring	Only scores the final answer	Scores every intermediate step
Feedback signal	Sparse (1 signal per chain)	Dense (1 signal per step)
Search utility	Only helps select final output	Guides search; enables early pruning
Training cost	Lower (less labeling effort)	Higher (requires step-level labels)

ORM waits for final score while PRM prunes a bad step early. — PRMs help when search needs earlier stop signals, but exact checkers still beat learned scores whenever a deterministic rule exists.

Lightman et al. (2023) ^{[11]Reference 11Let's Verify Step by Step.https://arxiv.org/abs/2305.20050} found that process supervision outperformed outcome supervision in their challenging math setting. Carrying that result into release-policy, planning, or code tasks requires domain-specific data and validation.

Training reasoning models with RL

Modern reasoning models don't rely only on external search controllers. DeepSeek-R1 shows a complementary path where reinforcement learning makes the base policy itself better at producing long, structured reasoning traces ^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}.

Group Relative Policy Optimization (GRPO) is one important algorithm in that line of work, and DeepSeek-R1 made it especially visible in open-model practice ^{[2]Reference 2DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}. Compared with PPO (Proximal Policy Optimization), GRPO removes the need for a separate critic model:

Sample a group of outputs from the policy for each problem
Compute rewards for each output (using task rewards, exact verifiers where available, or validated reward models)
Calculate the relative advantage within the group (baseline = group mean)
Update the policy to favor higher-reward outputs

This cuts RL training overhead because you don't need a separate critic network. It can make the base model more sample-efficient at test time, but it doesn't remove the need for routing, budget limits, or verifier design in production.

Putting it together: a production reasoning agent

So far we've looked at individual techniques. A real production system wires them into a pipeline that classifies the incoming problem, picks a strategy, runs the search, verifies the results, and streams progress back to the user.

What the system must handle

Before drawing boxes, list the concrete responsibilities:

Complex reasoning: Accept problems that need multi-step logic (canary incident triage, dependency resolution, multi-file debugging).
Candidate state: Maintain external candidate actions, tool observations, and evaluator results needed for search and audit, without assuming access to a model's hidden scratchpad.
Adaptive compute: Scale inference compute proportional to problem difficulty. A "What is the API key retention policy?" question shouldn't trigger a 20-second tree search.
Streaming: Emit status updates or safe progress summaries while the solver is working, without exposing unverified conclusions as completed checks.
Budget control: Enforce configurable reasoning budgets in tokens, time, branches, and cost; product tiers may change caps but must not weaken evidence requirements.

On the non-functional side, declare measurable objectives: concurrency for a target workload, quality lift over single pass on representative verifiable tasks, p95 latency, and per-request cost by route. A 3x or 10x budget is an experiment parameter, not a production guarantee.

Architecture overview

This diagram shows the full pipeline. Requests enter through a gateway, get classified by a difficulty router, and then flow through either a fast path, a medium path (Best-of-N), or a deep path (Tree of Thoughts). Branching paths should use the best available evaluator and reuse key-value (KV) cache prefixes only where the serving runtime safely supports compatible prefix sharing.

Reasoning agent routes work into fast, medium, or deep budget lanes before verification. — Production reasoning agents route by difficulty, then spend more budget only when cheaper traces stop being reliable.

The request flow below shows the sequence in more detail:

Reasoning request trace where one routed prompt fans into several checked branches that share a prefix cache before one verified answer survives. — Test-time compute stays affordable when branches share one cached stem, weak paths get pruned early, and only grounded progress leaves loop.

Notice the loop in the request flow. The solver generates candidates, reuses shared KV prefixes, scores with the verifier, and streams progress events back. This loop is where the extra test-time compute lives.

The difficulty router

Not every problem needs extended thinking. Simple factual queries waste compute if treated as reasoning tasks. The router evaluates input complexity and assigns a strategy plus token budget. Here's a minimal implementation:

the-difficulty-router.py

from dataclasses import asdict, dataclass
import json

@dataclass
class Route:
    difficulty: str
    strategy: str
    max_tokens: int
    budget_multiplier: int

class DifficultyRouter:
    CONFIGS = {
        "trivial": Route("trivial", "single_pass", 500, 1),
        "easy": Route("easy", "single_path", 2_000, 2),
        "medium": Route("medium", "best_of_3", 4_000, 5),
        "hard": Route("hard", "tree_search", 8_000, 10),
        "extreme": Route("extreme", "tree_search", 16_000, 20),
    }

    def classify(self, problem: str) -> Route:
        text = problem.lower()
        if "prove" in text or "counterexample" in text or "cyclic graph" in text:
            return self.CONFIGS["hard"]
        if "multiple constraints" in text or "debug" in text:
            return self.CONFIGS["medium"]
        if "policy" in text or "retention" in text:
            return self.CONFIGS["easy"]
        return self.CONFIGS["trivial"]

router = DifficultyRouter()
problems = [
    "What is the API key retention policy?",
    "Debug this dependency resolver with multiple constraints.",
    "Find a counterexample for this cyclic graph algorithm.",
]

print(json.dumps([asdict(router.classify(problem)) for problem in problems], indent=2))

Output

[
  {
    "difficulty": "easy",
    "strategy": "single_path",
    "max_tokens": 2000,
    "budget_multiplier": 2
  },
  {
    "difficulty": "medium",
    "strategy": "best_of_3",
    "max_tokens": 4000,
    "budget_multiplier": 5
  },
  {
    "difficulty": "hard",
    "strategy": "tree_search",
    "max_tokens": 8000,
    "budget_multiplier": 10
  }
]

The router should be cheap relative to solving work. Its job is to avoid expensive search on requests that don't benefit while escalating tasks whose risk and verifiability justify more work. Misclassification isn't harmless: under-routing can release an incorrect answer, while over-routing can increase cost, latency, and exposure to verifier overoptimization.

Compute-optimal scheduling

Snell et al. ^{[3]Reference 3Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314} show that no single test-time strategy dominates every prompt. The best move depends on problem difficulty and on whether the base model already has a plausible path to the correct answer. A production scheduler should start cheap and escalate only when an exact check or a validated evaluation signal says the first attempt is weak:

compute-optimal-scheduling.py

import json

class ComputeOptimalScheduler:
    def choose_strategy(
        self, evaluation_score: float, verifiable: bool, high_impact: bool = False
    ) -> dict[str, object]:
        if high_impact and not verifiable:
            return {"strategy": "tool_or_human_review", "reason": "high-impact claim lacks verifier"}
        if not verifiable:
            return {"strategy": "single_pass", "reason": "creative output; no objective ranking"}
        if evaluation_score >= 0.90:
            return {"strategy": "single_path", "reason": "first trace passed verifier"}
        if evaluation_score >= 0.60:
            return {"strategy": "best_of_3", "reason": "plausible trace needs reranking"}
        return {"strategy": "tree_search", "reason": "weak trace needs backtracking"}

requests = [
    {"task": "API-key policy FAQ", "evaluation_score": 0.94, "verifiable": True},
    {"task": "Ambiguous migration plan", "evaluation_score": 0.72, "verifiable": True},
    {"task": "Cyclic graph bug", "evaluation_score": 0.41, "verifiable": True},
    {"task": "Brand voice poem", "evaluation_score": 0.52, "verifiable": False},
    {"task": "Approve production write without policy record", "evaluation_score": 0.82, "verifiable": False, "high_impact": True},
]

scheduler = ComputeOptimalScheduler()
for request in requests:
    request.update(scheduler.choose_strategy(request["evaluation_score"], request["verifiable"], request.get("high_impact", False)))

print(json.dumps(requests, indent=2))

Output

[
  {
    "task": "API-key policy FAQ",
    "evaluation_score": 0.94,
    "verifiable": true,
    "strategy": "single_path",
    "reason": "first trace passed verifier"
  },
  {
    "task": "Ambiguous migration plan",
    "evaluation_score": 0.72,
    "verifiable": true,
    "strategy": "best_of_3",
    "reason": "plausible trace needs reranking"
  },
  {
    "task": "Cyclic graph bug",
    "evaluation_score": 0.41,
    "verifiable": true,
    "strategy": "tree_search",
    "reason": "weak trace needs backtracking"
  },
  {
    "task": "Brand voice poem",
    "evaluation_score": 0.52,
    "verifiable": false,
    "strategy": "single_pass",
    "reason": "creative output; no objective ranking"
  },
  {
    "task": "Approve production write without policy record",
    "evaluation_score": 0.82,
    "verifiable": false,
    "high_impact": true,
    "strategy": "tool_or_human_review",
    "reason": "high-impact claim lacks verifier"
  }
]

This creates an adaptive system that spends small budgets on easy, checkable problems and escalates only when expected quality lift is worth extra latency and KV-cache pressure. It doesn't treat "unverifiable" as permission to make high-impact decisions.

enforcing-branch-stop-budgets.py

import json

def stop_reason(
    tokens_used: int,
    token_cap: int,
    elapsed_ms: int,
    wall_clock_cap_ms: int,
    repeated_state: bool,
    score_history: list[float],
) -> str:
    if repeated_state:
        return "repeated_state"
    if tokens_used >= token_cap:
        return "token_cap"
    if elapsed_ms >= wall_clock_cap_ms:
        return "wall_clock_cap"
    if len(score_history) >= 3 and max(score_history[-3:]) <= max(score_history[:-3], default=-1):
        return "no_recent_score_improvement"
    return "continue"

branches = [
    {"tokens_used": 1200, "token_cap": 4000, "elapsed_ms": 800, "wall_clock_cap_ms": 3000,
     "repeated_state": True, "score_history": [0.54, 0.55]},
    {"tokens_used": 2500, "token_cap": 4000, "elapsed_ms": 1800, "wall_clock_cap_ms": 3000,
     "repeated_state": False, "score_history": [0.72, 0.70, 0.71, 0.69]},
    {"tokens_used": 900, "token_cap": 4000, "elapsed_ms": 500, "wall_clock_cap_ms": 3000,
     "repeated_state": False, "score_history": [0.40, 0.55]},
]

print(json.dumps([stop_reason(**branch) for branch in branches], indent=2))

Output

[
  "repeated_state",
  "no_recent_score_improvement",
  "continue"
]

The serving bottleneck nobody talks about

Long reasoning traces can stress memory as well as floating-point operations (FLOPs). Every live generated token adds a new key and value vector for every transformer layer, so key-value (KV) cache pressure grows linearly with live context length and multiplies across unshared branches.

KV cache pressure

A useful back-of-the-envelope formula is:

\text{KV bytes per token} \approx 2 \cdot L \cdot H_{kv} \cdot d_{\text{head}} \cdot \text{bytes per element}

Where $L$ is the number of layers, $H_{kv}$ is the number of KV heads (not always the full attention-head count if the model uses grouped-query attention (GQA)), and $d_{\text{head}}$ is the head dimension. For an 80-layer model with 8 KV heads, head dimension 128, and bfloat16 (BF16) activations:

2 \cdot 80 \cdot 8 \cdot 128 \cdot 2 = 327{,}680 \text{ bytes per token} \approx 320 \text{ KB/token}

At 16k live tokens, that's roughly 5.0 GiB of KV cache for one branch. Without any prefix reuse, a tree search with eight active 16k-token branches would need about 40 GiB of KV cache before you count model weights or temporary activations.

Prefix sharing and paged KV storage

Best-of-N and tree search create many branches that share the same prompt and often the same early reasoning prefix. If you duplicate that prefix for every branch, memory usage grows with the number and length of branches. Systems like vLLM use PagedAttention to manage KV memory in blocks, which cuts fragmentation and makes long contexts much easier to serve ^{[12]Reference 12Efficient Memory Management for Large Language Model Serving with PagedAttentionhttps://arxiv.org/abs/2309.06180}.

PagedAttention and radix-style prefix caches solve related but different problems. PagedAttention makes allocation cheaper. RadixAttention-style caches, as used in SGLang, index shared prefixes in a tree so sibling branches can reuse the same cached prefix instead of copying it branch by branch ^{[13]Reference 13SGLang: Efficient Execution of Structured Language Model Programshttps://arxiv.org/abs/2312.07104}. In a reasoning agent you usually want both: block-level memory management plus prefix-aware reuse.

That changes the capacity math materially. If eight branches share a 12k-token prefix and each branch adds only a 4k-token unique suffix, the effective footprint is closer to one 12k prefix plus eight 4k suffixes, or about 14 GiB, not the 40 GiB worst case above. The exact savings depend on how quickly branches diverge and on whether your runtime can share prefixes at token-level or block-level granularity.

KV cache reuse trace where one shared prefix fans into many branches, each branch pays only for its own suffix, and losing suffixes get freed after pruning. — Prefix reuse turns branch memory from repeated stem copies into one shared stem plus branch suffixes, which matters as soon as search width grows.

This is one reason test-time search is an infrastructure problem, not a prompting trick alone. A search policy may be sound on paper while its branch width, trace length, or lack of compatible prefix reuse exhausts the serving memory budget before useful search completes.

First update vs final-answer latency

Extended reasoning changes the latency contract. At the model-serving layer, a long hidden search phase can inflate time to first token (TTFT). At the product layer, a quick status event can make first-update latency look healthy while time-to-final-answer and total budget still fail. Measure all three separately.

That's why production systems can stream progress events instead of raw scratchpads. Safe events include selected strategy, budget usage, branch counts, or summaries of externally confirmed checks. Treat learned verifier scores as internal diagnostics unless they are calibrated and presented with an appropriate contract.

A sample capacity plan

This table uses the 320 KB/token estimate to show how quickly memory usage grows as you add longer traces or more live branches:

Workload	Live Branches	Context per Branch	Approx KV Cache	Operational Implication
Single CoT	1	8k tokens	2.5 GiB	Usually easy to colocate with the base model
Best-of-4	4	8k tokens	10.0 GiB	Parallel sampling is practical, but batching headroom shrinks
Tree Search	8	16k tokens	40.0 GiB	Usually needs dedicated search workers and aggressive prefix reuse

Those are worst-case branch-multiplication numbers. Strong prefix reuse can cut the effective KV footprint substantially.

Streaming safe updates

Users need feedback while the system is spending extra inference budget, but that doesn't mean you should dump raw internal scratchpad text or expose an uncalibrated reward score as confidence. A safer pattern is to stream strategy selection, budget usage, and externally grounded summaries while the solver keeps search details private:

streaming-safe-updates.py

from dataclasses import dataclass
import json

@dataclass
class StreamEvent:
    type: str
    content: str
    internal_score: float | None = None

class ReasoningSolver:
    def solve_stream(self) -> list[StreamEvent]:
        return [
            StreamEvent("progress", "checked canary status", 0.82),
            StreamEvent("progress", "verified rollback threshold", 0.91),
            StreamEvent("final", "Hold canary at 10% and inspect the config flag.", 0.88),
        ]

def stream_reasoning(config: dict[str, object], solver: ReasoningSolver) -> list[dict[str, object]]:
    events = [
        {"type": "status", "content": "Analyzing problem difficulty..."},
        {
            "type": "config",
            "content": f"Using {config['strategy']} with budget {config['max_tokens']} tokens",
        },
    ]

    for event in solver.solve_stream():
        if event.type == "progress":
            events.append({
                "type": "progress",
                "content": event.content,
            })
        elif event.type == "final":
            events.append({
                "type": "final_answer",
                "content": event.content,
            })
    return events

config = {"strategy": "best_of_3", "max_tokens": 4_000}
print(json.dumps(stream_reasoning(config, ReasoningSolver()), indent=2))

Output

[
  {
    "type": "status",
    "content": "Analyzing problem difficulty..."
  },
  {
    "type": "config",
    "content": "Using best_of_3 with budget 4000 tokens"
  },
  {
    "type": "progress",
    "content": "checked canary status"
  },
  {
    "type": "progress",
    "content": "verified rollback threshold"
  },
  {
    "type": "final_answer",
    "content": "Hold canary at 10% and inspect the config flag."
  }
]

The solver still keeps internal_score for routing and diagnostics, but stream_reasoning strips that field before the event reaches the client. The user sees confirmed work, not an uncalibrated proxy.

When reasoning is the wrong tool

Not every problem benefits from test-time scaling. Applying tree search or Best-of-N to the wrong use case results in unnecessary latency and wasted resources. You should bypass extended reasoning for:

Simple source-of-truth queries (e.g., "What is the API key retention policy?"). Query the policy source directly; extra internal reasoning won't recover a missing or current fact.
Retrieval failures. If the problem requires external facts or tools, longer internal reasoning is the wrong fix. You need retrieval, search, or tool use.
Open-ended marketing copy (e.g., "Write three brand voice options for a product page"). There is no single objective answer unless the application defines a rubric and validates it, so deep PRM-driven search is usually unjustified.
Latency-sensitive conversational turns. Live products need measured response objectives; a deep search path that exceeds that objective should acknowledge, defer, or avoid the search.
Cost-sensitive batch processing. When running millions of documents through a classification pipeline, a large inference multiplier can erase the value of a small quality gain. Escalate only request cohorts whose measured lift pays for the extra cost.

Allocate substantial test-time compute when evaluation shows gain for the request class and evidence or verification can keep the answer trustworthy. Creative revision may still use a small candidate budget, but it isn't proof-oriented reasoning.

Overthinking: when more thinking lowers accuracy

Longer reasoning isn't monotonically better. Empirical 2025 work finds that accuracy often plateaus and can decline once a trace runs past a problem-specific sweet spot: extra steps add error accumulation, and a model can talk itself out of a correct answer it already found ^{[14]Reference 14Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Modelshttps://arxiv.org/abs/2506.04210}. This overthinking failure mode is also a cost and latency problem, because a model can burn thousands of reasoning tokens on a trivial question. The defenses are the same controls you already have: a router that keeps trivial tasks on the fast path, a capped reasoning_effort or token budget, and a stop rule that exits once exact results are sufficient or validated evaluation stops improving. Treat reasoning budget as a tuned hyperparameter, not a dial you turn to maximum.

Why a bigger model isn't enough

It's tempting to think that deploying a bigger frontier model should dominate every reasoning workload. In practice, the trade-off is more subtle:

Approach	Cost Pattern	Control
Bigger model (Training / model switch)	Higher steady-state inference cost	Fixed after deployment; every request pays for the larger model
Test-time compute (Inference)	Higher per-request latency and token cost	Dynamic per problem; easy to scale up or down

Snell et al. make the practical point: on prompts where a smaller model already has some chance of success, extra inference-time compute can beat switching to a much larger model at matched compute ^{[3]Reference 3Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314}. But if the base model has near-zero task competence, reranking or sampling is unlikely to manufacture the missing capability. Test-time compute amplifies evaluated candidate quality; it doesn't supply missing knowledge or tools.

Common pitfalls

"Higher temperature will reveal a great answer if I sample enough"

Symptom: Best-of-N produces more varied traces, but quality barely improves and the chosen winner still feels random.
Cause: Diversity alone isn't enough. Without a verifier that tracks correctness, more samples mostly add noise.
Fix: Increase temperature only when you also have a PRM, ORM, or exact checker that can reliably separate strong traces from fluent wrong ones.

"Tree search is always better than single-path reasoning"

Symptom: Easy requests suddenly become slower and more expensive even though accuracy barely moves.
Cause: Search budget was applied to every request instead of only the hard, verifiable ones.
Fix: Route by difficulty. Keep trivial and easy requests on the fast path, then escalate only when the first trace scores weakly.

"More thinking always improves the answer"

Symptom: The system finds a decent first answer, then keeps exploring until it talks itself into a worse one.
Cause: Overthinking. Past a task-specific sweet spot, extra search compounds errors instead of fixing them.
Fix: Stop when exact checks pass or a validated evaluator plateaus, cap reasoning effort, and keep the best eligible fallback that can survive later noisy branches.

"One budget policy can serve every request"

Symptom: Cheap FAQ traffic burns deep-search budget while hard debugging tasks run out of depth and tokens too early.
Cause: Compute was allocated uniformly instead of matching task difficulty and business value.
Fix: Separate routing, branch limits, token caps, and wall-clock caps by request class. Budget policy is part of product design, not model tuning alone.

"Users need raw chain-of-thought to trust the system"

Symptom: Streams become long, unstable, and hard to parse, while downstream tools can't tell what is final.
Cause: Internal scratchpad text was exposed instead of a stable progress interface.
Fix: Stream strategy, budget usage, and short evidence-backed summaries. Keep hidden search traces and uncalibrated evaluator scores inside the runtime.

"Reasoning quality comes only from prompts"

Symptom: Prompt tweaks plateau, but the team still has no reliable gain on hard tasks.
Cause: Model capability, verifier quality, cache reuse, and stopping logic were collapsed into one prompt-design problem.
Fix: Treat base model, search controller, verifier, and serving runtime as separate levers. Improve whichever layer is bottlenecking quality or cost.

"Stop criteria are an optional cleanup detail"

Symptom: Branches revisit the same semantic state until they exhaust the token budget or hit wall-clock limits.
Cause: The controller has no hard policy for repeated states, budget exhaustion, or diminishing verifier returns.
Fix: Enforce token, depth, wall-clock, and repeated-state limits per branch, and return the highest-ranked surviving trace that satisfies required checks when the budget ends.

Follow-up questions

Use these checkpoints to test your routing instinct. Decide first, then compare against the answer sketch.

Final design checklist

Use this checklist to judge whether your final design is defensible:

Choose a routing strategy for trivial, medium, and hard verifiable tasks, and explain why each class gets that budget.
Defend when a task needs a single path, Best-of-N, tree search, or a bigger base model instead of more search.
Pick the right judge for the job: exact checker, ORM, or PRM, and explain what failure appears if you choose the wrong one.
Check that arithmetic and scheduling constraints are satisfiable before generating branches; reject or clarify impossible inputs instead of optimizing a fabricated solution.
Estimate rough KV-cache pressure for one branch, many branches, and a shared-prefix layout.
Define hard stop criteria for depth, tokens, wall-clock time, and repeated semantic states.
Describe what the user sees while the solver works, and why that stream should expose progress summaries instead of raw scratchpads.

What this completes

Additional inference work can outperform a fixed single attempt only when the task has enough uncertainty and the search budget is controlled. A practical design combines difficulty routing, Best-of-N sampling, Tree of Thoughts, partial-trace scoring with a Process Reward Model, and a bias toward exact evidence when available. KV-cache growth, prefix sharing, and final-answer latency determine whether that search design is affordable.

The roadmap ends here because earlier chapters taught the serving stack, retrieval layer, agent loop, evaluation harness, and safety controls one piece at a time. This capstone asks you to connect them: a reasoning agent is useful only when search policy, evidence contract, evaluator quality, KV-cache budget, streaming behavior, and fallback plan all work together.

Hold onto these design decisions:

Test-time compute buys additional attempts or checks; keep it only where held-out evaluation shows quality gains worth its cost and latency.
A strategy ladder covers most systems: single path with tuned reasoning effort -> self-consistency or Best-of-N -> Tree of Thoughts / MCTS. More compute helps only up to a problem-specific sweet spot; past it, overthinking degrades accuracy.
Process Reward Models (PRM) can score individual steps and enable earlier pruning, but exact checkers and calibrated evaluation remain authoritative where available.
Compute-optimal scheduling uses quick attempts first and only escalates for hard problems.
KV-cache design is a first-class constraint. Prefix sharing and paged KV storage decide whether search is affordable.
Streaming should focus on grounded progress updates and verified summaries, not raw internal scratchpads or uncalibrated confidence.

This capstone closes the system-design arc by asking you to connect the algorithm, verifier, cache budget, streaming behavior, and fallback plan into one defensible system design. Your final deliverable is a proof-of-skill checklist.

Next Step

Continue to AI Lab Coding Interview: Python Systems

You'll turn the curriculum into interview execution: practical Python systems, staged requirements, concurrency invariants, and production-shaped tests.

PreviousReal-Time Voice AI Agent

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

o1 Preview Model

OpenAI · 2024

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

Reasoning models

OpenAI · 2026

GPT-5.5 Model

OpenAI · 2026

Scaling Laws for Reward Model Overoptimization

Gao, L., Schulman, J., & Hilton, J. · 2023

Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Wang, X., et al. · 2023 · ICLR 2023

Tree of Thoughts: Deliberate Problem Solving with Large Language Models.

Yao, S., et al. · 2023 · NeurIPS

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.

Silver, D., et al. · 2017 · arXiv preprint

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., et al. · 2023 · SOSP 2023

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2024 · NeurIPS 2024

Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models

Ghosal, S. S., Chakraborty, S., Reddy, S., et al. · 2025

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Reasoning & Test-Time Compute

Fast thinking and slow thinking

What is the product difference between training compute and test-time compute?

The simplest form of extra compute: one deliberate path

Why should a reasoning API expose checkpoints instead of raw chain-of-thought?

Asking several agents: Best-of-N sampling

Why this helps: a worked example

Self-consistency: Best-of-N without a reward model

When does Best-of-N help, and when is it wasted?

Where beam search fits

Why can beam search underperform on reasoning tasks?

When one path isn't enough: exploring a tree

What makes Tree of Thoughts different from Best-of-N?

How does Tree of Thoughts handle non-deterministic outputs?

When you want true MCTS

When would you choose MCTS over level-by-level beam expansion?

Who judges the plans? Verifiers and reward models

Why can a validated PRM be more useful than an ORM for search?

How a PRM works

How do you train the Process Reward Model effectively?

ORM vs PRM at a glance

Training reasoning models with RL

What does GRPO improve, and what does it not replace?

Putting it together: a production reasoning agent

What the system must handle

What are the five core responsibilities of a production reasoning agent?

Architecture overview

The difficulty router

Why should the router be cheap and conservative?

How would you optimize latency for the easy path?

Compute-optimal scheduling

What signal should trigger escalation from single path to Best-of-N or tree search?

The serving bottleneck nobody talks about

KV cache pressure

Prefix sharing and paged KV storage

Why does tree search stress KV cache more than normal chat?

First update vs final-answer latency

Why do deep reasoning workflows need both first-update and final-answer latency metrics?

A sample capacity plan

Streaming safe updates

When reasoning is the wrong tool

Overthinking: when more thinking lowers accuracy

Why can adding more test-time compute hurt accuracy, and how do you defend against it?

Why a bigger model isn't enough

When is switching to a bigger model better than adding test-time search?

Common pitfalls

"Higher temperature will reveal a great answer if I sample enough"

"Tree search is always better than single-path reasoning"

"More thinking always improves the answer"

"One budget policy can serve every request"

"Users need raw chain-of-thought to trust the system"

"Reasoning quality comes only from prompts"

"Stop criteria are an optional cleanup detail"

What happens if the reasoning trace enters an infinite loop?

Follow-up questions

Release owner asks, "Should we hold the canary if smoke tests passed but p95 latency is near the watch threshold?" Deploy state and metrics are both already in structured tools. Should router choose single path or Best-of-3 first?

Multi-step math: "A test suite has 40 cases. 15% are flaky. How many flaky cases can run in the first shard if the shard takes exactly one third of all cases and flaky cases must run first?" Should test-time compute help?

Your first trace on a dependency-resolution task falls in the "plausible but weak" band of an evaluator validated for that task family, and exact checking is not available. Do you retry with single path, move to Best-of-3, or jump straight to tree search?

Code debugging: "This routing algorithm fails on cyclic graphs. Find and fix the bug." What search path and verifier should you pair?

Dependency parsing extracts package versions correctly 98% of the time, but some lockfile diffs contain ambiguous prerelease tags. Should you send every request to deep reasoning?

You have one 12k-token shared prefix and six active branches, each adding a 4k-token suffix. Using the article's rough 320 KB/token estimate, what KV-cache footprint should you plan for, and why does prefix reuse matter?

Product team asks for a brand-voice apology email and wants three candidate phrasings to choose from. Should you use deep reasoning?

Final design checklist

What this completes

Mastery Check

Discussion