LearnInference & Production ScaleReasoning & Test-Time Compute

📈HardReasoning & Scaling

Reasoning & Test-Time Compute

Understand how reasoning models trade extra inference compute for better answers, and what that means for search, verifiers, KV cache pressure, and routing.

43 min read

Learning path

Step 142 of 158 in the full curriculum

Mamba & State Space Models Advanced MLOps & DevOps for AI

State Space Models showed one way to keep decode state from growing with context length by compressing history into recurrent state. Reasoning models move in the opposite direction on purpose: spend more inference compute when the task is hard enough to justify it.

Think about two incident agents handling the same failed deployment. The first accepts the first rollback plan that looks plausible. The second checks failing tests, deploy diff, error budget, database migration state, and rollback blast radius before committing. When those checks catch a real mistake, the second plan performs better, not because the agent has different data, but because it spends more compute on the decision.

The same idea now shapes how frontier large language models (LLMs) are built and evaluated. Classic scaling work focused on train-time compute: more parameters, more data, and more pretraining FLOPs (floating-point operations).^{[1]Reference 1Scaling Laws for Neural Language Modelshttps://arxiv.org/abs/2001.08361} Work from 2024 onward made a second axis impossible to ignore: on hard reasoning tasks, you can often get better answers by spending more compute during generation itself.^{[2]Reference 2Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314}^{[3]Reference 3Learning to reason with LLMshttps://openai.com/index/learning-to-reason-with-llms/} This is test-time compute scaling. Reasoning-model APIs from OpenAI and open-weight systems such as DeepSeek-R1 made this shift visible, and Snell et al. showed that, on tasks where a smaller model already has a non-trivial chance of success, extra inference-time compute can beat a much larger single-pass model.^{[3]Reference 3Learning to reason with LLMshttps://openai.com/index/learning-to-reason-with-llms/}^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning}^{[5]Reference 5DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}^{[2]Reference 2Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314}

Provider controls are model-specific and they change. OpenAI documents reasoning.effort; Google's Gemini docs distinguish Interactions API generation_config.thinking_level from GenerateContent thinkingConfig.thinkingLevel for Gemini 3 and thinkingConfig.thinkingBudget for Gemini 2.5; Anthropic's current Claude Opus 4.8 and 4.7 docs use adaptive thinking with an effort parameter, while older Claude models use manual extended thinking with budget_tokens.^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning}^{[6]Reference 6Gemini thinkinghttps://ai.google.dev/gemini-api/docs/thinking}^{[7]Reference 7Building with extended thinkinghttps://docs.claude.com/en/docs/build-with-claude/extended-thinking} The durable skill is deciding how much compute a request deserves, because more thinking isn't always better.^{[8]Reference 8Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Modelshttps://arxiv.org/abs/2506.04210}

Some problems need deliberate reasoning instead of fast pattern matching, and test-time compute scaling turns that extra work into a production design choice. You'll need a basic mental model of how transformers predict the next token (covered earlier in the preparation path). The practical goal is to choose between single-pass, best-of-N, and guided-search strategies, then explain why routing and token budgets matter as much as the algorithms.

Inference budget router showing easy requests sent to single pass, uncertain requests to best-of-N sampling, and hard verifiable requests to guided search, alongside a quality frontier that rises with more test-time compute. — Test-time compute is a routing choice: easy requests stay cheap, uncertain requests buy samples, and hard verifiable requests may justify guided search.

System 1 and System 2 thinking

Psychologist Daniel Kahneman's framework is a useful starting point. System 1 is fast, instinctive, and pattern-driven. When you read the word "strawberry" and immediately know it's a fruit, that's System 1. System 2 is slow, deliberate, and step-by-step. When you count how many times the letter "r" appears in "strawberry," you have to switch to System 2 because your gut reaction ("two?") is often wrong. The correct answer is three, but you only get there by checking each letter deliberately.

The System 1/System 2 distinction is an analogy for serving behavior, not an architectural taxonomy. A low-budget single-pass call may answer from strong patterns; a model or surrounding system with more budget can check intermediate work, sample alternatives, or run tools. On tasks that require counting, lookahead, or multi-step deduction, that extra verification path can matter.

Reasoning-oriented models and systems are designed to make deliberate behavior easier to buy at inference time. Depending on the model and wrapper, they may spend non-visible reasoning tokens, sample alternatives, run an external verifier, or revise a candidate. Compare an agent that accepts the first patch with one that reproduces the failing test, checks the diff, verifies side effects, and only then commits.

This shift from pure pattern matching to deliberate reasoning is what makes test-time compute scaling possible.

Why low-budget generation can fail on hard problems

Here's a concrete example. Ask a model: "How many times does the letter 'r' appear in the word 'strawberry'?" A rushed answer can be "two," because the third "r" in "strawberry" is easy to miss without a check.

The problem isn't that the model lacks knowledge. It's that the task requires deliberate step-by-step verification, not fast association. Pretraining optimizes next-token prediction, which rewards fluent continuations more directly than checked final answers. On problems like complex math, code debugging, or incident triage, the same dynamic appears: the model produces a plausible-looking answer that collapses under scrutiny.

Test-time compute scaling can address this failure mode by spending budget before committing. Instead of one shot, the model or the system around it can explore multiple approaches, verify intermediate steps, and select a candidate. The extra compute only helps when exploration or verification changes the result for the better.

The two axes of compute scaling

The operational distinction is simple: train-time compute is spent before deployment to improve the model. Test-time compute is spent during a live request: trying alternate fixes, checking evidence, and verifying the answer before committing.

Classic scaling emphasized the pre-season tuning. Evaluations on reasoning tasks now show that inference-time allocation can sometimes compete with moving to a larger single-pass model.

Traditional scaling focuses on train-time compute, increasing model parameters $N$ or training tokens $D$ . The relationship between compute and loss follows a power law:

$L_{\text{train}}(C) \propto C^{-\alpha}, \quad C = 6ND$

Empirical scaling studies fit loss with an approximate power law of compute $C$ , where training compute is often estimated as roughly 6 × model parameters $N$ × training tokens $D$ .^{[1]Reference 1Scaling Laws for Neural Language Modelshttps://arxiv.org/abs/2001.08361} Within the measured regime, increasing training compute reduces loss predictably enough to guide training decisions.

Test-time compute scaling adds a second axis, inference compute $C_{\text{infer}}$ . Treat accuracy as a function of inference compute, task shape, policy, and verifier quality:

$\text{accuracy} = g(C_{\text{infer}}, \text{task}, \text{policy}, \text{verifier})$

In a measured operating range, more inference compute can improve accuracy with diminishing returns. It can also plateau or reduce accuracy when a trace overthinks, sampled candidates are correlated, or a verifier selects the wrong branch. That extra compute can come from longer reasoning traces, repeated sampling, revision loops, or explicit search with a verifier.^{[2]Reference 2Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314}^{[8]Reference 8Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Modelshttps://arxiv.org/abs/2506.04210}

Snell et al.^{[2]Reference 2Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314} don't claim one universal law for every model and every benchmark. Their more useful result is operational: if you allocate inference compute well, test-time scaling is strong enough that it can outperform buying a larger model and sampling once.

This changes how systems spend compute on hard prompts:

text

Traditional approach:     Train bigger → Answer once → Done
Reasoning approach:       Train reasoner → Think longer → Search for best answer

measured-budget-sweet-spot.py

budgets = [0, 128, 512, 2048]
verified_accuracy = [0.71, 0.78, 0.83, 0.80]
latency_ms = [220, 310, 610, 1840]
latency_limit_ms = 1000

eligible = [
    (accuracy, budget, latency)
    for budget, accuracy, latency in zip(budgets, verified_accuracy, latency_ms)
    if latency <= latency_limit_ms
]
accuracy, budget, latency = max(eligible)
print(f"chosen_budget={budget} accuracy={accuracy:.0%} latency_ms={latency}")
print(f"max_budget_is_best={verified_accuracy[-1] == max(verified_accuracy)}")

Output

chosen_budget=512 accuracy=83% latency_ms=610
max_budget_is_best=False

This evaluation table encodes the production question: choose the best measured quality under the service-level objective, rather than assuming the longest trace is best.

How reasoning models work

Reasoning-oriented training and inference policies allocate additional tokens or branches before returning an answer. The visible behavior can include decomposition, checks, or revision, but those behaviors are capabilities to evaluate rather than guarantees of every response.

Extended chain-of-thought

Reasoning models often generate long scratchpads or intermediate traces around a final answer. Sometimes that trace is exposed, sometimes it's summarized, and sometimes it's hidden entirely by provider policy. Some current APIs can interleave non-visible reasoning with visible output or tool calls. What matters isn't whether the user sees every token, but whether the model is allowed to spend additional inference-time compute while solving the task.

This built-in reasoning process differs from prompted chain-of-thought (CoT)^{[9]Reference 9Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.https://arxiv.org/abs/2201.11903} in several key ways:

Usually learned during post-training via Reinforcement Learning (RL), distillation, or both, rather than relying on prompt wording alone.
Potentially variable-length: the API or serving policy can permit different budgets by request
Can include self-correction: a trace may backtrack, recognize dead ends, and revise earlier steps
Often hidden or summarized: providers may not expose the raw reasoning tokens directly

Current reasoning APIs make this concrete. OpenAI's reasoning docs describe reasoning tokens as non-visible output tokens that still consume context budget and count as billed output tokens. The docs also describe interleaved thinking for some current models, where visible output or tool calls can appear between reasoning steps.^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning}

The thinking budget is now a primary control surface, even though providers expose it with model-specific knobs. OpenAI's reasoning.effort sets an effort level for reasoning models.^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning} Gemini's Interactions API uses generation_config.thinking_level; GenerateContent uses thinkingConfig.thinkingLevel for Gemini 3 and thinkingConfig.thinkingBudget for Gemini 2.5, with SDK-specific casing such as Python thinking_config.thinking_level and thinking_budget.^{[6]Reference 6Gemini thinkinghttps://ai.google.dev/gemini-api/docs/thinking} Claude Opus 4.8 and 4.7 use adaptive thinking plus effort; Anthropic documents budget_tokens for older Claude extended-thinking models.^{[7]Reference 7Building with extended thinkinghttps://docs.claude.com/en/docs/build-with-claude/extended-thinking} Provider prompting guidance also differs by model, so start with a clear task and constraints, tune the supported knob, and evaluate instead of assuming a chain-of-thought prompt helps.

Test-time compute can mean either a longer single trace or multiple sampled traces. OpenAI's o1 launch post made this visible at the benchmark level: on AIME 2024 (a math competition benchmark), reported accuracy improved when the system moved from a single sample to consensus over 64 samples and then to learned reranking over 1000 samples.^{[3]Reference 3Learning to reason with LLMshttps://openai.com/index/learning-to-reason-with-llms/} Treat this as one reported evaluation result, not a guarantee for other tasks or selection rules.

Deliberation and search at inference time

Test-time compute is an umbrella term, not a single algorithm. Some systems sample many complete answers and pick a winner. Some iteratively critique and revise one candidate. Others run explicit search over partial reasoning states using a verifier or reward model. All of these patterns branch from the same idea: spend extra compute to explore, score, and refine before returning an answer.

Reasoning search flow showing one prompt branching into candidate paths, a verifier scoring and pruning them, and one selected answer returning. — Reasoning delays commitment: extend, sample, revise, score, then choose the path worth trusting.

Not every reasoning model literally runs beam search or a PRM at inference time. The shared pattern is optional inference compute allocation. A good routing policy keeps easy requests cheap and assigns more tokens, branches, or verification only where evaluation shows a payoff.

Provider-reported hidden reasoning

Some hosted reasoning APIs report non-visible reasoning tokens in usage accounting while returning only a final visible answer. Others can interleave visible output or tool calls with non-visible work. OpenAI's reasoning-token documentation describes both patterns.^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning} Don't generalize this into one universal architecture: an open-weight deployment, explicit search wrapper, or provider with summarized thinking can expose and account for intermediate work differently.

Provider-reported reasoning budget example showing non-visible work, visible output, and usage accounting; timing can be hidden-only or interleaved. — For APIs that report non-visible reasoning tokens, that work still affects output budget and context use. Latency and temporary serving state depend on whether the implementation runs hidden work before or between visible outputs.

For such APIs, non-visible tokens matter because they consume context window space and may count toward billing even though users never see them.^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning} In transformer runtimes that materialize those tokens autoregressively, they also add decode work and temporary key-value (KV) cache state while generation is in progress. Exact cache behavior is provider- and engine-specific.

hidden-token-accounting.py

visible_tokens = 180
reasoning_tokens = 1220
output_price_per_million = 10.0

billed_output_tokens = visible_tokens + reasoning_tokens
cost = billed_output_tokens / 1_000_000 * output_price_per_million
print(f"visible={visible_tokens} billed_output={billed_output_tokens}")
print(f"visible_fraction={visible_tokens / billed_output_tokens:.1%} output_cost=${cost:.4f}")

Output

visible=180 billed_output=1400
visible_fraction=12.9% output_cost=$0.0140

Use the provider's usage schema and pricing table when implementing this calculation; the example demonstrates why billing and capacity dashboards can't count visible text alone.

Test-time compute strategies

Inference-time compute usually appears in a few patterns:

1. Best-of-N sampling

The simplest pattern is to generate 16 candidate fixes for a failed test and keep the one with the best verifier score. Each candidate is sampled separately, and more attempts raise your ceiling as long as they add useful diversity and your verifier or selection rule can reliably identify the best one. Shared model biases make the attempts correlated, not truly independent.

Best-of-N samples a generative model $N$ times on the same prompt, then returns one winner by reward-model score or self-consistency (majority vote on extracted final answers).

Cost note: This method is easily parallelizable (all N attempts can run simultaneously), making it simple to implement but potentially expensive since you pay for all N completions.

1-best-of-n-sampling.py

from collections import Counter

def extract_answer(completion: str) -> str:
    lines = [line.strip() for line in completion.splitlines() if line.strip()]
    return lines[-1] if lines else ""

def best_of_n(
    model,
    prompt: str,
    n: int = 16,
    reward_model=None
) -> str:
    """Generate N responses and return the best-scored candidate completion.

    Cost: O(N) forward passes, embarrassingly parallel.
    Best for: Problems with verifiable correctness (math, code).
    """
    candidates = [model.generate(prompt) for _ in range(n)]

    if reward_model:
        scores = [reward_model.score(prompt, c) for c in candidates]
        return candidates[scores.index(max(scores))]
    else:
        # Self-consistency: majority vote on final answer
        answers = [extract_answer(c) for c in candidates]
        answer_counts = Counter(answers)
        best_answer = answer_counts.most_common(1)[0][0]
        return next(c for c in candidates if extract_answer(c) == best_answer)

Scaling behavior

With an oracle verifier and independent samples, the chance of generating at least one correct answer in $N$ tries follows:

$P_{\text{success}} = 1 - (1 - p)^N$

Where $p$ is the base probability of the model generating a correct answer on a single attempt. Real systems do worse than this idealized formula because samples are correlated and verifiers make mistakes, but the equation explains why best-of-N works at all. OpenAI's o1 launch post reports the same pattern on AIME 2024: o1 improved from 74% with one sample to 83% with 64-sample consensus and 93% when reranking 1000 samples with a learned scorer.^{[3]Reference 3Learning to reason with LLMshttps://openai.com/index/learning-to-reason-with-llms/}

best_of_n_budget.py

def success_probability(p: float, n: int) -> float:
    return 1 - (1 - p) ** n

base_hit_rate = 0.25
tokens_per_attempt = 800

for n in [1, 4, 16, 64]:
    success = success_probability(base_hit_rate, n)
    output_tokens = n * tokens_per_attempt
    print(f"N={n:>2} success={success:5.1%} output_tokens={output_tokens:>5}")

Output

N= 1 success=25.0% output_tokens=  800
N= 4 success=68.4% output_tokens= 3200
N=16 success=99.0% output_tokens=12800
N=64 success=100.0% output_tokens=51200

The toy calculation shows why best-of-N is attractive and dangerous. The idealized success curve rises quickly, but cost rises linearly with every sampled completion. Real systems also plateau earlier because samples share model biases and verifiers make mistakes.

best-of-n-with-selection-errors.py

def selected_success_probability(candidate_success: float, selector_recall: float, n: int) -> float:
    return candidate_success if n == 1 else candidate_success * selector_recall

base_hit_rate = 0.25
selector_recall = 0.80
for n in [1, 4, 16]:
    oracle_success = 1 - (1 - base_hit_rate) ** n
    deployed_success = selected_success_probability(oracle_success, selector_recall, n)
    print(f"N={n:>2} oracle={oracle_success:5.1%} with_selector={deployed_success:5.1%}")

Output

N= 1 oracle=25.0% with_selector=25.0%
N= 4 oracle=68.4% with_selector=54.7%
N=16 oracle=99.0% with_selector=79.2%

This toy selector is deliberately simple. With one candidate, there's nothing to select. Once multiple candidates exist, generating a correct candidate and selecting it become separate failure surfaces.

Best-of-N success probability curves showing fast early gains and diminishing returns for different base hit rates. — Best-of-N helps most when the base model already has a meaningful hit rate and selection is reliable. If either condition fails, more samples mostly buy more repeated errors.

2. Sequential revision

This is like drafting a rollback plan, checking it against logs and migration state, then revising the weak steps before committing. Unlike best-of-N (where each attempt is independent), each revision builds on the previous one.

Sequential revision starts with one draft, then loops: critique the draft, and if the critique finds issues, regenerate an improved version. The sketch stops when the critique reports "no errors found". Production systems usually replace that string match with a trained verifier or structured critique schema.

2-sequential-revision.py

def iterative_refinement(model, prompt: str, max_rounds: int = 5) -> str:
    """Generate, critique, and refine until convergence.

    Cost: O(rounds) sequential passes, each building on previous.
    Best for: Open-ended tasks (writing, analysis, planning).
    """
    response = model.generate(prompt)

    for _ in range(max_rounds):
        critique = model.generate(
            f"Find errors or improvements in this response:\n"
            f"Question: {prompt}\nResponse: {response}"
        )

        if "no errors found" in critique.lower():
            break

        response = model.generate(
            f"Question: {prompt}\n"
            f"Previous response: {response}\n"
            f"Critique: {critique}\n"
            f"Provide an improved response:"
        )

    return response

revision-early-stop.py

verified_scores = [0.61, 0.76, 0.78, 0.781, 0.781]
minimum_gain = 0.01
used_rounds = 1

for previous, current in zip(verified_scores, verified_scores[1:]):
    if current - previous < minimum_gain:
        break
    used_rounds += 1

print(f"used_rounds={used_rounds} selected_score={verified_scores[used_rounds - 1]:.3f}")
print(f"skipped_rounds={len(verified_scores) - used_rounds}")

Output

used_rounds=3 selected_score=0.780
skipped_rounds=2

3. Tree search with process reward models

Tree search is heavier: a repair agent considers multiple patch paths, evaluates each action, and prunes bad branches early. Instead of scoring just the final answer ("did the fix pass?"), a process reward model scores each intermediate step ("was the failing test reproduced?" or "does this diff touch the right module?").

A simplified beam search using a PRM accepts a model and a PRM, taking a prompt as the initial state. At each step, it generates multiple possible next steps (the beam width), scores them with the PRM, and keeps only the highest-scoring paths. The function outputs the best completed reasoning chain. This is a conceptual interface sketch; it assumes model.generate_step, prm.score_step, and is_final_answer exist.

3-tree-search-with-process-reward-models.py

def beam_search_with_prm(
    model,
    prm,
    prompt: str,
    beam_width: int = 4,
    branch_factor: int = 4,
    max_steps: int = 20
):
    """Guided tree search using per-step reward scores.

    Cost: O(beam_width × branch_factor × max_steps) forward passes.
    Best for: Multi-step mathematical or logical reasoning.
    """
    beams = [{"steps": [], "score": 0.0, "text": prompt}]

    for step in range(max_steps):
        candidates = []
        for beam in beams:
            # Generate next reasoning step
            new_steps = [
                model.generate_step(beam["text"])
                for _ in range(branch_factor)
            ]

            for new_step in new_steps:
                # Score each step with the process reward model
                step_score = prm.score_step(
                    prompt, beam["steps"], new_step
                )
                candidates.append({
                    "steps": beam["steps"] + [new_step],
                    "score": beam["score"] + step_score,
                    "text": beam["text"] + "\n" + new_step,
                })

        # Keep top-k beams
        beams = sorted(candidates, key=lambda x: x["score"], reverse=True)[:beam_width]

        # Check if any beam has reached a final answer
        if any(is_final_answer(b["steps"][-1]) for b in beams):
            break

    return beams[0]

Beam search is easiest to explain, but it has a well-known weakness for reasoning: beams can collapse onto near-duplicate traces. Stronger systems inject diversity, allow backtracking, or use Monte Carlo Tree Search-style expansion policies. Building the tree is straightforward compared with building a verifier good enough to prune bad branches early.

Process reward models vs. outcome reward models

The choice of reward model changes search efficiency:

Outcome reward model versus process reward model comparison: ORM scores only final answers, while a reliable PRM can score steps and prune low-scoring paths early. — ORMs grade completed answers. A sufficiently reliable PRM can score intermediate steps and prune low-scoring branches before they burn a full rollout.

Final CI audit and step audit answer different questions. An ORM only checks whether the final answer works: "tests passed" or "tests failed." You have no idea where the plan broke. A PRM checks each intermediate decision: "failure reproduced," "right module selected," "unsafe migration path rejected." The per-step feedback is much more expensive to provide, and its own mistakes can discard a good path. When it's reliable enough, it can catch errors early and avoid wasting time on low-quality paths.

Outcome reward models (ORMs)

Outcome Reward Models (ORMs) evaluate a completed trajectory rather than its intermediate steps. The outcome signal can be binary correctness, a probability or scalar score, or a preference-derived comparison. Exact-answer math makes binary grading convenient, but that isn't part of the ORM definition.

Score completed trajectories: The model receives an outcome signal after the answer or trajectory is complete, such as exact correctness, a scalar reward, or a preference comparison.
Simpler supervision boundary: Labels attach to completed outputs rather than every reasoning step. Verifiable tasks can supply these labels automatically; subjective tasks may need learned or human preference signals.
Search limitation: They can't guide intermediate reasoning steps. A generator might make a fatal error in step 1 but continue generating 100 more steps before the ORM finally rejects it.
Delayed feedback: The system must generate complete solutions before any evaluation can happen, heavily limiting the usefulness of tree search.

How PRMs work

Process Reward Models (PRMs), in contrast, evaluate the validity of each intermediate step in a chain of thought. By providing step-by-step guidance, they allow search algorithms to quickly abandon incorrect reasoning paths before wasting compute on them.

Granular evaluation: They score each reasoning step conditioned on the problem and preceding steps, identifying where a path first becomes weak or invalid.
Early pruning: They can prune low-scoring paths before completion, improving search efficiency when their intermediate scores predict final correctness.
Complex data requirements: They need step-level supervision from some source. Lightman et al. trained their strongest PRM with human step labels, while methods such as Math-Shepherd derive process targets automatically from rollout outcomes.^{[10]Reference 10Let's Verify Step by Step.https://arxiv.org/abs/2305.20050}^{[11]Reference 11Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotationshttps://arxiv.org/abs/2312.08935}
Denser guidance: A process score gives a search controller earlier evidence for keeping or abandoning a branch, but it's not proof that a kept step is correct.

This side-by-side comparison shows how each model scores the same problem:

how-prms-work.py

# ORM: can only evaluate after the full solution is generated.
# It returns a single score for the complete answer.
orm_score = orm.score(
    question="What is 847 × 293?",
    full_solution="847 × 293 = 248,171"
)  # returns 1.0 (correct) or 0.0 (incorrect)

# PRM: evaluates each step given the question and preceding steps.
# This lets the search algorithm prune bad paths before they grow.
prm_scores = prm.score_steps(
    question="What is 847 × 293?",
    steps=[
        "First, I'll break this into 847 × 300 - 847 × 7",  # step score: 0.95
        "847 × 300 = 254,100",                               # step score: 0.98
        "847 × 7 = 5,929",                                   # step score: 0.97
        "254,100 - 5,929 = 248,171",                         # step score: 0.99
    ]
)

Snell et al.^{[2]Reference 2Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314} reported that compute-optimal test-time scaling can be more than 4× as efficient as a best-of-N baseline in their math-reasoning evaluation. Lightman et al.^{[10]Reference 10Let's Verify Step by Step.https://arxiv.org/abs/2305.20050} reported that process supervision outperformed outcome supervision on MATH. These results motivate testing PRM-guided pruning on verifiable workloads; they don't establish that every PRM or task benefits.

Not every reasoning system uses a learned PRM or ORM. DeepSeek-R1-Zero trained with rule-based accuracy and format rewards rather than a neural reward model, which is one reason you should treat test-time compute as a family of techniques, not a single stack.^{[5]Reference 5DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

prm-pruning-budget.py

branch_lengths = [8, 8, 8, 8]
prune_at_step = [None, 2, 3, 1]

orm_scored_steps = sum(branch_lengths)
prm_scored_steps = sum(
    full_length if stop is None else stop
    for full_length, stop in zip(branch_lengths, prune_at_step)
)
print(f"orm_steps={orm_scored_steps} prm_steps={prm_scored_steps}")
print(f"saved_steps={orm_scored_steps - prm_scored_steps}")

Output

orm_steps=32 prm_steps=14
saved_steps=18

This is the upside case in which the process scorer prunes the right paths. A deployment evaluation must also count false pruning of branches that would have ended correct.

DeepSeek-R1-Zero and DeepSeek-R1

DeepSeek's contribution is easiest to understand as two related results. DeepSeek-R1-Zero showed that large-scale RL with verifiable rewards can induce strong reasoning behaviors without a supervised cold start.^{[5]Reference 5DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} DeepSeek-R1 then added cold-start data and additional post-training to make those behaviors more stable and readable.^{[5]Reference 5DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

Training pipeline

The final DeepSeek-R1 pipeline isn't "pure RL" end to end. The paper explicitly describes two supervised fine-tuning (SFT) stages and two RL stages:

Base model: Start with DeepSeek-V3-Base^{[5]Reference 5DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}, a 671B Mixture-of-Experts (MoE) model with 37B active parameters per token.
Cold-start Supervised Fine-Tuning (SFT): Collect thousands of long CoT examples and fine-tune the base model. This helps avoid the readability and language-mixing issues reported for R1-Zero.
Reasoning-oriented RL: Run GRPO (Group Relative Policy Optimization)-based RL on reasoning tasks with verifiable, rule-based rewards, similar in spirit to R1-Zero.^{[5]Reference 5DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}
Rejection sampling + second SFT: Use the RL checkpoint to generate high-quality reasoning traces, mix them with supervised data from DeepSeek-V3 for non-reasoning domains, and retrain the base model.
Final RL for all scenarios: Run another RL stage across a broader prompt mix to produce DeepSeek-R1.

That sequence matters because people often summarize R1 as "pure RL." Only R1-Zero fits that description.^{[5]Reference 5DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}

Emergent reasoning

R1-Zero displayed behaviors such as self-verification, backtracking, and longer rollouts, but the paper also reports readability issues and occasional language mixing.^{[5]Reference 5DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} That detail matters. It means verifiable rewards can induce useful reasoning behavior, but raw RL rollouts still need cleanup before they become a general-purpose product.

Compute-optimal inference: when to think harder

Not all problems benefit equally from extended reasoning, and on some tasks extra thinking actively hurts. Ghosal et al. studied this directly: across reasoning models, accuracy often rises with a little more thinking and then falls as the trace grows, an inverted-U rather than a monotonic climb.^{[8]Reference 8Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Modelshttps://arxiv.org/abs/2506.04210} They attribute the apparent early gains partly to higher output variance instead of genuinely better reasoning, and they recommend parallel sampling (best-of-N) over endlessly extending one trace. An Anthropic study found the same inverse-scaling effect on several tasks, where longer reasoning amplified distractions and errors rather than fixing them.^{[12]Reference 12Inverse Scaling in Test-Time Computehttps://alignment.anthropic.com/2025/inverse-scaling/} The lesson for production is that thinking budget is a real tuning parameter with a sweet spot, not a slider you turn to maximum.

The optimal test-time compute allocation depends on problem difficulty:

Problem Type	Candidate strategy to evaluate	Example
Factual recall	Single pass	"What is the CLI flag for dry-run deploys?"
Simple reasoning	Short reasoning budget or brief scratchpad	"Can this migration run after the schema lock is released?"
Multi-step math	Longer scratchpad or best-of-N	GPU capacity or request-rate calculation
Complex code	Deliberation plus search or repair loops	Multi-file debugging and repair
Open-ended analysis	Sequential revision	Incident review or rollout-risk analysis

The cross-over point

A task family may have a measured compute budget where a smaller model plus a test-time policy beats a larger single-pass model:

$Q_{\text{small+policy}}(C^*) > Q_{\text{large,single}}$

Here, $Q$ is measured quality and $C^*$ is one evaluated inference budget. Finding a crossover at $C^*$ doesn't mean every larger budget keeps helping. Quality can plateau or fall, and the preferred path can change with task difficulty, latency limits, verifier quality, and how many requests you expect to serve.

Snell et al.^{[2]Reference 2Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314} observed that in FLOPs-matched evaluations, a smaller model with additional test-time compute can outperform a model roughly 14× larger answering in a single pass on problems where the small model already achieves non-trivial success rates. This makes longer thinking a candidate for compensating for model size on similar evaluated tasks, not a general replacement for larger models.

Compute allocation router choosing between fast single-pass generation, short reasoning budgets, and guided search based on task signals. — A production router should decide whether the request earns the expensive path. Difficulty, stakes, latency budget, and verifier availability matter as much as the model name.

Production systems often use a routing layer to handle varying difficulty without exploding cost. The router evaluates task complexity and directs the request to the appropriate model and test-time path, balancing latency, quality, and computational cost.

measured-routing-policy.py

routes = [
    {"name": "single-pass", "quality": 0.78, "latency_ms": 240, "cost": 0.002},
    {"name": "bounded-reasoning", "quality": 0.86, "latency_ms": 780, "cost": 0.010},
    {"name": "guided-search", "quality": 0.90, "latency_ms": 2600, "cost": 0.045},
]
latency_slo_ms = 1000
minimum_quality = 0.84

eligible = [
    route for route in routes
    if route["latency_ms"] <= latency_slo_ms and route["quality"] >= minimum_quality
]
chosen = min(eligible, key=lambda route: route["cost"])
print(f"route={chosen['name']} quality={chosen['quality']:.0%}")
print(f"latency_ms={chosen['latency_ms']} cost=${chosen['cost']:.3f}")

Output

route=bounded-reasoning quality=86%
latency_ms=780 cost=$0.010

Production considerations

Reasoning traffic changes capacity planning. Longer traces and more branches raise cache residency, first-token delay, and cost even when the visible answer stays short.

KV cache pressure and prefix sharing

Reasoning workloads stress inference engines differently from ordinary chat because they can generate far more intermediate tokens before or between visible outputs.

For every generated token, the server appends a Key vector and a Value vector for every layer:

$\text{KV bytes per token} = 2 \times L \times n_{kv} \times d_h \times b$

Where $L$ is the number of layers, $n_{kv}$ is the number of KV heads, $d_h$ is the head dimension, and $b$ is bytes per value. The important scaling fact is simple: KV cache memory grows linearly with sequence length. A rollout that spends 10,000 tokens thinking creates roughly 10,000 tokens' worth of KV state, which can collapse batch size long before raw FLOPs become the bottleneck.

PagedAttention stores KV cache in fixed-size blocks to reduce fragmentation and make scheduling practical at long context lengths.^{[13]Reference 13Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180} When multiple rollouts share the same prompt or partial trace, prefix-sharing runtimes can reuse that cached prefix instead of duplicating it across every branch. SGLang's RadixAttention is a good example.^{[14]Reference 14SGLang: Efficient Execution of Structured Language Model Programshttps://arxiv.org/abs/2312.07104} This matters a lot for best-of-N, self-consistency, and tree search, where many candidates share the same long prompt and diverge only near the leaves.

prefix-sharing-cache-accounting.py

prompt_tokens = 4000
continuation_tokens = 1000
branches = 8
kv_bytes_per_token = 128 * 1024

without_sharing = branches * (prompt_tokens + continuation_tokens)
with_sharing = prompt_tokens + branches * continuation_tokens
saved_gib = (without_sharing - with_sharing) * kv_bytes_per_token / 1024 ** 3
print(f"tokens_without_sharing={without_sharing:,}")
print(f"tokens_with_sharing={with_sharing:,} saved_kv_gib={saved_gib:.2f}")

Output

tokens_without_sharing=40,000
tokens_with_sharing=12,000 saved_kv_gib=3.42

TTFT vs. inter-token latency

Test-time compute distorts user-facing latency metrics. For a provider or wrapper that generates non-visible reasoning before revealing the answer, TTFT (time to first token) can increase substantially.^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning} An interleaved mode can also add pauses between visible chunks or tool calls. ITL (inter-token latency) can still be fine once answer text starts streaming. A reasoning system can feel slow even when decode throughput is healthy.

ttft-budget-slo.py

base_ttft_ms = 180
decode_tokens_per_second = 80
budgets = [0, 128, 512, 1024]
ttft_slo_ms = 3000

for budget in budgets:
    ttft_ms = base_ttft_ms + budget / decode_tokens_per_second * 1000
    status = "fits" if ttft_ms <= ttft_slo_ms else "reject"
    print(f"budget={budget:>4} ttft_ms={ttft_ms:>7.0f} {status}")

Output

budget=   0 ttft_ms=    180 fits
budget= 128 ttft_ms=   1780 fits
budget= 512 ttft_ms=   6580 reject
budget=1024 ttft_ms=  12980 reject

Latency vs. quality trade-off

Reasoning models add latency, but the exact numbers depend on hardware, batching policy, and provider implementation:

Strategy	User-visible behavior	Best fit
Single-pass generation	Low TTFT, short answers	Chat, extraction, classification
Reasoning or search-heavy generation	Higher TTFT, variable token budget, sometimes hidden intermediate work	Math, code, planning, verification

For strict interactive latency budgets, a long reasoning path may be unacceptable unless measured gains justify it. Candidates for larger budgets include:

Batch processing (code review, data analysis)
High-stakes analyst workflows (incident review, safety analysis, compliance analysis)
Asynchronous workflows (software engineering, research)

Cost implications

Extended reasoning is expensive even when the final answer is short. Hosted APIs may charge directly for hidden reasoning tokens. OpenAI's reasoning docs, for example, note that these tokens are billed as output tokens even though they aren't returned verbatim in the API response.^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning} Open-weight deployments pay through longer wall-clock time, lower throughput, and higher KV cache residency. Either way, longer rollouts reduce how many concurrent requests the same GPU can serve.

Production systems usually combine three controls:

Routing: send only hard or high-stakes tasks to expensive reasoning paths
Token budgets: cap how long a rollout may think before the marginal gain stops being worth it
Early stopping: terminate search when verifier scores stop improving

Without those controls, ambiguous or unsolvable prompts can burn a large amount of inference budget without producing a better answer.

Distillation: making reasoning affordable

Distillation is like having a senior incident engineer write detailed recovery traces, then training a smaller model to solve similar outages the same way. The smaller model won't match the source model everywhere, but it can inherit much of the solution style while being much cheaper to serve.

DeepSeek-R1 showed that reasoning capabilities can be distilled into much smaller models:

Distilled Model	Source	AIME 2024 (pass@1: accuracy on first attempt)
DeepSeek-R1-Distill-Qwen-1.5B	R1 → Qwen2.5-Math-1.5B	28.9%
DeepSeek-R1-Distill-Qwen-7B	R1 → Qwen2.5-Math-7B	55.5%
DeepSeek-R1-Distill-Qwen-32B	R1 → Qwen2.5-32B	72.6%
DeepSeek-R1 (full)	-	79.8%

In DeepSeek's reported AIME evaluation, the 7B and 32B distilled models retain substantial benchmark accuracy at lower parameter counts than R1. Parameter count is only a proxy, though. Real deployment cost still depends on active parameters, quantization, batch size, and engine choice. Distillation can make reasoning cheaper, but it doesn't remove the need for routing and token budgets.

serving-budget-gate.py

requests = [
    {"kind": "extract", "verifiable": True, "predicted_gain": 0.00},
    {"kind": "capacity_math", "verifiable": True, "predicted_gain": 0.09},
    {"kind": "creative_copy", "verifiable": False, "predicted_gain": 0.03},
]
minimum_gain = 0.05

for request in requests:
    use_reasoning = request["verifiable"] and request["predicted_gain"] >= minimum_gain
    route = "reasoning" if use_reasoning else "single-pass"
    print(f"{request['kind']}: {route}")

Output

extract: single-pass
capacity_math: reasoning
creative_copy: single-pass

Mastery check

What strong answers show

Scaling view: train-time compute improves the model before deployment, while test-time compute spends extra budget during generation.
Path view: longer traces, best-of-N, sequential revision, and verifier-guided tree search are different ways to buy more deliberation.
Verifier view: ORMs score final answers, while reliable PRMs let a search policy prune low-scoring branches early.
Systems view: when an API uses and reports non-visible reasoning tokens, they still affect context budget and billing. In transformer serving paths that materialize them, they also add decode work and temporary KV state.
Routing view: reasoning paths are candidates when the task is hard, verifiable, and important enough to justify measured latency and cost.
Budget view: provider/model knobs differ (reasoning.effort, Gemini Interactions generation_config.thinking_level, Gemini GenerateContent thinkingConfig.thinkingLevel or thinkingConfig.thinkingBudget, Claude adaptive effort or legacy budget_tokens), and more thinking isn't always better because accuracy can follow an inverted-U.
DeepSeek view: R1-Zero showed pure-RL reasoning behavior; R1 added cold-start data, SFT, and additional RL to make the behavior more usable.

Follow-up questions

When should you choose a reasoning model over a standard model?

Evaluate a reasoning path when the task needs multi-step deduction and you have a way to check whether the answer is good. Math, formal verification, code debugging, and structured planning are promising candidates. Start with a single-pass baseline for summarization, translation, extraction, and conversational UX, then promote only workloads where measured correctness gains justify latency and cost.

How does test-time compute interact with KV-cache optimization?

Reasoning workloads generate longer rollouts or multiple branches. In transformer serving paths that materialize those tokens, each new token adds per-layer K/V state. KV-cache memory grows linearly with sequence length, so long reasoning traces reduce batch size and throughput. Engines like vLLM use paging to manage that memory, and prefix-sharing runtimes can reuse the shared prefix across best-of-N or search branches instead of duplicating it.

What are the failure modes of extended reasoning?

Common failure modes include loops, error propagation, overthinking, and weak self-verification. If the same model generates and judges, its blind spots are correlated, so it may confidently approve a bad trace. PRMs, task-specific checkers, explicit token budgets, and early stopping help by scoring intermediate steps and terminating unproductive branches earlier.

How can you distill reasoning into a smaller model?

Distillation uses traces or verifier-filtered answers from a stronger reasoning system to train a smaller student. A student can learn useful solution patterns, as the DeepSeek results illustrate, but the gain must be measured against a comparable baseline. A student may imitate output format more easily than the teacher's actual search policy, so good benchmark scores don't always mean the smaller model learned the same underlying reasoning process.

How would you set a first-pass router for mixed workloads?

Start by separating tasks into three buckets: easy pattern matching, medium tasks with occasional reasoning upside, and hard tasks with clear verification signals. Route the first bucket to a fast single-pass model, route the second to a bounded reasoning budget or brief best-of-N, and reserve verifier-guided search for the third. Then instrument latency, total generated tokens, and verified win rate so the router can move requests down when extra thinking stops paying off.

When reasoning deployments break down

Even experienced engineers misapply reasoning models. These failure modes pair symptoms with causes and fixes.

The intelligence tax

Symptom: A production task takes 30 seconds to complete and costs five times more than the standard model, but the output quality is identical.
Cause: You're using a reasoning model for a task that doesn't show measurable benefit from extra deduction. Summarization, translation, simple extraction, and sentiment analysis are often strong single-pass baselines. Extra thinking time may add latency and cost without an accuracy gain.
Fix: Route easy tasks to fast single-pass models. Reserve reasoning models for math, code, planning, verification, and other tasks where you can verify whether the answer is correct.

Confusing prompting with architecture

Symptom: You assume a reasoning model is just a standard model with a hidden system prompt like "You are a careful thinker..."
Cause: The API hides the reasoning tokens, so it's tempting to imagine they're produced by a wrapper around a normal model. That mental model misses the training difference.
Fix: Reasoning models may be post-trained, often with RL or distillation, to use extra inference compute productively. Treat this as different model behavior and serving accounting, not a prompt hack or necessarily a new transformer architecture.

Treating visible chain-of-thought as the whole system

Symptom: You assume a visible "think step by step" trace, a hidden reasoning trace, best-of-N, PRM search, and RL with verifiable rewards are interchangeable.
Cause: They all spend extra inference or training budget around reasoning, but they solve different problems.
Fix: Name the mechanism before comparing systems. Prompted CoT changes the prompt; hidden reasoning changes model behavior and serving accounting; Best-of-N changes sampling; PRM search changes branch selection; and verifiable-reward RL changes post-training.

Ignoring KV cache pressure

Symptom: A reasoning rollout looks cheap because the final answer is short, but batch size collapses under load.
Cause: In transformer serving paths that materialize them, hidden and candidate tokens create K/V state across layers.
Fix: Track total generated tokens, including hidden reasoning tokens. Use paging, prefix sharing, routing, and hard budgets before scaling reasoning traffic.

Infinite loops and overthinking

Symptom: Token usage spikes on certain prompts, the model spins in circles repeating the same deduction, or accuracy drops on simple prompts when the budget is set high.
Cause: No reasoning budget was set, or the budget was set too high. On ambiguous or simple prompts, more thinking can amplify distractions and errors rather than fix them, so accuracy follows an inverted-U as the trace grows.^{[8]Reference 8Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Modelshttps://arxiv.org/abs/2506.04210}^{[12]Reference 12Inverse Scaling in Test-Time Computehttps://alignment.anthropic.com/2025/inverse-scaling/}
Fix: Set a reasoning budget tuned to the task instead of maximizing it, and add a hard cap. Implement early stopping when verifier confidence plateaus. For simple prompts, lower the effort level. Monitor per-request token distributions and alert on outliers.

Practice

Try this short audit to pin down the trade-offs.

The efficiency audit

Consider these five production tasks. For each one, predict whether it benefits from test-time compute scaling, then check your reasoning against the explanation.

Task	Your prediction	Why it does or doesn't benefit
Creative writing (marketing copy)		Single-pass fluency is usually sufficient; revision helps but often isn't worth the latency cost
Multi-step GPU capacity calculation		Hard reasoning with verifiable arithmetic; measure best-of-N or guided search against baseline
Sentiment analysis of feedback snippets		Strong single-pass baseline; only promote it if evaluation finds a gain
Debugging a 500-error across three microservices		Multi-step code reasoning with executable checks; evaluate repair-loop gain
Translation of short UI strings		Pattern-matching with strong base-model performance; extra reasoning adds little

The GPU calculation and microservice debugging are the strongest candidates because they provide verification signals. Test-time compute is worth testing when the task requires verifiable, multi-step reasoning, not because the task sounds difficult.

What to remember and where to go next

Test-time compute adds a second scaling axis: spending more compute at inference can outperform moving to a larger single-pass model on hard reasoning tasks.^{[2]Reference 2Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314}
Test-time compute is an umbrella, not one algorithm: longer scratchpads, repeated sampling, revision loops, and explicit search all fit under the same idea.
Good verifiers and compute-optimal policies matter: Snell et al. report more than 4× better test-time compute efficiency than a best-of-N baseline in their evaluation, and PRMs can support earlier pruning because they score intermediate steps instead of waiting for the end.^{[2]Reference 2Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314}^{[10]Reference 10Let's Verify Step by Step.https://arxiv.org/abs/2305.20050}
DeepSeek-R1-Zero and DeepSeek-R1 aren't the same result: R1-Zero showed emergent reasoning from pure RL, while DeepSeek-R1 added cold-start data plus additional SFT and RL stages to make that behavior readable and broadly usable.^{[5]Reference 5DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948}
Thinking is now a dial, and the sweet spot isn't always the maximum: current providers expose model- and API-specific effort or thinking controls, and accuracy can follow an inverted-U as the trace grows, so tune the supported knob rather than maxing it.^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning}^{[6]Reference 6Gemini thinkinghttps://ai.google.dev/gemini-api/docs/thinking}^{[7]Reference 7Building with extended thinkinghttps://docs.claude.com/en/docs/build-with-claude/extended-thinking}^{[8]Reference 8Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Modelshttps://arxiv.org/abs/2506.04210}
Serving bottlenecks matter as much as algorithms: KV cache memory grows linearly with long reasoning traces, so prefix sharing, routing, and token budgets are core production tools.^{[13]Reference 13Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180}^{[14]Reference 14SGLang: Efficient Execution of Structured Language Model Programshttps://arxiv.org/abs/2312.07104}

Explain why a single-pass model can miss "How many r's in strawberry," why a smaller model with traces, search, or distillation can recover much of a larger model's math performance, and why a production stack needs a routing layer before it needs a bigger GPU.

Next Step

Continue to Advanced MLOps and DevOps for AI Systems

You now understand why reasoning workloads need careful routing, token budgets, verification, and serving signals. The next chapter turns those signals into operational practice: GitOps, <span data-glossary="feature-store">feature stores</span>, shadow traffic, canaries, lineage, and automated rollback for production AI systems.

PreviousMamba & State Space Models

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

Learning to reason with LLMs

OpenAI · 2024

Reasoning models

OpenAI · 2026

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Gemini thinking

Google · 2025

Building with extended thinking

Anthropic · 2025

Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models

Ghosal, S. S., Chakraborty, S., Reddy, S., et al. · 2025

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Wang, P., et al. · 2023

Inverse Scaling in Test-Time Compute

Anthropic · 2025

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2023 · arXiv:2312.07104

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Reasoning & Test-Time Compute

What does test-time compute scaling add that train-time scaling alone doesn't?

System 1 and System 2 thinking

Why is counting the "r" letters in "strawberry" a System 2 task for a model?

Why low-budget generation can fail on hard problems

When does extra inference compute help most?

The two axes of compute scaling

Why does test-time compute have diminishing returns?

How reasoning models work

Extended chain-of-thought

Why can provider-reported non-visible reasoning tokens affect cost and capacity even when users never see them?

Deliberation and search at inference time

What is the shared pattern behind longer traces, best-of-N, revision loops, and tree search?

Provider-reported hidden reasoning

Why should prompts for a hosted reasoning model usually start simple and direct?

Test-time compute strategies

1. Best-of-N sampling

Scaling behavior

What must be true for best-of-N to pay off?

2. Sequential revision

When is sequential revision a better fit than best-of-N?

3. Tree search with process reward models

Why is verifier quality the bottleneck in search-based reasoning?

Process reward models vs. outcome reward models

Outcome reward models (ORMs)

How PRMs work

What is the practical difference between an ORM and a PRM during search?

DeepSeek-R1-Zero and DeepSeek-R1

Training pipeline

Why is "DeepSeek-R1 was pure RL" an inaccurate summary?

Emergent reasoning

What did R1-Zero prove, and what did it not solve by itself?

Compute-optimal inference: when to think harder

The cross-over point

What is the compute-optimal routing question for a reasoning workload?

Why should a production system route tasks before invoking an expensive reasoning path?

Production considerations

KV cache pressure and prefix sharing

Why do reasoning workloads stress KV cache more than ordinary chat?

TTFT vs. inter-token latency

Why can a reasoning model feel slow even if decode tokens per second are healthy?

Latency vs. quality trade-off

Cost implications

What three controls keep reasoning costs bounded in production?

Distillation: making reasoning affordable

What can reasoning distillation transfer, and what can it fail to transfer?

Mastery check

What strong answers show

Follow-up questions

When should you choose a reasoning model over a standard model?

How does test-time compute interact with KV-cache optimization?

What are the failure modes of extended reasoning?

How can you distill reasoning into a smaller model?

How would you set a first-pass router for mixed workloads?

When reasoning deployments break down

The intelligence tax

What is the "intelligence tax"?

Confusing prompting with architecture

Treating visible chain-of-thought as the whole system

Ignoring KV cache pressure

Infinite loops and overthinking

Practice

The efficiency audit

What to remember and where to go next

Mastery Check

Discussion