LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleReasoning & Test-Time Compute
📈HardReasoning & Scaling

Reasoning & Test-Time Compute

Understand how reasoning models trade extra inference compute for better answers, and what that means for search, verifiers, KV cache pressure, and routing.

41 min read
Learning path
Step 139 of 155 in the full curriculum
Mamba & State Space ModelsAdvanced MLOps & DevOps for AI

Reasoning & Test-Time Compute

State Space Models showed one way to keep decode state from growing with context length by compressing history into recurrent state. Reasoning models move in the opposite direction on purpose: spend more inference compute when the task is hard enough to justify it.

Think about two dispatch planners handling the same delayed-truck problem. The first planner accepts the first reroute that looks plausible. The second planner checks carrier capacity, hub cutoffs, refund risk, and customer priority before committing. When those checks catch a real mistake, the second plan performs better, not because the planner has different data, but because it spends more compute on the decision.

The same idea now shapes how frontier large language models (LLMs) are built and evaluated. Classic scaling work focused on train-time compute: more parameters, more data, and more pretraining FLOPs (floating-point operations).[1] Work from 2024 onward made a second axis impossible to ignore: on hard reasoning tasks, you can often get better answers by spending more compute during generation itself.[2][3] This is test-time compute scaling. Reasoning-model APIs from OpenAI and open-weight systems such as DeepSeek-R1 made this shift visible, and Snell et al. showed that, on tasks where a smaller model already has a non-trivial chance of success, extra inference-time compute can beat a much larger single-pass model.[3][4][5][2]

Provider controls are model-specific and they change. OpenAI documents reasoning.effort; Google's Gemini docs use thinkingLevel for Gemini 3 models and thinkingBudget for Gemini 2.5 models; Anthropic's current Claude Opus 4.7 docs use adaptive thinking with an effort parameter, while older Claude models use manual extended thinking with budget_tokens.[4][6][7] The durable skill is deciding how much compute a request deserves, because more thinking is not always better.[8]

In this article you'll learn why some problems need deliberate reasoning instead of fast pattern matching, how test-time compute scaling works, and what it means for production systems. You'll need a basic mental model of how transformers predict the next token (covered earlier in the preparation path). By the end, you'll be able to choose between single-pass, best-of-N, and guided-search strategies, and you'll know why routing and token budgets are as important as the algorithms themselves.

Test-time compute scaling frontier with single-pass, best-of-N, and guided search curves. Test-time compute scaling frontier with single-pass, best-of-N, and guided search curves.
Training compute improves the model before deployment. Test-time compute spends extra inference budget per request through longer traces, more samples, revision, or verifier-guided search.

System 1 and System 2 thinking

Psychologist Daniel Kahneman's framework is a useful starting point. System 1 is fast, instinctive, and pattern-driven. When you read the word "strawberry" and immediately know it's a fruit, that's System 1. System 2 is slow, deliberate, and step-by-step. When you count how many times the letter "r" appears in "strawberry," you have to switch to System 2 because your gut reaction ("two?") is often wrong. The correct answer is three, but you only get there by checking each letter deliberately.

The System 1/System 2 distinction is an analogy for serving behavior, not an architectural taxonomy. A low-budget single-pass call may answer from strong patterns; a model or surrounding system with more budget can check intermediate work, sample alternatives, or run tools. On tasks that require counting, lookahead, or multi-step deduction, that extra verification path can matter.

Reasoning-oriented models and systems are designed to make deliberate behavior easier to buy at inference time. Depending on the model and wrapper, they may spend non-visible reasoning tokens, sample alternatives, run an external verifier, or revise a candidate. Think of the difference between a routing system that accepts the first carrier option and one that checks cutoff times, capacity, destination risk, and refund impact before committing.

This shift from pure pattern matching to deliberate reasoning is what makes test-time compute scaling possible.

Why low-budget generation can fail on hard problems

Here's a concrete example. Ask a model: "How many times does the letter 'r' appear in the word 'strawberry'?" A rushed answer can be "two," because the third "r" in "strawberry" is easy to miss without a check.

The problem isn't that the model lacks knowledge. It's that the task requires deliberate step-by-step verification, not fast association. Pretraining optimizes next-token prediction, which rewards fluent continuations more directly than checked final answers. On problems like complex math, code debugging, or logistics planning, the same dynamic appears: the model produces a plausible-looking answer that collapses under scrutiny.

Test-time compute scaling can address this failure mode by spending budget before committing. Instead of one shot, the model or the system around it can explore multiple approaches, verify intermediate steps, and select a candidate. The extra compute only helps when exploration or verification changes the result for the better.

The two axes of compute scaling

Here's a useful analogy: think of a fulfillment network. Train-time compute is like months spent tuning forecasts, routes, and warehouse procedures before peak season. Test-time compute is the effort spent during a live incident: trying alternate routes, checking cutoff times, and verifying the plan before committing.

Classic scaling emphasized the pre-season tuning. Evaluations on reasoning tasks now show that inference-time allocation can sometimes compete with moving to a larger single-pass model.

Traditional scaling focuses on train-time compute, increasing model parameters NNN or training tokens DDD. The relationship between compute and loss follows a power law:

Ltrain(C)∝C−α,C=6NDL_{\text{train}}(C) \propto C^{-\alpha}, \quad C = 6NDLtrain​(C)∝C−α,C=6ND

Empirical scaling studies fit loss with an approximate power law of compute CCC, where training compute is often estimated as roughly 6 × model parameters NNN × training tokens DDD.[1] Within the measured regime, increasing training compute reduces loss predictably enough to guide training decisions.

Test-time compute scaling adds a second axis, inference compute CinferC_{\text{infer}}Cinfer​. A practical mental model is:

accuracy=g(Cinfer,task,policy,verifier)\text{accuracy} = g(C_{\text{infer}}, \text{task}, \text{policy}, \text{verifier})accuracy=g(Cinfer​,task,policy,verifier)

In a measured operating range, more inference compute can improve accuracy with diminishing returns. It can also plateau or reduce accuracy when a trace overthinks, sampled candidates are correlated, or a verifier selects the wrong branch. That extra compute can come from longer reasoning traces, repeated sampling, revision loops, or explicit search with a verifier.[2][8]

Snell et al.[2] don't claim one universal law for every model and every benchmark. Their more useful result is operational: if you allocate inference compute well, test-time scaling is strong enough that it can outperform simply buying a larger model and sampling once.

This represents a fundamental shift in how we approach problem-solving with AI models:

text
1Traditional approach: Train bigger → Answer once → Done 2Reasoning approach: Train reasoner → Think longer → Search for best answer
measured-budget-sweet-spot.py
1budgets = [0, 128, 512, 2048] 2verified_accuracy = [0.71, 0.78, 0.83, 0.80] 3latency_ms = [220, 310, 610, 1840] 4latency_limit_ms = 1000 5 6eligible = [ 7 (accuracy, budget, latency) 8 for budget, accuracy, latency in zip(budgets, verified_accuracy, latency_ms) 9 if latency <= latency_limit_ms 10] 11accuracy, budget, latency = max(eligible) 12print(f"chosen_budget={budget} accuracy={accuracy:.0%} latency_ms={latency}") 13print(f"max_budget_is_best={verified_accuracy[-1] == max(verified_accuracy)}")
Output
1chosen_budget=512 accuracy=83% latency_ms=610 2max_budget_is_best=False

This evaluation table encodes the production question: choose the best measured quality under the service-level objective, rather than assuming the longest trace is best.

How reasoning models work

At a high level, reasoning-oriented training and inference policies aim to allocate additional tokens or branches before returning an answer. The visible behavior can include decomposition, checks, or revision, but those behaviors are capabilities to evaluate rather than guarantees of every response.

Extended chain-of-thought

Reasoning models often generate long scratchpads or intermediate traces before producing a final answer. Sometimes that trace is exposed, sometimes it's summarized, and sometimes it's hidden entirely by provider policy. What matters isn't whether the user sees every token, but whether the model is allowed to spend additional inference-time compute before answering.

This built-in reasoning process differs from prompted chain-of-thought (CoT)[9] in several key ways:

  1. Usually learned during post-training via Reinforcement Learning (RL), distillation, or both, rather than relying on prompt wording alone.
  2. Potentially variable-length: the API or serving policy can permit different budgets by request
  3. Can include self-correction: a trace may backtrack, recognize dead ends, and revise earlier steps
  4. Often hidden or summarized: providers may not expose the raw reasoning tokens directly

Current reasoning APIs make this concrete. OpenAI's reasoning docs describe reasoning tokens as non-visible output tokens that still consume context budget, and they recommend giving the task, constraints, and desired output format while treating reasoning effort as a tuning knob.[4]

The thinking budget is now a primary control surface, even though providers expose it with model-specific knobs. OpenAI's reasoning.effort sets an effort level for reasoning models.[4] Gemini 3 models use thinkingLevel, while Gemini 2.5 models retain a numeric thinkingBudget.[6] Claude Opus 4.7 uses adaptive thinking plus effort; Anthropic documents budget_tokens for older Claude extended-thinking models.[7] Provider prompting guidance also differs by model, so start with a clear task and constraints, tune the supported knob, and evaluate instead of assuming a chain-of-thought prompt helps.

The important point is that test-time compute can mean either a longer single trace or multiple sampled traces. OpenAI's o1 launch post made this visible at the benchmark level: on AIME 2024 (a math competition benchmark), reported accuracy improved when the system moved from a single sample to consensus over 64 samples and then to learned reranking over 1000 samples.[3] Treat this as one reported evaluation result, not a guarantee for other tasks or selection rules.

Deliberation and search at inference time

Test-time compute is an umbrella term, not a single algorithm. Some systems sample many complete answers and pick a winner. Some iteratively critique and revise one candidate. Others run explicit search over partial reasoning states using a verifier or reward model. The diagram below shows these patterns as branches of the same idea: spend extra compute to explore, score, and refine before returning an answer.

Reasoning search flow showing one prompt branching into candidate paths, a verifier scoring and pruning them, and one selected answer returning. Reasoning search flow showing one prompt branching into candidate paths, a verifier scoring and pruning them, and one selected answer returning.
Reasoning delays commitment: extend, sample, revise, score, then choose the path worth trusting.

Not every reasoning model literally runs beam search or a PRM at inference time. The shared pattern is optional inference compute allocation. A good routing policy keeps easy requests cheap and assigns more tokens, branches, or verification only where evaluation shows a payoff.

Provider-reported hidden reasoning

Some hosted reasoning APIs report non-visible reasoning tokens in usage accounting while returning only the final visible answer. OpenAI's reasoning-token documentation is one concrete example.[4] Do not generalize this into one universal architecture: an open-weight deployment, explicit search wrapper, or provider with summarized thinking can expose and account for intermediate work differently.

Provider-reported hidden reasoning budget example showing non-visible work, visible answer, and serving resources consumed before output. Provider-reported hidden reasoning budget example showing non-visible work, visible answer, and serving resources consumed before output.
For APIs that report non-visible reasoning tokens, that work still affects output budget, context use, latency, and serving memory.

For such APIs, the non-visible tokens matter for production because they consume context window space, increase key-value (KV) cache pressure, and may count toward billing even though users never see them.[4]

hidden-token-accounting.py
1visible_tokens = 180 2reasoning_tokens = 1220 3output_price_per_million = 10.0 4 5billed_output_tokens = visible_tokens + reasoning_tokens 6cost = billed_output_tokens / 1_000_000 * output_price_per_million 7print(f"visible={visible_tokens} billed_output={billed_output_tokens}") 8print(f"visible_fraction={visible_tokens / billed_output_tokens:.1%} output_cost=${cost:.4f}")
Output
1visible=180 billed_output=1400 2visible_fraction=12.9% output_cost=$0.0140

Use the provider's usage schema and pricing table when implementing this calculation; the example demonstrates why billing and capacity dashboards cannot count visible text alone.

Test-time compute strategies

Several strategies allocate compute at inference time, each with different characteristics:

1. Best-of-N sampling

This is the simplest strategy: generate 16 candidate recovery plans for a delayed order and keep the one with the best verifier score. Each attempt is independent, and more attempts raise your ceiling as long as your verifier or selection rule can reliably identify the best one.

The conceptual interface sketch below demonstrates this approach. It takes a generative model and a prompt as inputs, alongside the number of desired completions (NNN). It returns the single best response by either scoring them with a reward model or falling back to self-consistency (majority vote).

Cost note: This method is easily parallelizable (all N attempts can run simultaneously), making it simple to implement but potentially expensive since you pay for all N completions.

1-best-of-n-sampling.py
1from collections import Counter 2 3def extract_answer(completion: str) -> str: 4 lines = [line.strip() for line in completion.splitlines() if line.strip()] 5 return lines[-1] if lines else "" 6 7def best_of_n( 8 model, 9 prompt: str, 10 n: int = 16, 11 reward_model=None 12) -> str: 13 """Generate N responses and return the best-scored candidate completion. 14 15 Cost: O(N) forward passes, embarrassingly parallel. 16 Best for: Problems with verifiable correctness (math, code). 17 """ 18 candidates = [model.generate(prompt) for _ in range(n)] 19 20 if reward_model: 21 scores = [reward_model.score(prompt, c) for c in candidates] 22 return candidates[scores.index(max(scores))] 23 else: 24 # Self-consistency: majority vote on final answer 25 answers = [extract_answer(c) for c in candidates] 26 answer_counts = Counter(answers) 27 best_answer = answer_counts.most_common(1)[0][0] 28 return next(c for c in candidates if extract_answer(c) == best_answer)

Scaling behavior

With an oracle verifier and independent samples, the chance of generating at least one correct answer in NNN tries follows:

Psuccess=1−(1−p)NP_{\text{success}} = 1 - (1 - p)^NPsuccess​=1−(1−p)N

Where ppp is the base probability of the model generating a correct answer on a single attempt. Real systems do worse than this idealized formula because samples are correlated and verifiers make mistakes, but the equation explains why best-of-N works at all. OpenAI's o1 launch post reports the same pattern on AIME 2024: o1 improved from 74% with one sample to 83% with 64-sample consensus and 93% when reranking 1000 samples with a learned scorer.[3]

best_of_n_budget.py
1def success_probability(p: float, n: int) -> float: 2 return 1 - (1 - p) ** n 3 4base_hit_rate = 0.25 5tokens_per_attempt = 800 6 7for n in [1, 4, 16, 64]: 8 success = success_probability(base_hit_rate, n) 9 output_tokens = n * tokens_per_attempt 10 print(f"N={n:>2} success={success:5.1%} output_tokens={output_tokens:>5}")
Output
1N= 1 success=25.0% output_tokens= 800 2N= 4 success=68.4% output_tokens= 3200 3N=16 success=99.0% output_tokens=12800 4N=64 success=100.0% output_tokens=51200

The toy calculation shows why best-of-N is attractive and dangerous. The idealized success curve rises quickly, but cost rises linearly with every sampled completion. Real systems also plateau earlier because samples share model biases and verifiers make mistakes.

best-of-n-with-selection-errors.py
1def selected_success_probability(candidate_success: float, selector_recall: float) -> float: 2 return candidate_success * selector_recall 3 4base_hit_rate = 0.25 5selector_recall = 0.80 6for n in [1, 4, 16]: 7 oracle_success = 1 - (1 - base_hit_rate) ** n 8 deployed_success = selected_success_probability(oracle_success, selector_recall) 9 print(f"N={n:>2} oracle={oracle_success:5.1%} with_selector={deployed_success:5.1%}")
Output
1N= 1 oracle=25.0% with_selector=20.0% 2N= 4 oracle=68.4% with_selector=54.7% 3N=16 oracle=99.0% with_selector=79.2%

This toy selector is deliberately simple: it shows that generating a correct candidate and selecting it are separate failure surfaces.

Best-of-N success probability curves showing fast early gains and diminishing returns for different base hit rates. Best-of-N success probability curves showing fast early gains and diminishing returns for different base hit rates.
Best-of-N helps most when the base model already has a meaningful hit rate and selection is reliable. If either condition fails, more samples mostly buy more repeated errors.

2. Sequential revision

This is like drafting a carrier-recovery plan, checking it against policy and capacity, then revising the weak steps before committing to the customer. Unlike best-of-N (where each attempt is independent), each revision builds on the previous one.

The conceptual interface below takes an initial prompt and a model. It generates an initial response and then enters a loop, asking the model to critique its own output. If errors are found, it uses the critique to generate an improved version, returning the final refined text. The error detection ("no errors found") is intentionally simplified; production systems use a trained verifier model or structured critique format.

2-sequential-revision.py
1def iterative_refinement(model, prompt: str, max_rounds: int = 5) -> str: 2 """Generate, critique, and refine until convergence. 3 4 Cost: O(rounds) sequential passes, each building on previous. 5 Best for: Open-ended tasks (writing, analysis, planning). 6 """ 7 response = model.generate(prompt) 8 9 for _ in range(max_rounds): 10 critique = model.generate( 11 f"Find errors or improvements in this response:\n" 12 f"Question: {prompt}\nResponse: {response}" 13 ) 14 15 if "no errors found" in critique.lower(): 16 break 17 18 response = model.generate( 19 f"Question: {prompt}\n" 20 f"Previous response: {response}\n" 21 f"Critique: {critique}\n" 22 f"Provide an improved response:" 23 ) 24 25 return response

revision-early-stop.py
1verified_scores = [0.61, 0.76, 0.78, 0.781, 0.781] 2minimum_gain = 0.01 3used_rounds = 1 4 5for previous, current in zip(verified_scores, verified_scores[1:]): 6 if current - previous < minimum_gain: 7 break 8 used_rounds += 1 9 10print(f"used_rounds={used_rounds} selected_score={verified_scores[used_rounds - 1]:.3f}") 11print(f"skipped_rounds={len(verified_scores) - used_rounds}")
Output
1used_rounds=3 selected_score=0.780 2skipped_rounds=2

3. Tree search with process reward models

This is the most sophisticated strategy, like a fulfillment planner who considers multiple recovery paths, evaluates each action, and prunes bad branches early. Instead of scoring just the final answer ("did you choose refund or reship?"), a process reward model scores each intermediate step ("is this policy check correct?").

Here's how a simplified beam search using a PRM operates. It accepts a model and a PRM, taking a prompt as the initial state. At each step, it generates multiple possible next steps (the beam width), scores them with the PRM, and keeps only the highest-scoring paths. The function outputs the best completed reasoning chain. This is a conceptual interface sketch; it assumes model.generate_step, prm.score_step, and is_final_answer exist.

3-tree-search-with-process-reward-models.py
1def beam_search_with_prm( 2 model, 3 prm, 4 prompt: str, 5 beam_width: int = 4, 6 branch_factor: int = 4, 7 max_steps: int = 20 8): 9 """Guided tree search using per-step reward scores. 10 11 Cost: O(beam_width × branch_factor × max_steps) forward passes. 12 Best for: Multi-step mathematical or logical reasoning. 13 """ 14 beams = [{"steps": [], "score": 0.0, "text": prompt}] 15 16 for step in range(max_steps): 17 candidates = [] 18 for beam in beams: 19 # Generate next reasoning step 20 new_steps = [ 21 model.generate_step(beam["text"]) 22 for _ in range(branch_factor) 23 ] 24 25 for new_step in new_steps: 26 # Score each step with the process reward model 27 step_score = prm.score_step( 28 prompt, beam["steps"], new_step 29 ) 30 candidates.append({ 31 "steps": beam["steps"] + [new_step], 32 "score": beam["score"] + step_score, 33 "text": beam["text"] + "\n" + new_step, 34 }) 35 36 # Keep top-k beams 37 beams = sorted(candidates, key=lambda x: x["score"], reverse=True)[:beam_width] 38 39 # Check if any beam has reached a final answer 40 if any(is_final_answer(b["steps"][-1]) for b in beams): 41 break 42 43 return beams[0]

Beam search is easiest to explain, but it has a well-known weakness for reasoning: beams can collapse onto near-duplicate traces. Stronger systems inject diversity, allow backtracking, or use Monte Carlo Tree Search-style expansion policies. The hard part isn't building a tree. It's having a verifier good enough to prune bad branches early.

Process reward models vs. outcome reward models

The choice of reward model fundamentally affects search efficiency:

Outcome reward model versus process reward model comparison: ORM scores only final answers, while a reliable PRM can score steps and prune low-scoring paths early. Outcome reward model versus process reward model comparison: ORM scores only final answers, while a reliable PRM can score steps and prune low-scoring paths early.
ORMs grade completed answers. A sufficiently reliable PRM can score intermediate steps and prune low-scoring branches before they burn a full rollout.

Think of it as final delivery audit versus checkpoint audit. An ORM only checks whether the final plan works: "delivered on time" or "missed SLA." You have no idea where the plan broke. A PRM checks each intermediate decision: "route chosen correctly," "hub cutoff valid," "carrier capacity wrong here." The per-step feedback is much more expensive to provide, and its own mistakes can discard a good path. When it is reliable enough, it can catch errors early and avoid wasting time on low-quality paths.

Outcome reward models (ORMs)

Outcome Reward Models (ORMs) evaluate the final generated output as a single, complete result. They act as a binary judge at the end of the generation process, determining only whether the final answer matches the expected solution.

  • Score only the final answer: The model receives a pass/fail grade based entirely on the conclusion (e.g., whether the final number is correct).
  • Simpler training: They're easier to train because they only need (question, answer, correctness) labels, which are readily available in many datasets.
  • Search limitation: They cannot guide intermediate reasoning steps. A generator might make a fatal error in step 1 but continue generating 100 more steps before the ORM finally rejects it.
  • Delayed feedback: The system must generate complete solutions before any evaluation can happen, heavily limiting the usefulness of tree search.

How PRMs work

Process Reward Models (PRMs), in contrast, evaluate the validity of each intermediate step in a chain of thought. By providing step-by-step guidance, they allow search algorithms to quickly abandon incorrect reasoning paths before wasting compute on them.

  • Granular evaluation: They score each reasoning step independently, identifying exactly where logic breaks down.
  • Early pruning: They can prune low-scoring paths before completion, improving search efficiency when their intermediate scores predict final correctness.
  • Complex data requirements: They require step-level annotations for training, which are notoriously expensive and difficult for humans to collect at scale.
  • Denser guidance: A process score gives a search controller earlier evidence for keeping or abandoning a branch, but it is not proof that a kept step is correct.

Here is a conceptual side-by-side comparison of how each model scores the same problem:

how-prms-work.py
1# ORM: can only evaluate after the full solution is generated. 2# It returns a single score for the complete answer. 3orm_score = orm.score( 4 question="What is 847 × 293?", 5 full_solution="847 × 293 = 248,171" 6) # returns 1.0 (correct) or 0.0 (incorrect) 7 8# PRM: evaluates each step independently as the solution is built. 9# This lets the search algorithm prune bad paths before they grow. 10prm_scores = prm.score_steps( 11 question="What is 847 × 293?", 12 steps=[ 13 "First, I'll break this into 847 × 300 - 847 × 7", # step score: 0.95 14 "847 × 300 = 254,100", # step score: 0.98 15 "847 × 7 = 5,929", # step score: 0.97 16 "254,100 - 5,929 = 248,171", # step score: 0.99 17 ] 18)

Snell et al.[2] reported that compute-optimal test-time scaling can be more than 4× as efficient as a best-of-N baseline in their math-reasoning evaluation. Lightman et al.[10] reported that process supervision outperformed outcome supervision on MATH. These results motivate testing PRM-guided pruning on verifiable workloads; they do not establish that every PRM or task benefits.

That said, not every reasoning system uses a learned PRM or ORM. DeepSeek-R1-Zero trained with rule-based accuracy and format rewards rather than a neural reward model, which is one reason you should treat test-time compute as a family of techniques, not a single stack.[5]

prm-pruning-budget.py
1branch_lengths = [8, 8, 8, 8] 2prune_at_step = [None, 2, 3, 1] 3 4orm_scored_steps = sum(branch_lengths) 5prm_scored_steps = sum( 6 full_length if stop is None else stop 7 for full_length, stop in zip(branch_lengths, prune_at_step) 8) 9print(f"orm_steps={orm_scored_steps} prm_steps={prm_scored_steps}") 10print(f"saved_steps={orm_scored_steps - prm_scored_steps}")
Output
1orm_steps=32 prm_steps=14 2saved_steps=18

This is the upside case in which the process scorer prunes the right paths. A deployment evaluation must also count false pruning of ultimately correct branches.

DeepSeek-R1-Zero and DeepSeek-R1

DeepSeek's contribution is easiest to understand as two related results. DeepSeek-R1-Zero showed that large-scale RL with verifiable rewards can induce strong reasoning behaviors without a supervised cold start.[5] DeepSeek-R1 then added cold-start data and additional post-training to make those behaviors more stable and readable.[5]

Training pipeline

The final DeepSeek-R1 pipeline isn't "pure RL" end to end. The paper explicitly describes two supervised fine-tuning (SFT) stages and two RL stages:

  1. Base model: Start with DeepSeek-V3-Base[5], a 671B Mixture-of-Experts (MoE) model with 37B active parameters per token.
  2. Cold-start Supervised Fine-Tuning (SFT): Collect thousands of long CoT examples and fine-tune the base model. This helps avoid the readability and language-mixing issues reported for R1-Zero.
  3. Reasoning-oriented RL: Run GRPO (Group Relative Policy Optimization)-based RL on reasoning tasks with verifiable, rule-based rewards, similar in spirit to R1-Zero.[5]
  4. Rejection sampling + second SFT: Use the RL checkpoint to generate high-quality reasoning traces, mix them with supervised data from DeepSeek-V3 for non-reasoning domains, and retrain the base model.
  5. Final RL for all scenarios: Run another RL stage across a broader prompt mix to produce DeepSeek-R1.

That sequence matters because people often summarize R1 as "pure RL." Only R1-Zero fits that description.[5]

Emergent reasoning

R1-Zero displayed behaviors such as self-verification, backtracking, and longer rollouts, but the paper also reports readability issues and occasional language mixing.[5] That detail matters. It means verifiable rewards can induce sophisticated reasoning behavior, but raw RL rollouts still need cleanup before they become a general-purpose product.

Compute-optimal inference: when to think harder

Not all problems benefit equally from extended reasoning, and on some tasks extra thinking actively hurts. Ghosal et al. studied this directly: across reasoning models, accuracy often rises with a little more thinking and then falls as the trace grows, an inverted-U rather than a monotonic climb.[8] They attribute the apparent early gains partly to higher output variance instead of genuinely better reasoning, and they recommend parallel sampling (best-of-N) over endlessly extending one trace. An Anthropic study found the same inverse-scaling effect on several tasks, where longer reasoning amplified distractions and errors rather than fixing them.[11] The lesson for production is that thinking budget is a real tuning parameter with a sweet spot, not a slider you turn to maximum.

The optimal test-time compute allocation depends on problem difficulty:

Problem TypeCandidate strategy to evaluateExample
Factual recallSingle pass"What is the damaged-item return window?"
Simple reasoningShort reasoning budget or brief scratchpad"Can an order shipped yesterday arrive by Friday?"
Multi-step mathLonger scratchpad or best-of-NWarehouse capacity or carrier-cost calculation
Complex codeDeliberation plus search or repair loopsMulti-file debugging and repair
Open-ended analysisSequential revisionLogistics rollout or support-policy analysis

The cross-over point

A task family may have a compute budget threshold where test-time scaling becomes more efficient than using a larger single-pass model. Below or above that point, the preferred path must be established with quality, latency, and cost measurements:

Lsmall+search(C)<Llarge,singlefor some C≥C∗L_{\text{small+search}}(C) < L_{\text{large,single}} \quad \text{for some } C \ge C^*Lsmall+search​(C)<Llarge,single​for some C≥C∗

There's a compute budget C∗C^*C∗ above which a small model that searches through multiple solutions can outperform a large model answering in one shot. This isn't universal across all tasks, but it appears clearly on hard reasoning problems where the smaller model already has a non-zero hit rate.

Snell et al.[2] observed that in FLOPs-matched evaluations, a smaller model with additional test-time compute can outperform a model roughly 14× larger answering in a single pass on problems where the small model already achieves non-trivial success rates. This makes longer thinking a candidate for compensating for model size on similar evaluated tasks, not a general replacement for larger models.

Compute allocation router choosing between fast single-pass generation, short reasoning budgets, and guided search based on task signals. Compute allocation router choosing between fast single-pass generation, short reasoning budgets, and guided search based on task signals.
A production router should decide whether the request earns the expensive path. Difficulty, stakes, latency budget, and verifier availability matter as much as the model name.

To handle varying levels of difficulty in production without exploding costs, systems often implement a routing layer. This architecture evaluates an incoming task's complexity and directs it to the appropriate model and test-time strategy, balancing latency, quality, and computational cost.

measured-routing-policy.py
1routes = [ 2 {"name": "single-pass", "quality": 0.78, "latency_ms": 240, "cost": 0.002}, 3 {"name": "bounded-reasoning", "quality": 0.86, "latency_ms": 780, "cost": 0.010}, 4 {"name": "guided-search", "quality": 0.90, "latency_ms": 2600, "cost": 0.045}, 5] 6latency_slo_ms = 1000 7minimum_quality = 0.84 8 9eligible = [ 10 route for route in routes 11 if route["latency_ms"] <= latency_slo_ms and route["quality"] >= minimum_quality 12] 13chosen = min(eligible, key=lambda route: route["cost"]) 14print(f"route={chosen['name']} quality={chosen['quality']:.0%}") 15print(f"latency_ms={chosen['latency_ms']} cost=${chosen['cost']:.3f}")
Output
1route=bounded-reasoning quality=86% 2latency_ms=780 cost=$0.010

Production considerations

Deploying reasoning models in a real-world environment requires careful planning. While the capability gains are substantial, they come with operational challenges that engineering teams must proactively address.

KV cache pressure and prefix sharing

Reasoning workloads stress inference engines differently from ordinary chat because they often generate far more intermediate tokens before the visible answer arrives.

For every generated token, the server appends a Key vector and a Value vector for every layer:

KV bytes per token=2×L×nkv×dh×b\text{KV bytes per token} = 2 \times L \times n_{kv} \times d_h \times bKV bytes per token=2×L×nkv​×dh​×b

Where LLL is the number of layers, nkvn_{kv}nkv​ is the number of KV heads, dhd_hdh​ is the head dimension, and bbb is bytes per value. The important scaling fact is simple: KV cache memory grows linearly with sequence length. A rollout that spends 10,000 tokens thinking creates roughly 10,000 tokens' worth of KV state, which can collapse batch size long before raw FLOPs become the bottleneck.

PagedAttention stores KV cache in fixed-size blocks to reduce fragmentation and make scheduling practical at long context lengths.[12] When multiple rollouts share the same prompt or partial trace, prefix-sharing runtimes can reuse that cached prefix instead of duplicating it across every branch. SGLang's RadixAttention is a good example.[13] This matters a lot for best-of-N, self-consistency, and tree search, where many candidates share the same long prompt and diverge only near the leaves.

prefix-sharing-cache-accounting.py
1prompt_tokens = 4000 2continuation_tokens = 1000 3branches = 8 4kv_bytes_per_token = 128 * 1024 5 6without_sharing = branches * (prompt_tokens + continuation_tokens) 7with_sharing = prompt_tokens + branches * continuation_tokens 8saved_gib = (without_sharing - with_sharing) * kv_bytes_per_token / 1024 ** 3 9print(f"tokens_without_sharing={without_sharing:,}") 10print(f"tokens_with_sharing={with_sharing:,} saved_kv_gib={saved_gib:.2f}")
Output
1tokens_without_sharing=40,000 2tokens_with_sharing=12,000 saved_kv_gib=3.42

TTFT vs. inter-token latency

Test-time compute distorts user-facing latency metrics. For a provider or wrapper that generates non-visible reasoning before revealing the answer, TTFT (time to first token) can increase substantially.[4] ITL (inter-token latency) can still be fine once the answer starts streaming. In other words, a reasoning system can feel slow even when decode throughput is healthy.

ttft-budget-slo.py
1base_ttft_ms = 180 2decode_tokens_per_second = 80 3budgets = [0, 128, 512, 1024] 4ttft_slo_ms = 3000 5 6for budget in budgets: 7 ttft_ms = base_ttft_ms + budget / decode_tokens_per_second * 1000 8 status = "fits" if ttft_ms <= ttft_slo_ms else "reject" 9 print(f"budget={budget:>4} ttft_ms={ttft_ms:>7.0f} {status}")
Output
1budget= 0 ttft_ms= 180 fits 2budget= 128 ttft_ms= 1780 fits 3budget= 512 ttft_ms= 6580 reject 4budget=1024 ttft_ms= 12980 reject

Latency vs. quality trade-off

Reasoning models introduce significant latency, but the exact numbers depend on hardware, batching policy, and provider implementation:

StrategyUser-visible behaviorBest fit
Single-pass generationLow TTFT, short answersChat, extraction, classification
Reasoning or search-heavy generationHigher TTFT, variable token budget, sometimes hidden intermediate workMath, code, planning, verification

For strict interactive latency budgets, a long reasoning path may be unacceptable unless measured gains justify it. Candidates for larger budgets include:

  • Batch processing (code review, data analysis)
  • High-stakes analyst workflows (chargeback review, fraud review, compliance analysis)
  • Asynchronous workflows (software engineering, research)

Cost implications

Extended reasoning is expensive even when the final answer is short. Hosted APIs may charge directly for hidden reasoning tokens. OpenAI's reasoning docs, for example, note that these tokens are billed as output tokens even though they aren't returned verbatim in the API response.[4] Open-weight deployments pay through longer wall-clock time, lower throughput, and higher KV cache residency. Either way, longer rollouts reduce how many concurrent requests the same GPU can serve.

Production systems usually combine three controls:

  1. Routing: send only hard or high-stakes tasks to expensive reasoning paths
  2. Token budgets: cap how long a rollout may think before the marginal gain stops being worth it
  3. Early stopping: terminate search when verifier scores stop improving

Without those controls, ambiguous or unsolvable prompts can burn a large amount of inference budget without producing a better answer.

Distillation: making reasoning affordable

Distillation is like having a senior logistics planner write detailed recovery traces, then training a smaller model to solve similar incidents the same way. The smaller model won't match the source model everywhere, but it can inherit much of the solution style while being much cheaper to serve.

DeepSeek-R1 showed that reasoning capabilities can be distilled into much smaller models:

Distilled ModelSourceAIME 2024 (pass@1: accuracy on first attempt)
DeepSeek-R1-Distill-Qwen-1.5BR1 → Qwen2.5-Math-1.5B28.9%
DeepSeek-R1-Distill-Qwen-7BR1 → Qwen2.5-Math-7B55.5%
DeepSeek-R1-Distill-Qwen-32BR1 → Qwen2.5-32B72.6%
DeepSeek-R1 (full)-79.8%

In DeepSeek's reported AIME evaluation, the 7B and 32B distilled models retain substantial benchmark accuracy at lower parameter counts than R1. Parameter count is only a proxy, though. Real deployment cost still depends on active parameters, quantization, batch size, and engine choice. Distillation can make reasoning cheaper, but it doesn't remove the need for routing and token budgets.

serving-budget-gate.py
1requests = [ 2 {"kind": "extract", "verifiable": True, "predicted_gain": 0.00}, 3 {"kind": "capacity_math", "verifiable": True, "predicted_gain": 0.09}, 4 {"kind": "creative_copy", "verifiable": False, "predicted_gain": 0.03}, 5] 6minimum_gain = 0.05 7 8for request in requests: 9 use_reasoning = request["verifiable"] and request["predicted_gain"] >= minimum_gain 10 route = "reasoning" if use_reasoning else "single-pass" 11 print(f"{request['kind']}: {route}")
Output
1extract: single-pass 2capacity_math: reasoning 3creative_copy: single-pass

Mastery check

Evaluation rubric

  • Scaling view: train-time compute improves the model before deployment, while test-time compute spends extra budget during generation.
  • Strategy view: longer traces, best-of-N, sequential revision, and verifier-guided tree search are different ways to buy more deliberation.
  • Verifier view: ORMs score final answers, while reliable PRMs let a search policy prune low-scoring branches early.
  • Systems view: when an API uses and reports non-visible reasoning tokens, they still affect context budget, KV cache, TTFT, throughput, and billing.
  • Routing view: reasoning paths are candidates when the task is hard, verifiable, and important enough to justify measured latency and cost.
  • Budget view: provider/model knobs differ (reasoning.effort, Gemini thinkingLevel or thinkingBudget, Claude adaptive effort or legacy budget_tokens), and more thinking is not always better because accuracy can follow an inverted-U.
  • DeepSeek view: R1-Zero showed pure-RL reasoning behavior; R1 added cold-start data, SFT, and additional RL to make the behavior more usable.

Follow-up questions

When should you choose a reasoning model over a standard model?

Evaluate a reasoning path when the task needs multi-step deduction and you have a way to check whether the answer is good. Math, formal verification, code debugging, and structured planning are promising candidates. Start with a single-pass baseline for summarization, translation, extraction, and conversational UX, then promote only workloads where measured correctness gains justify latency and cost.

How does test-time compute interact with KV-cache optimization?

Reasoning workloads generate longer rollouts or multiple branches, which increases KV-cache pressure because each new token adds per-layer K/V state. KV-cache memory grows linearly with sequence length, so long reasoning traces reduce batch size and throughput. Engines like vLLM use paging to manage that memory, and prefix-sharing runtimes can reuse the shared prefix across best-of-N or search branches instead of duplicating it.

What are the failure modes of extended reasoning?

Common failure modes include loops, error propagation, overthinking, and weak self-verification. If the same model generates and judges, its blind spots are correlated, so it may confidently approve a bad trace. PRMs, task-specific checkers, explicit token budgets, and early stopping help by scoring intermediate steps and terminating unproductive branches earlier.

How can you distill reasoning into a smaller model?

Distillation uses traces or verifier-filtered answers from a stronger reasoning system to train a smaller student. A student can learn useful solution patterns, as the DeepSeek results illustrate, but the gain must be measured against a comparable baseline. The catch is that the student may imitate output format more easily than the teacher's actual search policy, so good benchmark scores don't always mean the smaller model learned the same underlying reasoning process.

How would you set a first-pass router for mixed workloads?

Start by separating tasks into three buckets: easy pattern matching, medium tasks with occasional reasoning upside, and hard tasks with clear verification signals. Route the first bucket to a fast single-pass model, route the second to a bounded reasoning budget or brief best-of-N, and reserve verifier-guided search for the third. Then instrument latency, total generated tokens, and verified win rate so the router can move requests down when extra thinking stops paying off.

Common pitfalls

Even experienced engineers misapply reasoning models. Here are frequent pitfalls, with symptoms to watch for and concrete fixes.

The intelligence tax

Symptom: A production task takes 30 seconds to complete and costs five times more than the standard model, but the output quality is identical.

Cause: You're using a reasoning model for a task that doesn't show measurable benefit from extra deduction. Summarization, translation, simple extraction, and sentiment analysis are often strong single-pass baselines. Extra thinking time may add latency and cost without an accuracy gain.

Fix: Route easy tasks to fast single-pass models. Reserve reasoning models for math, code, planning, verification, and other tasks where you can actually verify whether the answer is correct.

Confusing prompting with architecture

Symptom: You assume a reasoning model is just a standard model with a hidden system prompt like "You are a careful thinker..."

Cause: The API hides the reasoning tokens, so it's tempting to imagine they're produced by a wrapper around a normal model. That mental model misses the training difference.

Fix: Reasoning models may be post-trained, often with RL or distillation, to use extra inference compute productively. Treat this as different model behavior and serving accounting, not a prompt hack or necessarily a new transformer architecture.

Treating visible chain-of-thought as the whole system

Symptom: You assume a visible "think step by step" trace, a hidden reasoning trace, best-of-N, PRM search, and RL with verifiable rewards are interchangeable.

Cause: They all spend extra inference or training budget around reasoning, but they solve different problems.

Fix: Name the mechanism. Prompted CoT changes the prompt. Hidden reasoning changes model behavior and serving accounting. Best-of-N changes sampling. PRM search changes branch selection. Verifiable-reward RL changes post-training.

Ignoring KV cache pressure

Symptom: A reasoning rollout looks cheap because the final answer is short, but batch size collapses under load.

Cause: Hidden and candidate tokens still create K/V state across layers.

Fix: Track total generated tokens, not just visible tokens. Use paging, prefix sharing, routing, and hard budgets before scaling reasoning traffic.

Infinite loops and overthinking

Symptom: Token usage spikes on certain prompts, the model spins in circles repeating the same deduction, or accuracy drops on simple prompts when the budget is set high.

Cause: No reasoning budget was set, or the budget was set too high. On ambiguous or simple prompts, more thinking can amplify distractions and errors rather than fix them, so accuracy follows an inverted-U as the trace grows.[8][11]

Fix: Set a reasoning budget tuned to the task instead of maximizing it, and add a hard cap. Implement early stopping when verifier confidence plateaus. For simple prompts, lower the effort level. Monitor per-request token distributions and alert on outliers.

Practice

Before you move on, try this short audit to make the trade-offs concrete.

The Efficiency Audit

Consider these five production tasks. For each one, predict whether it benefits from test-time compute scaling, then check your reasoning against the explanation.

TaskYour predictionWhy it does or doesn't benefit
Creative writing (marketing copy)Single-pass fluency is usually sufficient; revision helps but often isn't worth the latency cost
Multi-step warehouse capacity calculationHard reasoning with verifiable arithmetic; measure best-of-N or guided search against baseline
Sentiment analysis of support ticketsStrong single-pass baseline; only promote it if evaluation finds a gain
Debugging a 500-error across three microservicesMulti-step code reasoning with executable checks; evaluate repair-loop gain
Translation of product descriptionsPattern-matching with strong base-model performance; extra reasoning adds little

The warehouse calculation and microservice debugging are the strongest candidates because they provide verification signals. The core idea is that test-time compute is worth testing when the task requires verifiable, multi-step reasoning, not merely because a task sounds difficult.

What to remember and where to go next

  1. Test-time compute adds a second scaling axis: spending more compute at inference can outperform moving to a larger single-pass model on hard reasoning tasks.[2]

  2. Test-time compute is an umbrella, not one algorithm: longer scratchpads, repeated sampling, revision loops, and explicit search all fit under the same idea.

  3. Good verifiers and compute-optimal policies matter: Snell et al. report more than 4× better test-time compute efficiency than a best-of-N baseline in their evaluation, and PRMs can support earlier pruning because they score intermediate steps instead of waiting for the end.[2][10]

  4. DeepSeek-R1-Zero and DeepSeek-R1 aren't the same result: R1-Zero showed emergent reasoning from pure RL, while DeepSeek-R1 added cold-start data plus additional SFT and RL stages to make that behavior readable and broadly usable.[5]

  5. Thinking is now a dial, and the sweet spot is not always the maximum: current providers expose model-specific effort or thinking controls, and accuracy can follow an inverted-U as the trace grows, so tune the supported knob rather than maxing it.[4][6][7][8]

  6. Serving bottlenecks matter as much as algorithms: KV cache memory grows linearly with long reasoning traces, so prefix sharing, routing, and token budgets are core production tools.[12][13]

You should now be able to explain why a single-pass model can miss "How many r's in strawberry," why a smaller model with traces, search, or distillation can recover much of a larger model's math performance, and why your production stack needs a routing layer before it needs a bigger GPU.

Next Step
Continue to Advanced MLOps and DevOps for AI Systems

You now understand why reasoning workloads need careful routing, token budgets, verification, and serving signals. The next chapter turns those signals into operational practice: GitOps, feature stores, shadow traffic, canaries, lineage, and automated rollback for production AI systems.

PreviousMamba & State Space Models
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

Learning to reason with LLMs

OpenAI · 2024

Reasoning models

OpenAI · 2026

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Gemini thinking

Google · 2025

Building with extended thinking

Anthropic · 2025

Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models

Ghosal, S. S., Chakraborty, S., Reddy, S., et al. · 2025

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

Let's Verify Step by Step.

Lightman, H., et al. · 2023 · ICLR

Inverse Scaling in Test-Time Compute

Anthropic · 2025

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2023 · arXiv:2312.07104