LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringCoT, ToT & Self-Consistency Prompting
✍️MediumPrompt Engineering

CoT, ToT & Self-Consistency Prompting

Build and evaluate reasoning controllers: single traces, answer voting, and bounded tree search for multi-step LLM decisions.

14 min read
Learning path
Step 54 of 155 in the full curriculum
Dimensionality Reduction for EmbeddingsFunction Calling & Tool Use

The previous lesson measured whether compressed still retrieved the right policy evidence. Retrieval is not the end of the task. An assistant may now have the right facts and still choose the wrong action because it skips a condition, mishandles arithmetic, or commits too early to one plan.

This lesson turns extra inference work into an engineering decision. You will build a small support-resolution controller for a delayed parcel:

  • Chain-of-Thought (CoT): request one decomposed candidate decision.
  • Self-Consistency: sample several candidates, normalize final actions, and vote.
  • Tree-of-Thoughts (ToT): expand and prune branches when a decision needs backtracking.

The goal is not to collect long rationales. The goal is to improve measurable decision accuracy under a , latency, and safety budget.

Start with a failure you can audit

Suppose an order has been delayed for six days. A replacement is in stock, the carrier scan confirms a delivery exception, and policy permits expedited replacement after five delayed days. A direct response may still overlook one condition and offer a refund or escalation unnecessarily.

A useful outward artifact is a short decision record: facts used, checks applied, and final action. It is smaller than an open-ended rationale, easy to score, and safe to compare against deterministic policy logic.

decision-record-contract.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class ParcelCase: 5 delayed_days: int 6 exception_scan: bool 7 replacement_in_stock: bool 8 expedite_after_days: int = 5 9 10def decision_record(case: ParcelCase) -> dict[str, object]: 11 checks = { 12 "delay_threshold_met": case.delayed_days >= case.expedite_after_days, 13 "carrier_exception_confirmed": case.exception_scan, 14 "replacement_available": case.replacement_in_stock, 15 } 16 action = "reship_expedited" if all(checks.values()) else "manual_review" 17 return {"checks": checks, "final_action": action} 18 19record = decision_record( 20 ParcelCase(delayed_days=6, exception_scan=True, replacement_in_stock=True) 21) 22for name, passed in record["checks"].items(): 23 print(f"{name}: {passed}") 24print(f"final_action: {record['final_action']}")
Output
1delay_threshold_met: True 2carrier_exception_confirmed: True 3replacement_available: True 4final_action: reship_expedited

This code is not an LLM. It is the oracle that your prompt variants must match. Before increasing model compute, define an output contract and a scorer.

Chain-of-thought prompting comparison showing a warehouse inventory problem where direct prompting skips a step and chain-of-thought tracks intermediate state before the final answer Chain-of-thought prompting comparison showing a warehouse inventory problem where direct prompting skips a step and chain-of-thought tracks intermediate state before the final answer
A can expose a skipped intermediate step, but the verified final result remains the outcome you evaluate.

Chain-of-Thought: one decomposed candidate

Wei et al. introduced few-shot CoT by placing worked intermediate steps in prompt examples. Their experiments found gains on arithmetic, commonsense, and symbolic reasoning benchmarks for sufficiently large models, including a strong GSM8K result with PaLM 540B.[1] Kojima et al. later showed that the zero-shot trigger "Let's think step by step" could improve several benchmark reasoning tasks without worked examples.[2]

Those papers establish techniques to evaluate, not a production law. On your endpoint and task, a visible scratchpad may help, do nothing, or add cost. If an API offers a native reasoning control, evaluate that option as another strategy rather than assuming that extra visible prose helps.

For a customer-facing workflow, do not ask the model to reveal unrestricted inner reasoning. Ask it to produce reviewable artifacts:

single-trace-prompt.txt
1Use only the supplied parcel facts and policy rules. 2Return: 31. required_checks: each policy predicate with pass/fail 42. final_action: one allowed action enum 53. customer_message: one sentence 6 7Facts: 8- delayed_days: 6 9- carrier_exception_confirmed: true 10- replacement_in_stock: true 11 12Policy: 13- expedited replacement is allowed when delayed_days >= 5, 14 an exception is confirmed, and replacement stock exists.

The checks are useful because a missed predicate becomes observable. They are not proof of faithful hidden cognition. Turpin et al. showed that CoT explanations can rationalize outputs influenced by hidden biasing features without mentioning those features.[3] Log inputs, actions, validations, and outcomes; do not treat eloquent reasoning text as an audit guarantee.

score-candidate-decisions.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Candidate: 5 name: str 6 checks: set[str] 7 final_action: str 8 9required_checks = { 10 "delay_threshold_met", 11 "carrier_exception_confirmed", 12 "replacement_available", 13} 14expected_action = "reship_expedited" 15candidates = [ 16 Candidate("direct", {"delay_threshold_met"}, "manual_review"), 17 Candidate("structured_trace", required_checks, "reship_expedited"), 18] 19 20for candidate in candidates: 21 coverage = len(candidate.checks & required_checks) / len(required_checks) 22 action_ok = candidate.final_action == expected_action 23 print(f"{candidate.name}: coverage={coverage:.0%}, action_ok={action_ok}")
Output
1direct: coverage=33%, action_ok=False 2structured_trace: coverage=100%, action_ok=True

Here, the trace is useful because it meets the scored contract. It does not win merely because it contains more words.

Zero-shot or few-shot?

Zero-shot CoT supplies an instruction and lets the model choose its intermediate format. Few-shot CoT gives one or more solved examples, so it can teach both decomposition and the final answer shape. Native structured-output enforcement is better when available; examples are still useful when the prompt must communicate task-specific checks.

Side-by-side comparison of zero-shot reasoning instructions and few-shot parcel decisions that teach required checks and a final action field Side-by-side comparison of zero-shot reasoning instructions and few-shot parcel decisions that teach required checks and a final action field
Few-shot examples spend prompt tokens to teach the contract you will score.
build-few-shot-decision-prompt.py
1example = """Example: 2facts: delayed_days=2, exception_scan=true, replacement_in_stock=true 3required_checks: delay_threshold_met=false, carrier_exception_confirmed=true, replacement_available=true 4final_action: manual_review""" 5 6case = "facts: delayed_days=6, exception_scan=true, replacement_in_stock=true" 7zero_shot = f"Evaluate parcel policy step by step.\n{case}\nfinal_action:" 8few_shot = f"{example}\n\nNow evaluate:\n{case}\nrequired_checks:\nfinal_action:" 9 10print(f"zero_shot_has_example: {'Example:' in zero_shot}") 11print(f"few_shot_has_example: {'Example:' in few_shot}") 12print(f"few_shot_requests_checks: {'required_checks:' in few_shot}")
Output
1zero_shot_has_example: False 2few_shot_has_example: True 3few_shot_requests_checks: True

Self-Consistency: sample answers, then vote

One structured trace can fail because generation takes an unlucky path. Self-Consistency replaces reliance on one path with several sampled paths and chooses the most consistent final answer.[4] The vote operates on extracted answers, not on whose rationale sounds best.

On the original benchmark setting, Wang et al. reported a 17.9 percentage-point GSM8K improvement over CoT prompting for PaLM 540B.[4] That number is evidence for the method on those benchmarks. It is not your expected support-resolution gain. Measure your own cases and sample cost.

Direct prompting, chain-of-thought, and self-consistency flow comparison showing how structured checks and sampled paths change a final action decision Direct prompting, chain-of-thought, and self-consistency flow comparison showing how structured checks and sampled paths change a final action decision
Self-consistency spends several model calls and aggregates the normalized decision field.

Normalize before counting

Model outputs rarely use exactly one spelling. Your controller should map harmless variants to one allowed action and reject unknown outputs before voting.

canonicalize-and-vote.py
1from collections import Counter 2 3ALIASES = { 4 "expedited reship": "reship_expedited", 5 "reship_expedited": "reship_expedited", 6 "send expedited replacement": "reship_expedited", 7 "manual review": "manual_review", 8} 9 10def canonicalize(text: str) -> str | None: 11 normalized = text.strip().lower().replace("-", " ") 12 return ALIASES.get(normalized) 13 14samples = [ 15 "Expedited reship", 16 "reship_expedited", 17 "Send expedited replacement", 18 "manual review", 19 "issue credit immediately", 20] 21votes = Counter(action for text in samples if (action := canonicalize(text))) 22winner = votes.most_common(1)[0][0] if votes else "manual_review" 23 24print(f"accepted_samples: {sum(votes.values())}/{len(samples)}") 25print(f"votes: {dict(votes)}") 26print(f"winner: {winner}")
Output
1accepted_samples: 4/5 2votes: {'reship_expedited': 3, 'manual_review': 1} 3winner: reship_expedited

A winner is not always confident enough

A 2 to 2 tie, zero accepted outputs, or a narrow plurality with many rejected outputs should not silently become a customer action. Add an abstention rule. The winning share must use all sampled outputs as its denominator, not only the strings that parsed successfully.

abstain-on-weak-votes.py
1from collections import Counter 2 3def decide(votes: list[str], total_samples: int, minimum_share: float = 0.6) -> str: 4 if not votes: 5 return "manual_review" 6 counts = Counter(votes) 7 winner, count = counts.most_common(1)[0] 8 share = count / total_samples 9 tied = len(counts) > 1 and counts.most_common(2)[0][1] == counts.most_common(2)[1][1] 10 if tied or share < minimum_share: 11 return "manual_review" 12 return winner 13 14strong = ["reship_expedited", "reship_expedited", "reship_expedited", "refund"] 15split = ["reship_expedited", "reship_expedited", "refund", "refund"] 16mostly_rejected = ["reship_expedited", "reship_expedited"] 17 18print(f"strong_vote: {decide(strong, total_samples=4)}") 19print(f"split_vote: {decide(split, total_samples=4)}") 20print(f"mostly_rejected_vote: {decide(mostly_rejected, total_samples=5)}") 21print(f"no_valid_votes: {decide([], total_samples=5)}")
Output
1strong_vote: reship_expedited 2split_vote: manual_review 3mostly_rejected_vote: manual_review 4no_valid_votes: manual_review

Measure gains against call cost

Use a held-out fixture set before calling the strategy ready. The code below represents five sampled final actions returned for each fixture; the controller compares first-sample accuracy with five-sample voting accuracy.

measure-voting-gain.py
1from collections import Counter 2 3fixtures = { 4 "late_with_stock": { 5 "expected": "reship", 6 "samples": ["review", "reship", "reship", "reship", "refund"], 7 }, 8 "late_no_stock": { 9 "expected": "refund", 10 "samples": ["refund", "refund", "review", "refund", "review"], 11 }, 12 "not_late": { 13 "expected": "review", 14 "samples": ["reship", "review", "review", "review", "reship"], 15 }, 16} 17 18single_correct = 0 19vote_correct = 0 20for item in fixtures.values(): 21 winner = Counter(item["samples"]).most_common(1)[0][0] 22 single_correct += item["samples"][0] == item["expected"] 23 vote_correct += winner == item["expected"] 24 25total = len(fixtures) 26print(f"single_trace_accuracy: {single_correct / total:.0%}") 27print(f"vote_5_accuracy: {vote_correct / total:.0%}") 28print(f"model_calls: single={total}, vote_5={total * 5}")
Output
1single_trace_accuracy: 33% 2vote_5_accuracy: 100% 3model_calls: single=3, vote_5=15

This fixture is intentionally small and deterministic: it tests controller logic. A real release decision needs representative labeled cases, real model samples, token counts, latency, refusal rates, and cost.

Tree-of-Thoughts: search when branches can dead-end

Voting helps when independent paths tend to converge on the same short answer. It does not deliberately revisit earlier choices. Tree-of-Thoughts (ToT) represents partial solutions as search states, generates alternatives, evaluates the states, and preserves only branches worth extending.[5]

Yao et al. evaluated ToT on tasks built for planning and search. On Game of 24, their GPT-4 ToT setup solved 74% of tasks while their CoT baseline solved 4%.[5] The lesson is narrower than "use trees everywhere": if a problem has verifiable partial states and meaningful backtracking, search can rescue a bad early move.

Search states in a support plan

For a delayed parcel, consider a workflow whose final recommendation must be supported by two observations: a confirmed carrier exception and a replacement inventory check. A controller can expand legal steps instead of letting a model invent a final action before evidence is present.

expand-support-plan-states.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class State: 5 scan_checked: bool = False 6 stock_checked: bool = False 7 final_action: str | None = None 8 9def expand(state: State) -> list[tuple[str, State]]: 10 next_states: list[tuple[str, State]] = [] 11 if not state.scan_checked: 12 next_states.append(("check_carrier_exception", State(True, state.stock_checked))) 13 if not state.stock_checked: 14 next_states.append(("check_replacement_stock", State(state.scan_checked, True))) 15 if state.scan_checked and state.stock_checked and state.final_action is None: 16 next_states.append(("offer_expedited_reship", State(True, True, "reship"))) 17 return next_states 18 19frontier = [State()] 20for depth in range(3): 21 generated = [item for state in frontier for item in expand(state)] 22 print(f"depth_{depth + 1}: {[action for action, _ in generated]}") 23 frontier = list(dict.fromkeys(state for _, state in generated))
Output
1depth_1: ['check_carrier_exception', 'check_replacement_stock'] 2depth_2: ['check_replacement_stock', 'check_carrier_exception'] 3depth_3: ['offer_expedited_reship']

The two legal evidence-gathering orders converge on the same verified state before the final action. In the next lesson, tool calls will populate that state from real APIs.

A fully runnable search example

Game of 24 is useful because the evaluator is exact: arithmetic either reaches 24 using each input once or it does not. The solver below explores partial equations with breadth-first search and returns a verified solution.

breadth-first-game-of-24.py
1from fractions import Fraction 2from itertools import combinations 3 4def combine(left: tuple[Fraction, str], right: tuple[Fraction, str]) -> list[tuple[Fraction, str]]: 5 a, a_expr = left 6 b, b_expr = right 7 outcomes = [ 8 (a + b, f"({a_expr} + {b_expr})"), 9 (a - b, f"({a_expr} - {b_expr})"), 10 (b - a, f"({b_expr} - {a_expr})"), 11 (a * b, f"({a_expr} * {b_expr})"), 12 ] 13 if b: 14 outcomes.append((a / b, f"({a_expr} / {b_expr})")) 15 if a: 16 outcomes.append((b / a, f"({b_expr} / {a_expr})")) 17 return outcomes 18 19def solve_24(numbers: list[int]) -> str | None: 20 frontier = [[(Fraction(number), str(number)) for number in numbers]] 21 while frontier: 22 state = frontier.pop(0) 23 if len(state) == 1 and state[0][0] == 24: 24 return state[0][1] 25 for i, j in combinations(range(len(state)), 2): 26 remainder = [item for k, item in enumerate(state) if k not in (i, j)] 27 frontier.extend([remainder + [result] for result in combine(state[i], state[j])]) 28 return None 29 30solution = solve_24([4, 5, 6, 7]) 31print(f"solution_found: {solution is not None}") 32print(f"expression: {solution}")
Output
1solution_found: True 2expression: ((6 - 4) * (5 + 7))
Tree-of-Thoughts search over Game of 24 branches, with illustrative heuristic scores deciding whether to prune, backtrack, or continue Tree-of-Thoughts search over Game of 24 branches, with illustrative heuristic scores deciding whether to prune, backtrack, or continue
ToT adds value only if branch evaluation can preserve promising states and discard bad ones.

Pruning is a source of failure

An LLM evaluator is not an arithmetic oracle. If it scores an apparently simple but dead branch above a non-obvious solvable branch, an aggressive beam can remove the answer before expansion.

beam-pruning-risk.py
1branches = [ 2 {"move": "6 * 4 = 24 first", "solvable": False, "weak_score": 0.95, "exact_score": 0.0}, 3 {"move": "5 + 7 = 12 first", "solvable": True, "weak_score": 0.40, "exact_score": 1.0}, 4 {"move": "7 - 5 = 2 first", "solvable": False, "weak_score": 0.35, "exact_score": 0.0}, 5] 6 7def keep_one(score_name: str) -> dict[str, object]: 8 return max(branches, key=lambda branch: branch[score_name]) 9 10weak_choice = keep_one("weak_score") 11exact_choice = keep_one("exact_score") 12print(f"weak_evaluator_keeps_solution: {weak_choice['solvable']}") 13print(f"exact_evaluator_keeps_solution: {exact_choice['solvable']}") 14print(f"risk: beam_width_1 can prune the valid branch")
Output
1weak_evaluator_keeps_solution: False 2exact_evaluator_keeps_solution: True 3risk: beam_width_1 can prune the valid branch

The production implications are concrete:

  • Keep ToT for tasks with real branch structure, not ordinary classification.
  • Prefer deterministic validators when a partial state can be checked in code.
  • Measure solver recall at each beam width, not only final successes.
  • Cap expansions and latency before an open-ended search reaches users.

Choose compute with a release gate

Direct prompting, one trace, voting, and tree search are not maturity levels. They are candidates with different accuracy and serving cost. Start with the cheapest candidate, then promote a more expensive strategy only when held-out results require it.

Reasoning strategy tradeoffs across direct prompting, few-shot chain-of-thought, self-consistency, and Tree-of-Thoughts Reasoning strategy tradeoffs across direct prompting, few-shot chain-of-thought, self-consistency, and Tree-of-Thoughts
Reasoning strategy is a release-gated serving decision, not a default prompt decoration.
reasoning-release-gate.py
1results = [ 2 {"strategy": "direct", "accuracy": 0.76, "p95_ms": 190, "calls": 1}, 3 {"strategy": "single_trace", "accuracy": 0.84, "p95_ms": 360, "calls": 1}, 4 {"strategy": "vote_5", "accuracy": 0.94, "p95_ms": 740, "calls": 5}, 5 {"strategy": "tree_search", "accuracy": 0.96, "p95_ms": 1840, "calls": 14}, 6] 7 8minimum_accuracy = 0.90 9latency_budget_ms = 900 10eligible = [ 11 row for row in results 12 if row["accuracy"] >= minimum_accuracy and row["p95_ms"] <= latency_budget_ms 13] 14selected = min(eligible, key=lambda row: (row["calls"], row["p95_ms"])) 15 16for row in results: 17 print(f"{row['strategy']}: accuracy={row['accuracy']:.0%}, p95_ms={row['p95_ms']}, calls={row['calls']}") 18print(f"selected: {selected['strategy']}")
Output
1direct: accuracy=76%, p95_ms=190, calls=1 2single_trace: accuracy=84%, p95_ms=360, calls=1 3vote_5: accuracy=94%, p95_ms=740, calls=5 4tree_search: accuracy=96%, p95_ms=1840, calls=14 5selected: vote_5

These numbers are example evaluation results, not a benchmark claim. In your system, keep a table with:

MetricWhy it matters
Action accuracy or task successExtra reasoning must change correct outcomes
Unsafe-action and abstention ratesReliability includes knowing when not to act
Input, output, and reasoning tokensSampling and search multiply spend
p50 and p95 latencyLong tails can make support interactions unusable
Parse and schema failuresA correct thought is useless if the runtime cannot consume its action

Where reasoning ends and tools begin

All runnable experiments above operate on provided facts or deterministic state. A real delayed-parcel request requires current carrier scans and inventory. Reasoning alone cannot obtain those observations.

ReAct interleaves reasoning traces and task-specific actions so new observations can update the next decision.[6] For production systems, the important handoff is not a saved inner monologue. It is a validated action request, a controlled execution result, and a bounded next decision:

reasoning-to-tool-handoff.txt
1Need: carrier exception is not present in supplied facts. 2Next action request: get_carrier_scan(order_id="A102") 3Runtime responsibility: validate authorization, execute call, log result. 4Next decision: apply policy only after observation is returned.

The next lesson implements that action boundary with typed function calls, schemas, errors, and safe execution.

What to remember

  • Define the scorer first. A decision record lets you test whether extra inference work improves outcomes.
  • One trace is one candidate. CoT can reveal missed steps, but a plausible rationale is not a faithful audit log.[3]
  • Vote on normalized outcomes. Self-consistency is useful only when its measured gain beats its sample cost.[4]
  • Search only with branch structure. ToT needs meaningful states, evaluators, pruning limits, and failure measurements.[5]
  • Promote strategies through evals. Direct, trace, vote, and search should compete under quality and latency gates.

Mastery check

Key concepts

  • Chain-of-Thought as one structured candidate trace
  • Zero-shot CoT versus few-shot contract examples
  • Reasoning trace faithfulness limits
  • Self-Consistency with canonicalization and abstention
  • Tree-of-Thoughts as bounded, evaluator-guided search
  • Beam pruning failure modes
  • Inference-time cost, latency, and release gates
  • ReAct as the handoff from decisions to observations

Evaluation rubric

  • Foundational: Defines an expected action and the checks required to justify it.
  • Intermediate: Implements final-answer normalization, majority voting, and an abstention threshold.
  • Intermediate: Compares direct, trace, and voting strategies on labeled fixtures with cost accounting.
  • Advanced: Implements a searchable state space and explains how evaluator errors interact with pruning.
  • Advanced: Designs a release gate using task success, safety, token cost, and tail latency.

Common pitfalls

  • Scoring prose instead of outcomes: Require final actions and policy-check fields that an evaluator can validate.
  • Logging unrestricted rationale as evidence: Trace text can be unfaithful. Log retrieved inputs, validated actions, observations, and results.
  • Voting on raw strings: Normalize final decisions into allowed enums before counting.
  • Treating every plurality as safe: Add abstention for ties, weak majorities, and rejected outputs.
  • Using ToT for a lookup: Search cost is wasted when no branch can backtrack or improve the answer.
  • Pruning with an unmeasured evaluator: A confident score can remove the only valid branch. Evaluate beam recall and cap the budget.

Practice extension

Create twenty delayed-parcel fixtures with policy-grounded final actions. Collect direct, structured-trace, and five-sample candidate outputs from one model endpoint. Canonicalize the final action, abstain on weak votes, and produce a comparison table with accuracy, unsafe actions, abstentions, tokens, and p95 latency. Add search only for cases where choosing an action requires sequencing multiple checks or backing out of a failed plan.

Next Step
Continue to Function Calling & Tool Use

You can now choose and evaluate a reasoning budget over supplied facts; next you will convert missing facts and intended actions into validated tool calls.

PreviousDimensionality Reduction for Embeddings
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

Large Language Models are Zero-Shot Reasoners.

Kojima, T., et al. · 2022

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman · 2023

Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Wang, X., et al. · 2022

Tree of Thoughts: Deliberate Problem Solving with Large Language Models.

Yao, S., et al. · 2023 · NeurIPS

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023