LearnApplied LLM EngineeringCoT, ToT & Self-Consistency Prompting

✍️MediumPrompt Engineering

CoT, ToT & Self-Consistency Prompting

Build and evaluate reasoning controllers: single traces, answer voting, and bounded tree search for multi-step LLM decisions.

16 min read

Learning path

Step 57 of 158 in the full curriculum

Dimensionality Reduction for Embeddings Function Calling & Tool Use

Compressed embeddings can retrieve the right policy evidence and still leave the assistant with a decision to make. Retrieval isn't the end of the task. An assistant may have the right facts and still choose the wrong action because it skips a condition, mishandles arithmetic, or commits too early to one plan.

Treat extra inference work as an engineering decision. You'll build a small release-resolution controller for a risky code change:

Chain-of-Thought (CoT): request one decomposed candidate decision.
Self-Consistency: sample several candidates, normalize final actions, and vote.
Tree-of-Thoughts (ToT): expand and prune branches when a decision needs backtracking.

Long rationales aren't the target. The target is better measurable decision accuracy under a token, latency, and safety budget.

Start with a failure you can audit

Suppose a repository has a candidate release. Unit tests pass, the security scan is clear, and a reviewer has approved the diff. Policy permits auto-merge only when all three facts are true. A direct response may still overlook one condition and suggest a force-merge or escalation unnecessarily.

A useful outward artifact is a short decision record: facts used, checks applied, and final action. It's smaller than an open-ended rationale, easy to score, and safe to compare against deterministic policy logic.

decision-record-contract.py

from dataclasses import dataclass

@dataclass(frozen=True)
class ReleaseCase:
    tests_passed: bool
    security_scan_clear: bool
    reviewer_approved: bool

def decision_record(case: ReleaseCase) -> dict[str, object]:
    checks = {
        "tests_passed": case.tests_passed,
        "security_scan_clear": case.security_scan_clear,
        "reviewer_approved": case.reviewer_approved,
    }
    action = "merge_release" if all(checks.values()) else "manual_review"
    return {"checks": checks, "final_action": action}

record = decision_record(
    ReleaseCase(tests_passed=True, security_scan_clear=True, reviewer_approved=True)
)
for name, passed in record["checks"].items():
    print(f"{name}: {passed}")
print(f"final_action: {record['final_action']}")

Output

tests_passed: True
security_scan_clear: True
reviewer_approved: True
final_action: merge_release

This code isn't an LLM. It's the oracle that your prompt variants must match. Before increasing model compute, define an output contract and a scorer.

The same release-gate facts feed a direct candidate that records only one of three required checks and fails the oracle, while a structured decision record preserves all three predicates and reaches the expected merge release action. — A trace matters only when it improves the scored contract: complete required checks and the correct final action.

Chain-of-Thought: one decomposed candidate

Wei et al. introduced few-shot CoT by placing worked intermediate steps in prompt examples. Their experiments found gains on arithmetic, commonsense, and symbolic reasoning benchmarks for sufficiently large models, including a strong GSM8K result with PaLM 540B.^{[1]Reference 1Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.https://arxiv.org/abs/2201.11903} Kojima et al. later showed that the zero-shot trigger "Let's think step by step" could improve several benchmark reasoning tasks without worked examples.^{[2]Reference 2Large Language Models are Zero-Shot Reasoners.https://arxiv.org/abs/2205.11916}

Those papers establish techniques to evaluate, not a production law. On your endpoint and task, a visible scratchpad may help, do nothing, or add cost. If an API offers a native reasoning control, evaluate that option as another strategy rather than assuming that extra visible text helps.

🔬 Research insight: More reasoning text isn't always better. Zheng et al. tested CoT and its variants (including ToT and ReAct) on nine pattern-based in-context-learning benchmarks across 16 models and found they consistently underperformed plain direct answering, with the gap widening as more demonstrations were added. Even long-CoT reasoning models that spent far more tokens didn't overcome the effect.^{[3]Reference 3The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning.https://arxiv.org/abs/2504.05081} When a task is really pattern matching from examples, a visible scratchpad can add noise instead of signal. That's the point of the eval gate: measure the win, don't assume it.

This tradeoff also shapes a build-versus-buy decision you'll face in production. You can orchestrate reasoning yourself with prompt scaffolding (the CoT, self-consistency, and tree-search controllers in this chapter), or call a native reasoning model that performs its own internal reasoning before answering. Scaffolding keeps each step visible, auditable, and cheap to swap, but you own the token and latency budget. A native reasoning model can lift accuracy on genuinely multi-step problems, yet it hides its reasoning, bills you for it, and gives you fewer control points to gate. Treat the two as competing candidates behind the same eval gate rather than assuming the newer option wins.

For a production workflow, don't ask the model to reveal unrestricted inner reasoning. Ask it to produce reviewable artifacts:

single-trace-prompt.txt

Use only the supplied release facts and policy rules.
Return:
1. required_checks: each policy predicate with pass/fail
2. final_action: one allowed action enum
3. operator_message: one sentence

Facts:
- tests_passed: true
- security_scan_clear: true
- reviewer_approved: true

Policy:
- merge_release is allowed only when tests pass,
  the security scan is clear, and review approval exists.

The checks are useful because a missed predicate becomes observable. They aren't proof of faithful hidden cognition. Turpin et al. showed that CoT explanations can rationalize outputs influenced by hidden biasing features without mentioning those features.^{[4]Reference 4Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Promptinghttps://arxiv.org/abs/2305.04388} Log inputs, actions, validations, and outcomes; don't treat eloquent reasoning text as an audit guarantee.

score-candidate-decisions.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Candidate:
    name: str
    checks: set[str]
    final_action: str

required_checks = {
    "tests_passed",
    "security_scan_clear",
    "reviewer_approved",
}
expected_action = "merge_release"
candidates = [
    Candidate("direct", {"tests_passed"}, "manual_review"),
    Candidate("structured_trace", required_checks, "merge_release"),
]

for candidate in candidates:
    coverage = len(candidate.checks & required_checks) / len(required_checks)
    action_ok = candidate.final_action == expected_action
    print(f"{candidate.name}: coverage={coverage:.0%}, action_ok={action_ok}")

Output

direct: coverage=33%, action_ok=False
structured_trace: coverage=100%, action_ok=True

Here, the trace is useful because it meets the scored contract. It doesn't win merely because it contains more words.

Zero-shot or few-shot?

Zero-shot CoT supplies an instruction and lets the model choose its intermediate format. Few-shot CoT gives one or more solved examples, so it can teach both decomposition and the final answer shape. Native structured output enforcement is better when available; examples are still useful when the prompt must communicate task-specific checks.

Comparison showing zero-shot prompting drifting across facts, checks, and action fields while few-shot examples keep the same schema slots for examples and target. — Few-shot examples spend more prompt tokens to stabilize field order. Prefer native structured output when the endpoint can enforce the schema.

build-few-shot-decision-prompt.py

example = """Example:
facts: tests_passed=true, security_scan_clear=false, reviewer_approved=true
required_checks: tests_passed=true, security_scan_clear=false, reviewer_approved=true
final_action: manual_review"""

case = "facts: tests_passed=true, security_scan_clear=true, reviewer_approved=true"
zero_shot = f"Evaluate release policy step by step.\n{case}\nfinal_action:"
few_shot = f"{example}\n\nNow evaluate:\n{case}\nrequired_checks:\nfinal_action:"

print(f"zero_shot_has_example: {'Example:' in zero_shot}")
print(f"few_shot_has_example: {'Example:' in few_shot}")
print(f"few_shot_requests_checks: {'required_checks:' in few_shot}")

Output

zero_shot_has_example: False
few_shot_has_example: True
few_shot_requests_checks: True

Self-Consistency: sample answers, then vote

One structured trace can fail because generation takes an unlucky path. Self-Consistency replaces reliance on one path with several sampled paths and chooses the most consistent final answer.^{[5]Reference 5Self-Consistency Improves Chain of Thought Reasoning in Language Models.https://arxiv.org/abs/2203.11171} The vote operates on extracted answers, not on whose rationale sounds best.

To obtain different candidates, use stochastic decoding, usually with a nonzero temperature, rather than rerunning one deterministic decode. The earlier Decoding Strategies lesson explains that control. Keep it fixed in your experiment and log it with the sample count.

On the original benchmark setting, Wang et al. reported a 17.9 percentage-point GSM8K improvement over CoT prompting for PaLM 540B.^{[5]Reference 5Self-Consistency Improves Chain of Thought Reasoning in Language Models.https://arxiv.org/abs/2203.11171} That number is evidence for the method on those benchmarks. It's not your expected support-resolution gain. Measure your own cases and sample cost.

A self-consistency controller fans one release-gate prompt into five stochastic traces, canonicalizes three merge variants into one action, keeps one manual-review vote, rejects one unknown action, and releases the 3-of-5 winner only because it meets the 0.60 share threshold. — Self-consistency votes on canonical actions, not rationale style. Rejected outputs remain in the denominator, and weak or tied votes route to review.

Normalize before counting

Model outputs rarely use exactly one spelling. Your controller should map harmless variants to one allowed action and reject unknown outputs before voting.

canonicalize-and-vote.py

from collections import Counter

ALIASES = {
    "merge release": "merge_release",
    "merge_release": "merge_release",
    "auto merge approved change": "merge_release",
    "manual review": "manual_review",
}

def canonicalize(text: str) -> str | None:
    normalized = text.strip().lower().replace("-", " ")
    return ALIASES.get(normalized)

samples = [
    "Merge release",
    "merge_release",
    "Auto merge approved change",
    "manual review",
    "force merge immediately",
]
votes = Counter(action for text in samples if (action := canonicalize(text)))
winner = votes.most_common(1)[0][0] if votes else "manual_review"

print(f"accepted_samples: {sum(votes.values())}/{len(samples)}")
print(f"votes: {dict(votes)}")
print(f"winner: {winner}")

Output

accepted_samples: 4/5
votes: {'merge_release': 3, 'manual_review': 1}
winner: merge_release

A winner isn't always confident enough

A 2 to 2 tie, zero accepted outputs, or a narrow plurality with many rejected outputs shouldn't silently become an automated action. Add an abstention rule. The winning share must use all sampled outputs as its denominator, including strings that failed to parse.

abstain-on-weak-votes.py

from collections import Counter

def decide(votes: list[str], total_samples: int, minimum_share: float = 0.6) -> str:
    if not votes:
        return "manual_review"
    counts = Counter(votes)
    winner, count = counts.most_common(1)[0]
    share = count / total_samples
    tied = len(counts) > 1 and counts.most_common(2)[0][1] == counts.most_common(2)[1][1]
    if tied or share < minimum_share:
        return "manual_review"
    return winner

strong = ["merge_release", "merge_release", "merge_release", "manual_review"]
split = ["merge_release", "merge_release", "manual_review", "manual_review"]
mostly_rejected = ["merge_release", "merge_release"]

print(f"strong_vote: {decide(strong, total_samples=4)}")
print(f"split_vote: {decide(split, total_samples=4)}")
print(f"mostly_rejected_vote: {decide(mostly_rejected, total_samples=5)}")
print(f"no_valid_votes: {decide([], total_samples=5)}")

Output

strong_vote: merge_release
split_vote: manual_review
mostly_rejected_vote: manual_review
no_valid_votes: manual_review

Measure gains against call cost

Use a held-out fixture set before calling the strategy ready. The fixture represents five sampled final actions returned for each case; the controller compares first-sample accuracy with five-sample voting accuracy.

measure-voting-gain.py

from collections import Counter

fixtures = {
    "approved_clean_release": {
        "expected": "merge",
        "samples": ["review", "merge", "merge", "merge", "force"],
    },
    "scan_failed": {
        "expected": "review",
        "samples": ["review", "review", "force", "review", "hold"],
    },
    "review_missing": {
        "expected": "review",
        "samples": ["merge", "review", "review", "review", "merge"],
    },
}

single_correct = 0
vote_correct = 0
for item in fixtures.values():
    winner = Counter(item["samples"]).most_common(1)[0][0]
    single_correct += item["samples"][0] == item["expected"]
    vote_correct += winner == item["expected"]

total = len(fixtures)
print(f"single_trace_accuracy: {single_correct / total:.0%}")
print(f"vote_5_accuracy: {vote_correct / total:.0%}")
print(f"model_calls: single={total}, vote_5={total * 5}")

Output

single_trace_accuracy: 33%
vote_5_accuracy: 100%
model_calls: single=3, vote_5=15

This fixture is intentionally small and deterministic: it tests controller logic. A real release decision needs representative labeled cases, real model samples, token counts, latency, refusal rates, and cost.

Tree-of-Thoughts: search when branches can dead-end

Voting helps when independent paths tend to converge on the same short answer. It doesn't deliberately revisit earlier choices. Tree-of-Thoughts (ToT) represents partial solutions as search states, generates alternatives, evaluates the states, and preserves only branches worth extending.^{[6]Reference 6Tree of Thoughts: Deliberate Problem Solving with Large Language Models.https://arxiv.org/abs/2305.10601}

Yao et al. evaluated ToT on tasks built for planning and search. On Game of 24, their GPT-4 ToT setup solved 74% of tasks while their CoT baseline solved 4%.^{[6]Reference 6Tree of Thoughts: Deliberate Problem Solving with Large Language Models.https://arxiv.org/abs/2305.10601} The narrower conclusion is not "use trees everywhere." If a problem has verifiable partial states and meaningful backtracking, search can rescue a bad early move.

Search states in a release plan

For a release gate, consider a workflow whose final recommendation must be supported by two observations: a current security scan and a reviewer approval record. A controller can expand legal steps instead of letting a model invent a final action before evidence is present.

expand-release-plan-states.py

from dataclasses import dataclass

@dataclass(frozen=True)
class State:
    security_clear: bool | None = None
    reviewer_approved: bool | None = None
    final_action: str | None = None

def expand(state: State) -> list[tuple[str, State]]:
    next_states: list[tuple[str, State]] = []
    if state.security_clear is None:
        next_states.extend([
            ("security_scan:clear", State(True, state.reviewer_approved)),
            ("security_scan:failed", State(False, state.reviewer_approved)),
        ])
    if state.reviewer_approved is None:
        next_states.extend([
            ("review:approved", State(state.security_clear, True)),
            ("review:rejected", State(state.security_clear, False)),
        ])
    if (
        state.security_clear is not None
        and state.reviewer_approved is not None
        and state.final_action is None
    ):
        action = "merge_release" if state.security_clear and state.reviewer_approved else "block_release"
        next_states.append((action, State(state.security_clear, state.reviewer_approved, action)))
    return next_states

frontier = [State()]
for depth in range(3):
    generated = [item for state in frontier for item in expand(state)]
    print(f"depth_{depth + 1}: {[action for action, _ in generated]}")
    frontier = list(dict.fromkeys(state for _, state in generated))

Output

depth_1: ['security_scan:clear', 'security_scan:failed', 'review:approved', 'review:rejected']
depth_2: ['review:approved', 'review:rejected', 'review:approved', 'review:rejected', 'security_scan:clear', 'security_scan:failed', 'security_scan:clear', 'security_scan:failed']
depth_3: ['merge_release', 'block_release', 'block_release', 'block_release']

The two evidence-gathering orders converge on outcome-bearing states. Only the state with a clear scan and an approved review can merge; a failed scan or rejected review blocks the release. In the next lesson, tool calls will populate those outcomes from real APIs.

A fully runnable search example

Game of 24 is useful because the evaluator is exact: arithmetic either reaches 24 using each input once or it doesn't. The solver below explores partial equations with breadth-first search and returns a verified solution.

breadth-first-game-of-24.py

from fractions import Fraction
from itertools import combinations

def combine(left: tuple[Fraction, str], right: tuple[Fraction, str]) -> list[tuple[Fraction, str]]:
    a, a_expr = left
    b, b_expr = right
    outcomes = [
        (a + b, f"({a_expr} + {b_expr})"),
        (a - b, f"({a_expr} - {b_expr})"),
        (b - a, f"({b_expr} - {a_expr})"),
        (a * b, f"({a_expr} * {b_expr})"),
    ]
    if b:
        outcomes.append((a / b, f"({a_expr} / {b_expr})"))
    if a:
        outcomes.append((b / a, f"({b_expr} / {a_expr})"))
    return outcomes

def solve_24(numbers: list[int]) -> str | None:
    frontier = [[(Fraction(number), str(number)) for number in numbers]]
    while frontier:
        state = frontier.pop(0)
        if len(state) == 1 and state[0][0] == 24:
            return state[0][1]
        for i, j in combinations(range(len(state)), 2):
            remainder = [item for k, item in enumerate(state) if k not in (i, j)]
            frontier.extend([remainder + [result] for result in combine(state[i], state[j])])
    return None

solution = solve_24([4, 5, 6, 7])
print(f"solution_found: {solution is not None}")
print(f"expression: {solution}")

Output

solution_found: True
expression: ((6 - 4) * (5 + 7))

Tree-of-thought search comparing branch expansion for Game of 24 with a beam-width-one scorer that keeps a dead branch under weak scoring but preserves the solution under exact checks. — ToT helps only when the evaluator keeps viable states. Weak pruning can keep a dead branch; exact checks preserve the branch that reaches 24.

Pruning is a source of failure

An LLM evaluator isn't an arithmetic oracle. If it scores an apparently simple but dead branch above a non-obvious solvable branch, an aggressive beam can remove the answer before expansion.

beam-pruning-risk.py

branches = [
    {"move": "6 * 4 = 24 first", "solvable": False, "weak_score": 0.95, "exact_score": 0.0},
    {"move": "5 + 7 = 12 first", "solvable": True, "weak_score": 0.40, "exact_score": 1.0},
    {"move": "7 - 5 = 2 first", "solvable": False, "weak_score": 0.35, "exact_score": 0.0},
]

def keep_one(score_name: str) -> dict[str, object]:
    return max(branches, key=lambda branch: branch[score_name])

weak_choice = keep_one("weak_score")
exact_choice = keep_one("exact_score")
print(f"weak_evaluator_keeps_solution: {weak_choice['solvable']}")
print(f"exact_evaluator_keeps_solution: {exact_choice['solvable']}")
print(f"risk: beam_width_1 can prune the valid branch")

Output

weak_evaluator_keeps_solution: False
exact_evaluator_keeps_solution: True
risk: beam_width_1 can prune the valid branch

The production implications are concrete:

Keep ToT for tasks with real branch structure, not ordinary classification.
Prefer deterministic validators when a partial state can be checked in code.
Measure solver recall at each beam width alongside final successes.
Cap expansions and latency before an open-ended search reaches users.

Choose compute with an eval gate

Direct prompting, one trace, voting, and tree search aren't maturity levels. They are candidates with different accuracy and serving cost. Start with the cheapest candidate, then promote a more expensive strategy through an eval gate only when held-out results require it.

An accuracy-versus-p95-latency chart compares direct prompting at 76% and 190 milliseconds, one trace at 84% and 360 milliseconds, five-sample voting at 94% and 740 milliseconds, and tree search at 96% and 1840 milliseconds; only voting clears both the 90% accuracy floor and 900 millisecond latency budget. — Apply both constraints before optimizing cost. Here, five-sample voting is the only strategy eligible to ship.

reasoning-release-gate.py

results = [
    {"strategy": "direct", "accuracy": 0.76, "p95_ms": 190, "calls": 1},
    {"strategy": "single_trace", "accuracy": 0.84, "p95_ms": 360, "calls": 1},
    {"strategy": "vote_5", "accuracy": 0.94, "p95_ms": 740, "calls": 5},
    {"strategy": "tree_search", "accuracy": 0.96, "p95_ms": 1840, "calls": 14},
]

minimum_accuracy = 0.90
latency_budget_ms = 900
eligible = [
    row for row in results
    if row["accuracy"] >= minimum_accuracy and row["p95_ms"] <= latency_budget_ms
]
selected = min(eligible, key=lambda row: (row["calls"], row["p95_ms"]))

for row in results:
    print(f"{row['strategy']}: accuracy={row['accuracy']:.0%}, p95_ms={row['p95_ms']}, calls={row['calls']}")
print(f"selected: {selected['strategy']}")

Output

direct: accuracy=76%, p95_ms=190, calls=1
single_trace: accuracy=84%, p95_ms=360, calls=1
vote_5: accuracy=94%, p95_ms=740, calls=5
tree_search: accuracy=96%, p95_ms=1840, calls=14
selected: vote_5

These numbers are example evaluation results, not a benchmark claim. In your system, keep a table with:

Metric	Why it matters
Action accuracy or task success	Extra reasoning must change correct outcomes
Unsafe-action and abstention rates	Reliability includes knowing when not to act
Input, output, and reasoning tokens	Sampling and search multiply spend
p50 and p95 latency	Long tails can make support interactions unusable
Parse and schema failures	A correct thought is useless if the runtime can't consume its action

Where reasoning ends and tools begin

All runnable experiments above operate on provided facts or deterministic state. A real release decision requires current CI status, security scan output, and review state from source systems. Reasoning alone can't obtain those observations.

ReAct interleaves reasoning traces and task-specific actions so new observations can update the next decision.^{[7]Reference 7ReAct: Synergizing Reasoning and Acting in Language Models.https://arxiv.org/abs/2210.03629} The production handoff isn't a saved inner monologue. It's a validated action request, a controlled execution result, and a bounded next decision:

reasoning-to-tool-handoff.txt

Need: security scan status is not present in supplied facts.
Next action request: get_security_scan(run_id="ci-1482")
Runtime responsibility: validate authorization, execute call, log result.
Next decision: apply policy only after observation is returned.

The next lesson implements that action boundary with typed function calls, schemas, errors, and safe execution.

What to remember

Define the scorer first. A decision record lets you test whether extra inference work improves outcomes.
One trace is one candidate. CoT can reveal missed steps, but a plausible rationale isn't a faithful audit log.^{[4]Reference 4Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Promptinghttps://arxiv.org/abs/2305.04388}
Vote on normalized outcomes. Self-consistency is useful only when its measured gain beats its sample cost.^{[5]Reference 5Self-Consistency Improves Chain of Thought Reasoning in Language Models.https://arxiv.org/abs/2203.11171}
Search only with branch structure. ToT needs meaningful states, evaluators, pruning limits, and failure measurements.^{[6]Reference 6Tree of Thoughts: Deliberate Problem Solving with Large Language Models.https://arxiv.org/abs/2305.10601}
Promote strategies through evals. Direct, trace, vote, and search should compete under quality and latency gates.

Mastery check

Key concepts

Chain-of-Thought as one structured candidate trace
Zero-shot CoT versus few-shot contract examples
Reasoning trace faithfulness limits
Self-Consistency with canonicalization and abstention
Tree-of-Thoughts as bounded, evaluator-guided search
Beam pruning failure modes
Inference-time cost, latency, and release gates
ReAct as the handoff from decisions to observations

Evaluation rubric

Foundational: Defines an expected action and the checks required to justify it.
Intermediate: Implements final-answer normalization, majority voting, and an abstention threshold.
Intermediate: Compares direct, trace, and voting strategies on labeled fixtures with cost accounting.
Advanced: Implements a searchable state space and explains how evaluator errors interact with pruning.
Advanced: Designs a release gate using task success, safety, token cost, and tail latency.

Common pitfalls

Scoring text instead of outcomes: Require final actions and policy-check fields that an evaluator can validate.
Logging unrestricted rationale as evidence: Trace text can be unfaithful. Log retrieved inputs, validated actions, observations, and results.
Voting on raw strings: Normalize final decisions into allowed enums before counting.
Treating every plurality as safe: Add abstention for ties, weak majorities, and rejected outputs.
Using ToT for a lookup: Search cost is wasted when no branch can backtrack or improve the answer.
Pruning with an unmeasured evaluator: A confident score can remove the only valid branch. Evaluate beam recall and cap the budget.

Practice extension

Create twenty release-gate fixtures with policy-grounded final actions. Collect direct, structured-trace, and five-sample candidate outputs from one model endpoint. Canonicalize the final action, abstain on weak votes, and produce a comparison table with accuracy, unsafe actions, abstentions, tokens, and p95 latency. Add search only for cases where choosing an action requires sequencing multiple checks or backing out of a failed plan.

Next Step

Continue to Function Calling & Tool Use

You can now choose and evaluate a reasoning budget over supplied facts; next you'll convert missing facts and intended actions into validated tool calls.

PreviousDimensionality Reduction for Embeddings

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

Large Language Models are Zero-Shot Reasoners.

Kojima, T., et al. · 2022

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning.

Zheng, T., Chen, Y., Li, C., et al. · 2025 · arXiv preprint

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman · 2023

Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Wang, X., et al. · 2022

Tree of Thoughts: Deliberate Problem Solving with Large Language Models.

Yao, S., et al. · 2023 · NeurIPS

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnApplied LLM EngineeringCoT, ToT & Self-Consistency Prompting

✍️MediumPrompt Engineering

CoT, ToT & Self-Consistency Prompting

Build and evaluate reasoning controllers: single traces, answer voting, and bounded tree search for multi-step LLM decisions.

16 min read

Learning path

Step 57 of 158 in the full curriculum

Dimensionality Reduction for Embeddings Function Calling & Tool Use

Treat extra inference work as an engineering decision. You'll build a small release-resolution controller for a risky code change:

Chain-of-Thought (CoT): request one decomposed candidate decision.
Self-Consistency: sample several candidates, normalize final actions, and vote.
Tree-of-Thoughts (ToT): expand and prune branches when a decision needs backtracking.

Long rationales aren't the target. The target is better measurable decision accuracy under a token, latency, and safety budget.

Start with a failure you can audit

decision-record-contract.py

from dataclasses import dataclass

@dataclass(frozen=True)
class ReleaseCase:
    tests_passed: bool
    security_scan_clear: bool
    reviewer_approved: bool

def decision_record(case: ReleaseCase) -> dict[str, object]:
    checks = {
        "tests_passed": case.tests_passed,
        "security_scan_clear": case.security_scan_clear,
        "reviewer_approved": case.reviewer_approved,
    }
    action = "merge_release" if all(checks.values()) else "manual_review"
    return {"checks": checks, "final_action": action}

record = decision_record(
    ReleaseCase(tests_passed=True, security_scan_clear=True, reviewer_approved=True)
)
for name, passed in record["checks"].items():
    print(f"{name}: {passed}")
print(f"final_action: {record['final_action']}")

Output

tests_passed: True
security_scan_clear: True
reviewer_approved: True
final_action: merge_release

This code isn't an LLM. It's the oracle that your prompt variants must match. Before increasing model compute, define an output contract and a scorer.

Chain-of-Thought: one decomposed candidate

🔬 Research insight: More reasoning text isn't always better. Zheng et al. tested CoT and its variants (including ToT and ReAct) on nine pattern-based in-context-learning benchmarks across 16 models and found they consistently underperformed plain direct answering, with the gap widening as more demonstrations were added. Even long-CoT reasoning models that spent far more tokens didn't overcome the effect.^{[3]Reference 3The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning.https://arxiv.org/abs/2504.05081} When a task is really pattern matching from examples, a visible scratchpad can add noise instead of signal. That's the point of the eval gate: measure the win, don't assume it.

For a production workflow, don't ask the model to reveal unrestricted inner reasoning. Ask it to produce reviewable artifacts:

single-trace-prompt.txt

Use only the supplied release facts and policy rules.
Return:
1. required_checks: each policy predicate with pass/fail
2. final_action: one allowed action enum
3. operator_message: one sentence

Facts:
- tests_passed: true
- security_scan_clear: true
- reviewer_approved: true

Policy:
- merge_release is allowed only when tests pass,
  the security scan is clear, and review approval exists.

score-candidate-decisions.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Candidate:
    name: str
    checks: set[str]
    final_action: str

required_checks = {
    "tests_passed",
    "security_scan_clear",
    "reviewer_approved",
}
expected_action = "merge_release"
candidates = [
    Candidate("direct", {"tests_passed"}, "manual_review"),
    Candidate("structured_trace", required_checks, "merge_release"),
]

for candidate in candidates:
    coverage = len(candidate.checks & required_checks) / len(required_checks)
    action_ok = candidate.final_action == expected_action
    print(f"{candidate.name}: coverage={coverage:.0%}, action_ok={action_ok}")

Output

direct: coverage=33%, action_ok=False
structured_trace: coverage=100%, action_ok=True

Here, the trace is useful because it meets the scored contract. It doesn't win merely because it contains more words.

Zero-shot or few-shot?

build-few-shot-decision-prompt.py

example = """Example:
facts: tests_passed=true, security_scan_clear=false, reviewer_approved=true
required_checks: tests_passed=true, security_scan_clear=false, reviewer_approved=true
final_action: manual_review"""

case = "facts: tests_passed=true, security_scan_clear=true, reviewer_approved=true"
zero_shot = f"Evaluate release policy step by step.\n{case}\nfinal_action:"
few_shot = f"{example}\n\nNow evaluate:\n{case}\nrequired_checks:\nfinal_action:"

print(f"zero_shot_has_example: {'Example:' in zero_shot}")
print(f"few_shot_has_example: {'Example:' in few_shot}")
print(f"few_shot_requests_checks: {'required_checks:' in few_shot}")

Output

zero_shot_has_example: False
few_shot_has_example: True
few_shot_requests_checks: True

Self-Consistency: sample answers, then vote

Normalize before counting

Model outputs rarely use exactly one spelling. Your controller should map harmless variants to one allowed action and reject unknown outputs before voting.

canonicalize-and-vote.py

from collections import Counter

ALIASES = {
    "merge release": "merge_release",
    "merge_release": "merge_release",
    "auto merge approved change": "merge_release",
    "manual review": "manual_review",
}

def canonicalize(text: str) -> str | None:
    normalized = text.strip().lower().replace("-", " ")
    return ALIASES.get(normalized)

samples = [
    "Merge release",
    "merge_release",
    "Auto merge approved change",
    "manual review",
    "force merge immediately",
]
votes = Counter(action for text in samples if (action := canonicalize(text)))
winner = votes.most_common(1)[0][0] if votes else "manual_review"

print(f"accepted_samples: {sum(votes.values())}/{len(samples)}")
print(f"votes: {dict(votes)}")
print(f"winner: {winner}")

Output

accepted_samples: 4/5
votes: {'merge_release': 3, 'manual_review': 1}
winner: merge_release

A winner isn't always confident enough

abstain-on-weak-votes.py

from collections import Counter

def decide(votes: list[str], total_samples: int, minimum_share: float = 0.6) -> str:
    if not votes:
        return "manual_review"
    counts = Counter(votes)
    winner, count = counts.most_common(1)[0]
    share = count / total_samples
    tied = len(counts) > 1 and counts.most_common(2)[0][1] == counts.most_common(2)[1][1]
    if tied or share < minimum_share:
        return "manual_review"
    return winner

strong = ["merge_release", "merge_release", "merge_release", "manual_review"]
split = ["merge_release", "merge_release", "manual_review", "manual_review"]
mostly_rejected = ["merge_release", "merge_release"]

print(f"strong_vote: {decide(strong, total_samples=4)}")
print(f"split_vote: {decide(split, total_samples=4)}")
print(f"mostly_rejected_vote: {decide(mostly_rejected, total_samples=5)}")
print(f"no_valid_votes: {decide([], total_samples=5)}")

Output

strong_vote: merge_release
split_vote: manual_review
mostly_rejected_vote: manual_review
no_valid_votes: manual_review

Measure gains against call cost

measure-voting-gain.py

from collections import Counter

fixtures = {
    "approved_clean_release": {
        "expected": "merge",
        "samples": ["review", "merge", "merge", "merge", "force"],
    },
    "scan_failed": {
        "expected": "review",
        "samples": ["review", "review", "force", "review", "hold"],
    },
    "review_missing": {
        "expected": "review",
        "samples": ["merge", "review", "review", "review", "merge"],
    },
}

single_correct = 0
vote_correct = 0
for item in fixtures.values():
    winner = Counter(item["samples"]).most_common(1)[0][0]
    single_correct += item["samples"][0] == item["expected"]
    vote_correct += winner == item["expected"]

total = len(fixtures)
print(f"single_trace_accuracy: {single_correct / total:.0%}")
print(f"vote_5_accuracy: {vote_correct / total:.0%}")
print(f"model_calls: single={total}, vote_5={total * 5}")

Output

single_trace_accuracy: 33%
vote_5_accuracy: 100%
model_calls: single=3, vote_5=15

Tree-of-Thoughts: search when branches can dead-end

Search states in a release plan

expand-release-plan-states.py

from dataclasses import dataclass

@dataclass(frozen=True)
class State:
    security_clear: bool | None = None
    reviewer_approved: bool | None = None
    final_action: str | None = None

def expand(state: State) -> list[tuple[str, State]]:
    next_states: list[tuple[str, State]] = []
    if state.security_clear is None:
        next_states.extend([
            ("security_scan:clear", State(True, state.reviewer_approved)),
            ("security_scan:failed", State(False, state.reviewer_approved)),
        ])
    if state.reviewer_approved is None:
        next_states.extend([
            ("review:approved", State(state.security_clear, True)),
            ("review:rejected", State(state.security_clear, False)),
        ])
    if (
        state.security_clear is not None
        and state.reviewer_approved is not None
        and state.final_action is None
    ):
        action = "merge_release" if state.security_clear and state.reviewer_approved else "block_release"
        next_states.append((action, State(state.security_clear, state.reviewer_approved, action)))
    return next_states

frontier = [State()]
for depth in range(3):
    generated = [item for state in frontier for item in expand(state)]
    print(f"depth_{depth + 1}: {[action for action, _ in generated]}")
    frontier = list(dict.fromkeys(state for _, state in generated))

Output

depth_1: ['security_scan:clear', 'security_scan:failed', 'review:approved', 'review:rejected']
depth_2: ['review:approved', 'review:rejected', 'review:approved', 'review:rejected', 'security_scan:clear', 'security_scan:failed', 'security_scan:clear', 'security_scan:failed']
depth_3: ['merge_release', 'block_release', 'block_release', 'block_release']

A fully runnable search example

breadth-first-game-of-24.py

from fractions import Fraction
from itertools import combinations

def combine(left: tuple[Fraction, str], right: tuple[Fraction, str]) -> list[tuple[Fraction, str]]:
    a, a_expr = left
    b, b_expr = right
    outcomes = [
        (a + b, f"({a_expr} + {b_expr})"),
        (a - b, f"({a_expr} - {b_expr})"),
        (b - a, f"({b_expr} - {a_expr})"),
        (a * b, f"({a_expr} * {b_expr})"),
    ]
    if b:
        outcomes.append((a / b, f"({a_expr} / {b_expr})"))
    if a:
        outcomes.append((b / a, f"({b_expr} / {a_expr})"))
    return outcomes

def solve_24(numbers: list[int]) -> str | None:
    frontier = [[(Fraction(number), str(number)) for number in numbers]]
    while frontier:
        state = frontier.pop(0)
        if len(state) == 1 and state[0][0] == 24:
            return state[0][1]
        for i, j in combinations(range(len(state)), 2):
            remainder = [item for k, item in enumerate(state) if k not in (i, j)]
            frontier.extend([remainder + [result] for result in combine(state[i], state[j])])
    return None

solution = solve_24([4, 5, 6, 7])
print(f"solution_found: {solution is not None}")
print(f"expression: {solution}")

Output

solution_found: True
expression: ((6 - 4) * (5 + 7))

Pruning is a source of failure

An LLM evaluator isn't an arithmetic oracle. If it scores an apparently simple but dead branch above a non-obvious solvable branch, an aggressive beam can remove the answer before expansion.

beam-pruning-risk.py

branches = [
    {"move": "6 * 4 = 24 first", "solvable": False, "weak_score": 0.95, "exact_score": 0.0},
    {"move": "5 + 7 = 12 first", "solvable": True, "weak_score": 0.40, "exact_score": 1.0},
    {"move": "7 - 5 = 2 first", "solvable": False, "weak_score": 0.35, "exact_score": 0.0},
]

def keep_one(score_name: str) -> dict[str, object]:
    return max(branches, key=lambda branch: branch[score_name])

weak_choice = keep_one("weak_score")
exact_choice = keep_one("exact_score")
print(f"weak_evaluator_keeps_solution: {weak_choice['solvable']}")
print(f"exact_evaluator_keeps_solution: {exact_choice['solvable']}")
print(f"risk: beam_width_1 can prune the valid branch")

Output

weak_evaluator_keeps_solution: False
exact_evaluator_keeps_solution: True
risk: beam_width_1 can prune the valid branch

The production implications are concrete:

Keep ToT for tasks with real branch structure, not ordinary classification.
Prefer deterministic validators when a partial state can be checked in code.
Measure solver recall at each beam width alongside final successes.
Cap expansions and latency before an open-ended search reaches users.

Choose compute with an eval gate

reasoning-release-gate.py

results = [
    {"strategy": "direct", "accuracy": 0.76, "p95_ms": 190, "calls": 1},
    {"strategy": "single_trace", "accuracy": 0.84, "p95_ms": 360, "calls": 1},
    {"strategy": "vote_5", "accuracy": 0.94, "p95_ms": 740, "calls": 5},
    {"strategy": "tree_search", "accuracy": 0.96, "p95_ms": 1840, "calls": 14},
]

minimum_accuracy = 0.90
latency_budget_ms = 900
eligible = [
    row for row in results
    if row["accuracy"] >= minimum_accuracy and row["p95_ms"] <= latency_budget_ms
]
selected = min(eligible, key=lambda row: (row["calls"], row["p95_ms"]))

for row in results:
    print(f"{row['strategy']}: accuracy={row['accuracy']:.0%}, p95_ms={row['p95_ms']}, calls={row['calls']}")
print(f"selected: {selected['strategy']}")

Output

direct: accuracy=76%, p95_ms=190, calls=1
single_trace: accuracy=84%, p95_ms=360, calls=1
vote_5: accuracy=94%, p95_ms=740, calls=5
tree_search: accuracy=96%, p95_ms=1840, calls=14
selected: vote_5

These numbers are example evaluation results, not a benchmark claim. In your system, keep a table with:

Metric	Why it matters
Action accuracy or task success	Extra reasoning must change correct outcomes
Unsafe-action and abstention rates	Reliability includes knowing when not to act
Input, output, and reasoning tokens	Sampling and search multiply spend
p50 and p95 latency	Long tails can make support interactions unusable
Parse and schema failures	A correct thought is useless if the runtime can't consume its action

Where reasoning ends and tools begin

reasoning-to-tool-handoff.txt

Need: security scan status is not present in supplied facts.
Next action request: get_security_scan(run_id="ci-1482")
Runtime responsibility: validate authorization, execute call, log result.
Next decision: apply policy only after observation is returned.

The next lesson implements that action boundary with typed function calls, schemas, errors, and safe execution.

What to remember

Define the scorer first. A decision record lets you test whether extra inference work improves outcomes.
One trace is one candidate. CoT can reveal missed steps, but a plausible rationale isn't a faithful audit log.^{[4]Reference 4Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Promptinghttps://arxiv.org/abs/2305.04388}
Vote on normalized outcomes. Self-consistency is useful only when its measured gain beats its sample cost.^{[5]Reference 5Self-Consistency Improves Chain of Thought Reasoning in Language Models.https://arxiv.org/abs/2203.11171}
Search only with branch structure. ToT needs meaningful states, evaluators, pruning limits, and failure measurements.^{[6]Reference 6Tree of Thoughts: Deliberate Problem Solving with Large Language Models.https://arxiv.org/abs/2305.10601}
Promote strategies through evals. Direct, trace, vote, and search should compete under quality and latency gates.

Mastery check

Key concepts

Chain-of-Thought as one structured candidate trace
Zero-shot CoT versus few-shot contract examples
Reasoning trace faithfulness limits
Self-Consistency with canonicalization and abstention
Tree-of-Thoughts as bounded, evaluator-guided search
Beam pruning failure modes
Inference-time cost, latency, and release gates
ReAct as the handoff from decisions to observations

Evaluation rubric

Foundational: Defines an expected action and the checks required to justify it.
Intermediate: Implements final-answer normalization, majority voting, and an abstention threshold.
Intermediate: Compares direct, trace, and voting strategies on labeled fixtures with cost accounting.
Advanced: Implements a searchable state space and explains how evaluator errors interact with pruning.
Advanced: Designs a release gate using task success, safety, token cost, and tail latency.

Common pitfalls

Scoring text instead of outcomes: Require final actions and policy-check fields that an evaluator can validate.
Logging unrestricted rationale as evidence: Trace text can be unfaithful. Log retrieved inputs, validated actions, observations, and results.
Voting on raw strings: Normalize final decisions into allowed enums before counting.
Treating every plurality as safe: Add abstention for ties, weak majorities, and rejected outputs.
Using ToT for a lookup: Search cost is wasted when no branch can backtrack or improve the answer.
Pruning with an unmeasured evaluator: A confident score can remove the only valid branch. Evaluate beam recall and cap the budget.

Practice extension

Next Step

Continue to Function Calling & Tool Use

You can now choose and evaluate a reasoning budget over supplied facts; next you'll convert missing facts and intended actions into validated tool calls.

PreviousDimensionality Reduction for Embeddings

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

Large Language Models are Zero-Shot Reasoners.

Kojima, T., et al. · 2022

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning.

Zheng, T., Chen, Y., Li, C., et al. · 2025 · arXiv preprint

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman · 2023

Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Wang, X., et al. · 2022

Tree of Thoughts: Deliberate Problem Solving with Large Language Models.

Yao, S., et al. · 2023 · NeurIPS

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

CoT, ToT & Self-Consistency Prompting

Start with a failure you can audit

Why define an oracle and decision record before testing reasoning prompts?

Chain-of-Thought: one decomposed candidate

Zero-shot or few-shot?

When is few-shot CoT worth more prompt tokens than zero-shot CoT?

Self-Consistency: sample answers, then vote

Normalize before counting

A winner isn't always confident enough

Measure gains against call cost

What should a self-consistency controller do when votes do not form a strong majority?

Tree-of-Thoughts: search when branches can dead-end

Search states in a release plan

A fully runnable search example

Pruning is a source of failure

Why can Tree-of-Thoughts perform worse when the evaluator is weak and the beam is narrow?

Choose compute with an eval gate

Where reasoning ends and tools begin

What to remember

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Practice extension

What single rule connects CoT, Self-Consistency, and Tree-of-Thoughts in production?

Mastery Check

Discussion

CoT, ToT & Self-Consistency Prompting

Start with a failure you can audit

Why define an oracle and decision record before testing reasoning prompts?

Chain-of-Thought: one decomposed candidate

Zero-shot or few-shot?

When is few-shot CoT worth more prompt tokens than zero-shot CoT?

Self-Consistency: sample answers, then vote

Normalize before counting

A winner isn't always confident enough

Measure gains against call cost

What should a self-consistency controller do when votes do not form a strong majority?

Tree-of-Thoughts: search when branches can dead-end

Search states in a release plan

A fully runnable search example

Pruning is a source of failure

Why can Tree-of-Thoughts perform worse when the evaluator is weak and the beam is narrow?

Choose compute with an eval gate

Where reasoning ends and tools begin

What to remember

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Practice extension

What single rule connects CoT, Self-Consistency, and Tree-of-Thoughts in production?

Mastery Check

Discussion