Build and evaluate reasoning controllers: single traces, answer voting, and bounded tree search for multi-step LLM decisions.
The previous lesson measured whether compressed still retrieved the right policy evidence. Retrieval is not the end of the task. An assistant may now have the right facts and still choose the wrong action because it skips a condition, mishandles arithmetic, or commits too early to one plan.
This lesson turns extra inference work into an engineering decision. You will build a small support-resolution controller for a delayed parcel:
The goal is not to collect long rationales. The goal is to improve measurable decision accuracy under a , latency, and safety budget.
Suppose an order has been delayed for six days. A replacement is in stock, the carrier scan confirms a delivery exception, and policy permits expedited replacement after five delayed days. A direct response may still overlook one condition and offer a refund or escalation unnecessarily.
A useful outward artifact is a short decision record: facts used, checks applied, and final action. It is smaller than an open-ended rationale, easy to score, and safe to compare against deterministic policy logic.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class ParcelCase:
5 delayed_days: int
6 exception_scan: bool
7 replacement_in_stock: bool
8 expedite_after_days: int = 5
9
10def decision_record(case: ParcelCase) -> dict[str, object]:
11 checks = {
12 "delay_threshold_met": case.delayed_days >= case.expedite_after_days,
13 "carrier_exception_confirmed": case.exception_scan,
14 "replacement_available": case.replacement_in_stock,
15 }
16 action = "reship_expedited" if all(checks.values()) else "manual_review"
17 return {"checks": checks, "final_action": action}
18
19record = decision_record(
20 ParcelCase(delayed_days=6, exception_scan=True, replacement_in_stock=True)
21)
22for name, passed in record["checks"].items():
23 print(f"{name}: {passed}")
24print(f"final_action: {record['final_action']}")1delay_threshold_met: True
2carrier_exception_confirmed: True
3replacement_available: True
4final_action: reship_expeditedThis code is not an LLM. It is the oracle that your prompt variants must match. Before increasing model compute, define an output contract and a scorer.
Wei et al. introduced few-shot CoT by placing worked intermediate steps in prompt examples. Their experiments found gains on arithmetic, commonsense, and symbolic reasoning benchmarks for sufficiently large models, including a strong GSM8K result with PaLM 540B.[1] Kojima et al. later showed that the zero-shot trigger "Let's think step by step" could improve several benchmark reasoning tasks without worked examples.[2]
Those papers establish techniques to evaluate, not a production law. On your endpoint and task, a visible scratchpad may help, do nothing, or add cost. If an API offers a native reasoning control, evaluate that option as another strategy rather than assuming that extra visible prose helps.
For a customer-facing workflow, do not ask the model to reveal unrestricted inner reasoning. Ask it to produce reviewable artifacts:
1Use only the supplied parcel facts and policy rules.
2Return:
31. required_checks: each policy predicate with pass/fail
42. final_action: one allowed action enum
53. customer_message: one sentence
6
7Facts:
8- delayed_days: 6
9- carrier_exception_confirmed: true
10- replacement_in_stock: true
11
12Policy:
13- expedited replacement is allowed when delayed_days >= 5,
14 an exception is confirmed, and replacement stock exists.The checks are useful because a missed predicate becomes observable. They are not proof of faithful hidden cognition. Turpin et al. showed that CoT explanations can rationalize outputs influenced by hidden biasing features without mentioning those features.[3] Log inputs, actions, validations, and outcomes; do not treat eloquent reasoning text as an audit guarantee.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Candidate:
5 name: str
6 checks: set[str]
7 final_action: str
8
9required_checks = {
10 "delay_threshold_met",
11 "carrier_exception_confirmed",
12 "replacement_available",
13}
14expected_action = "reship_expedited"
15candidates = [
16 Candidate("direct", {"delay_threshold_met"}, "manual_review"),
17 Candidate("structured_trace", required_checks, "reship_expedited"),
18]
19
20for candidate in candidates:
21 coverage = len(candidate.checks & required_checks) / len(required_checks)
22 action_ok = candidate.final_action == expected_action
23 print(f"{candidate.name}: coverage={coverage:.0%}, action_ok={action_ok}")1direct: coverage=33%, action_ok=False
2structured_trace: coverage=100%, action_ok=TrueHere, the trace is useful because it meets the scored contract. It does not win merely because it contains more words.
Zero-shot CoT supplies an instruction and lets the model choose its intermediate format. Few-shot CoT gives one or more solved examples, so it can teach both decomposition and the final answer shape. Native structured-output enforcement is better when available; examples are still useful when the prompt must communicate task-specific checks.
1example = """Example:
2facts: delayed_days=2, exception_scan=true, replacement_in_stock=true
3required_checks: delay_threshold_met=false, carrier_exception_confirmed=true, replacement_available=true
4final_action: manual_review"""
5
6case = "facts: delayed_days=6, exception_scan=true, replacement_in_stock=true"
7zero_shot = f"Evaluate parcel policy step by step.\n{case}\nfinal_action:"
8few_shot = f"{example}\n\nNow evaluate:\n{case}\nrequired_checks:\nfinal_action:"
9
10print(f"zero_shot_has_example: {'Example:' in zero_shot}")
11print(f"few_shot_has_example: {'Example:' in few_shot}")
12print(f"few_shot_requests_checks: {'required_checks:' in few_shot}")1zero_shot_has_example: False
2few_shot_has_example: True
3few_shot_requests_checks: TrueOne structured trace can fail because generation takes an unlucky path. Self-Consistency replaces reliance on one path with several sampled paths and chooses the most consistent final answer.[4] The vote operates on extracted answers, not on whose rationale sounds best.
On the original benchmark setting, Wang et al. reported a 17.9 percentage-point GSM8K improvement over CoT prompting for PaLM 540B.[4] That number is evidence for the method on those benchmarks. It is not your expected support-resolution gain. Measure your own cases and sample cost.
Model outputs rarely use exactly one spelling. Your controller should map harmless variants to one allowed action and reject unknown outputs before voting.
1from collections import Counter
2
3ALIASES = {
4 "expedited reship": "reship_expedited",
5 "reship_expedited": "reship_expedited",
6 "send expedited replacement": "reship_expedited",
7 "manual review": "manual_review",
8}
9
10def canonicalize(text: str) -> str | None:
11 normalized = text.strip().lower().replace("-", " ")
12 return ALIASES.get(normalized)
13
14samples = [
15 "Expedited reship",
16 "reship_expedited",
17 "Send expedited replacement",
18 "manual review",
19 "issue credit immediately",
20]
21votes = Counter(action for text in samples if (action := canonicalize(text)))
22winner = votes.most_common(1)[0][0] if votes else "manual_review"
23
24print(f"accepted_samples: {sum(votes.values())}/{len(samples)}")
25print(f"votes: {dict(votes)}")
26print(f"winner: {winner}")1accepted_samples: 4/5
2votes: {'reship_expedited': 3, 'manual_review': 1}
3winner: reship_expeditedA 2 to 2 tie, zero accepted outputs, or a narrow plurality with many rejected outputs should not silently become a customer action. Add an abstention rule. The winning share must use all sampled outputs as its denominator, not only the strings that parsed successfully.
1from collections import Counter
2
3def decide(votes: list[str], total_samples: int, minimum_share: float = 0.6) -> str:
4 if not votes:
5 return "manual_review"
6 counts = Counter(votes)
7 winner, count = counts.most_common(1)[0]
8 share = count / total_samples
9 tied = len(counts) > 1 and counts.most_common(2)[0][1] == counts.most_common(2)[1][1]
10 if tied or share < minimum_share:
11 return "manual_review"
12 return winner
13
14strong = ["reship_expedited", "reship_expedited", "reship_expedited", "refund"]
15split = ["reship_expedited", "reship_expedited", "refund", "refund"]
16mostly_rejected = ["reship_expedited", "reship_expedited"]
17
18print(f"strong_vote: {decide(strong, total_samples=4)}")
19print(f"split_vote: {decide(split, total_samples=4)}")
20print(f"mostly_rejected_vote: {decide(mostly_rejected, total_samples=5)}")
21print(f"no_valid_votes: {decide([], total_samples=5)}")1strong_vote: reship_expedited
2split_vote: manual_review
3mostly_rejected_vote: manual_review
4no_valid_votes: manual_reviewUse a held-out fixture set before calling the strategy ready. The code below represents five sampled final actions returned for each fixture; the controller compares first-sample accuracy with five-sample voting accuracy.
1from collections import Counter
2
3fixtures = {
4 "late_with_stock": {
5 "expected": "reship",
6 "samples": ["review", "reship", "reship", "reship", "refund"],
7 },
8 "late_no_stock": {
9 "expected": "refund",
10 "samples": ["refund", "refund", "review", "refund", "review"],
11 },
12 "not_late": {
13 "expected": "review",
14 "samples": ["reship", "review", "review", "review", "reship"],
15 },
16}
17
18single_correct = 0
19vote_correct = 0
20for item in fixtures.values():
21 winner = Counter(item["samples"]).most_common(1)[0][0]
22 single_correct += item["samples"][0] == item["expected"]
23 vote_correct += winner == item["expected"]
24
25total = len(fixtures)
26print(f"single_trace_accuracy: {single_correct / total:.0%}")
27print(f"vote_5_accuracy: {vote_correct / total:.0%}")
28print(f"model_calls: single={total}, vote_5={total * 5}")1single_trace_accuracy: 33%
2vote_5_accuracy: 100%
3model_calls: single=3, vote_5=15This fixture is intentionally small and deterministic: it tests controller logic. A real release decision needs representative labeled cases, real model samples, token counts, latency, refusal rates, and cost.
Voting helps when independent paths tend to converge on the same short answer. It does not deliberately revisit earlier choices. Tree-of-Thoughts (ToT) represents partial solutions as search states, generates alternatives, evaluates the states, and preserves only branches worth extending.[5]
Yao et al. evaluated ToT on tasks built for planning and search. On Game of 24, their GPT-4 ToT setup solved 74% of tasks while their CoT baseline solved 4%.[5] The lesson is narrower than "use trees everywhere": if a problem has verifiable partial states and meaningful backtracking, search can rescue a bad early move.
For a delayed parcel, consider a workflow whose final recommendation must be supported by two observations: a confirmed carrier exception and a replacement inventory check. A controller can expand legal steps instead of letting a model invent a final action before evidence is present.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class State:
5 scan_checked: bool = False
6 stock_checked: bool = False
7 final_action: str | None = None
8
9def expand(state: State) -> list[tuple[str, State]]:
10 next_states: list[tuple[str, State]] = []
11 if not state.scan_checked:
12 next_states.append(("check_carrier_exception", State(True, state.stock_checked)))
13 if not state.stock_checked:
14 next_states.append(("check_replacement_stock", State(state.scan_checked, True)))
15 if state.scan_checked and state.stock_checked and state.final_action is None:
16 next_states.append(("offer_expedited_reship", State(True, True, "reship")))
17 return next_states
18
19frontier = [State()]
20for depth in range(3):
21 generated = [item for state in frontier for item in expand(state)]
22 print(f"depth_{depth + 1}: {[action for action, _ in generated]}")
23 frontier = list(dict.fromkeys(state for _, state in generated))1depth_1: ['check_carrier_exception', 'check_replacement_stock']
2depth_2: ['check_replacement_stock', 'check_carrier_exception']
3depth_3: ['offer_expedited_reship']The two legal evidence-gathering orders converge on the same verified state before the final action. In the next lesson, tool calls will populate that state from real APIs.
Game of 24 is useful because the evaluator is exact: arithmetic either reaches 24 using each input once or it does not. The solver below explores partial equations with breadth-first search and returns a verified solution.
1from fractions import Fraction
2from itertools import combinations
3
4def combine(left: tuple[Fraction, str], right: tuple[Fraction, str]) -> list[tuple[Fraction, str]]:
5 a, a_expr = left
6 b, b_expr = right
7 outcomes = [
8 (a + b, f"({a_expr} + {b_expr})"),
9 (a - b, f"({a_expr} - {b_expr})"),
10 (b - a, f"({b_expr} - {a_expr})"),
11 (a * b, f"({a_expr} * {b_expr})"),
12 ]
13 if b:
14 outcomes.append((a / b, f"({a_expr} / {b_expr})"))
15 if a:
16 outcomes.append((b / a, f"({b_expr} / {a_expr})"))
17 return outcomes
18
19def solve_24(numbers: list[int]) -> str | None:
20 frontier = [[(Fraction(number), str(number)) for number in numbers]]
21 while frontier:
22 state = frontier.pop(0)
23 if len(state) == 1 and state[0][0] == 24:
24 return state[0][1]
25 for i, j in combinations(range(len(state)), 2):
26 remainder = [item for k, item in enumerate(state) if k not in (i, j)]
27 frontier.extend([remainder + [result] for result in combine(state[i], state[j])])
28 return None
29
30solution = solve_24([4, 5, 6, 7])
31print(f"solution_found: {solution is not None}")
32print(f"expression: {solution}")1solution_found: True
2expression: ((6 - 4) * (5 + 7))
An LLM evaluator is not an arithmetic oracle. If it scores an apparently simple but dead branch above a non-obvious solvable branch, an aggressive beam can remove the answer before expansion.
1branches = [
2 {"move": "6 * 4 = 24 first", "solvable": False, "weak_score": 0.95, "exact_score": 0.0},
3 {"move": "5 + 7 = 12 first", "solvable": True, "weak_score": 0.40, "exact_score": 1.0},
4 {"move": "7 - 5 = 2 first", "solvable": False, "weak_score": 0.35, "exact_score": 0.0},
5]
6
7def keep_one(score_name: str) -> dict[str, object]:
8 return max(branches, key=lambda branch: branch[score_name])
9
10weak_choice = keep_one("weak_score")
11exact_choice = keep_one("exact_score")
12print(f"weak_evaluator_keeps_solution: {weak_choice['solvable']}")
13print(f"exact_evaluator_keeps_solution: {exact_choice['solvable']}")
14print(f"risk: beam_width_1 can prune the valid branch")1weak_evaluator_keeps_solution: False
2exact_evaluator_keeps_solution: True
3risk: beam_width_1 can prune the valid branchThe production implications are concrete:
Direct prompting, one trace, voting, and tree search are not maturity levels. They are candidates with different accuracy and serving cost. Start with the cheapest candidate, then promote a more expensive strategy only when held-out results require it.
1results = [
2 {"strategy": "direct", "accuracy": 0.76, "p95_ms": 190, "calls": 1},
3 {"strategy": "single_trace", "accuracy": 0.84, "p95_ms": 360, "calls": 1},
4 {"strategy": "vote_5", "accuracy": 0.94, "p95_ms": 740, "calls": 5},
5 {"strategy": "tree_search", "accuracy": 0.96, "p95_ms": 1840, "calls": 14},
6]
7
8minimum_accuracy = 0.90
9latency_budget_ms = 900
10eligible = [
11 row for row in results
12 if row["accuracy"] >= minimum_accuracy and row["p95_ms"] <= latency_budget_ms
13]
14selected = min(eligible, key=lambda row: (row["calls"], row["p95_ms"]))
15
16for row in results:
17 print(f"{row['strategy']}: accuracy={row['accuracy']:.0%}, p95_ms={row['p95_ms']}, calls={row['calls']}")
18print(f"selected: {selected['strategy']}")1direct: accuracy=76%, p95_ms=190, calls=1
2single_trace: accuracy=84%, p95_ms=360, calls=1
3vote_5: accuracy=94%, p95_ms=740, calls=5
4tree_search: accuracy=96%, p95_ms=1840, calls=14
5selected: vote_5These numbers are example evaluation results, not a benchmark claim. In your system, keep a table with:
| Metric | Why it matters |
|---|---|
| Action accuracy or task success | Extra reasoning must change correct outcomes |
| Unsafe-action and abstention rates | Reliability includes knowing when not to act |
| Input, output, and reasoning tokens | Sampling and search multiply spend |
| p50 and p95 latency | Long tails can make support interactions unusable |
| Parse and schema failures | A correct thought is useless if the runtime cannot consume its action |
All runnable experiments above operate on provided facts or deterministic state. A real delayed-parcel request requires current carrier scans and inventory. Reasoning alone cannot obtain those observations.
ReAct interleaves reasoning traces and task-specific actions so new observations can update the next decision.[6] For production systems, the important handoff is not a saved inner monologue. It is a validated action request, a controlled execution result, and a bounded next decision:
1Need: carrier exception is not present in supplied facts.
2Next action request: get_carrier_scan(order_id="A102")
3Runtime responsibility: validate authorization, execute call, log result.
4Next decision: apply policy only after observation is returned.The next lesson implements that action boundary with typed function calls, schemas, errors, and safe execution.
Create twenty delayed-parcel fixtures with policy-grounded final actions. Collect direct, structured-trace, and five-sample candidate outputs from one model endpoint. Canonicalize the final action, abstain on weak votes, and produce a comparison table with accuracy, unsafe actions, abstentions, tokens, and p95 latency. Add search only for cases where choosing an action requires sequencing multiple checks or backing out of a failed plan.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
Wei, J., et al. · 2022 · NeurIPS
Large Language Models are Zero-Shot Reasoners.
Kojima, T., et al. · 2022
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman · 2023
Self-Consistency Improves Chain of Thought Reasoning in Language Models.
Wang, X., et al. · 2022
Tree of Thoughts: Deliberate Problem Solving with Large Language Models.
Yao, S., et al. · 2023 · NeurIPS
ReAct: Synergizing Reasoning and Acting in Language Models.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023