Master Chain-of-Thought prompting, Self-Consistency, and Tree-of-Thought strategies. Learn when to trade inference compute for reasoning accuracy.
When you ask an AI like ChatGPT a question, you are "prompting" it. Most people just ask for an answer directly, treating the AI like a smart search engine. But for complex problems (like math, logic puzzles, or coding) this often leads to wrong answers.
Advanced prompting changes the game by asking the model to think before it speaks. By using techniques like Chain-of-Thought (CoT), we can force the AI to show its work, dramatically improving its ability to solve hard problems without needing to retrain the model.
Standard prompting asks a model for an answer directly. This often fails on complex tasks because the model attempts to map Input → Output in a single step (a single forward pass of the neural network), without allocating "compute time" to the intermediate steps.
Chain-of-Thought (CoT) changes this by prompting the model to generate intermediate reasoning steps before the final answer.
💡 Key insight: Generating tokens is thinking. By forcing the model to output reasoning steps, you effectively give it more computational cycles (and working memory) to solve the problem.
Here is an example of standard prompting, where the user asks the model directly for an answer. This approach often results in the model failing to properly compute intermediate steps:
Standard prompting:
text1Q: Roger has 5 tennis balls. He buys 2 cans of 3 tennis balls each. 2 How many tennis balls does he have now? 3A: 11
By contrast, Chain-of-Thought prompting forces the model to articulate its reasoning before arriving at the final conclusion. This step-by-step approach significantly improves accuracy:
Chain-of-Thought prompting:
text1Q: Roger has 5 tennis balls. He buys 2 cans of 3 tennis balls each. 2 How many tennis balls does he have now? 3A: Roger started with 5 balls. He bought 2 cans of 3 balls each, so he bought 2 * 3 = 6 balls. 4 The total number of balls is 5 + 6 = 11. 5 The answer is 11.
Understanding the flow of information helps visualize why CoT is effective.
The most robust way to invoke CoT is by providing few-shot exemplars that demonstrate the reasoning pattern:
text1Q: [Example Problem 1] 2A: Let's think step by step. [Reasoning steps...] Therefore, the answer is [X]. 3 4Q: [Example Problem 2] 5A: Let's think step by step. [Reasoning steps...] Therefore, the answer is [Y]. 6 7Q: [Actual Question] 8A:
⚠️ Common mistake: CoT generally only yields significant gains for models larger than ~100B parameters. Smaller models (like early 7B variants) often produce incoherent reasoning chains that hallucinate logic[2].
Surprisingly, you don't always need examples. Appending a simple trigger phrase like "Let's think step by step" can activate reasoning capabilities in zero-shot settings[3].
text1Q: A juggler has 16 balls. Half are golf balls, and half of the golf 2 balls are blue. How many blue golf balls are there? 3 4A: Let's think step by step.
This works because the pre-training data contains many examples of step-by-step solutions (textbooks, tutorials) that are conditioned on similar explanatory phrases.
| Trigger Phrase | Use Case |
|---|---|
| "Let's think step by step" | General reasoning tasks |
| "Let's work this out in a step-by-step way to be sure we have the right answer" | Math & Logic |
| "Let's break this down" | Multi-part complex queries |
| "Are you sure? Walk me through it" | Error correction |
While zero-shot CoT is incredibly easy to implement, it provides less control over the format of the output. The model might reason correctly but bury the final answer deep in a paragraph, making it hard to parse programmatically. For production systems where you need strict adherence to a schema (like JSON output), few-shot CoT is generally preferred because the examples serve as both a reasoning guide and a formatting template.
Think of this like asking five different friends to solve a riddle. They might each have a different thought process, but if three of them arrive at the exact same answer, you can be much more confident in that result than if you just asked one person.
In standard greedy decoding (temperature=0), a model always outputs the single most likely token sequence. However, reasoning is often non-deterministic; there are multiple valid paths to a correct answer.
Self-Consistency[4] exploits this by sampling multiple diverse reasoning paths and taking a majority vote.
The following Python code demonstrates how to implement the Self-Consistency algorithm. It takes a language model, a prompt, and temperature settings as inputs. It generates multiple independent reasoning paths, extracts the final answer from each, and returns the most robust conclusion through majority voting.
python1from collections import Counter 2import re 3from typing import List 4 5def extract_final_answer(text: str) -> str: 6 """Parses the text to find the final answer after 'The answer is'.""" 7 # Simple regex to catch common patterns. In production, this needs 8 # to be robust to variations. 9 match = re.search(r"The answer is (.*?)(?:\.|$)", text, re.IGNORECASE) 10 return match.group(1).strip() if match else "UNKNOWN" 11 12def self_consistency( 13 model, 14 prompt: str, 15 n_samples: int = 5, 16 temperature: float = 0.7 17) -> str: 18 """ 19 Implements Self-Consistency by sampling multiple reasoning paths 20 and taking a majority vote. 21 """ 22 answers = [] 23 24 for _ in range(n_samples): 25 # Generate with high temperature to encourage diverse reasoning paths. 26 # This forces the model to explore different latent thought trajectories. 27 response = model.generate(prompt, temperature=temperature, max_tokens=512) 28 29 # Extract the structured final answer from the reasoning chain 30 answer = extract_final_answer(response) 31 answers.append(answer) 32 33 # Majority vote: The answer that appears most often is likely correct 34 # even if individual reasoning chains differ. This marginalizes out 35 # stochastic errors in the reasoning process. 36 most_common, count = Counter(answers).most_common(1)[0] 37 return most_common
Imagine playing Chess. You don't just pick the first move that looks good; you simulate several possible moves, think about the opponent's reply to each, and "backtrack" if a path looks bad.
CoT treats reasoning as a linear sequence. If the model makes a mistake early in the chain, the error propagates. Tree-of-Thought (ToT)[5] generalizes this by exploring a branching tree of possibilities, allowing the model to look ahead and backtrack, just like in Chess.
Here is a conceptual implementation of Tree-of-Thought using Breadth-First Search (BFS). This function takes a problem description and search parameters (breadth and depth) as inputs. At each step, it evaluates candidate thoughts to prune unpromising branches, ultimately returning the most likely solution path.
python1from typing import List, Tuple, Optional 2 3# Abstract interfaces for the domain-specific logic 4def generate_thought(model, problem: str, state: str = "") -> str: 5 """Generates a single reasoning step (thought).""" 6 pass 7 8def evaluate_thought_validity(model, problem: str, thought: str) -> float: 9 """Returns a score (0.0-1.0) for the thought's promise.""" 10 pass 11 12def is_solved(thought: str) -> bool: 13 """Checks if the thought represents a complete solution.""" 14 pass 15 16def tree_of_thought(model, problem: str, breadth: int = 3, depth: int = 3) -> Optional[str]: 17 """ 18 Implements a Breadth-First Search (BFS) Tree-of-Thought solver. 19 20 Args: 21 breadth (b): How many branches to explore at each step. 22 depth (d): Maximum depth of the reasoning tree. 23 """ 24 # Start with the initial problem state 25 current_thoughts = [problem] 26 27 for step in range(depth): 28 scored_thoughts: List[Tuple[str, float]] = [] 29 30 # 1. Expansion Step 31 # Generate 'breadth' new thoughts from EACH current thought 32 next_candidates = [] 33 for thought in current_thoughts: 34 if is_solved(thought): 35 return thought 36 37 # Generate multiple potential next steps 38 for _ in range(breadth): 39 new_thought = generate_thought(model, problem, state=thought) 40 next_candidates.append(new_thought) 41 42 # 2. Evaluation Step 43 # Score all new candidates 44 for candidate in next_candidates: 45 # "System 2" check: Is this partial solution promising? 46 score = evaluate_thought_validity(model, problem, candidate) 47 scored_thoughts.append((candidate, score)) 48 49 # 3. Selection Step (Beam Search logic) 50 # Prune the tree: Keep only the top-k most promising thoughts across all branches. 51 # This prevents exponential explosion. 52 top_thoughts = sorted(scored_thoughts, key=lambda x: x[1], reverse=True)[:breadth] 53 current_thoughts = [t[0] for t in top_thoughts] 54 55 if not current_thoughts: 56 break 57 58 # Return best found path if not fully solved 59 return current_thoughts[0] if current_thoughts else None
💡 Key insight: Standard CoT is linear ( branches), making it cheap and fast. ToT explores a tree structure with complexity , where is the branching factor and is the depth. This can be significantly more expensive in terms of token usage and latency, so reserve it for problems that require backtracking.
Instead of parsing regex, prompt the model to output valid JSON. This is crucial for pipeline integration.
The prompt below illustrates how to enforce a JSON schema in the output. By providing clear instructions and a schema definition as input, the model knows exactly what fields to extract and what data types to return as its output.
text1Extract the entities from the text. Return ONLY a valid JSON object. 2Schema: 3{ 4 "people": [{"name": string, "role": string}], 5 "dates": [string] 6} 7 8Text: "On July 20, 1969, Neil Armstrong walked on the moon."
Establishing a persona constrains the solution space and sets tone.
This example shows how to set up a system prompt that defines a specific engineering persona. It provides instructions to the model to focus strictly on certain types of code issues, ensuring the output matches the required review criteria.
python1SYSTEM_PROMPT = """ 2You are a Senior Backend Engineer reviewing code. 3Focus ONLY on: 41. Security vulnerabilities (SQLi, XSS) 52. Performance bottlenecks (N+1 queries) 63. Race conditions 7 8Do NOT comment on style or formatting. 9"""
ReAct[6] extends Chain-of-Thought by allowing the model to perform actions (like calling tools) based on its reasoning. Instead of just Thought → Conclusion, the cycle becomes Thought → Action → Observation → Thought.
This is the foundational architecture for LLM Agents:
search_tool("Bitcoin price")By externalizing both reasoning (CoT) and knowledge (Tools), ReAct overcomes the limitations of the model's static weights.
Engineers often jump to fine-tuning too early. Prompting should be the first line of defense. Fine-tuning excels at learning new syntax or style, while prompting excels at using existing general knowledge for reasoning.
| Feature | Advanced Prompting (CoT/ToT) | Fine-Tuning |
|---|---|---|
| Data Requirement | Zero or Few-shot (1-5 examples) | High (100s - 1000s of examples) |
| Iteration Speed | Seconds (edit text) | Hours/Days (retrain model) |
| Generalization | High (retains general knowledge) | Low (risk of catastrophic forgetting) |
| Cost | High inference cost (more tokens) | High training cost, Low inference cost |
| Specific Syntax | Hard to enforce perfectly | Excellent for learning new formats |
⚠️ Warning: Advanced prompting is not a magic fix for all model limitations.
"CoT always improves performance." For simple tasks (e.g., sentiment analysis, factual lookups), CoT adds latency and cost without improving accuracy. In some cases, it can even degrade performance by introducing "reasoning hallucinations" where the model overthinks a simple prompt.
"Small models benefit from CoT." Models under ~10B parameters generally struggle with CoT. They often generate the "reasoning" style but the content is incoherent or logically flawed. CoT is an emergent property of scale.
"Faithful Reasoning." Just because a model outputs a correct chain of thought doesn't mean that's how it solved the problem. Conversely, models can hallucinate a plausible-sounding reasoning chain to justify an incorrect answer ("faithfulness" problem).
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
Wei, J., et al. · 2022 · NeurIPS
Emergent Abilities of Large Language Models.
Wei, J., et al. · 2022 · TMLR
Large Language Models are Zero-Shot Reasoners.
Kojima, T., et al. · 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models.
Wang, X., et al. · 2023
Tree of Thoughts: Deliberate Problem Solving with Large Language Models.
Yao, S., et al. · 2023 · NeurIPS
ReAct: Synergizing Reasoning and Acting in Language Models.
Yao, S., et al. · 2023 · ICLR 2023