LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
⚡Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
🔍Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
🤖Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
📊Evaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
🛠️LLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
🏗️System Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnAgentic Architecture & OrchestrationCoT, ToT & Self-Consistency Prompting
✍️MediumPrompt Engineering

CoT, ToT & Self-Consistency Prompting

Master Chain-of-Thought prompting, Self-Consistency, and Tree-of-Thought strategies. Learn when to trade inference compute for reasoning accuracy.

25 min readGoogle, OpenAI, Anthropic +210 key concepts

When you ask an AI like ChatGPT a question, you are "prompting" it. Most people just ask for an answer directly, treating the AI like a smart search engine. But for complex problems (like math, logic puzzles, or coding) this often leads to wrong answers.

Advanced prompting changes the game by asking the model to think before it speaks. By using techniques like Chain-of-Thought (CoT), we can force the AI to show its work, dramatically improving its ability to solve hard problems without needing to retrain the model.

Chain-of-Thought (CoT) Prompting

The core idea

Standard prompting asks a model for an answer directly. This often fails on complex tasks because the model attempts to map Input → Output in a single step (a single forward pass of the neural network), without allocating "compute time" to the intermediate steps.

Chain-of-Thought (CoT) changes this by prompting the model to generate intermediate reasoning steps before the final answer.

Chain-of-thought prompting: a math word problem where the model reasons through intermediate steps before arriving at the final answer. Chain-of-thought prompting: a math word problem where the model reasons through intermediate steps before arriving at the final answer.

💡 Key insight: Generating tokens is thinking. By forcing the model to output reasoning steps, you effectively give it more computational cycles (and working memory) to solve the problem.

Comparison: standard vs. CoT

Here is an example of standard prompting, where the user asks the model directly for an answer. This approach often results in the model failing to properly compute intermediate steps:

Standard prompting:

text
1Q: Roger has 5 tennis balls. He buys 2 cans of 3 tennis balls each. 2 How many tennis balls does he have now? 3A: 11

By contrast, Chain-of-Thought prompting forces the model to articulate its reasoning before arriving at the final conclusion. This step-by-step approach significantly improves accuracy:

Chain-of-Thought prompting:

text
1Q: Roger has 5 tennis balls. He buys 2 cans of 3 tennis balls each. 2 How many tennis balls does he have now? 3A: Roger started with 5 balls. He bought 2 cans of 3 balls each, so he bought 2 * 3 = 6 balls. 4 The total number of balls is 5 + 6 = 11. 5 The answer is 11.

Architectural view

Understanding the flow of information helps visualize why CoT is effective.

Diagram Diagram

Why it works

  1. •Decomposition: It forces the model to break complex problems into manageable sub-steps[1].
  2. •Working Memory: The generated text acts as an external scratchpad. It stores previously generated results in the Key-Value (KV) cache, allowing the model to attend to this context for subsequent tokens.
  3. •Error Localization: Wrong reasoning steps are visible and debuggable.
  4. •Compute Scaling: It allows the model to spend more test-time compute on harder problems.

Few-shot CoT

The most robust way to invoke CoT is by providing few-shot exemplars that demonstrate the reasoning pattern:

Side-by-side comparison of zero-shot vs few-shot prompting. Zero-shot provides only a task description; few-shot includes example input-output pairs before the query. Side-by-side comparison of zero-shot vs few-shot prompting. Zero-shot provides only a task description; few-shot includes example input-output pairs before the query.
text
1Q: [Example Problem 1] 2A: Let's think step by step. [Reasoning steps...] Therefore, the answer is [X]. 3 4Q: [Example Problem 2] 5A: Let's think step by step. [Reasoning steps...] Therefore, the answer is [Y]. 6 7Q: [Actual Question] 8A:

⚠️ Common mistake: CoT generally only yields significant gains for models larger than ~100B parameters. Smaller models (like early 7B variants) often produce incoherent reasoning chains that hallucinate logic[2].

Zero-Shot CoT

The simplest technique

Surprisingly, you don't always need examples. Appending a simple trigger phrase like "Let's think step by step" can activate reasoning capabilities in zero-shot settings[3].

text
1Q: A juggler has 16 balls. Half are golf balls, and half of the golf 2 balls are blue. How many blue golf balls are there? 3 4A: Let's think step by step.

This works because the pre-training data contains many examples of step-by-step solutions (textbooks, tutorials) that are conditioned on similar explanatory phrases.

Trigger phrase effectiveness

Trigger PhraseUse Case
"Let's think step by step"General reasoning tasks
"Let's work this out in a step-by-step way to be sure we have the right answer"Math & Logic
"Let's break this down"Multi-part complex queries
"Are you sure? Walk me through it"Error correction

While zero-shot CoT is incredibly easy to implement, it provides less control over the format of the output. The model might reason correctly but bury the final answer deep in a paragraph, making it hard to parse programmatically. For production systems where you need strict adherence to a schema (like JSON output), few-shot CoT is generally preferred because the examples serve as both a reasoning guide and a formatting template.

Self-Consistency

The algorithm

Think of this like asking five different friends to solve a riddle. They might each have a different thought process, but if three of them arrive at the exact same answer, you can be much more confident in that result than if you just asked one person.

In standard greedy decoding (temperature=0), a model always outputs the single most likely token sequence. However, reasoning is often non-deterministic; there are multiple valid paths to a correct answer.

Self-Consistency[4] exploits this by sampling multiple diverse reasoning paths and taking a majority vote.

Implementation logic

  1. •Sample: Generate NNN different responses using a non-zero temperature (e.g., T=0.7T=0.7T=0.7).
  2. •Extract: Parse the final answer from each response.
  3. •Vote: Select the answer that appears most frequently (majority voting).

The following Python code demonstrates how to implement the Self-Consistency algorithm. It takes a language model, a prompt, and temperature settings as inputs. It generates multiple independent reasoning paths, extracts the final answer from each, and returns the most robust conclusion through majority voting.

python
1from collections import Counter 2import re 3from typing import List 4 5def extract_final_answer(text: str) -> str: 6 """Parses the text to find the final answer after 'The answer is'.""" 7 # Simple regex to catch common patterns. In production, this needs 8 # to be robust to variations. 9 match = re.search(r"The answer is (.*?)(?:\.|$)", text, re.IGNORECASE) 10 return match.group(1).strip() if match else "UNKNOWN" 11 12def self_consistency( 13 model, 14 prompt: str, 15 n_samples: int = 5, 16 temperature: float = 0.7 17) -> str: 18 """ 19 Implements Self-Consistency by sampling multiple reasoning paths 20 and taking a majority vote. 21 """ 22 answers = [] 23 24 for _ in range(n_samples): 25 # Generate with high temperature to encourage diverse reasoning paths. 26 # This forces the model to explore different latent thought trajectories. 27 response = model.generate(prompt, temperature=temperature, max_tokens=512) 28 29 # Extract the structured final answer from the reasoning chain 30 answer = extract_final_answer(response) 31 answers.append(answer) 32 33 # Majority vote: The answer that appears most often is likely correct 34 # even if individual reasoning chains differ. This marginalizes out 35 # stochastic errors in the reasoning process. 36 most_common, count = Counter(answers).most_common(1)[0] 37 return most_common

Why it works

  • •Diversity: Different reasoning paths might make different mistakes, but they are unlikely to make the exact same mistake. Correct reasoning paths tend to converge on the same answer.
  • •Marginalization: It effectively marginalizes over the reasoning process to find the most probable output.

Tree-of-Thought (ToT)

Beyond linear chains

Imagine playing Chess. You don't just pick the first move that looks good; you simulate several possible moves, think about the opponent's reply to each, and "backtrack" if a path looks bad.

CoT treats reasoning as a linear sequence. If the model makes a mistake early in the chain, the error propagates. Tree-of-Thought (ToT)[5] generalizes this by exploring a branching tree of possibilities, allowing the model to look ahead and backtrack, just like in Chess.

Visualization

Diagram Diagram

Algorithm

  1. •Decomposition: Break the problem into intermediate steps (thoughts).
  2. •Generation: Generate multiple candidate thoughts for the next step.
  3. •Evaluation: Use the LLM to score each candidate (e.g., "Is this step possible? Rate 1-10").
  4. •Search: Use Breadth-First Search (BFS) or Depth-First Search (DFS) to traverse the tree.

Here is a conceptual implementation of Tree-of-Thought using Breadth-First Search (BFS). This function takes a problem description and search parameters (breadth and depth) as inputs. At each step, it evaluates candidate thoughts to prune unpromising branches, ultimately returning the most likely solution path.

python
1from typing import List, Tuple, Optional 2 3# Abstract interfaces for the domain-specific logic 4def generate_thought(model, problem: str, state: str = "") -> str: 5 """Generates a single reasoning step (thought).""" 6 pass 7 8def evaluate_thought_validity(model, problem: str, thought: str) -> float: 9 """Returns a score (0.0-1.0) for the thought's promise.""" 10 pass 11 12def is_solved(thought: str) -> bool: 13 """Checks if the thought represents a complete solution.""" 14 pass 15 16def tree_of_thought(model, problem: str, breadth: int = 3, depth: int = 3) -> Optional[str]: 17 """ 18 Implements a Breadth-First Search (BFS) Tree-of-Thought solver. 19 20 Args: 21 breadth (b): How many branches to explore at each step. 22 depth (d): Maximum depth of the reasoning tree. 23 """ 24 # Start with the initial problem state 25 current_thoughts = [problem] 26 27 for step in range(depth): 28 scored_thoughts: List[Tuple[str, float]] = [] 29 30 # 1. Expansion Step 31 # Generate 'breadth' new thoughts from EACH current thought 32 next_candidates = [] 33 for thought in current_thoughts: 34 if is_solved(thought): 35 return thought 36 37 # Generate multiple potential next steps 38 for _ in range(breadth): 39 new_thought = generate_thought(model, problem, state=thought) 40 next_candidates.append(new_thought) 41 42 # 2. Evaluation Step 43 # Score all new candidates 44 for candidate in next_candidates: 45 # "System 2" check: Is this partial solution promising? 46 score = evaluate_thought_validity(model, problem, candidate) 47 scored_thoughts.append((candidate, score)) 48 49 # 3. Selection Step (Beam Search logic) 50 # Prune the tree: Keep only the top-k most promising thoughts across all branches. 51 # This prevents exponential explosion. 52 top_thoughts = sorted(scored_thoughts, key=lambda x: x[1], reverse=True)[:breadth] 53 current_thoughts = [t[0] for t in top_thoughts] 54 55 if not current_thoughts: 56 break 57 58 # Return best found path if not fully solved 59 return current_thoughts[0] if current_thoughts else None

When to use ToT

  • •Logic Puzzles: Games like "Game of 24", Crosswords, or Sudoku.
  • •Creative Writing: Planning a plot where future events constrain past choices.
  • •Complex Coding: Architectural decisions where one choice invalidates another.

💡 Key insight: Standard CoT is linear (O(1)O(1)O(1) branches), making it cheap and fast. ToT explores a tree structure with complexity O(bd)O(b^d)O(bd), where bbb is the branching factor and ddd is the depth. This can be significantly more expensive in terms of token usage and latency, so reserve it for problems that require backtracking.

Structural prompting patterns

JSON mode

Instead of parsing regex, prompt the model to output valid JSON. This is crucial for pipeline integration.

The prompt below illustrates how to enforce a JSON schema in the output. By providing clear instructions and a schema definition as input, the model knows exactly what fields to extract and what data types to return as its output.

text
1Extract the entities from the text. Return ONLY a valid JSON object. 2Schema: 3{ 4 "people": [{"name": string, "role": string}], 5 "dates": [string] 6} 7 8Text: "On July 20, 1969, Neil Armstrong walked on the moon."

System prompts & personas

Establishing a persona constrains the solution space and sets tone.

This example shows how to set up a system prompt that defines a specific engineering persona. It provides instructions to the model to focus strictly on certain types of code issues, ensuring the output matches the required review criteria.

python
1SYSTEM_PROMPT = """ 2You are a Senior Backend Engineer reviewing code. 3Focus ONLY on: 41. Security vulnerabilities (SQLi, XSS) 52. Performance bottlenecks (N+1 queries) 63. Race conditions 7 8Do NOT comment on style or formatting. 9"""

ReAct (reasoning + acting)

ReAct[6] extends Chain-of-Thought by allowing the model to perform actions (like calling tools) based on its reasoning. Instead of just Thought → Conclusion, the cycle becomes Thought → Action → Observation → Thought.

This is the foundational architecture for LLM Agents:

  1. •Thought: "I need to find the current price of Bitcoin."
  2. •Action: search_tool("Bitcoin price")
  3. •Observation: "$65,000"
  4. •Conclusion: "Bitcoin is currently $65,000."

By externalizing both reasoning (CoT) and knowledge (Tools), ReAct overcomes the limitations of the model's static weights.

When prompting beats fine-tuning

Engineers often jump to fine-tuning too early. Prompting should be the first line of defense. Fine-tuning excels at learning new syntax or style, while prompting excels at using existing general knowledge for reasoning.

FeatureAdvanced Prompting (CoT/ToT)Fine-Tuning
Data RequirementZero or Few-shot (1-5 examples)High (100s - 1000s of examples)
Iteration SpeedSeconds (edit text)Hours/Days (retrain model)
GeneralizationHigh (retains general knowledge)Low (risk of catastrophic forgetting)
CostHigh inference cost (more tokens)High training cost, Low inference cost
Specific SyntaxHard to enforce perfectlyExcellent for learning new formats

Common misconceptions

⚠️ Warning: Advanced prompting is not a magic fix for all model limitations.

  1. •

    "CoT always improves performance." For simple tasks (e.g., sentiment analysis, factual lookups), CoT adds latency and cost without improving accuracy. In some cases, it can even degrade performance by introducing "reasoning hallucinations" where the model overthinks a simple prompt.

  2. •

    "Small models benefit from CoT." Models under ~10B parameters generally struggle with CoT. They often generate the "reasoning" style but the content is incoherent or logically flawed. CoT is an emergent property of scale.

  3. •

    "Faithful Reasoning." Just because a model outputs a correct chain of thought doesn't mean that's how it solved the problem. Conversely, models can hallucinate a plausible-sounding reasoning chain to justify an incorrect answer ("faithfulness" problem).

Key takeaways

  • •Thinking takes tokens. You cannot expect a model to solve a hard problem in one token. CoT buys the model "time to think."
  • •Prompting is programming. Advanced techniques like ToT and Self-Consistency treat the LLM as a module in a larger software system, not just a text box.
  • •Self-Consistency trades compute for accuracy. By paying 10x the inference cost (sampling 10 times), you can significantly boost reliability on reasoning tasks.
  • •Use the right tool. Zero-shot CoT for general queries, Few-shot CoT for specific formats, ToT for puzzles/planning.
Evaluation Rubric
  • 1
    Demonstrates CoT with concrete reasoning examples
  • 2
    Compares zero-shot ('let's think step by step') vs few-shot CoT
  • 3
    Explains Self-Consistency's sampling and majority vote
  • 4
    Shows Tree-of-Thought's search-based approach
  • 5
    Identifies tasks where CoT helps most (math, logic, planning)
Common Pitfalls
  • Applying CoT to simple factual lookups where it adds latency without value
  • Using greedy decoding (temperature=0) with Self-Consistency, defeating its purpose
  • Assuming CoT works on small (<10B) models which often hallucinate logic
  • Using Tree-of-Thought for linear problems that don't require backtracking
Follow-up Questions to Expect

Key Concepts Tested
Chain-of-Thought (CoT) promptingZero-shot CoT vs few-shot CoTSelf-Consistency: sample and voteTree-of-Thought: branching explorationInference-time compute scalingReasoning extractionMajority votingBreadth-First Search (BFS)System 2 thinkingPrompt chaining
References

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

Emergent Abilities of Large Language Models.

Wei, J., et al. · 2022 · TMLR

Large Language Models are Zero-Shot Reasoners.

Kojima, T., et al. · 2022

Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Wang, X., et al. · 2023

Tree of Thoughts: Deliberate Problem Solving with Large Language Models.

Yao, S., et al. · 2023 · NeurIPS

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., et al. · 2023 · ICLR 2023

Your account is free and you can post anonymously if you choose.