Understand Sutton's Bitter Lesson, why general methods that use computation consistently outperform human-engineered heuristics, and how this principle shapes every modern AI architecture decision.
In the history of artificial intelligence, researchers have repeatedly tried to build human knowledge directly into computers, teaching them grammar rules for language, or strategy tactics for chess. It works well at first. But eventually, a simpler approach that just uses massive computing power to learn from data always wins. This phenomenon is known as "The Bitter Lesson."
In March 2019, Rich Sutton, one of the fathers of reinforcement learning (the branch of AI where agents learn by trial and error), published a short essay that would become one of the most cited and debated pieces in AI history [1]. "The Bitter Lesson" makes a simple but profound claim: the biggest lesson that can be read from 70 years of AI research is that general methods that use computation are ultimately the most effective, and by a large margin.
Sutton's argument can be distilled into three claims:
💡 Key insight: Analogy: Power tools vs. hand tools. Imagine two woodworkers: one spends years mastering hand-carving techniques (a "heuristic" or rule-of-thumb approach), creating beautiful but slow work. The other buys a CNC machine and learns to operate it. At first, the hand-carver produces better furniture. But the CNC operator keeps buying faster machines, and within a few years, they're producing furniture that's better and orders of magnitude faster. The Bitter Lesson says AI follows this same pattern: cleverly hand-crafted solutions lose to general methods that scale with compute.
AI researchers have repeatedly tried to build human knowledge into their agents, through hand-crafted features, domain-specific heuristics, and expert-designed representations.
These approaches work well initially but eventually hit a ceiling, because human-engineered knowledge is complex, brittle, and hard to scale.
General methods that use computation, specifically search and learning, consistently win in the long run, because they scale with Moore's Law while human engineering effort does not.
The two methods Sutton identifies as truly scalable are:
Both convert compute into performance. Neither requires encoding human domain knowledge. And both have, historically, displaced every hand-engineered alternative.
The power of the Bitter Lesson comes from how consistently the pattern appears across every subfield of AI. Let's trace the evidence.
In the 1970s-90s, chess AI was dominated by programs with hand-crafted evaluation functions: piece-square tables, pawn structure analysis, king safety metrics, and endgame databases, all designed by chess experts. These programs were impressive but fragile. Each new improvement required painstaking human engineering.
Then came Deep Blue (1997), which used massive hardware parallelism to search ~200 million positions per second. It still used hand-tuned evaluation, but compute was doing most of the work. The final nail came with AlphaZero (2017) [2], which learned to play chess entirely from self-play using zero human knowledge beyond the rules. Starting from random play, it surpassed every hand-engineered chess program within hours of training on a TPU (Tensor Processing Unit) cluster.
The expert knowledge accumulated over decades of chess programming was rendered obsolete by a general algorithm (MCTS + neural network evaluation) powered by compute.
Go was considered the frontier of game AI due to its enormous branching factor ( vs. chess's ). For decades, the strongest Go programs used pattern databases hand-designed by Go experts: joseki dictionaries, life-and-death solvers, influence maps.
These programs barely played at amateur level.
AlphaGo (2016) combined deep neural networks with MCTS, initially trained on human expert games and then refined through self-play. AlphaGo Zero (2017) removed even the human games entirely: pure self-play from scratch. It became the strongest Go player in history, playing moves that human experts described as "alien" and "beautiful."
AlphaGo's move 37 in Game 2 against Lee Sedol, a shoulder hit on the fifth line that no human player would consider, was later recognized as genuinely creative. A hand-engineered system would never have discovered it.
The history of automatic speech recognition (ASR) follows the same arc:
Every hand-designed component (mel-frequency cepstral coefficients, tri-phone models, n-gram language models) was eventually replaced by a general neural network doing the same job, but better, because it could use more compute and data.
The vision trajectory is equally stark:
The progression from hand-designed features → learned features (CNNs) → even more general learned features (Transformers) is the Bitter Lesson in action, compressed into a decade.
NLP provides perhaps the most dramatic example:
The entire NLP community's decades of work on parse trees, semantic role labeling, discourse analysis, and grammar engineering was subsumed by "predict the next token, but with more compute."
The Bitter Lesson isn't just an observation. There's a structural reason why general methods eventually dominate:
💡 Key insight: Analogy: Language immersion vs. dictionary. Learning a language from a dictionary and grammar book (hand-engineering) gives you decent results fast, but you'll always sound wooden and miss subtle idioms. Immersion, simply listening to millions of conversations (learning from data), is painfully slow at first but eventually produces fluency no textbook can match. The more time (compute) you spend immersed, the better you get. That's why learning scales and engineering doesn't.
Human engineering effort scales linearly (at best) with researcher-hours. You can hire more linguists, but each one still has to think through problems manually. Compute, by contrast, scales exponentially. Moore's Law (and its modern equivalents in GPU throughput) doubles available compute roughly every 18-24 months.
text1Human engineering: performance ∝ log(researcher_hours) 2General methods: performance ∝ compute^α (where α > 0)
Any method that converts compute into performance will eventually surpass any method that relies on human effort, given enough time. The Bitter Lesson says this crossover has happened repeatedly, and it always will.
Real-world domains are vastly more complex than our ability to understand them. A language has millions of statistical regularities that no linguist can fully enumerate. A visual scene has billions of pixel-level relationships that no feature engineer can capture. The only way to model this complexity is to learn it from data, and learning from data requires compute.
We can demonstrate the Bitter Lesson with a simple Python simulation. The code below defines a game agent evaluating a board state. It takes a board state (with an explicit value and a hidden complexity factor) as input. We implement two approaches: a heuristic function that applies rigid rules, and a general search function that simulates outcomes. The script outputs the evaluation scores, demonstrating how search uncovers hidden complexity while heuristics hit a hard ceiling.
python1import random 2 3# A simplified state evaluation problem 4class GameState: 5 def __init__(self, value: int, complexity: int = 0): 6 self.value = value 7 self.complexity = complexity # Hidden factors human rules might miss 8 9 def simulate_step(self) -> float: 10 """Simulate one step forward in the game (requires compute).""" 11 # In a real game, this evolves the state based on rules 12 noise = random.gauss(0, 1) 13 return float(self.value + noise + (self.complexity * 0.1)) 14 15# 1. The Heuristic Approach: Human-coded rules 16# Fast, cheap, but capped by the engineer's understanding 17def heuristic_evaluate(state: GameState) -> float: 18 score = 0.0 19 # Rule 1: High value is good 20 if state.value > 10: 21 score += 1.0 22 # Rule 2: Even numbers are preferred (arbitrary human intuition) 23 if state.value % 2 == 0: 24 score += 0.5 25 # The heuristic ignores 'complexity' because the human didn't understand it 26 return score 27 28# 2. The General Approach: Search / Simulation 29# Slow, expensive, but accuracy scales with compute budget 30def search_evaluate(state: GameState, compute_budget: int) -> float: 31 total_outcome = 0.0 32 # Spend compute to simulate the future 33 for _ in range(compute_budget): 34 outcome = state.simulate_step() 35 total_outcome += outcome 36 37 # The result converges to the truth as compute increases 38 return total_outcome / compute_budget 39 40# Demonstration 41state = GameState(value=12, complexity=50) # A state with hidden complexity 42 43print(f"Heuristic Score: {heuristic_evaluate(state):.2f}") 44# Output: 1.50 (Fixed, fast, but inaccurate) 45 46print(f"Search (Low Compute): {search_evaluate(state, compute_budget=10):.2f}") 47# Output: ~16.50 (Noisy) 48 49print(f"Search (High Compute): {search_evaluate(state, compute_budget=10000):.2f}") 50# Output: ~17.00 (Accurate: captures the hidden complexity)
In this example, the heuristic_evaluate function is static. No matter how much GPU power you buy, it won't get better than its hard-coded logic. The search_evaluate function starts worse (it's noisy and slow), but as you increase compute_budget, it discovers the ground truth, including the hidden complexity term that the human engineer missed.
The Python simulation illustrates a fundamental truth about system design:
| Characteristic | Heuristic Approach | General Search Approach |
|---|---|---|
| Development Cost | High (Requires human expert time) | Low (Simple rules, relies on simulation) |
| Execution Speed | Fast (Constant time ) | Slow (Scales with budget ) |
| Accuracy Ceiling | Capped by human understanding | Approaches ground truth as compute increases |
| Adaptability | Brittle (Fails if hidden variables change) | Robust (Simulates actual outcomes) |
When compute is expensive and limited, the heuristic wins because it is fast. As compute becomes cheap and abundant, the search approach inevitably surpasses it.
Hand-engineered systems tend to be brittle: they work well on the cases the engineer anticipated but fail on unexpected inputs. Learned representations, by contrast, generalize across the distribution of training data. A hand-designed NER system might recognize "New York" and "Paris" but miss "Bengaluru" because no engineer thought to add it. A learned system that has seen enough text will handle all of them.
The Bitter Lesson identifies learning and search as the two methods that scale arbitrarily with compute. Modern AI systems have split these into two distinct phases of computation, both of which follow the Bitter Lesson.
During training, compute is spent on learning (optimizing weights via gradient descent). Throwing more FLOPs (Floating Point Operations) at a larger dataset with a larger model predictably reduces the loss.
This phase builds the model's intuitive "System 1" capabilities. A language model learns to predict the next token with high accuracy. However, this relies entirely on the model's forward pass, meaning the computation per token is fixed. A complex reasoning problem gets the exact same amount of compute as a simple greeting.
At inference time, compute is spent on search (exploring possible outputs). This represents "System 2" thinking. Instead of immediately returning the first predicted token, the system can generate multiple possible reasoning paths, evaluate them, and select the best one.
Models like OpenAI's o1 and DeepSeek-R1 use techniques like Monte Carlo Tree Search (MCTS) or process reward models to explore solutions. As Sutton noted with AlphaZero, search scales: if you give the model 10 times more inference compute to think before answering, it will find significantly better solutions.
By scaling both train-time learning and inference-time search, engineers apply the Bitter Lesson at every stage of the pipeline, avoiding hand-crafted reasoning modules in favor of raw computation.
The Bitter Lesson is powerful, but it's not the whole story. Several important nuances deserve attention:
While general methods eventually win, inductive biases serve as efficient initialization. Convolutions encode translation invariance, which is genuinely true of images. Positional encodings capture the sequential nature of language. These biases don't contradict the Bitter Lesson; they accelerate learning within a fixed compute budget. The lesson predicts they'll be replaced, but doesn't say they're useless now.
Not every problem has internet-scale data. In medical imaging, robotics, or rare language NLP, data is scarce and compute alone can't compensate. In these domains, carefully designed architectures and data augmentation (forms of human knowledge) still provide critical advantages. The Bitter Lesson implicitly assumes abundant data, which isn't always available.
Perhaps the strongest counter-argument: we may not want fully general methods for safety-critical decisions. A learned reward model might "hack" its objective in unexpected ways. A rule-based safety filter, while brittle, is interpretable and verifiable. The alignment community argues that some problems require human-designed constraints, not just more compute.
The Bitter Lesson describes long-term trends, but engineers work within fixed budgets. If you have 8 GPUs and a deadline, using domain knowledge to constrain your search space is pragmatic, not foolish. The lesson applies to research trajectories stretching over decades, not to individual project decisions.
The Bitter Lesson isn't just philosophical. It has concrete implications for how we design AI systems today:
Design architectures that benefit from more compute and data, even if they seem "wasteful" at current scales. The Transformer replaced the LSTM not because it was more elegant, but because its parallelizable attention mechanism could use GPU clusters more effectively. Choose the architecture that scales.
When you're tempted to add a hand-designed component (a custom tokenizer heuristic, a rule-based post-processor, a domain-specific feature), ask: "Would a general learned component work if I gave it enough data?" If yes, the Bitter Lesson says invest in data collection and compute rather than engineering.
The empirical scaling laws discovered by Kaplan et al. [6] and Hoffmann et al. [7] are the Bitter Lesson expressed mathematically. They demonstrate that the cross-entropy loss of a language model scales as a predictable power law with the compute budget :
Where is a constant and is the scaling exponent. This equation proves that model performance improves steadily and predictably simply by increasing compute, exactly the behavior Sutton highlighted. The scaling laws tell you how much compute to throw at a problem; the Bitter Lesson tells you why this brute-force approach succeeds where elegant hand-engineering fails.
Modern reasoning models (o1, o3, DeepSeek-R1) apply the Bitter Lesson to inference: instead of engineering complex reasoning pipelines, they spend more compute at test time through learned search strategies. This directly parallels AlphaZero's MCTS, converting compute into intelligence through general search, not hand-designed heuristics.
If you develop a technique that works by encoding domain knowledge (a specialized attention mask, a domain-specific loss function, a hand-tuned decoding strategy), expect it to be obsoleted by a simpler, more general approach once sufficient compute is available. This doesn't mean the technique is bad. It means you should design it to be removable.
For engineers deciding between general and specialized approaches, consider this decision matrix:
| Factor | Favor General Methods | Favor Domain Heuristics |
|---|---|---|
| Data availability | Abundant (>10M examples) | Scarce (<10K examples) |
| Compute budget | Large or growing | Fixed and small |
| Time horizon | Long-term product | Short-term prototype |
| Domain complexity | High (language, vision) | Low (clear rules exist) |
| Safety requirements | Standard | Safety-critical |
| Architecture lifespan | Years | Months |
This matrix highlights that the Bitter Lesson is a statement about asymptotic behavior. It tells us where the field is heading over the long term. However, day-to-day engineering requires balancing these long-term trends against immediate constraints.
If you are building a prototype with zero training data, you have no choice but to use heuristics. You might write a complex prompt with a strict set of extraction rules. But as your product matures and you collect a million user interactions, the Bitter Lesson suggests you should transition away from those complex prompts and instead fine-tune a model on that data.
The most successful AI teams design their systems to smoothly bridge this gap. They use heuristics to bootstrap their initial data flywheel, but they build their infrastructure to eventually replace those same heuristics with learning and search once enough data and compute are available.
The Bitter Lesson is a meta-principle: across 70 years and every subfield of AI, general methods that use computation have displaced hand-engineered approaches.
Search and learning are the two scalable methods. Both convert compute into performance without requiring human domain knowledge.
The pattern is accelerating: the Transformer's dominance across language, vision, audio, and biology is the Bitter Lesson operating in real-time.
Nuance matters: inductive biases, sample efficiency, safety constraints, and fixed compute budgets all create situations where human-designed structure still provides value.
For engineers, the practical implication is clear: design systems that benefit from scale (more data, more compute, more parameters) because that's where the long-term wins are.
Know when you're fighting the lesson: if your system relies on clever tricks that only work at your current scale, start planning for the general alternative.
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision.
Radford, A., et al. · 2022
The Bitter Lesson.
Sutton, R. S. · 2019
Scaling Laws for Neural Language Models
Kaplan et al. · 2020
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.
Silver, D., et al. · 2017 · arXiv preprint
Language Models are Few-Shot Learners.
Brown, T., et al. · 2020 · NeurIPS 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Dosovitskiy, A., et al. · 2020 · ICLR 2021
Training Compute-Optimal Large Language Models
Hoffmann et al. · 2022 · NeurIPS 2022