Use Sutton's Bitter Lesson to compare rules, learning, and search through a measured support-ticket routing lab.
You just built a conventional ML product path: time-safe features, deployed predictive models, and monitoring gates that preserve held-out evidence and an audit trail. Now comes a harder design question: when your system is wrong, should you write another rule, learn from more examples, or spend more compute searching among possible answers?
Rich Sutton's 2019 essay The Bitter Lesson gives a demanding answer. Across long stretches of AI history, he argues, methods that can use increasing computation eventually outperform systems built around expert-written knowledge. He names two broadly scalable methods: learning, which improves a model from experience, and search, which explores choices before acting.[1]
This lesson won't treat "more compute" as magic. You'll build a tiny ShopFlow ticket router, see when rules are useful, measure where they fail, and decide what additional budget can honestly buy.
Suppose Alex writes ticket #48291: "My FastShip parcel still hasn't arrived. I need my money back." A support system must choose a route: shipping investigation, refund review, or account help.
A rule can say if "refund" in text: route_to("refund"). That works on wording the author anticipated. A learner can fit patterns from labeled tickets such as money back, damaged return, and delivery delayed. A search step can generate several possible actions and check each one against policy evidence before committing.
Those are different ways to spend effort:
| Approach | What a human specifies | What can improve later | Failure to watch |
|---|---|---|---|
| Rules | Keywords and branches | More rules | New phrasing escapes the rulebook |
| Learning | Data, objective, evaluation | More clean labels and training compute | Bad labels or leakage teach the wrong behavior |
| Search | Candidates and a checker | More candidate or verification budget | A weak checker rewards the wrong answer |
Sutton's thesis concerns the long-run direction of research, not a ban on rules. A refund cap or a required human approval can be exactly the right safety boundary. The warning is about making a growing rulebook carry the core intelligence of a messy task.
Start with a router that a developer can ship in an afternoon. Three keywords cover obvious cases.
1cases = [
2 ("tracking says delivered", "shipping"),
3 ("where is my package", "shipping"),
4 ("refund my order", "refund"),
5 ("money back for cracked mug", "refund"),
6 ("password reset please", "account"),
7 ("can't log in", "account"),
8]
9
10def rule_route(text: str) -> str:
11 text = text.lower()
12 rules = {"tracking": "shipping", "refund": "refund", "password": "account"}
13 for keyword, route in rules.items():
14 if keyword in text:
15 return route
16 return "manual_review"
17
18predictions = [(text, expected, rule_route(text)) for text, expected in cases]
19correct = sum(expected == predicted for _, expected, predicted in predictions)
20misses = [text for text, expected, predicted in predictions if expected != predicted]
21
22print(f"correct={correct}/{len(cases)}")
23print("misses:", misses)
24assert correct == 31correct=3/6
2misses: ['where is my package', 'money back for cracked mug', "can't log in"]The router isn't foolish. It catches exact, high-signal phrases quickly. Its limit is visible: compute won't make those three conditions recognize money back or log in. A human must anticipate and encode every expansion.
Now add rules that fix today's misses, then test tomorrow's wording.
1old_cases = [
2 ("tracking says delivered", "shipping"),
3 ("where is my package", "shipping"),
4 ("refund my order", "refund"),
5 ("money back for cracked mug", "refund"),
6 ("password reset please", "account"),
7 ("can't log in", "account"),
8]
9new_cases = [
10 ("parcel hasn't moved", "shipping"),
11 ("reverse the charge for this return", "refund"),
12 ("sign-in fails again", "account"),
13]
14
15expanded_rules = {
16 "tracking": "shipping",
17 "where is": "shipping",
18 "refund": "refund",
19 "money back": "refund",
20 "password": "account",
21 "log in": "account",
22}
23
24def route(text: str) -> str:
25 lowered = text.lower()
26 for keyword, label in expanded_rules.items():
27 if keyword in lowered:
28 return label
29 return "manual_review"
30
31old_score = sum(route(text) == label for text, label in old_cases)
32new_score = sum(route(text) == label for text, label in new_cases)
33print(f"rules={len(expanded_rules)} old_set={old_score}/6 new_set={new_score}/3")
34print("new misses:", [text for text, label in new_cases if route(text) != label])
35assert old_score == 6 and new_score == 01rules=6 old_set=6/6 new_set=0/3
2new misses: ["parcel hasn't moved", 'reverse the charge for this return', 'sign-in fails again']The second result is the treadmill Sutton warns about. More engineer-hours repair yesterday's errors but don't automatically create a method that adapts to tomorrow's distribution.
A learned system needs a representation. Before tokenization becomes a full lesson, use the smallest representation possible: lowercase words. A ticket becomes a set of observed terms, and training records which route each term appeared with.
1import re
2from collections import Counter
3
4training = [
5 ("late parcel tracking update", "shipping"),
6 ("courier delivery delayed", "shipping"),
7 ("refund damaged item", "refund"),
8 ("money back return", "refund"),
9 ("password login account", "account"),
10 ("sign in reset", "account"),
11]
12
13def tokens(text: str) -> list[str]:
14 return re.findall(r"[a-z]+", text.lower())
15
16by_label = {}
17for text, label in training:
18 by_label.setdefault(label, Counter()).update(tokens(text))
19
20print("refund evidence:", sorted(by_label["refund"]))
21print("shipping evidence:", sorted(by_label["shipping"]))
22assert by_label["refund"]["return"] == 11refund evidence: ['back', 'damaged', 'item', 'money', 'refund', 'return']
2shipping evidence: ['courier', 'delayed', 'delivery', 'late', 'parcel', 'tracking', 'update']That representation is primitive, but its source of knowledge is different from the rulebook. No engineer chose that return or courier should indicate a route. Labeled examples supplied that association.
Use those counts as a tiny learned classifier: score each route by how many words it shares with a new ticket.
1import re
2from collections import Counter, defaultdict
3
4training = [
5 ("late parcel tracking update", "shipping"),
6 ("courier delivery delayed", "shipping"),
7 ("refund damaged item", "refund"),
8 ("money back return", "refund"),
9 ("password login account", "account"),
10 ("sign in reset", "account"),
11]
12held_out = [
13 ("late delivery tracking", "shipping"),
14 ("return damaged item", "refund"),
15 ("reset password", "account"),
16 ("money back damaged", "refund"),
17]
18
19def tokenize(text: str) -> list[str]:
20 return re.findall(r"[a-z]+", text.lower())
21
22counts = defaultdict(Counter)
23for text, label in training:
24 counts[label].update(tokenize(text))
25
26def predict(text: str) -> str:
27 words = tokenize(text)
28 scores = {label: sum(counter[word] for word in words) for label, counter in counts.items()}
29 return max(sorted(scores), key=scores.get)
30
31results = [(text, label, predict(text)) for text, label in held_out]
32correct = sum(expected == predicted for _, expected, predicted in results)
33print(f"held_out={correct}/{len(held_out)}")
34print("predictions:", [predicted for _, _, predicted in results])
35assert correct == 41held_out=4/4
2predictions: ['shipping', 'refund', 'account', 'refund']Don't confuse this tiny word-overlap model with an LLM. Its role is to make the principle observable: a general learning procedure can absorb new labeled examples without a developer adding a condition for each phrase.
1import re
2from collections import Counter, defaultdict
3
4base = [
5 ("tracking parcel", "shipping"),
6 ("refund damaged", "refund"),
7 ("password reset", "account"),
8]
9ticket = "exchange cracked mug"
10
11def train_and_predict(rows: list[tuple[str, str]], text: str) -> str:
12 counts = defaultdict(Counter)
13 for example, label in rows:
14 counts[label].update(re.findall(r"[a-z]+", example.lower()))
15 words = re.findall(r"[a-z]+", text.lower())
16 scores = {label: sum(counter[word] for word in words) for label, counter in counts.items()}
17 return max(sorted(scores), key=scores.get)
18
19before = train_and_predict(base, ticket)
20after = train_and_predict(base + [("exchange cracked mug return", "refund")], ticket)
21print(f"before_correction={before}")
22print(f"after_correction={after}")
23assert before == "account" and after == "refund"1before_correction=account
2after_correction=refundThe correction alone doesn't prove generalization. You'd still evaluate on untouched tickets, just as you did for datasets and training loops. It does show a useful operational property: product feedback can become evidence for the next training run instead of another permanent branch.
Learning isn't the only scalable method in Sutton's essay. Search spends computation at decision time. For a support agent, search might mean proposing multiple actions, retrieving evidence for each, and sending the best supported route forward.
This next miniature isn't a language model. It's a transparent candidate-and-verifier loop that lets you see why additional inference budget helps only when better candidates and a meaningful checker exist.
1ticket = "FastShip parcel for order A10234 has no scan for three days"
2candidates = [
3 ("shipping", "Wait.", []),
4 ("refund", "Issue refund now.", []),
5 ("shipping", "Open carrier trace.", ["FastShip"]),
6 ("shipping", "Open carrier trace; escalate after 48 hours without a scan.", ["FastShip", "48 hours"]),
7]
8
9def verifier(candidate: tuple[str, str, list[str]]) -> int:
10 route, _, evidence = candidate
11 if route != "shipping":
12 return -1
13 return len(evidence)
14
15for budget in (1, 2, 4):
16 selected = max(candidates[:budget], key=verifier)
17 print(f"budget={budget} route={selected[0]} evidence={len(selected[2])} action={selected[1]}")
18
19assert verifier(max(candidates, key=verifier)) == 21budget=1 route=shipping evidence=0 action=Wait.
2budget=2 route=shipping evidence=0 action=Wait.
3budget=4 route=shipping evidence=2 action=Open carrier trace; escalate after 48 hours without a scan.The first candidate is cheap and weak. With four candidates, the verifier can choose an action tied to carrier and escalation evidence. If every candidate were wrong, or if the verifier rewarded fast refunds without policy support, extra search would amplify error instead.
Snell and collaborators study this question for LLM reasoning: their experiments find that test-time strategies depend on prompt difficulty, and an adaptive allocation can use inference computation more efficiently than a fixed best-of- baseline on their evaluated tasks.[2] That result supports a conditional claim, not "thinking longer always works."
Language models give learning an unusually clean objective: predict the next token. Kaplan and collaborators measured smooth empirical power-law relationships between cross-entropy loss and model size, dataset size, and training computation for the Transformer language models they studied.[3] That's evidence that a general learning objective can turn additional scale into predictable improvement within an experimental regime.
For a dense decoder-only Transformer, engineers often estimate training compute with:
Here, is training floating-point operations (FLOPs), is the number of model parameters, and is the number of training tokens. The factor 6 is a planning approximation for forward and backward passes, not a promise about hardware throughput or model quality.
1def dense_training_flops(parameters: int, tokens: int) -> int:
2 return 6 * parameters * tokens
3
4plans = [
5 ("small study", 1_000_000_000, 20_000_000_000),
6 ("larger run", 7_000_000_000, 140_000_000_000),
7]
8
9for name, parameters, tokens in plans:
10 flops = dense_training_flops(parameters, tokens)
11 print(f"{name}: {flops / 1e21:.2f} zettaFLOPs")
12
13assert dense_training_flops(7_000_000_000, 140_000_000_000) == 5_880_000_000_000_000_000_0001small study: 0.12 zettaFLOPs
2larger run: 5.88 zettaFLOPsMore compute isn't the same as better allocation. Hoffmann and collaborators trained more than 400 models and reported that, under their compute-optimal fits, parameter count and training-token count should scale together: doubling one calls for roughly doubling the other.[4]
Use the FLOPs approximation to expose the tradeoff under one fixed budget. The following calculation doesn't select the best model; it only tells you how many tokens each model size can afford at that budget.
1budget = 6 * 7_000_000_000 * 140_000_000_000
2model_sizes = [1_000_000_000, 7_000_000_000, 14_000_000_000]
3
4for parameters in model_sizes:
5 affordable_tokens = budget // (6 * parameters)
6 print(f"N={parameters / 1e9:.0f}B -> D={affordable_tokens / 1e9:.0f}B tokens")
7
8assert budget // (6 * 14_000_000_000) == 70_000_000_0001N=1B -> D=980B tokens
2N=7B -> D=140B tokens
3N=14B -> D=70B tokensThat calculation matters because "make the model bigger" can consume the budget needed to expose it to sufficient data. A research scientist doesn't ask only how much compute is available. They ask how to allocate it, what evaluation will reveal, and which failure mode an extra unit of compute addresses.
Sutton's essay supports a broad historical pattern. A careful reader should also notice where the evidence stops.
| Domain | What the source supports | What you shouldn't claim from it |
|---|---|---|
| Chess and Go | Sutton describes delayed success from scaled search and learning; AlphaZero learned chess, shogi, and Go from self-play with only game rules and reached superhuman play within 24 hours.[1][5] | That search eliminates every useful prior or safety rule |
| Speech and vision | Sutton points to statistical/deep-learning methods replacing increasingly elaborate human feature engineering.[1] | That modern architectures contain no inductive biases |
| Language models | Scaling studies measure improvement of Transformer language models as model size, tokens, and compute change.[3][4] | That LLM pretraining proves Sutton's full claim about agents learning from experience |
That last distinction is important. Internet text contains human-written knowledge. Training on more of it can be a powerful general procedure while still relying on human-produced data. Treat the Bitter Lesson as a research heuristic: prefer methods that can absorb more evidence and compute, then demand measurement.
Rules are valuable when they encode a contractual boundary instead of trying to recognize every possible user intent. For ShopFlow, the model can suggest a refund route, but policy can require review for a high-value order or missing evidence.
1def release_action(proposed_route: str, order_total: int, evidence: set[str]) -> str:
2 if proposed_route == "refund" and order_total > 100:
3 return "human_review_high_value"
4 if proposed_route == "refund" and "damaged_item" not in evidence:
5 return "request_evidence"
6 return proposed_route
7
8examples = [
9 ("refund", 24, {"damaged_item"}),
10 ("refund", 240, {"damaged_item"}),
11 ("refund", 24, set()),
12]
13decisions = [release_action(*example) for example in examples]
14print("decisions:", decisions)
15assert decisions == ["refund", "human_review_high_value", "request_evidence"]1decisions: ['refund', 'human_review_high_value', 'request_evidence']The rule here isn't pretending to understand language. It protects an action with business and safety consequences. Learning handles messy phrasing; the gate controls what the learned component may execute.
Finally, leave a receipt. A system that consumes more compute but can't report its evaluation score, inference budget, policy gates, and escalation rate isn't ready for a production experiment.
1import json
2
3receipt = {
4 "component": "ticket_router_v2",
5 "held_out_accuracy": 0.92,
6 "training_examples": 10_000,
7 "candidate_budget": 4,
8 "policy_gates": ["refund_over_100_requires_review", "refund_requires_evidence"],
9 "monitor": "manual_escalation_rate",
10}
11
12print(json.dumps(receipt, indent=2, sort_keys=True))
13assert receipt["candidate_budget"] == 4
14assert receipt["held_out_accuracy"] >= 0.901{
2 "candidate_budget": 4,
3 "component": "ticket_router_v2",
4 "held_out_accuracy": 0.92,
5 "monitor": "manual_escalation_rate",
6 "policy_gates": [
7 "refund_over_100_requires_review",
8 "refund_requires_evidence"
9 ],
10 "training_examples": 10000
11}The Bitter Lesson isn't "throw compute at every problem." It's a test for your design:
Language models begin by converting raw text into discrete pieces. Tokenizers are a perfect next place to ask the same question: which reusable units can be learned from data, and what mistakes are introduced at that boundary?
Symptom: You enlarge a model or sample more candidates while labels leak or contradict one another. Cause: General methods can amplify the data and objectives they're given. Fix: Keep the dataset-quality and held-out evaluation discipline from the earlier data and production-ML lessons.
Symptom: You let a learned router auto-approve high-value refunds without a policy boundary. Cause: You've confused scalable capability with unconstrained execution. Fix: Use rules for auditable gates and learning for messy recognition.
Symptom: Cost and latency rise, but accepted answers don't improve. Cause: Candidate generation or verification is too weak for added budget to help. Fix: Evaluate budget levels on held-out tasks and stop spending where gains vanish.
The Bitter Lesson.
Sutton, R. S. · 2019
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.
Snell, C., et al. · 2024 · arXiv preprint
Scaling Laws for Neural Language Models
Kaplan et al. · 2020
Training Compute-Optimal Large Language Models.
Hoffmann, J., et al. · 2022 · NeurIPS 2022
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.
Silver, D., et al. · 2017 · arXiv preprint