LearnCore LLM FoundationsThe Bitter Lesson & Compute

📈MediumReasoning & Scaling

The Bitter Lesson & Compute

Use Sutton's Bitter Lesson to compare rules, learning, and search through a measured AI-incident routing lab.

15 min read

Learning path

Step 48 of 158 in the full curriculum

Monitoring Predictive Models BPE, WordPiece, and SentencePiece

You just built a conventional ML product path: time-safe features, deployed predictive models, and monitoring gates that preserve held-out evidence and an audit trail. Now comes a harder design question: when your system is wrong, should you write another rule, learn from more examples, or spend more compute searching among possible answers?

Rich Sutton's 2019 essay The Bitter Lesson gives a demanding answer. Across long stretches of AI history, he argues, methods that can use increasing computation eventually outperform systems built around expert-written knowledge. He names two broadly scalable methods: learning, which improves a model from experience, and search, which explores choices before acting.^{[1]Reference 1The Bitter Lesson.http://www.incompleteideas.net/IncIdeas/BitterLesson.html}

More compute doesn't automatically fix a system. You'll build a tiny AI-incident router, see when rules are useful, measure where they fail, and decide what additional budget can honestly buy.

Where effort compounds

Suppose Alex writes issue #48291: "Decode latency spiked after yesterday's reranker deploy." An AI operations system must choose a route: serving latency, answer quality, or access/auth.

A rule can say if "latency" in text: route_to("latency"). That works on wording the author anticipated. A learner can fit patterns from labeled issues such as TTFT spike, unsupported citation, and OAuth callback failed. A search step can generate several possible actions and check each one against evaluation evidence before committing.

Those are different ways to spend effort:

Approach	What a human specifies	What can improve later	Failure to watch
Rules	Keywords and branches	More rules	New phrasing escapes the rulebook
Learning	Data, objective, evaluation	More clean labels and training compute	Bad labels or leakage teach the wrong behavior
Search	Candidates and a checker	More candidate or verification budget	A weak checker rewards the wrong answer

Sutton's thesis concerns the long-run direction of research, not a ban on rules. A promotion gate or a required human approval can be exactly the right safety boundary. The warning is about making a growing rulebook carry the core intelligence of a messy task.

Three-lane comparison for the same latency-regression issue: keyword rules miss the unseen phrase token delay and require a new branch, learned scores route the issue toward latency examples, and search scores trace, rollback, and ignore candidates before selecting the evidence-backed trace. — The same issue creates different work. Rules pay with an engineer after each unseen phrase; learning pays at training time to move a reusable decision boundary; search pays at request time to score alternatives against evidence.

See a rulebook reach its boundary

Start with a router that a developer can ship in an afternoon. Three keywords cover obvious cases.

01-rules-have-boundaries.py

cases = [
    ("decode timeout in prod", "latency"),
    ("tokens arrive slowly", "latency"),
    ("answer cites wrong doc", "quality"),
    ("citation points to stale source", "quality"),
    ("password reset please", "access"),
    ("can't authenticate", "access"),
]

def rule_route(text: str) -> str:
    text = text.lower()
    rules = {"timeout": "latency", "citation": "quality", "password": "access"}
    for keyword, route in rules.items():
        if keyword in text:
            return route
    return "manual_review"

predictions = [(text, expected, rule_route(text)) for text, expected in cases]
correct = sum(expected == predicted for _, expected, predicted in predictions)
misses = [text for text, expected, predicted in predictions if expected != predicted]

print(f"correct={correct}/{len(cases)}")
print("misses:", misses)
assert correct == 3

Rule router baseline

correct=3/6
misses: ['tokens arrive slowly', 'answer cites wrong doc', "can't authenticate"]

The router isn't foolish. It catches exact, high-signal phrases quickly. Its limit is visible: compute won't make those three conditions recognize tokens arrive slowly or can't authenticate. A human must anticipate and encode every expansion.

Now add rules that fix today's misses, then test tomorrow's wording.

02-rule-churn.py

old_cases = [
    ("decode timeout in prod", "latency"),
    ("tokens arrive slowly", "latency"),
    ("answer cites wrong doc", "quality"),
    ("citation points to stale source", "quality"),
    ("password reset please", "access"),
    ("can't authenticate", "access"),
]
new_cases = [
    ("TTFT spiked today", "latency"),
    ("response used unsupported claim", "quality"),
    ("OAuth callback fails again", "access"),
]

expanded_rules = {
    "timeout": "latency",
    "tokens": "latency",
    "cites": "quality",
    "citation": "quality",
    "stale source": "quality",
    "password": "access",
    "authenticate": "access",
}

def route(text: str) -> str:
    lowered = text.lower()
    for keyword, label in expanded_rules.items():
        if keyword in lowered:
            return label
    return "manual_review"

old_score = sum(route(text) == label for text, label in old_cases)
new_score = sum(route(text) == label for text, label in new_cases)
print(f"rules={len(expanded_rules)} old_set={old_score}/6 new_set={new_score}/3")
print("new misses:", [text for text, label in new_cases if route(text) != label])
assert old_score == 6 and new_score == 0

Rule churn under new wording

rules=7 old_set=6/6 new_set=0/3
new misses: ['TTFT spiked today', 'response used unsupported claim', 'OAuth callback fails again']

The second result is the treadmill Sutton warns about. More engineer-hours repair yesterday's errors but don't automatically create a method that adapts to tomorrow's distribution.

Conceptual curve for an AI-incident router: routing rules improve early and flatten, while a learned router can continue improving as labeled issues and training budget grow. — The crossover is a design hypothesis, not a guaranteed curve. Learning earns its extra budget only when labels, evaluation, and deployment feedback remain trustworthy.

Replace phrases with evidence from examples

A learned system needs a representation. Before tokenization becomes a full lesson, use the smallest representation possible: lowercase words. An issue becomes a set of observed terms, and training records which route each term appeared with.

03-token-evidence.py

import re
from collections import Counter

training = [
    ("decode latency timeout", "latency"),
    ("ttft spike tokens", "latency"),
    ("wrong citation unsupported answer", "quality"),
    ("stale source hallucination", "quality"),
    ("password login access", "access"),
    ("oauth callback auth", "access"),
]

def tokens(text: str) -> list[str]:
    return re.findall(r"[a-z]+", text.lower())

by_label = {}
for text, label in training:
    by_label.setdefault(label, Counter()).update(tokens(text))

print("quality evidence:", sorted(by_label["quality"]))
print("latency evidence:", sorted(by_label["latency"]))
assert by_label["quality"]["citation"] == 1

Issue tokens

quality evidence: ['answer', 'citation', 'hallucination', 'source', 'stale', 'unsupported', 'wrong']
latency evidence: ['decode', 'latency', 'spike', 'timeout', 'tokens', 'ttft']

That representation is primitive, but its source of knowledge is different from the rulebook. No engineer chose that citation or ttft should indicate a route. Labeled examples supplied that association.

Use those counts as a tiny learned classifier: score each route by how many words it shares with a new issue.

04-learn-from-labeled-issues.py

import re
from collections import Counter, defaultdict

training = [
    ("decode latency timeout", "latency"),
    ("ttft spike tokens", "latency"),
    ("wrong citation unsupported answer", "quality"),
    ("stale source hallucination", "quality"),
    ("password login access", "access"),
    ("oauth callback auth", "access"),
]
held_out = [
    ("latency spike tokens", "latency"),
    ("unsupported citation answer", "quality"),
    ("reset password", "access"),
    ("stale source answer", "quality"),
    ("bad graph query", "manual_review"),
]

def tokenize(text: str) -> list[str]:
    return re.findall(r"[a-z]+", text.lower())

counts = defaultdict(Counter)
for text, label in training:
    counts[label].update(tokenize(text))

def predict(text: str) -> str:
    words = tokenize(text)
    scores = {label: sum(counter[word] for word in words) for label, counter in counts.items()}
    best_score = max(scores.values())
    winners = [label for label, score in scores.items() if score == best_score]
    return winners[0] if best_score > 0 and len(winners) == 1 else "manual_review"

results = [(text, label, predict(text)) for text, label in held_out]
correct = sum(expected == predicted for _, expected, predicted in results)
print(f"held_out={correct}/{len(held_out)}")
print("predictions:", [predicted for _, _, predicted in results])
assert correct == 5

Learned router evaluation

held_out=5/5
predictions: ['latency', 'quality', 'access', 'quality', 'manual_review']

Don't confuse this tiny word-overlap model with an LLM. Its role is to make the principle observable: a general learning procedure can absorb new labeled examples without a developer adding a condition for each phrase. When no observed word supports a unique route, the router abstains with manual_review instead of turning a tie into an arbitrary production decision.

05-add-one-correction.py

import re
from collections import Counter, defaultdict

base = [
    ("timeout tokens", "latency"),
    ("wrong citation", "quality"),
    ("password reset", "access"),
]
issue = "bad graph query"

def train_and_predict(rows: list[tuple[str, str]], text: str) -> str:
    counts = defaultdict(Counter)
    for example, label in rows:
        counts[label].update(re.findall(r"[a-z]+", example.lower()))
    words = re.findall(r"[a-z]+", text.lower())
    scores = {label: sum(counter[word] for word in words) for label, counter in counts.items()}
    best_score = max(scores.values())
    winners = [label for label, score in scores.items() if score == best_score]
    return winners[0] if best_score > 0 and len(winners) == 1 else "manual_review"

before = train_and_predict(base, issue)
after = train_and_predict(base + [("bad graph query citation", "quality")], issue)
print(f"before_correction={before}")
print(f"after_correction={after}")
assert before == "manual_review" and after == "quality"

Correction becomes training data

before_correction=manual_review
after_correction=quality

Before correction, the safe answer is manual_review: the learner has no matching evidence. The correction alone doesn't prove generalization. You'd still evaluate on untouched issues, just as you did for datasets and training loops. It does show a useful operational property: production feedback can become evidence for the next training run instead of another permanent branch.

Diagram showing Resolved issues fit router, New issue candidate route, Verify policy evidence, and Route or escalate log outcome. — Resolved issues fit router, New issue candidate route, Verify policy evidence, and Route or escalate log outcome.

Search spends compute after training

Learning isn't the only scalable method in Sutton's essay. Search spends computation at decision time. For an AI-operations agent, search might mean proposing multiple actions, retrieving evidence for each, and sending the best supported route forward.

Train-time and inference-time compute diagram showing resolved AI incidents fitting one evaluated router release, then a hard request searching four candidate actions whose verifier scores zero, negative one, one, and two before selecting the evidence-backed eval trace and block-promotion action. — Training pays once to fit an evaluated router release. For the hard eval issue, request-time search spends a budget of `4` to find the only candidate with both run and metric evidence.

This next miniature isn't a language model. It's a transparent candidate-and-verifier loop that lets you see why additional inference budget helps only when better candidates and a meaningful checker exist.

06-search-with-a-verifier.py

issue = "RAG answer failed citation_precision on run R42"
candidates = [
    ("quality", "Retry later.", []),
    ("promote", "Promote anyway.", []),
    ("quality", "Open eval trace.", ["run R42"]),
    ("quality", "Open eval trace; block promotion until citation_precision recovers.", ["run R42", "citation_precision"]),
]

def verifier(candidate: tuple[str, str, list[str]]) -> int:
    route, _, evidence = candidate
    if route != "quality":
        return -1
    return len(evidence)

for budget in (1, 2, 4):
    selected = max(candidates[:budget], key=verifier)
    print(f"budget={budget} route={selected[0]} evidence={len(selected[2])} action={selected[1]}")

assert verifier(max(candidates, key=verifier)) == 2

Inference-time search budget

budget=1 route=quality evidence=0 action=Retry later.
budget=2 route=quality evidence=0 action=Retry later.
budget=4 route=quality evidence=2 action=Open eval trace; block promotion until citation_precision recovers.

The first candidate is cheap and weak. With four candidates, the verifier can choose an action tied to eval-run and metric evidence. If every candidate were wrong, or if the verifier rewarded promotion without release evidence, extra search would amplify error instead.

Snell and collaborators study this question for LLM reasoning: their experiments find that test-time strategies depend on prompt difficulty, and an adaptive allocation can use inference computation more efficiently than a fixed best-of- $N$ baseline on their evaluated tasks.^{[2]Reference 2Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314} That result supports a conditional claim, not "thinking longer always works."

Connect the lesson to language-model training

Language models give learning an unusually clean objective: predict the next token. Kaplan and collaborators measured smooth empirical power-law relationships between cross-entropy loss and model size, dataset size, and training computation for the Transformer language models they studied.^{[3]Reference 3Scaling Laws for Neural Language Modelshttps://arxiv.org/abs/2001.08361} That's evidence that a general learning objective can turn additional scale into predictable improvement within an experimental regime.

For a dense decoder-only Transformer, engineers often estimate training compute with:

C \approx 6ND

Here, $C$ is training floating-point operations (FLOPs), $N$ is the number of model parameters, and $D$ is the number of training tokens. The factor 6 is a planning approximation for forward and backward passes, not a promise about hardware throughput or model quality.^{[4]Reference 4Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}

07-estimate-training-flops.py

def dense_training_flops(parameters: int, tokens: int) -> int:
    return 6 * parameters * tokens

plans = [
    ("small study", 1_000_000_000, 20_000_000_000),
    ("larger run", 7_000_000_000, 140_000_000_000),
]

for name, parameters, tokens in plans:
    flops = dense_training_flops(parameters, tokens)
    print(f"{name}: {flops / 1e21:.2f} zettaFLOPs")

assert dense_training_flops(7_000_000_000, 140_000_000_000) == 5_880_000_000_000_000_000_000

Training FLOPs estimate

small study: 0.12 zettaFLOPs
larger run: 5.88 zettaFLOPs

More compute isn't the same as better allocation. Hoffmann and collaborators trained more than 400 models and reported that, under their compute-optimal fits, parameter count and training-token count should scale together: doubling one calls for roughly doubling the other.^{[4]Reference 4Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}

Use the FLOPs approximation to expose the tradeoff under one fixed budget. This calculation doesn't select the best model; it only tells you how many tokens each model size can afford at that budget.

08-allocate-a-fixed-training-budget.py

budget = 6 * 7_000_000_000 * 140_000_000_000
model_sizes = [1_000_000_000, 7_000_000_000, 14_000_000_000]

for parameters in model_sizes:
    affordable_tokens = budget // (6 * parameters)
    print(f"N={parameters / 1e9:.0f}B -> D={affordable_tokens / 1e9:.0f}B tokens")

assert budget // (6 * 14_000_000_000) == 70_000_000_000

Fixed-budget tradeoff

N=1B -> D=980B tokens
N=7B -> D=140B tokens
N=14B -> D=70B tokens

That calculation matters because "make the model bigger" can consume the budget needed to expose it to sufficient data. A research scientist doesn't ask only how much compute is available. They ask how to allocate it, what evaluation will reveal, and which failure mode an extra unit of compute addresses.

History is evidence, not a slogan

Sutton's essay supports a broad historical pattern. A careful reader should also notice where the evidence stops.

Domain	What the source supports	What you shouldn't claim from it
Chess and Go	Sutton describes delayed success from scaled search and learning; AlphaZero learned chess, shogi, and Go from self-play with only game rules and reached superhuman play within 24 hours.^{[1]Reference 1The Bitter Lesson.http://www.incompleteideas.net/IncIdeas/BitterLesson.html}^{[5]Reference 5Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.https://arxiv.org/abs/1712.01815}	That search eliminates every useful prior or safety rule
Speech and vision	Sutton points to statistical/deep-learning methods replacing increasingly elaborate human feature engineering.^{[1]Reference 1The Bitter Lesson.http://www.incompleteideas.net/IncIdeas/BitterLesson.html}	That modern architectures contain no inductive biases
Language models	Scaling studies measure improvement of Transformer language models as model size, tokens, and compute change.^{[3]Reference 3Scaling Laws for Neural Language Modelshttps://arxiv.org/abs/2001.08361}^{[4]Reference 4Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}	That LLM pretraining proves Sutton's full claim about agents learning from experience

That last distinction is important. Internet text contains human-written knowledge. Training on more of it can be a strong general procedure while still relying on human-produced data. Treat the Bitter Lesson as a research heuristic: prefer methods that can absorb more evidence and compute, then demand measurement.

Keep rules where they belong

Rules belong at contractual boundaries, not inside a growing list of guessed user intents. For an AI release system, the model can suggest a promotion route, but policy can require review for a failed gate or missing eval evidence.

Ambiguity-consequence matrix showing deterministic lookups at low ambiguity and risk, learned intent for free-form issues, policy rules for high-consequence thresholds, and a hybrid path that classifies ship the candidate before routing a low-score promotion to human review. — Use learning where meaning is ambiguous, rules where execution risk is explicit, and both when a free-form request can trigger a costly action. Here the model identifies promotion intent; the policy gate sends a low-score candidate to review.

09-keep-policy-gates.py

def release_action(proposed_route: str, eval_score: float, evidence: set[str]) -> str:
    if proposed_route == "promote" and eval_score < 0.95:
        return "human_review_failed_gate"
    if proposed_route == "promote" and "eval_report" not in evidence:
        return "request_evidence"
    return proposed_route

examples = [
    ("promote", 0.98, {"eval_report"}),
    ("promote", 0.91, {"eval_report"}),
    ("promote", 0.98, set()),
]
decisions = [release_action(*example) for example in examples]
print("decisions:", decisions)
assert decisions == ["promote", "human_review_failed_gate", "request_evidence"]

Guardrail around capability

decisions: ['promote', 'human_review_failed_gate', 'request_evidence']

The rule here isn't pretending to understand language. It protects an action with business and safety consequences. Learning handles messy phrasing; the gate controls what the learned component may execute.

Finally, leave a receipt. This one records the tiny lesson fixture, not a launch approval. A system that consumes more compute but can't report its evaluated split, training procedure, inference budget, verifier, policy gates, and escalation monitor isn't ready for a production experiment.

10-log-the-compute-decision.py

from hashlib import sha256
from json import dumps

evaluation = {
    "split": "incident-router-held-out-v1",
    "correct": 5,
    "total": 5,
    "manual_review_routes": 1,
}
evaluation["accuracy"] = evaluation["correct"] / evaluation["total"]
receipt = {
    "component": "incident-router-word-overlap-v1",
    "training": {
        "procedure": "token-overlap-counts-v1",
        "examples": 6,
    },
    "evaluation": evaluation,
    "search": {
        "candidate_budget": 4,
        "verifier": "eval-evidence-v1",
    },
    "policy_gates": ["promotion_below_0_95_requires_review", "promotion_requires_eval_report"],
    "monitor": "manual_escalation_rate",
}
payload = dumps(receipt, sort_keys=True, separators=(",", ":"))

print(dumps(receipt, indent=2, sort_keys=True))
print("receipt sha256:", sha256(payload.encode()).hexdigest()[:12])
assert receipt["search"]["candidate_budget"] == 4
assert receipt["evaluation"]["accuracy"] == 1.0

Compute decision receipt

{
  "component": "incident-router-word-overlap-v1",
  "evaluation": {
    "accuracy": 1.0,
    "correct": 5,
    "manual_review_routes": 1,
    "split": "incident-router-held-out-v1",
    "total": 5
  },
  "monitor": "manual_escalation_rate",
  "policy_gates": [
    "promotion_below_0_95_requires_review",
    "promotion_requires_eval_report"
  ],
  "search": {
    "candidate_budget": 4,
    "verifier": "eval-evidence-v1"
  },
  "training": {
    "examples": 6,
    "procedure": "token-overlap-counts-v1"
  }
}
receipt sha256: 957f864a4562

What to carry forward

The Bitter Lesson isn't "throw compute at every problem." It's a test for your design:

Can new labeled evidence improve the capability without new handwritten branches?
Can additional search or verification budget be measured against a held-out task?
Are policy gates protecting actions rather than trying to encode every meaning?
Can you report the compute budget and the evaluation that justified it?

Language models begin by converting raw text into discrete pieces. Tokenizers are a perfect next place to ask the same question: which reusable units can be learned from data, and what mistakes are introduced at that boundary?

Mastery check

Checkpoints

Evaluation rubric

Foundational: State Sutton's thesis and distinguish learning from search.
Practical: Run the incident-router examples and explain why a correction is data for a learner but rule debt for a keyword router.
Quantitative: Estimate dense-model training FLOPs and compare model/token allocations under a fixed budget.
Production: Design one learned component, one policy gate, and one logged metric for an automated AI-operations workflow.

Common pitfalls

"Compute will fix bad data"

Symptom: You enlarge a model or sample more candidates while labels leak or contradict one another.
Cause: General methods can amplify the data and objectives they're given.
Fix: Keep the dataset-quality and held-out evaluation discipline from the earlier data and production-ML lessons.

"Rules are forbidden"

Symptom: You let a learned router auto-promote low-score candidates without a policy boundary.
Cause: You've confused scalable capability with unconstrained execution.
Fix: Use rules for auditable gates and learning for messy recognition.

"More search always improves an answer"

Symptom: Cost and latency rise, but accepted answers don't improve.
Cause: Candidate generation or verification is too weak for added budget to help.
Fix: Evaluate budget levels on held-out tasks and stop spending where gains vanish.

"A deterministic tie-break is safe"

Symptom: An unseen issue is sent to the first route in a sorted label list.
Cause: The classifier turns missing evidence into a reproducible but unsupported action.
Fix: Abstain when no route has positive evidence or when several routes share the best score.

Next Step

Continue to BPE, WordPiece, and SentencePiece

You now know why methods that absorb more data and computation matter. Next you'll build the text-to-token boundary that makes language-model learning possible and measure the tradeoffs it introduces.

PreviousMonitoring Predictive Models

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

The Bitter Lesson.

Sutton, R. S. · 2019

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.

Silver, D., et al. · 2017 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnCore LLM FoundationsThe Bitter Lesson & Compute

📈MediumReasoning & Scaling

The Bitter Lesson & Compute

Use Sutton's Bitter Lesson to compare rules, learning, and search through a measured AI-incident routing lab.

15 min read

Learning path

Step 48 of 158 in the full curriculum

Monitoring Predictive Models BPE, WordPiece, and SentencePiece

More compute doesn't automatically fix a system. You'll build a tiny AI-incident router, see when rules are useful, measure where they fail, and decide what additional budget can honestly buy.

Where effort compounds

Suppose Alex writes issue #48291: "Decode latency spiked after yesterday's reranker deploy." An AI operations system must choose a route: serving latency, answer quality, or access/auth.

Those are different ways to spend effort:

Approach	What a human specifies	What can improve later	Failure to watch
Rules	Keywords and branches	More rules	New phrasing escapes the rulebook
Learning	Data, objective, evaluation	More clean labels and training compute	Bad labels or leakage teach the wrong behavior
Search	Candidates and a checker	More candidate or verification budget	A weak checker rewards the wrong answer

See a rulebook reach its boundary

Start with a router that a developer can ship in an afternoon. Three keywords cover obvious cases.

01-rules-have-boundaries.py

cases = [
    ("decode timeout in prod", "latency"),
    ("tokens arrive slowly", "latency"),
    ("answer cites wrong doc", "quality"),
    ("citation points to stale source", "quality"),
    ("password reset please", "access"),
    ("can't authenticate", "access"),
]

def rule_route(text: str) -> str:
    text = text.lower()
    rules = {"timeout": "latency", "citation": "quality", "password": "access"}
    for keyword, route in rules.items():
        if keyword in text:
            return route
    return "manual_review"

predictions = [(text, expected, rule_route(text)) for text, expected in cases]
correct = sum(expected == predicted for _, expected, predicted in predictions)
misses = [text for text, expected, predicted in predictions if expected != predicted]

print(f"correct={correct}/{len(cases)}")
print("misses:", misses)
assert correct == 3

Rule router baseline

correct=3/6
misses: ['tokens arrive slowly', 'answer cites wrong doc', "can't authenticate"]

Now add rules that fix today's misses, then test tomorrow's wording.

02-rule-churn.py

old_cases = [
    ("decode timeout in prod", "latency"),
    ("tokens arrive slowly", "latency"),
    ("answer cites wrong doc", "quality"),
    ("citation points to stale source", "quality"),
    ("password reset please", "access"),
    ("can't authenticate", "access"),
]
new_cases = [
    ("TTFT spiked today", "latency"),
    ("response used unsupported claim", "quality"),
    ("OAuth callback fails again", "access"),
]

expanded_rules = {
    "timeout": "latency",
    "tokens": "latency",
    "cites": "quality",
    "citation": "quality",
    "stale source": "quality",
    "password": "access",
    "authenticate": "access",
}

def route(text: str) -> str:
    lowered = text.lower()
    for keyword, label in expanded_rules.items():
        if keyword in lowered:
            return label
    return "manual_review"

old_score = sum(route(text) == label for text, label in old_cases)
new_score = sum(route(text) == label for text, label in new_cases)
print(f"rules={len(expanded_rules)} old_set={old_score}/6 new_set={new_score}/3")
print("new misses:", [text for text, label in new_cases if route(text) != label])
assert old_score == 6 and new_score == 0

Rule churn under new wording

rules=7 old_set=6/6 new_set=0/3
new misses: ['TTFT spiked today', 'response used unsupported claim', 'OAuth callback fails again']

The second result is the treadmill Sutton warns about. More engineer-hours repair yesterday's errors but don't automatically create a method that adapts to tomorrow's distribution.

Replace phrases with evidence from examples

03-token-evidence.py

import re
from collections import Counter

training = [
    ("decode latency timeout", "latency"),
    ("ttft spike tokens", "latency"),
    ("wrong citation unsupported answer", "quality"),
    ("stale source hallucination", "quality"),
    ("password login access", "access"),
    ("oauth callback auth", "access"),
]

def tokens(text: str) -> list[str]:
    return re.findall(r"[a-z]+", text.lower())

by_label = {}
for text, label in training:
    by_label.setdefault(label, Counter()).update(tokens(text))

print("quality evidence:", sorted(by_label["quality"]))
print("latency evidence:", sorted(by_label["latency"]))
assert by_label["quality"]["citation"] == 1

Issue tokens

quality evidence: ['answer', 'citation', 'hallucination', 'source', 'stale', 'unsupported', 'wrong']
latency evidence: ['decode', 'latency', 'spike', 'timeout', 'tokens', 'ttft']

Use those counts as a tiny learned classifier: score each route by how many words it shares with a new issue.

04-learn-from-labeled-issues.py

import re
from collections import Counter, defaultdict

training = [
    ("decode latency timeout", "latency"),
    ("ttft spike tokens", "latency"),
    ("wrong citation unsupported answer", "quality"),
    ("stale source hallucination", "quality"),
    ("password login access", "access"),
    ("oauth callback auth", "access"),
]
held_out = [
    ("latency spike tokens", "latency"),
    ("unsupported citation answer", "quality"),
    ("reset password", "access"),
    ("stale source answer", "quality"),
    ("bad graph query", "manual_review"),
]

def tokenize(text: str) -> list[str]:
    return re.findall(r"[a-z]+", text.lower())

counts = defaultdict(Counter)
for text, label in training:
    counts[label].update(tokenize(text))

def predict(text: str) -> str:
    words = tokenize(text)
    scores = {label: sum(counter[word] for word in words) for label, counter in counts.items()}
    best_score = max(scores.values())
    winners = [label for label, score in scores.items() if score == best_score]
    return winners[0] if best_score > 0 and len(winners) == 1 else "manual_review"

results = [(text, label, predict(text)) for text, label in held_out]
correct = sum(expected == predicted for _, expected, predicted in results)
print(f"held_out={correct}/{len(held_out)}")
print("predictions:", [predicted for _, _, predicted in results])
assert correct == 5

Learned router evaluation

held_out=5/5
predictions: ['latency', 'quality', 'access', 'quality', 'manual_review']

05-add-one-correction.py

import re
from collections import Counter, defaultdict

base = [
    ("timeout tokens", "latency"),
    ("wrong citation", "quality"),
    ("password reset", "access"),
]
issue = "bad graph query"

def train_and_predict(rows: list[tuple[str, str]], text: str) -> str:
    counts = defaultdict(Counter)
    for example, label in rows:
        counts[label].update(re.findall(r"[a-z]+", example.lower()))
    words = re.findall(r"[a-z]+", text.lower())
    scores = {label: sum(counter[word] for word in words) for label, counter in counts.items()}
    best_score = max(scores.values())
    winners = [label for label, score in scores.items() if score == best_score]
    return winners[0] if best_score > 0 and len(winners) == 1 else "manual_review"

before = train_and_predict(base, issue)
after = train_and_predict(base + [("bad graph query citation", "quality")], issue)
print(f"before_correction={before}")
print(f"after_correction={after}")
assert before == "manual_review" and after == "quality"

Correction becomes training data

before_correction=manual_review
after_correction=quality

Search spends compute after training

06-search-with-a-verifier.py

issue = "RAG answer failed citation_precision on run R42"
candidates = [
    ("quality", "Retry later.", []),
    ("promote", "Promote anyway.", []),
    ("quality", "Open eval trace.", ["run R42"]),
    ("quality", "Open eval trace; block promotion until citation_precision recovers.", ["run R42", "citation_precision"]),
]

def verifier(candidate: tuple[str, str, list[str]]) -> int:
    route, _, evidence = candidate
    if route != "quality":
        return -1
    return len(evidence)

for budget in (1, 2, 4):
    selected = max(candidates[:budget], key=verifier)
    print(f"budget={budget} route={selected[0]} evidence={len(selected[2])} action={selected[1]}")

assert verifier(max(candidates, key=verifier)) == 2

Inference-time search budget

budget=1 route=quality evidence=0 action=Retry later.
budget=2 route=quality evidence=0 action=Retry later.
budget=4 route=quality evidence=2 action=Open eval trace; block promotion until citation_precision recovers.

Connect the lesson to language-model training

For a dense decoder-only Transformer, engineers often estimate training compute with:

C \approx 6ND

07-estimate-training-flops.py

def dense_training_flops(parameters: int, tokens: int) -> int:
    return 6 * parameters * tokens

plans = [
    ("small study", 1_000_000_000, 20_000_000_000),
    ("larger run", 7_000_000_000, 140_000_000_000),
]

for name, parameters, tokens in plans:
    flops = dense_training_flops(parameters, tokens)
    print(f"{name}: {flops / 1e21:.2f} zettaFLOPs")

assert dense_training_flops(7_000_000_000, 140_000_000_000) == 5_880_000_000_000_000_000_000

Training FLOPs estimate

small study: 0.12 zettaFLOPs
larger run: 5.88 zettaFLOPs

Use the FLOPs approximation to expose the tradeoff under one fixed budget. This calculation doesn't select the best model; it only tells you how many tokens each model size can afford at that budget.

08-allocate-a-fixed-training-budget.py

budget = 6 * 7_000_000_000 * 140_000_000_000
model_sizes = [1_000_000_000, 7_000_000_000, 14_000_000_000]

for parameters in model_sizes:
    affordable_tokens = budget // (6 * parameters)
    print(f"N={parameters / 1e9:.0f}B -> D={affordable_tokens / 1e9:.0f}B tokens")

assert budget // (6 * 14_000_000_000) == 70_000_000_000

Fixed-budget tradeoff

N=1B -> D=980B tokens
N=7B -> D=140B tokens
N=14B -> D=70B tokens

History is evidence, not a slogan

Sutton's essay supports a broad historical pattern. A careful reader should also notice where the evidence stops.

Domain	What the source supports	What you shouldn't claim from it
Chess and Go	Sutton describes delayed success from scaled search and learning; AlphaZero learned chess, shogi, and Go from self-play with only game rules and reached superhuman play within 24 hours.^{[1]Reference 1The Bitter Lesson.http://www.incompleteideas.net/IncIdeas/BitterLesson.html}^{[5]Reference 5Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.https://arxiv.org/abs/1712.01815}	That search eliminates every useful prior or safety rule
Speech and vision	Sutton points to statistical/deep-learning methods replacing increasingly elaborate human feature engineering.^{[1]Reference 1The Bitter Lesson.http://www.incompleteideas.net/IncIdeas/BitterLesson.html}	That modern architectures contain no inductive biases
Language models	Scaling studies measure improvement of Transformer language models as model size, tokens, and compute change.^{[3]Reference 3Scaling Laws for Neural Language Modelshttps://arxiv.org/abs/2001.08361}^{[4]Reference 4Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}	That LLM pretraining proves Sutton's full claim about agents learning from experience

Keep rules where they belong

09-keep-policy-gates.py

def release_action(proposed_route: str, eval_score: float, evidence: set[str]) -> str:
    if proposed_route == "promote" and eval_score < 0.95:
        return "human_review_failed_gate"
    if proposed_route == "promote" and "eval_report" not in evidence:
        return "request_evidence"
    return proposed_route

examples = [
    ("promote", 0.98, {"eval_report"}),
    ("promote", 0.91, {"eval_report"}),
    ("promote", 0.98, set()),
]
decisions = [release_action(*example) for example in examples]
print("decisions:", decisions)
assert decisions == ["promote", "human_review_failed_gate", "request_evidence"]

Guardrail around capability

decisions: ['promote', 'human_review_failed_gate', 'request_evidence']

10-log-the-compute-decision.py

from hashlib import sha256
from json import dumps

evaluation = {
    "split": "incident-router-held-out-v1",
    "correct": 5,
    "total": 5,
    "manual_review_routes": 1,
}
evaluation["accuracy"] = evaluation["correct"] / evaluation["total"]
receipt = {
    "component": "incident-router-word-overlap-v1",
    "training": {
        "procedure": "token-overlap-counts-v1",
        "examples": 6,
    },
    "evaluation": evaluation,
    "search": {
        "candidate_budget": 4,
        "verifier": "eval-evidence-v1",
    },
    "policy_gates": ["promotion_below_0_95_requires_review", "promotion_requires_eval_report"],
    "monitor": "manual_escalation_rate",
}
payload = dumps(receipt, sort_keys=True, separators=(",", ":"))

print(dumps(receipt, indent=2, sort_keys=True))
print("receipt sha256:", sha256(payload.encode()).hexdigest()[:12])
assert receipt["search"]["candidate_budget"] == 4
assert receipt["evaluation"]["accuracy"] == 1.0

Compute decision receipt

{
  "component": "incident-router-word-overlap-v1",
  "evaluation": {
    "accuracy": 1.0,
    "correct": 5,
    "manual_review_routes": 1,
    "split": "incident-router-held-out-v1",
    "total": 5
  },
  "monitor": "manual_escalation_rate",
  "policy_gates": [
    "promotion_below_0_95_requires_review",
    "promotion_requires_eval_report"
  ],
  "search": {
    "candidate_budget": 4,
    "verifier": "eval-evidence-v1"
  },
  "training": {
    "examples": 6,
    "procedure": "token-overlap-counts-v1"
  }
}
receipt sha256: 957f864a4562

What to carry forward

The Bitter Lesson isn't "throw compute at every problem." It's a test for your design:

Can new labeled evidence improve the capability without new handwritten branches?
Can additional search or verification budget be measured against a held-out task?
Are policy gates protecting actions rather than trying to encode every meaning?
Can you report the compute budget and the evaluation that justified it?

Mastery check

Checkpoints

Evaluation rubric

Foundational: State Sutton's thesis and distinguish learning from search.
Practical: Run the incident-router examples and explain why a correction is data for a learner but rule debt for a keyword router.
Quantitative: Estimate dense-model training FLOPs and compare model/token allocations under a fixed budget.
Production: Design one learned component, one policy gate, and one logged metric for an automated AI-operations workflow.

Common pitfalls

"Compute will fix bad data"

Symptom: You enlarge a model or sample more candidates while labels leak or contradict one another.
Cause: General methods can amplify the data and objectives they're given.
Fix: Keep the dataset-quality and held-out evaluation discipline from the earlier data and production-ML lessons.

"Rules are forbidden"

Symptom: You let a learned router auto-promote low-score candidates without a policy boundary.
Cause: You've confused scalable capability with unconstrained execution.
Fix: Use rules for auditable gates and learning for messy recognition.

"More search always improves an answer"

Symptom: Cost and latency rise, but accepted answers don't improve.
Cause: Candidate generation or verification is too weak for added budget to help.
Fix: Evaluate budget levels on held-out tasks and stop spending where gains vanish.

"A deterministic tie-break is safe"

Symptom: An unseen issue is sent to the first route in a sorted label list.
Cause: The classifier turns missing evidence into a reproducible but unsupported action.
Fix: Abstain when no route has positive evidence or when several routes share the best score.

Next Step

Continue to BPE, WordPiece, and SentencePiece

PreviousMonitoring Predictive Models

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

The Bitter Lesson.

Sutton, R. S. · 2019

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.

Silver, D., et al. · 2017 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

The Bitter Lesson & Compute

Where effort compounds

See a rulebook reach its boundary

Replace phrases with evidence from examples

Search spends compute after training

Connect the lesson to language-model training

History is evidence, not a slogan

Keep rules where they belong

What to carry forward

Mastery check

Checkpoints

Evaluation rubric

Common pitfalls

"Compute will fix bad data"

"Rules are forbidden"

"More search always improves an answer"

"A deterministic tie-break is safe"

Mastery Check

Discussion

The Bitter Lesson & Compute

Where effort compounds

See a rulebook reach its boundary

Replace phrases with evidence from examples

Search spends compute after training

Connect the lesson to language-model training

History is evidence, not a slogan

Keep rules where they belong

What to carry forward

Mastery check

Checkpoints

Evaluation rubric

Common pitfalls

"Compute will fix bad data"

"Rules are forbidden"

"More search always improves an answer"

"A deterministic tie-break is safe"

Mastery Check

Discussion

The Bitter Lesson & Compute

Where effort compounds

See a rulebook reach its boundary

Replace phrases with evidence from examples

Why does bad graph query go to manual_review instead of whichever route appears first?

Search spends compute after training

Connect the lesson to language-model training

History is evidence, not a slogan

Keep rules where they belong

What to carry forward

Mastery check

Checkpoints

What exactly does Sutton name as the two general methods that can make use of large amounts of computation?

Why doesn't a promotion approval threshold contradict the Bitter Lesson?

What does C≈6NDC \approx 6NDC≈6ND tell you, and what can't it tell you?

When can extra inference-time search hurt instead of help?

Evaluation rubric

Common pitfalls

"Compute will fix bad data"

"Rules are forbidden"

"More search always improves an answer"

"A deterministic tie-break is safe"

Mastery Check

Discussion

The Bitter Lesson & Compute

Where effort compounds

See a rulebook reach its boundary

Replace phrases with evidence from examples

Why does bad graph query go to manual_review instead of whichever route appears first?

Search spends compute after training

Connect the lesson to language-model training

History is evidence, not a slogan

Keep rules where they belong

What to carry forward

Mastery check

Checkpoints

What exactly does Sutton name as the two general methods that can make use of large amounts of computation?

Why doesn't a promotion approval threshold contradict the Bitter Lesson?

What does C≈6NDC \approx 6NDC≈6ND tell you, and what can't it tell you?

When can extra inference-time search hurt instead of help?

Evaluation rubric

Common pitfalls

"Compute will fix bad data"

"Rules are forbidden"

"More search always improves an answer"

"A deterministic tie-break is safe"

Mastery Check

Discussion