LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnCore LLM FoundationsThe Bitter Lesson & Compute
📈MediumReasoning & Scaling

The Bitter Lesson & Compute

Use Sutton's Bitter Lesson to compare rules, learning, and search through a measured support-ticket routing lab.

13 min read
Learning path
Step 45 of 155 in the full curriculum
Monitoring Predictive ModelsBPE, WordPiece, and SentencePiece

You just built a conventional ML product path: time-safe features, deployed predictive models, and monitoring gates that preserve held-out evidence and an audit trail. Now comes a harder design question: when your system is wrong, should you write another rule, learn from more examples, or spend more compute searching among possible answers?

Rich Sutton's 2019 essay The Bitter Lesson gives a demanding answer. Across long stretches of AI history, he argues, methods that can use increasing computation eventually outperform systems built around expert-written knowledge. He names two broadly scalable methods: learning, which improves a model from experience, and search, which explores choices before acting.[1]

This lesson won't treat "more compute" as magic. You'll build a tiny ShopFlow ticket router, see when rules are useful, measure where they fail, and decide what additional budget can honestly buy.

Where effort compounds

Suppose Alex writes ticket #48291: "My FastShip parcel still hasn't arrived. I need my money back." A support system must choose a route: shipping investigation, refund review, or account help.

A rule can say if "refund" in text: route_to("refund"). That works on wording the author anticipated. A learner can fit patterns from labeled tickets such as money back, damaged return, and delivery delayed. A search step can generate several possible actions and check each one against policy evidence before committing.

Those are different ways to spend effort:

ApproachWhat a human specifiesWhat can improve laterFailure to watch
RulesKeywords and branchesMore rulesNew phrasing escapes the rulebook
LearningData, objective, evaluationMore clean labels and training computeBad labels or leakage teach the wrong behavior
SearchCandidates and a checkerMore candidate or verification budgetA weak checker rewards the wrong answer

Sutton's thesis concerns the long-run direction of research, not a ban on rules. A refund cap or a required human approval can be exactly the right safety boundary. The warning is about making a growing rulebook carry the core intelligence of a messy task.

Support routing design showing keyword rules, learning from labeled tickets, and search with verification as three ways to spend engineering and compute budget. Support routing design showing keyword rules, learning from labeled tickets, and search with verification as three ways to spend engineering and compute budget.
Rules handle known cases. Learning and search are attractive because additional labeled examples or verification budget can improve the method without adding one branch per new phrase.

See a rulebook reach its boundary

Start with a router that a developer can ship in an afternoon. Three keywords cover obvious cases.

01-rules-have-boundaries.py
1cases = [ 2 ("tracking says delivered", "shipping"), 3 ("where is my package", "shipping"), 4 ("refund my order", "refund"), 5 ("money back for cracked mug", "refund"), 6 ("password reset please", "account"), 7 ("can't log in", "account"), 8] 9 10def rule_route(text: str) -> str: 11 text = text.lower() 12 rules = {"tracking": "shipping", "refund": "refund", "password": "account"} 13 for keyword, route in rules.items(): 14 if keyword in text: 15 return route 16 return "manual_review" 17 18predictions = [(text, expected, rule_route(text)) for text, expected in cases] 19correct = sum(expected == predicted for _, expected, predicted in predictions) 20misses = [text for text, expected, predicted in predictions if expected != predicted] 21 22print(f"correct={correct}/{len(cases)}") 23print("misses:", misses) 24assert correct == 3
Rule router baseline
1correct=3/6 2misses: ['where is my package', 'money back for cracked mug', "can't log in"]

The router isn't foolish. It catches exact, high-signal phrases quickly. Its limit is visible: compute won't make those three conditions recognize money back or log in. A human must anticipate and encode every expansion.

Now add rules that fix today's misses, then test tomorrow's wording.

02-rule-churn.py
1old_cases = [ 2 ("tracking says delivered", "shipping"), 3 ("where is my package", "shipping"), 4 ("refund my order", "refund"), 5 ("money back for cracked mug", "refund"), 6 ("password reset please", "account"), 7 ("can't log in", "account"), 8] 9new_cases = [ 10 ("parcel hasn't moved", "shipping"), 11 ("reverse the charge for this return", "refund"), 12 ("sign-in fails again", "account"), 13] 14 15expanded_rules = { 16 "tracking": "shipping", 17 "where is": "shipping", 18 "refund": "refund", 19 "money back": "refund", 20 "password": "account", 21 "log in": "account", 22} 23 24def route(text: str) -> str: 25 lowered = text.lower() 26 for keyword, label in expanded_rules.items(): 27 if keyword in lowered: 28 return label 29 return "manual_review" 30 31old_score = sum(route(text) == label for text, label in old_cases) 32new_score = sum(route(text) == label for text, label in new_cases) 33print(f"rules={len(expanded_rules)} old_set={old_score}/6 new_set={new_score}/3") 34print("new misses:", [text for text, label in new_cases if route(text) != label]) 35assert old_score == 6 and new_score == 0
Rule churn under new wording
1rules=6 old_set=6/6 new_set=0/3 2new misses: ["parcel hasn't moved", 'reverse the charge for this return', 'sign-in fails again']

The second result is the treadmill Sutton warns about. More engineer-hours repair yesterday's errors but don't automatically create a method that adapts to tomorrow's distribution.

Conceptual curve for a ShopFlow support router: routing rules improve early and flatten, while a learned router can continue improving as labeled tickets and training budget grow. Conceptual curve for a ShopFlow support router: routing rules improve early and flatten, while a learned router can continue improving as labeled tickets and training budget grow.
The crossover is a design hypothesis, not a guaranteed curve. Learning earns its extra budget only when labels, evaluation, and deployment feedback remain trustworthy.

Replace phrases with evidence from examples

A learned system needs a representation. Before tokenization becomes a full lesson, use the smallest representation possible: lowercase words. A ticket becomes a set of observed terms, and training records which route each term appeared with.

03-token-evidence.py
1import re 2from collections import Counter 3 4training = [ 5 ("late parcel tracking update", "shipping"), 6 ("courier delivery delayed", "shipping"), 7 ("refund damaged item", "refund"), 8 ("money back return", "refund"), 9 ("password login account", "account"), 10 ("sign in reset", "account"), 11] 12 13def tokens(text: str) -> list[str]: 14 return re.findall(r"[a-z]+", text.lower()) 15 16by_label = {} 17for text, label in training: 18 by_label.setdefault(label, Counter()).update(tokens(text)) 19 20print("refund evidence:", sorted(by_label["refund"])) 21print("shipping evidence:", sorted(by_label["shipping"])) 22assert by_label["refund"]["return"] == 1
Ticket tokens
1refund evidence: ['back', 'damaged', 'item', 'money', 'refund', 'return'] 2shipping evidence: ['courier', 'delayed', 'delivery', 'late', 'parcel', 'tracking', 'update']

That representation is primitive, but its source of knowledge is different from the rulebook. No engineer chose that return or courier should indicate a route. Labeled examples supplied that association.

Use those counts as a tiny learned classifier: score each route by how many words it shares with a new ticket.

04-learn-from-labeled-tickets.py
1import re 2from collections import Counter, defaultdict 3 4training = [ 5 ("late parcel tracking update", "shipping"), 6 ("courier delivery delayed", "shipping"), 7 ("refund damaged item", "refund"), 8 ("money back return", "refund"), 9 ("password login account", "account"), 10 ("sign in reset", "account"), 11] 12held_out = [ 13 ("late delivery tracking", "shipping"), 14 ("return damaged item", "refund"), 15 ("reset password", "account"), 16 ("money back damaged", "refund"), 17] 18 19def tokenize(text: str) -> list[str]: 20 return re.findall(r"[a-z]+", text.lower()) 21 22counts = defaultdict(Counter) 23for text, label in training: 24 counts[label].update(tokenize(text)) 25 26def predict(text: str) -> str: 27 words = tokenize(text) 28 scores = {label: sum(counter[word] for word in words) for label, counter in counts.items()} 29 return max(sorted(scores), key=scores.get) 30 31results = [(text, label, predict(text)) for text, label in held_out] 32correct = sum(expected == predicted for _, expected, predicted in results) 33print(f"held_out={correct}/{len(held_out)}") 34print("predictions:", [predicted for _, _, predicted in results]) 35assert correct == 4
Learned router evaluation
1held_out=4/4 2predictions: ['shipping', 'refund', 'account', 'refund']

Don't confuse this tiny word-overlap model with an LLM. Its role is to make the principle observable: a general learning procedure can absorb new labeled examples without a developer adding a condition for each phrase.

05-add-one-correction.py
1import re 2from collections import Counter, defaultdict 3 4base = [ 5 ("tracking parcel", "shipping"), 6 ("refund damaged", "refund"), 7 ("password reset", "account"), 8] 9ticket = "exchange cracked mug" 10 11def train_and_predict(rows: list[tuple[str, str]], text: str) -> str: 12 counts = defaultdict(Counter) 13 for example, label in rows: 14 counts[label].update(re.findall(r"[a-z]+", example.lower())) 15 words = re.findall(r"[a-z]+", text.lower()) 16 scores = {label: sum(counter[word] for word in words) for label, counter in counts.items()} 17 return max(sorted(scores), key=scores.get) 18 19before = train_and_predict(base, ticket) 20after = train_and_predict(base + [("exchange cracked mug return", "refund")], ticket) 21print(f"before_correction={before}") 22print(f"after_correction={after}") 23assert before == "account" and after == "refund"
Correction becomes training data
1before_correction=account 2after_correction=refund

The correction alone doesn't prove generalization. You'd still evaluate on untouched tickets, just as you did for datasets and training loops. It does show a useful operational property: product feedback can become evidence for the next training run instead of another permanent branch.

Diagram showing Resolved tickets fit router, New ticket candidate route, Verify policy evidence, and Route or escalate log outcome. Diagram showing Resolved tickets fit router, New ticket candidate route, Verify policy evidence, and Route or escalate log outcome.
Resolved tickets fit router, New ticket candidate route, Verify policy evidence, and Route or escalate log outcome.

Search spends compute after training

Learning isn't the only scalable method in Sutton's essay. Search spends computation at decision time. For a support agent, search might mean proposing multiple actions, retrieving evidence for each, and sending the best supported route forward.

ShopFlow support-routing pipeline separating train-time learning from labeled tickets and inference-time search over candidate routes checked against policy evidence. ShopFlow support-routing pipeline separating train-time learning from labeled tickets and inference-time search over candidate routes checked against policy evidence.
Training spends compute to fit weights from past tickets. Search spends compute per new request when several plausible actions need evidence or escalation.

This next miniature isn't a language model. It's a transparent candidate-and-verifier loop that lets you see why additional inference budget helps only when better candidates and a meaningful checker exist.

06-search-with-a-verifier.py
1ticket = "FastShip parcel for order A10234 has no scan for three days" 2candidates = [ 3 ("shipping", "Wait.", []), 4 ("refund", "Issue refund now.", []), 5 ("shipping", "Open carrier trace.", ["FastShip"]), 6 ("shipping", "Open carrier trace; escalate after 48 hours without a scan.", ["FastShip", "48 hours"]), 7] 8 9def verifier(candidate: tuple[str, str, list[str]]) -> int: 10 route, _, evidence = candidate 11 if route != "shipping": 12 return -1 13 return len(evidence) 14 15for budget in (1, 2, 4): 16 selected = max(candidates[:budget], key=verifier) 17 print(f"budget={budget} route={selected[0]} evidence={len(selected[2])} action={selected[1]}") 18 19assert verifier(max(candidates, key=verifier)) == 2
Inference-time search budget
1budget=1 route=shipping evidence=0 action=Wait. 2budget=2 route=shipping evidence=0 action=Wait. 3budget=4 route=shipping evidence=2 action=Open carrier trace; escalate after 48 hours without a scan.

The first candidate is cheap and weak. With four candidates, the verifier can choose an action tied to carrier and escalation evidence. If every candidate were wrong, or if the verifier rewarded fast refunds without policy support, extra search would amplify error instead.

Snell and collaborators study this question for LLM reasoning: their experiments find that test-time strategies depend on prompt difficulty, and an adaptive allocation can use inference computation more efficiently than a fixed best-of-NNN baseline on their evaluated tasks.[2] That result supports a conditional claim, not "thinking longer always works."

Connect the lesson to language-model training

Language models give learning an unusually clean objective: predict the next token. Kaplan and collaborators measured smooth empirical power-law relationships between cross-entropy loss and model size, dataset size, and training computation for the Transformer language models they studied.[3] That's evidence that a general learning objective can turn additional scale into predictable improvement within an experimental regime.

For a dense decoder-only Transformer, engineers often estimate training compute with:

C≈6NDC \approx 6NDC≈6ND

Here, CCC is training floating-point operations (FLOPs), NNN is the number of model parameters, and DDD is the number of training tokens. The factor 6 is a planning approximation for forward and backward passes, not a promise about hardware throughput or model quality.

07-estimate-training-flops.py
1def dense_training_flops(parameters: int, tokens: int) -> int: 2 return 6 * parameters * tokens 3 4plans = [ 5 ("small study", 1_000_000_000, 20_000_000_000), 6 ("larger run", 7_000_000_000, 140_000_000_000), 7] 8 9for name, parameters, tokens in plans: 10 flops = dense_training_flops(parameters, tokens) 11 print(f"{name}: {flops / 1e21:.2f} zettaFLOPs") 12 13assert dense_training_flops(7_000_000_000, 140_000_000_000) == 5_880_000_000_000_000_000_000
Training FLOPs estimate
1small study: 0.12 zettaFLOPs 2larger run: 5.88 zettaFLOPs

More compute isn't the same as better allocation. Hoffmann and collaborators trained more than 400 models and reported that, under their compute-optimal fits, parameter count and training-token count should scale together: doubling one calls for roughly doubling the other.[4]

Use the FLOPs approximation to expose the tradeoff under one fixed budget. The following calculation doesn't select the best model; it only tells you how many tokens each model size can afford at that budget.

08-allocate-a-fixed-training-budget.py
1budget = 6 * 7_000_000_000 * 140_000_000_000 2model_sizes = [1_000_000_000, 7_000_000_000, 14_000_000_000] 3 4for parameters in model_sizes: 5 affordable_tokens = budget // (6 * parameters) 6 print(f"N={parameters / 1e9:.0f}B -> D={affordable_tokens / 1e9:.0f}B tokens") 7 8assert budget // (6 * 14_000_000_000) == 70_000_000_000
Fixed-budget tradeoff
1N=1B -> D=980B tokens 2N=7B -> D=140B tokens 3N=14B -> D=70B tokens

That calculation matters because "make the model bigger" can consume the budget needed to expose it to sufficient data. A research scientist doesn't ask only how much compute is available. They ask how to allocate it, what evaluation will reveal, and which failure mode an extra unit of compute addresses.

History is evidence, not a slogan

Sutton's essay supports a broad historical pattern. A careful reader should also notice where the evidence stops.

DomainWhat the source supportsWhat you shouldn't claim from it
Chess and GoSutton describes delayed success from scaled search and learning; AlphaZero learned chess, shogi, and Go from self-play with only game rules and reached superhuman play within 24 hours.[1][5]That search eliminates every useful prior or safety rule
Speech and visionSutton points to statistical/deep-learning methods replacing increasingly elaborate human feature engineering.[1]That modern architectures contain no inductive biases
Language modelsScaling studies measure improvement of Transformer language models as model size, tokens, and compute change.[3][4]That LLM pretraining proves Sutton's full claim about agents learning from experience

That last distinction is important. Internet text contains human-written knowledge. Training on more of it can be a powerful general procedure while still relying on human-produced data. Treat the Bitter Lesson as a research heuristic: prefer methods that can absorb more evidence and compute, then demand measurement.

Keep rules where they belong

Rules are valuable when they encode a contractual boundary instead of trying to recognize every possible user intent. For ShopFlow, the model can suggest a refund route, but policy can require review for a high-value order or missing evidence.

Decision guide for a support router: prefer learning when labeled tickets grow, use temporary rules under short-term constraints, and plan a measured handoff while retaining policy gates. Decision guide for a support router: prefer learning when labeled tickets grow, use temporary rules under short-term constraints, and plan a measured handoff while retaining policy gates.
Heuristics can bootstrap a system or enforce policy boundaries. They become a liability when every new wording demands another capability rule.
09-keep-policy-gates.py
1def release_action(proposed_route: str, order_total: int, evidence: set[str]) -> str: 2 if proposed_route == "refund" and order_total > 100: 3 return "human_review_high_value" 4 if proposed_route == "refund" and "damaged_item" not in evidence: 5 return "request_evidence" 6 return proposed_route 7 8examples = [ 9 ("refund", 24, {"damaged_item"}), 10 ("refund", 240, {"damaged_item"}), 11 ("refund", 24, set()), 12] 13decisions = [release_action(*example) for example in examples] 14print("decisions:", decisions) 15assert decisions == ["refund", "human_review_high_value", "request_evidence"]
Guardrail around capability
1decisions: ['refund', 'human_review_high_value', 'request_evidence']

The rule here isn't pretending to understand language. It protects an action with business and safety consequences. Learning handles messy phrasing; the gate controls what the learned component may execute.

Finally, leave a receipt. A system that consumes more compute but can't report its evaluation score, inference budget, policy gates, and escalation rate isn't ready for a production experiment.

10-log-the-compute-decision.py
1import json 2 3receipt = { 4 "component": "ticket_router_v2", 5 "held_out_accuracy": 0.92, 6 "training_examples": 10_000, 7 "candidate_budget": 4, 8 "policy_gates": ["refund_over_100_requires_review", "refund_requires_evidence"], 9 "monitor": "manual_escalation_rate", 10} 11 12print(json.dumps(receipt, indent=2, sort_keys=True)) 13assert receipt["candidate_budget"] == 4 14assert receipt["held_out_accuracy"] >= 0.90
Compute decision receipt
1{ 2 "candidate_budget": 4, 3 "component": "ticket_router_v2", 4 "held_out_accuracy": 0.92, 5 "monitor": "manual_escalation_rate", 6 "policy_gates": [ 7 "refund_over_100_requires_review", 8 "refund_requires_evidence" 9 ], 10 "training_examples": 10000 11}

What to carry forward

The Bitter Lesson isn't "throw compute at every problem." It's a test for your design:

  1. Can new labeled evidence improve the capability without new handwritten branches?
  2. Can additional search or verification budget be measured against a held-out task?
  3. Are policy gates protecting actions rather than trying to encode every meaning?
  4. Can you report the compute budget and the evaluation that justified it?

Language models begin by converting raw text into discrete pieces. Tokenizers are a perfect next place to ask the same question: which reusable units can be learned from data, and what mistakes are introduced at that boundary?

Mastery check

Checkpoints

Evaluation rubric

  • Foundational: State Sutton's thesis and distinguish learning from search.
  • Practical: Run the ticket-router examples and explain why a correction is data for a learner but rule debt for a keyword router.
  • Quantitative: Estimate dense-model training FLOPs and compare model/token allocations under a fixed budget.
  • Production: Design one learned component, one policy gate, and one logged metric for an automated support workflow.

Common pitfalls

"Compute will fix bad data"

Symptom: You enlarge a model or sample more candidates while labels leak or contradict one another. Cause: General methods can amplify the data and objectives they're given. Fix: Keep the dataset-quality and held-out evaluation discipline from the earlier data and production-ML lessons.

"Rules are forbidden"

Symptom: You let a learned router auto-approve high-value refunds without a policy boundary. Cause: You've confused scalable capability with unconstrained execution. Fix: Use rules for auditable gates and learning for messy recognition.

"More search always improves an answer"

Symptom: Cost and latency rise, but accepted answers don't improve. Cause: Candidate generation or verification is too weak for added budget to help. Fix: Evaluate budget levels on held-out tasks and stop spending where gains vanish.

Next Step
Continue to BPE, WordPiece, and SentencePiece

You now know why methods that absorb more data and computation matter. Next you'll build the text-to-token boundary that makes language-model learning possible and measure the tradeoffs it introduces.

PreviousMonitoring Predictive Models
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Bitter Lesson.

Sutton, R. S. · 2019

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.

Silver, D., et al. · 2017 · arXiv preprint