LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
⚡Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
🔍Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
🤖Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
📊Evaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
🛠️LLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
🏗️System Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnAI Engineering FoundationsThe Bitter Lesson & Compute
📝EasyNLP Fundamentals

The Bitter Lesson & Compute

Understand Sutton's Bitter Lesson, why general methods that use computation consistently outperform human-engineered heuristics, and how this principle shapes every modern AI architecture decision.

20 min readGoogle DeepMind, OpenAI, Meta +38 key concepts

In the history of artificial intelligence, researchers have repeatedly tried to build human knowledge directly into computers, teaching them grammar rules for language, or strategy tactics for chess. It works well at first. But eventually, a simpler approach that just uses massive computing power to learn from data always wins. This phenomenon is known as "The Bitter Lesson."

In March 2019, Rich Sutton, one of the fathers of reinforcement learning (the branch of AI where agents learn by trial and error), published a short essay that would become one of the most cited and debated pieces in AI history [1]. "The Bitter Lesson" makes a simple but profound claim: the biggest lesson that can be read from 70 years of AI research is that general methods that use computation are ultimately the most effective, and by a large margin.

The Core Thesis

Sutton's argument can be distilled into three claims:

💡 Key insight: Analogy: Power tools vs. hand tools. Imagine two woodworkers: one spends years mastering hand-carving techniques (a "heuristic" or rule-of-thumb approach), creating beautiful but slow work. The other buys a CNC machine and learns to operate it. At first, the hand-carver produces better furniture. But the CNC operator keeps buying faster machines, and within a few years, they're producing furniture that's better and orders of magnitude faster. The Bitter Lesson says AI follows this same pattern: cleverly hand-crafted solutions lose to general methods that scale with compute.

  1. •

    AI researchers have repeatedly tried to build human knowledge into their agents, through hand-crafted features, domain-specific heuristics, and expert-designed representations.

  2. •

    These approaches work well initially but eventually hit a ceiling, because human-engineered knowledge is complex, brittle, and hard to scale.

  3. •

    General methods that use computation, specifically search and learning, consistently win in the long run, because they scale with Moore's Law while human engineering effort does not.

The two methods Sutton identifies as truly scalable are:

  • •Search: exploring large spaces of possibilities at inference time (e.g., Monte Carlo Tree Search (MCTS), beam search, test-time compute)
  • •Learning: optimizing parameters from data at training time (e.g., gradient descent, reinforcement learning)

Both convert compute into performance. Neither requires encoding human domain knowledge. And both have, historically, displaced every hand-engineered alternative.

The Bitter Lesson visualization: hand-crafted features vs learned representations across chess, Go, vision, and NLP, showing compute-based approaches consistently winning. The Bitter Lesson visualization: hand-crafted features vs learned representations across chess, Go, vision, and NLP, showing compute-based approaches consistently winning.

Historical Evidence: The Pattern Repeats

The power of the Bitter Lesson comes from how consistently the pattern appears across every subfield of AI. Let's trace the evidence.

Chess: From Grandmaster Heuristics to Brute-Force Search

In the 1970s-90s, chess AI was dominated by programs with hand-crafted evaluation functions: piece-square tables, pawn structure analysis, king safety metrics, and endgame databases, all designed by chess experts. These programs were impressive but fragile. Each new improvement required painstaking human engineering.

Then came Deep Blue (1997), which used massive hardware parallelism to search ~200 million positions per second. It still used hand-tuned evaluation, but compute was doing most of the work. The final nail came with AlphaZero (2017) [2], which learned to play chess entirely from self-play using zero human knowledge beyond the rules. Starting from random play, it surpassed every hand-engineered chess program within hours of training on a TPU (Tensor Processing Unit) cluster.

The expert knowledge accumulated over decades of chess programming was rendered obsolete by a general algorithm (MCTS + neural network evaluation) powered by compute.

Go: The Impossible Problem Solved by Learning

Go was considered the frontier of game AI due to its enormous branching factor (b≈250b \approx 250b≈250 vs. chess's b≈35b \approx 35b≈35). For decades, the strongest Go programs used pattern databases hand-designed by Go experts: joseki dictionaries, life-and-death solvers, influence maps.

These programs barely played at amateur level.

AlphaGo (2016) combined deep neural networks with MCTS, initially trained on human expert games and then refined through self-play. AlphaGo Zero (2017) removed even the human games entirely: pure self-play from scratch. It became the strongest Go player in history, playing moves that human experts described as "alien" and "beautiful."

AlphaGo's move 37 in Game 2 against Lee Sedol, a shoulder hit on the fifth line that no human player would consider, was later recognized as genuinely creative. A hand-engineered system would never have discovered it.

Speech Recognition: From Phoneme Rules to End-to-End Learning

The history of automatic speech recognition (ASR) follows the same arc:

  • •1970s-2000s: Hidden Markov Models (HMMs) with hand-designed phoneme models, pronunciation dictionaries, and language model hierarchies. Building a production ASR system required teams of linguists and phoneticians.
  • •2012-2016: Deep neural network acoustic models replaced HMM features, but the pipeline (feature extraction → acoustic model → language model → decoder) remained hand-engineered.
  • •2017+: End-to-end models like CTC (Connectionist Temporal Classification), Listen-Attend-Spell, and eventually Whisper (2022) [3] collapsed the entire pipeline into a single neural network trained on massive data. Whisper was trained on 680,000 hours of web audio. Its "knowledge" of phonetics, accents, and languages was learned entirely from compute and data.

Every hand-designed component (mel-frequency cepstral coefficients, tri-phone models, n-gram language models) was eventually replaced by a general neural network doing the same job, but better, because it could use more compute and data.

Computer Vision: From SIFT to Transformers

The vision trajectory is equally stark:

  • •2000s: Hand-designed feature extractors (SIFT, HOG, SURF) fed into SVMs (Support Vector Machines) or random forests. Each feature was carefully engineered to capture edges, gradients, or keypoints.
  • •2012: AlexNet (a deep Convolutional Neural Network or CNN trained on ImageNet) demolished hand-crafted features at ILSVRC (ImageNet Large Scale Visual Recognition Challenge), dropping error rates from ~26% to ~16%. The features were learned, not designed.
  • •2020: Vision Transformers (ViTs) [4] removed even the convolutional inductive bias, treating images as sequences of patches and learning everything from data. When trained with sufficient compute and data, ViTs matched or exceeded CNNs, despite having less domain-specific structure.

The progression from hand-designed features → learned features (CNNs) → even more general learned features (Transformers) is the Bitter Lesson in action, compressed into a decade.

Natural Language Processing: From Parse Trees to Language Models

NLP provides perhaps the most dramatic example:

  • •1990s-2000s: NLP systems were built on linguistic engineering: part-of-speech taggers, dependency parsers, named entity recognizers (NER), coreference resolution systems, each hand-designed with linguistic theory. Building a question-answering system required composing dozens of these modules.
  • •2013: Word2Vec showed that simple neural methods (skip-gram, Continuous Bag-of-Words or CBOW) could learn rich semantic representations from raw text, with no linguistic annotation required.
  • •2018: BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) demonstrated that pre-training on large text corpora produced representations that transferred to virtually every NLP task, making most hand-designed NLP pipelines obsolete.
  • •2020: GPT-3 [5] showed that simply scaling up a language model (175B parameters, 300B tokens) produced a system capable of few-shot learning across an enormous range of tasks (translation, arithmetic, coding, reasoning) without any task-specific architecture or fine-tuning.

The entire NLP community's decades of work on parse trees, semantic role labeling, discourse analysis, and grammar engineering was subsumed by "predict the next token, but with more compute."

Across chess, Go, speech, vision, and NLP, hand-engineered methods are replaced by compute-scaled search and learning. Across chess, Go, speech, vision, and NLP, hand-engineered methods are replaced by compute-scaled search and learning.
Diagram Diagram

Why General Methods Win

The Bitter Lesson isn't just an observation. There's a structural reason why general methods eventually dominate:

💡 Key insight: Analogy: Language immersion vs. dictionary. Learning a language from a dictionary and grammar book (hand-engineering) gives you decent results fast, but you'll always sound wooden and miss subtle idioms. Immersion, simply listening to millions of conversations (learning from data), is painfully slow at first but eventually produces fluency no textbook can match. The more time (compute) you spend immersed, the better you get. That's why learning scales and engineering doesn't.

The Scaling Argument

Human engineering effort scales linearly (at best) with researcher-hours. You can hire more linguists, but each one still has to think through problems manually. Compute, by contrast, scales exponentially. Moore's Law (and its modern equivalents in GPU throughput) doubles available compute roughly every 18-24 months.

text
1Human engineering: performance ∝ log(researcher_hours) 2General methods: performance ∝ compute^α (where α > 0)

Any method that converts compute into performance will eventually surpass any method that relies on human effort, given enough time. The Bitter Lesson says this crossover has happened repeatedly, and it always will.

The Complexity Argument

Real-world domains are vastly more complex than our ability to understand them. A language has millions of statistical regularities that no linguist can fully enumerate. A visual scene has billions of pixel-level relationships that no feature engineer can capture. The only way to model this complexity is to learn it from data, and learning from data requires compute.

Code: Heuristic vs. Compute in Action

We can demonstrate the Bitter Lesson with a simple Python simulation. The code below defines a game agent evaluating a board state. It takes a board state (with an explicit value and a hidden complexity factor) as input. We implement two approaches: a heuristic function that applies rigid rules, and a general search function that simulates outcomes. The script outputs the evaluation scores, demonstrating how search uncovers hidden complexity while heuristics hit a hard ceiling.

python
1import random 2 3# A simplified state evaluation problem 4class GameState: 5 def __init__(self, value: int, complexity: int = 0): 6 self.value = value 7 self.complexity = complexity # Hidden factors human rules might miss 8 9 def simulate_step(self) -> float: 10 """Simulate one step forward in the game (requires compute).""" 11 # In a real game, this evolves the state based on rules 12 noise = random.gauss(0, 1) 13 return float(self.value + noise + (self.complexity * 0.1)) 14 15# 1. The Heuristic Approach: Human-coded rules 16# Fast, cheap, but capped by the engineer's understanding 17def heuristic_evaluate(state: GameState) -> float: 18 score = 0.0 19 # Rule 1: High value is good 20 if state.value > 10: 21 score += 1.0 22 # Rule 2: Even numbers are preferred (arbitrary human intuition) 23 if state.value % 2 == 0: 24 score += 0.5 25 # The heuristic ignores 'complexity' because the human didn't understand it 26 return score 27 28# 2. The General Approach: Search / Simulation 29# Slow, expensive, but accuracy scales with compute budget 30def search_evaluate(state: GameState, compute_budget: int) -> float: 31 total_outcome = 0.0 32 # Spend compute to simulate the future 33 for _ in range(compute_budget): 34 outcome = state.simulate_step() 35 total_outcome += outcome 36 37 # The result converges to the truth as compute increases 38 return total_outcome / compute_budget 39 40# Demonstration 41state = GameState(value=12, complexity=50) # A state with hidden complexity 42 43print(f"Heuristic Score: {heuristic_evaluate(state):.2f}") 44# Output: 1.50 (Fixed, fast, but inaccurate) 45 46print(f"Search (Low Compute): {search_evaluate(state, compute_budget=10):.2f}") 47# Output: ~16.50 (Noisy) 48 49print(f"Search (High Compute): {search_evaluate(state, compute_budget=10000):.2f}") 50# Output: ~17.00 (Accurate: captures the hidden complexity)

In this example, the heuristic_evaluate function is static. No matter how much GPU power you buy, it won't get better than its hard-coded logic. The search_evaluate function starts worse (it's noisy and slow), but as you increase compute_budget, it discovers the ground truth, including the hidden complexity term that the human engineer missed.

Why the Simulation Works

The Python simulation illustrates a fundamental truth about system design:

CharacteristicHeuristic ApproachGeneral Search Approach
Development CostHigh (Requires human expert time)Low (Simple rules, relies on simulation)
Execution SpeedFast (Constant time O(1)O(1)O(1))Slow (Scales with budget O(N)O(N)O(N))
Accuracy CeilingCapped by human understandingApproaches ground truth as compute increases
AdaptabilityBrittle (Fails if hidden variables change)Robust (Simulates actual outcomes)

When compute is expensive and limited, the heuristic wins because it is fast. As compute becomes cheap and abundant, the search approach inevitably surpasses it.

Diagram Diagram

The Generalization Argument

Hand-engineered systems tend to be brittle: they work well on the cases the engineer anticipated but fail on unexpected inputs. Learned representations, by contrast, generalize across the distribution of training data. A hand-designed NER system might recognize "New York" and "Paris" but miss "Bengaluru" because no engineer thought to add it. A learned system that has seen enough text will handle all of them.

Train-Time vs. Inference-Time Compute

The Bitter Lesson identifies learning and search as the two methods that scale arbitrarily with compute. Modern AI systems have split these into two distinct phases of computation, both of which follow the Bitter Lesson.

Train-Time Compute: Learning the Foundation

During training, compute is spent on learning (optimizing weights via gradient descent). Throwing more FLOPs (Floating Point Operations) at a larger dataset with a larger model predictably reduces the loss.

This phase builds the model's intuitive "System 1" capabilities. A language model learns to predict the next token with high accuracy. However, this relies entirely on the model's forward pass, meaning the computation per token is fixed. A complex reasoning problem gets the exact same amount of compute as a simple greeting.

Inference-Time Compute: Search and Reasoning

At inference time, compute is spent on search (exploring possible outputs). This represents "System 2" thinking. Instead of immediately returning the first predicted token, the system can generate multiple possible reasoning paths, evaluate them, and select the best one.

Models like OpenAI's o1 and DeepSeek-R1 use techniques like Monte Carlo Tree Search (MCTS) or process reward models to explore solutions. As Sutton noted with AlphaZero, search scales: if you give the model 10 times more inference compute to think before answering, it will find significantly better solutions.

Diagram Diagram

By scaling both train-time learning and inference-time search, engineers apply the Bitter Lesson at every stage of the pipeline, avoiding hand-crafted reasoning modules in favor of raw computation.

The Counter-Arguments: When Heuristics Still Matter

The Bitter Lesson is powerful, but it's not the whole story. Several important nuances deserve attention:

Inductive Biases as Efficient Priors

While general methods eventually win, inductive biases serve as efficient initialization. Convolutions encode translation invariance, which is genuinely true of images. Positional encodings capture the sequential nature of language. These biases don't contradict the Bitter Lesson; they accelerate learning within a fixed compute budget. The lesson predicts they'll be replaced, but doesn't say they're useless now.

Sample Efficiency

Not every problem has internet-scale data. In medical imaging, robotics, or rare language NLP, data is scarce and compute alone can't compensate. In these domains, carefully designed architectures and data augmentation (forms of human knowledge) still provide critical advantages. The Bitter Lesson implicitly assumes abundant data, which isn't always available.

Safety and Alignment

Perhaps the strongest counter-argument: we may not want fully general methods for safety-critical decisions. A learned reward model might "hack" its objective in unexpected ways. A rule-based safety filter, while brittle, is interpretable and verifiable. The alignment community argues that some problems require human-designed constraints, not just more compute.

Compute as a Fixed Budget

The Bitter Lesson describes long-term trends, but engineers work within fixed budgets. If you have 8 GPUs and a deadline, using domain knowledge to constrain your search space is pragmatic, not foolish. The lesson applies to research trajectories stretching over decades, not to individual project decisions.

Implications for Modern AI Engineering

The Bitter Lesson isn't just philosophical. It has concrete implications for how we design AI systems today:

1. Build for Scale, Not Cleverness

Design architectures that benefit from more compute and data, even if they seem "wasteful" at current scales. The Transformer replaced the LSTM not because it was more elegant, but because its parallelizable attention mechanism could use GPU clusters more effectively. Choose the architecture that scales.

2. Prefer Learning over Engineering

When you're tempted to add a hand-designed component (a custom tokenizer heuristic, a rule-based post-processor, a domain-specific feature), ask: "Would a general learned component work if I gave it enough data?" If yes, the Bitter Lesson says invest in data collection and compute rather than engineering.

3. Scaling Laws Are the Bitter Lesson, Quantified

The empirical scaling laws discovered by Kaplan et al. [6] and Hoffmann et al. [7] are the Bitter Lesson expressed mathematically. They demonstrate that the cross-entropy loss LLL of a language model scales as a predictable power law with the compute budget CCC:

L(C)≈(CcC)αL(C) \approx \left( \frac{C_c}{C} \right)^\alphaL(C)≈(CCc​​)α

Where CcC_cCc​ is a constant and α\alphaα is the scaling exponent. This equation proves that model performance improves steadily and predictably simply by increasing compute, exactly the behavior Sutton highlighted. The scaling laws tell you how much compute to throw at a problem; the Bitter Lesson tells you why this brute-force approach succeeds where elegant hand-engineering fails.

4. Test-Time Compute Is Search Reborn

Modern reasoning models (o1, o3, DeepSeek-R1) apply the Bitter Lesson to inference: instead of engineering complex reasoning pipelines, they spend more compute at test time through learned search strategies. This directly parallels AlphaZero's MCTS, converting compute into intelligence through general search, not hand-designed heuristics.

5. Your Clever Trick Is Temporary

If you develop a technique that works by encoding domain knowledge (a specialized attention mask, a domain-specific loss function, a hand-tuned decoding strategy), expect it to be obsoleted by a simpler, more general approach once sufficient compute is available. This doesn't mean the technique is bad. It means you should design it to be removable.

A Practical Framework

For engineers deciding between general and specialized approaches, consider this decision matrix:

FactorFavor General MethodsFavor Domain Heuristics
Data availabilityAbundant (>10M examples)Scarce (<10K examples)
Compute budgetLarge or growingFixed and small
Time horizonLong-term productShort-term prototype
Domain complexityHigh (language, vision)Low (clear rules exist)
Safety requirementsStandardSafety-critical
Architecture lifespanYearsMonths

This matrix highlights that the Bitter Lesson is a statement about asymptotic behavior. It tells us where the field is heading over the long term. However, day-to-day engineering requires balancing these long-term trends against immediate constraints.

If you are building a prototype with zero training data, you have no choice but to use heuristics. You might write a complex prompt with a strict set of extraction rules. But as your product matures and you collect a million user interactions, the Bitter Lesson suggests you should transition away from those complex prompts and instead fine-tune a model on that data.

The most successful AI teams design their systems to smoothly bridge this gap. They use heuristics to bootstrap their initial data flywheel, but they build their infrastructure to eventually replace those same heuristics with learning and search once enough data and compute are available.

Key Takeaways

  1. •

    The Bitter Lesson is a meta-principle: across 70 years and every subfield of AI, general methods that use computation have displaced hand-engineered approaches.

  2. •

    Search and learning are the two scalable methods. Both convert compute into performance without requiring human domain knowledge.

  3. •

    The pattern is accelerating: the Transformer's dominance across language, vision, audio, and biology is the Bitter Lesson operating in real-time.

  4. •

    Nuance matters: inductive biases, sample efficiency, safety constraints, and fixed compute budgets all create situations where human-designed structure still provides value.

  5. •

    For engineers, the practical implication is clear: design systems that benefit from scale (more data, more compute, more parameters) because that's where the long-term wins are.

  6. •

    Know when you're fighting the lesson: if your system relies on clever tricks that only work at your current scale, start planning for the general alternative.

Evaluation Rubric
  • 1
    State the core thesis of the Bitter Lesson in one sentence
  • 2
    Give three historical examples where compute-driven methods displaced hand-engineered approaches
  • 3
    Explain why search and learning are the two methods that scale with compute
  • 4
    Describe how the Bitter Lesson predicted the success of Transformers over LSTMs with attention tricks
  • 5
    Articulate the counter-arguments: inductive biases, sample efficiency, and safety constraints
  • 6
    Connect the Bitter Lesson to modern scaling laws (Kaplan, Chinchilla) and why they work
  • 7
    Explain the engineering implications: how to design systems that benefit from future compute increases
Common Pitfalls
  • Interpreting the Bitter Lesson as 'engineering doesn't matter': it says general methods win, not that implementation quality is irrelevant
  • Ignoring the time horizon: heuristics often win in the short term before compute catches up
  • Conflating 'general methods' with 'no structure at all': Transformers have inductive biases, they're just more general than RNNs
  • Assuming the lesson means we should only scale models and never innovate on architecture
  • Forgetting that the Bitter Lesson is about research trends, not individual project decisions with fixed compute budgets
Follow-up Questions to Expect

Key Concepts Tested
Sutton's Bitter Lesson thesis and historical evidenceCompute use vs. domain-specific heuristicsSearch and learning as the two scalable methodsHistorical case studies: chess, Go, speech, vision, NLPImplications for modern LLM architecture choicesThe counter-arguments and nuance: when heuristics still matterConnection to scaling laws and emergent capabilitiesEngineering implications: build for scale, not cleverness
References

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision.

Radford, A., et al. · 2022

The Bitter Lesson.

Sutton, R. S. · 2019

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.

Silver, D., et al. · 2017 · arXiv preprint

Language Models are Few-Shot Learners.

Brown, T., et al. · 2020 · NeurIPS 2020

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Dosovitskiy, A., et al. · 2020 · ICLR 2021

Training Compute-Optimal Large Language Models

Hoffmann et al. · 2022 · NeurIPS 2022

Your account is free and you can post anonymously if you choose.