LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationScaling Laws & Compute-Optimal Training
⚡HardFine-Tuning & Training

Scaling Laws & Compute-Optimal Training

Learn the empirical power laws governing LLM performance, from Kaplan's parameter-heavy frontier through Chinchilla-optimal ratios to modern inference-aware training strategies.

38 min read
Learning path
Step 93 of 155 in the full curriculum
Decoding Strategies: Greedy to NucleusPre-training Data at Scale

After decoding, we move from "how a trained model emits " to "how a lab decides what model to train in the first place." Scaling laws describe how model loss changes as you add parameters, tokens, and compute. This chapter explains the practical question behind the math: how to spend a fixed training budget without overbuilding the wrong part of the system.

Imagine your team has a fixed budget to build an AI model that answers customer support questions. You face a simple-sounding choice: do you build a huge model and train it on a modest amount of data, or build a smaller model and train it on much more data? Get this balance wrong, and you can spend millions of dollars on a model that underperforms a smaller, better-trained alternative.

This is the problem scaling laws address. They are empirical relationships that predict how a model's performance improves as you increase its size, its training data, or both. Kaplan et al. at OpenAI [1] quantified influential scaling curves, and the Chinchilla paper [2] later found a more balanced allocation of model size and data for compute-optimal dense training. These fits help plan expensive runs, but each fit applies only within its objective, data, architecture, and measurement assumptions.

Before we go further, let's agree on three basic terms we'll use throughout this chapter:

  • Parameters (N): The learnable numbers inside a model (weights and biases). Think of them as the capacity of the model, like the number of shelves in a warehouse.
  • Training tokens (D): The individual words or subword pieces the model sees during training. Think of them as the experience the model gains, like the total number of orders a fulfillment center processes.
  • Compute (C): The total number of floating-point operations (FLOPs) needed for training. For a standard dense transformer, a useful rule of thumb is C≈6NDC \approx 6NDC≈6ND (approximately 2N FLOPs per token for the forward pass and 4N for backpropagation). This approximation originates in Kaplan et al. and is refined in the Chinchilla work.[1][2]
Scaling-law planning lens showing objective shift from bigger models to balanced data to inference-aware economics. Scaling-law planning lens showing objective shift from bigger models to balanced data to inference-aware economics.
Scaling laws are planning tools, not a command to make every model bigger. The objective decides whether you spend more on parameters, data, or serving efficiency.

Why scaling laws matter

Frontier pre-training runs consume large compute budgets. In the early days of deep learning, architecture design and hyperparameter tuning drove many performance improvements. Today, scale is one of the main levers. But scaling isn't as simple as making everything bigger. If you misallocate your compute budget between model size and data volume, you can spend a large budget on a model that underperforms a smaller, better-trained counterpart.

Scaling laws address this problem by providing predictive models that guide resource allocation. They map the relationship between the scale of your inputs (parameters, tokens, and compute) and the expected performance of the output model. In pre-training studies, that performance is usually loss, which measures how well the model predicts the next token. By studying these relationships, engineers can estimate how a model will perform before committing to a costly, months-long training run.

These empirical laws allow teams to:

  • Forecast performance before committing to a full training run.
  • Optimize compute allocation between parameters and training tokens.
  • Make economic trade-offs between training cost and inference cost in production environments.
  • Extrapolate from small-scale proxy experiments to production-scale models, reducing the risk of expensive surprises at scale.

They are planning tools, not guarantees. A good scaling study reduces uncertainty before a large run, but the final decision still has to account for data quality, architecture, hardware efficiency, post-training, and deployment cost.

Building intuition with numbers

Before we meet any formulas, let's see the shape of scaling with a tiny concrete example. Suppose you train a 1-billion-parameter model on 10 billion tokens and measure its cross-entropy loss. Then you run three follow-up experiments:

ExperimentParametersTokensWhat changedApproximate loss
Baseline1B10BNothing2.80
Double model size2B10B2x parameters2.60
Double data1B20B2x tokens2.55
Double both2B20B2x each2.35

These numbers are synthetic, but they capture a real pattern: doubling parameters or data lowers loss, but doubling both together lowers it more than either alone. The improvement is smooth and predictable, but it follows diminishing returns. A 10x increase in parameters doesn't cut loss by 10x; it cuts it by a smaller, fixed multiplier.

This predictable curve is a power law. In everyday language, an idealized power-law term becomes a straight line when you plot its logarithm against log(parameters) or log(data). That line estimates the scaling trend. It lets engineers answer questions like: "If I have 10x more compute, how much better will my model get?" without training the model first. Later, we'll add an asymptotic floor and fit parameters and data together.

Power-law chart showing loss dropping smoothly as compute grows while incremental improvement per 3x compute step shrinks. Power-law chart showing loss dropping smoothly as compute grows while incremental improvement per 3x compute step shrinks.
Scaling gives smooth, forecastable improvement, not linear payoff. The loss curve keeps falling, but each extra block of compute buys a smaller absolute reduction.
power-law-diminishing-returns.py
1alpha_n = 0.076 2 3for parameter_multiplier in [2, 10, 100]: 4 scaled_term = parameter_multiplier ** (-alpha_n) 5 reduction = 1 - scaled_term 6 print( 7 f"{parameter_multiplier:>3}x parameters -> " 8 f"{scaled_term:.3f}x parameter-limited loss term " 9 f"({reduction:.1%} reduction)" 10 )
Output
12x parameters -> 0.949x parameter-limited loss term (5.1% reduction) 2 10x parameters -> 0.839x parameter-limited loss term (16.1% reduction) 3100x parameters -> 0.705x parameter-limited loss term (29.5% reduction)

Kaplan scaling laws (2020)

OpenAI's 2020 work [1] established three power-law relationships for transformer language models. Each describes a different bottleneck regime for loss LLL:

The three power laws

1. Scaling with parameters (model size NNN)

L(N)∝N−αNL(N) \propto N^{-\alpha_N}L(N)∝N−αN​

Reading the formula: Loss decreases as a power law when you make the model bigger. Double the parameters, and loss drops by a fixed multiplier. The exponent αN≈0.076\alpha_N \approx 0.076αN​≈0.076 means the improvement is smooth and predictable, but diminishing returns are steep: a 10x increase in parameters only reduces the loss by about 16%.

2. Scaling with data (dataset size DDD in tokens)

L(D)∝D−αDL(D) \propto D^{-\alpha_D}L(D)∝D−αD​

Reading the formula: The same pattern holds for data. More training tokens means lower loss, following a predictable power curve with αD≈0.095\alpha_D \approx 0.095αD​≈0.095.

3. Scaling with compute-optimal training compute (Cmin⁡C_{\min}Cmin​)

L(Cmin⁡)∝Cmin⁡−αCL(C_{\min}) \propto C_{\min}^{-\alpha_C}L(Cmin​)∝Cmin−αC​​

Reading the formula: Kaplan's compute curve is written in terms of optimally allocated training compute, not arbitrary training runs. Using the standard dense-transformer budgeting rule C≈6NDC \approx 6NDC≈6ND, the fitted exponent is αC≈0.050\alpha_C \approx 0.050αC​≈0.050. The factor of 6 comes from a rough FLOP accounting: about 2N2N2N per token for the forward pass and about 4N4N4N per token for backpropagation. It's a planning heuristic, not an exact profiler. Real training runs also depend on attention kernels, optimizer overhead, activation recomputation, and architecture choices such as MoE sparsity.

Kaplan's influential result wasn't that αN\alpha_NαN​ was larger than αD\alpha_DαD​; it wasn't. The operational takeaway came from the compute-optimal frontier: for a fixed compute budget, the fitted optimum favored much larger models and relatively little data.

Kaplan's compute-optimal frontier favored larger models

Kaplan's team concluded that, under their fitted scaling regime, you should spend most new compute on model size and only a smaller fraction on additional data. Think of it like this: upgrading a tiny fulfillment center into a large automated hub improves throughput more than sending more orders through the same tiny center.

This led to the initial strategy of training very large models on relatively modest data budgets, an approach exemplified by GPT-3 (175B parameters on roughly 300B tokens, a 1.7:1 ratio) [3].

The compute-optimal frontier

When compute is the binding constraint, Kaplan suggested allocating the budget such that:

Nopt∝C0.73,Dopt∝C0.27N_{\text{opt}} \propto C^{0.73}, \quad D_{\text{opt}} \propto C^{0.27}Nopt​∝C0.73,Dopt​∝C0.27

Where these exponents come from: Kaplan didn't get these exponents by comparing αN\alpha_NαN​ and αD\alpha_DαD​ directly. He first fit a joint loss surface,

L(N,D)=[(NcN)αN/αD+DcD]αDL(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}L(N,D)=[(NNc​​)αN​/αD​+DDc​​]αD​

and then solved for the best allocation under the transformer training-cost approximation C≈6NDC \approx 6NDC≈6ND. That optimization produced the strongly parameter-heavy split above. Chinchilla later re-fit the compute-optimal frontier and got the much more balanced 0.500.500.50/0.500.500.50 result.

What this means: Kaplan said: if you get 10x more compute, spend most of it on a bigger model (about 5x bigger) and only a little more data (about 2x more). This led to the strategy of building very large models on modest datasets.

allocation-exponents.py
1frontiers = { 2 "Kaplan": (0.73, 0.27), 3 "Chinchilla": (0.50, 0.50), 4} 5 6for compute_multiplier in [10, 100]: 7 print(f"{compute_multiplier}x training compute") 8 for name, (parameter_exp, token_exp) in frontiers.items(): 9 parameters = compute_multiplier ** parameter_exp 10 tokens = compute_multiplier ** token_exp 11 print(f" {name:<10} N={parameters:5.2f}x, D={tokens:5.2f}x")
Output
110x training compute 2 Kaplan N= 5.37x, D= 1.86x 3 Chinchilla N= 3.16x, D= 3.16x 4100x training compute 5 Kaplan N=28.84x, D= 3.47x 6 Chinchilla N=10.00x, D=10.00x

the-compute-optimal-frontier.py
1def dense_training_flops(parameters, tokens): 2 return 6 * parameters * tokens 3 4gpt3_style = dense_training_flops(175e9, 300e9) 5small_data_rich = dense_training_flops(70e9, 1.4e12) 6 7print(f"GPT-3-style training FLOPs: {gpt3_style:.2e}") 8print(f"70B / 1.4T-token training FLOPs: {small_data_rich:.2e}")
Output
1GPT-3-style training FLOPs: 3.15e+23 270B / 1.4T-token training FLOPs: 5.88e+23

The Chinchilla reallocation (2022)

Imagine two fulfillment teams get the same budget. Team A builds a large automated hub but can only process a thin sample of orders. Team B builds a modest hub but processes a much richer, higher-quality order history. Which system performs better? DeepMind ran this experiment with AI models, and the results changed how teams thought about model size.

Hoffmann et al. at DeepMind [2] challenged the Kaplan prescription by training over 400 models ranging from 70M to over 16B parameters on 5B to 500B tokens. Their central finding was that the Kaplan allocation had underestimated the value of data.

Why did two careful studies disagree so sharply? Later analysis traced the gap to methodology rather than a new law of transformers. Porian et al. [4] reproduce Kaplan-style scaling experiments and identify three contributors: omitting final decoding-layer computation, an excessively long fixed warmup for small models, and optimizer hyperparameters that weren't tuned as a function of scale. Correcting those factors produces close agreement with the Chinchilla allocation; their experiments also find that tailored learning-rate decay isn't the central explanation. The lesson is that a scaling exponent is only as trustworthy as its compute accounting and fitting procedure.

Chinchilla-optimal training

The revised scaling law for jointly optimal model size and data volume is:

Nopt∝C0.50,Dopt∝C0.50N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}Nopt​∝C0.50,Dopt​∝C0.50

In plain terms: Chinchilla's correction is that model size and data should grow at equal rates. If you double compute, make the model 2×\sqrt{2}\times2​× bigger and use 2×\sqrt{2}\times2​× more data. The practical rule:

That 20:1 ratio is a fitted result for the dense-transformer setup Hoffmann et al. studied, not a universal constant.[2]

compute-budget-to-chinchilla-size.py
1def chinchilla_style_size(training_flops, tokens_per_parameter=20): 2 parameters = (training_flops / (6 * tokens_per_parameter)) ** 0.5 3 tokens = tokens_per_parameter * parameters 4 return parameters, tokens 5 6parameters, tokens = chinchilla_style_size(1e24) 7print(f"Parameters at 1e24 FLOPs: {parameters / 1e9:.1f}B") 8print(f"Tokens at 20:1: {tokens / 1e12:.2f}T")
Output
1Parameters at 1e24 FLOPs: 91.3B 2Tokens at 20:1: 1.83T
Compute-optimal frontier comparison showing Kaplan growing parameters much faster than data, while Chinchilla grows parameters and data at equal rates. Compute-optimal frontier comparison showing Kaplan growing parameters much faster than data, while Chinchilla grows parameters and data at equal rates.
Kaplan and Chinchilla differ in how new compute is allocated. Kaplan's fitted frontier grew model size much faster than data, while Chinchilla's fitted frontier grew parameters and tokens together.

Chinchilla vs. Gopher: a concrete example

DeepMind demonstrated this dramatically with Gopher and Chinchilla:

ModelParametersTraining TokensRatio (D/N)Compute (FLOPs)
Gopher280B300B1.07:15.76 × 10²³
Chinchilla~70B1.4T20:15.76 × 10²³

The FLOP column is the matched training budget reported by Hoffmann et al. It won't equal 6ND6ND6ND exactly if you plug in the displayed parameter and token counts: 6ND6ND6ND is a planning approximation, the paper uses architecture-aware FLOP accounting, and the displayed counts are rounded.[2]

With the same reported training budget, Chinchilla outperformed Gopher across the evaluation suite DeepMind reported. The 4x smaller model trained on about 4.7x more data delivered better downstream accuracy while also being cheaper to fine-tune and serve.[2]

Later dense-model token ratios

Published model reports make it possible to observe the token-to-parameter ratio actually used. That observation doesn't by itself reveal which objective selected the configuration.

Meta reports that the Llama 3 405B flagship was pre-trained on 15.6T text tokens.[5] That yields approximately 15.6T/405B≈38.515.6T / 405B \approx 38.515.6T/405B≈38.5 tokens per parameter, above Chinchilla's roughly 20:1 fitted rule. Importantly, the Llama 3 paper says its 405B model size is approximately compute-optimal under scaling laws fitted on Meta's data and its 3.8×10253.8 \times 10^{25}3.8×1025 FLOP training budget.[5] It doesn't establish that lifetime inference cost caused the ratio. The inference-aware objective in the next section is a separate way to evaluate that trade-off.

chinchilla-style-budget.py
1def chinchilla_tokens(parameters, tokens_per_parameter=20): 2 return parameters * tokens_per_parameter 3 4def dense_training_flops(parameters, tokens): 5 return 6 * parameters * tokens 6 7parameters = 10e9 8tokens = chinchilla_tokens(parameters) 9flops = dense_training_flops(parameters, tokens) 10 11print(f"Chinchilla-style tokens for 10B parameters: {tokens:.2e}") 12print(f"Dense training FLOPs: {flops:.2e}")
Output
1Chinchilla-style tokens for 10B parameters: 2.00e+11 2Dense training FLOPs: 1.20e+22

Beyond Chinchilla: inference-aware scaling

Chinchilla tells you how to train the model with the lowest pre-training loss for a fixed training-compute budget, but that's only half the story. Think of it like choosing warehouse equipment: the Chinchilla view says "pick the machine with the best performance-per-dollar purchase price." But if the fastest sorter needs expensive maintenance and constant supervision, while a smaller sorter runs cheaply every night, the smaller sorter might cost less over its lifetime. The same logic applies to AI models: a smaller, well-trained model is cheaper to serve every day.

To visualize this trade-off, it helps to separate the objective functions. Kaplan and Chinchilla both ask: "Given a fixed pre-training compute budget, how should I split it between parameters and tokens?" Inference-aware scaling asks a different question: "Given a target quality and expected demand, which model minimizes total lifetime cost?"

Decision diagram separating fixed training-compute objectives from lifetime-cost objectives, routing Kaplan, Chinchilla, and inference-aware scaling to different model-size choices. Decision diagram separating fixed training-compute objectives from lifetime-cost objectives, routing Kaplan, Chinchilla, and inference-aware scaling to different model-size choices.
Before sizing a model, name the objective. Training-loss optimization and lifetime-cost optimization are different problems, so they can choose different model and token budgets.

Chinchilla-optimal minimizes training loss for a fixed training compute budget. If an operator instead optimizes total deployment-lifetime cost, the objective includes training cost plus inference cost over the model's deployment lifetime.

The inference cost problem

A model that serves enough traffic can accumulate inference FLOPs that rival or exceed the original training cost. Sardana et al. [6] extend a Chinchilla-style objective to account for deployment demand.

Under their fitted objective and demand assumptions, Sardana et al. report that researchers expecting roughly 1B requests should prefer smaller models trained for longer. In their experiments, models from 150M to 6B parameters continue improving as token ratios rise; the 150M model is tested up to 10,000 tokens per parameter, while larger models are tested to lower maxima because of compute limits.[6]

min⁡N,D  Ctrain(N,D)+Cinference(N,Tserved)subject to L(N,D)≤ℓ\min_{N, D} \; C_{\text{train}}(N, D) + C_{\text{inference}}(N, T_{\text{served}}) \quad \text{subject to } \quad L(N, D) \le \ellN,Dmin​Ctrain​(N,D)+Cinference​(N,Tserved​)subject to L(N,D)≤ℓ

Reading the formula: Training costs scale with model size times data (6ND6ND6ND FLOPs). Inference costs scale with model size times total processed inference tokens (2N⋅Tserved2N \cdot T_{\text{served}}2N⋅Tserved​) for a dense decoder-only model. Once TservedT_{\text{served}}Tserved​ gets large enough, a smaller model that's trained longer can match the target quality while costing less over its lifetime. Here:

Ctrain≈6ND,Cinference≈2N⋅TservedC_{\text{train}} \approx 6ND, \quad C_{\text{inference}} \approx 2N \cdot T_{\text{served}}Ctrain​≈6ND,Cinference​≈2N⋅Tserved​

Here TservedT_{\text{served}}Tserved​ represents total input plus output tokens processed across inference requests during the model's deployment lifetime. The proxy ignores differences in hardware utilization, latency, and attention cost, so production sizing still needs measurements from the serving stack.

Concrete break-even example

Suppose you target a fixed pre-training loss ℓ\ellℓ that two candidate models can achieve. The numbers below are an arithmetic illustration, not two measured models from Sardana et al.; a real comparison first has to show that both candidates hit the quality target.

  • Model A (Chinchilla-style): 70B parameters trained on 1.4T tokens. Training cost ≈ 6×70B×1.4T≈5.88×10236 \times 70\text{B} \times 1.4\text{T} \approx 5.88 \times 10^{23}6×70B×1.4T≈5.88×1023 FLOPs. Inference cost per token ≈ 2×70B=1402 \times 70\text{B} = 1402×70B=140B FLOPs/token.
  • Model B (inference-aware, smaller + over-trained): 30B parameters trained on ~4T tokens (heavier over-training to match quality). Training cost ≈ 6×30B×4T≈7.2×10236 \times 30\text{B} \times 4\text{T} \approx 7.2 \times 10^{23}6×30B×4T≈7.2×1023 FLOPs (slightly higher upfront). Inference cost per token ≈ 2×30B=602 \times 30\text{B} = 602×30B=60B FLOPs/token, less than half of Model A.

The break-even point occurs when Model B's extra training FLOPs are recovered by its lower inference FLOPs. Under these equal-quality assumptions, Model A is cheaper below about 1.65T served tokens and Model B is cheaper above that point. A deployment decision must also replace this FLOP proxy with measured quality, latency, utilization, and hardware cost.

Assumed equal-quality break-even example comparing a 70B model and a 30B longer-trained model; under the illustration inputs, the smaller model becomes cheaper after roughly 1.65 trillion served tokens. Assumed equal-quality break-even example comparing a 70B model and a 30B longer-trained model; under the illustration inputs, the smaller model becomes cheaper after roughly 1.65 trillion served tokens.
Under the example's equal-quality assumption, the smaller model starts with higher training FLOPs but crosses below the larger model after about 1.65T served tokens.
concrete-break-even-example.py
1def train_flops(parameters, tokens): 2 return 6 * parameters * tokens 3 4def inference_flops(parameters, served_tokens): 5 return 2 * parameters * served_tokens 6 7model_a_train = train_flops(70e9, 1.4e12) 8model_b_train = train_flops(30e9, 4e12) 9extra_train = model_b_train - model_a_train 10savings_per_token = 2 * (70e9 - 30e9) 11break_even_tokens = extra_train / savings_per_token 12 13print(f"Model A training FLOPs: {model_a_train:.2e}") 14print(f"Model B training FLOPs: {model_b_train:.2e}") 15print(f"Break-even served tokens: {break_even_tokens:.2e}")
Output
1Model A training FLOPs: 5.88e+23 2Model B training FLOPs: 7.20e+23 3Break-even served tokens: 1.65e+12
lifetime-cost-above-and-below-break-even.py
1def total_flops(parameters, training_tokens, served_tokens): 2 return 6 * parameters * training_tokens + 2 * parameters * served_tokens 3 4models = { 5 "70B / 1.4T": (70e9, 1.4e12), 6 "30B / 4.0T": (30e9, 4.0e12), 7} 8 9for demand in [1e12, 3e12]: 10 costs = { 11 name: total_flops(parameters, tokens, demand) 12 for name, (parameters, tokens) in models.items() 13 } 14 cheaper = min(costs, key=costs.get) 15 print(f"{demand / 1e12:.0f}T served tokens -> cheaper candidate: {cheaper}")
Output
11T served tokens -> cheaper candidate: 70B / 1.4T 23T served tokens -> cheaper candidate: 30B / 4.0T

What the inference-aware objective predicts

Under the Sardana et al. objective, increasing expected demand shifts the fitted optimum toward fewer parameters and more pre-training tokens while holding the modeled loss target fixed. That is a prediction under an explicit quality model and demand estimate, not evidence that any particular released model was selected using that objective.

For a given quality target, compare candidates according to the question you can actually evaluate:

Planning inputWhat to evaluate
Training loss is the objective and deployment demand is excludedUse a Chinchilla-style baseline fitted to your data and architecture.
Large forecasted lifetime demandBenchmark a smaller, longer-trained candidate against the quality target and lifetime serving cost.
Candidate ratios far outside fitted dataCollect proxy evidence in that regime instead of extrapolating the original fit without checks.

The exact optimum depends on the loss target, architecture, data distribution, inference forecast, and serving stack. Don't infer a training objective from a released model's token-to-parameter ratio alone.

Public-text supply as a planning constraint

Both Chinchilla and inference-aware scaling require access to more useful training tokens as their recommended data budget grows. The supply of public human-generated text is finite, but a projection about that supply isn't evidence that it has already been exhausted.

Villalobos et al. [7] estimate an effective public-human-text stock of about 320T tokens after quality and multi-epoch adjustments. Under continued dataset-growth trends, they project full utilization between 2026 and 2032, with a median year of 2028; their assumed 5x over-training scenario shifts that projection one or two years earlier. These are scenario-dependent forecasts, not a statement that usable public text is exhausted today.

This projected constraint changes how you read every scaling law in this chapter. The B/DβB/D^\betaB/Dβ term can only be extrapolated confidently while the data regime remains comparable to the fit. When fresh high-quality text is scarce, practitioners may evaluate better filtering and deduplication, controlled multi-epoch reuse, transfer from other domains, and synthetic data; each changes the assumptions behind the fitted curve.

projected-public-text-context.py
1effective_stock = 320e12 # Villalobos et al. scenario estimate 2llama3_405b_tokens = 15.6e12 3chinchilla_style_tokens = 20 * 405e9 4 5print(f"Llama 3 405B reported tokens: {llama3_405b_tokens / 1e12:.1f}T") 6print(f"Chinchilla-style 405B tokens: {chinchilla_style_tokens / 1e12:.1f}T") 7print( 8 "Reported Llama 3 dataset / projected effective stock: " 9 f"{llama3_405b_tokens / effective_stock:.1%}" 10) 11print("Comparison is dataset size, not unique-text consumption.")
Output
1Llama 3 405B reported tokens: 15.6T 2Chinchilla-style 405B tokens: 8.1T 3Reported Llama 3 dataset / projected effective stock: 4.9% 4Comparison is dataset size, not unique-text consumption.

A new scaling axis: test-time compute

For most of this chapter, "scale" meant training compute split between parameters and tokens. A third allocation question is how much compute the model spends at inference time while answering a single query.

Instead of producing one answer in a fixed number of forward passes, a model can generate longer responses, sample multiple candidates, or search with a verifier. Snell et al. [8] evaluate revision and verifier-search strategies with PaLM 2-S* on MATH. In their FLOP-matched comparison, adaptively allocated test-time compute with a smaller model can outperform a roughly 14x larger model on problems where the smaller model already attains non-trivial success; on the hardest problems or under higher inference workloads, additional pre-training is more effective.

DeepSeek-R1 provides a related post-training example: its R1-Zero model applies reinforcement learning without supervised fine-tuning before RL, while the final R1 pipeline integrates cold-start data, supervised fine-tuning, and reinforcement-learning stages.[9] It demonstrates that post-training can change reasoning behavior; it doesn't show that pre-training data or compute no longer matters.

This matters for sizing decisions in two ways. First, it adds a knob: for a quality target, you can trade a bigger or longer-trained model against a smaller model that spends more compute per query. Second, it interacts with the inference-aware objective from earlier, because additional generated candidates or longer responses multiply TservedT_{\text{served}}Tserved​, raising the serving term.

The small calculation below isolates generated-token cost for sampled candidates. A real request also includes prompt tokens and any verifier work.

sampled-candidates-raise-generation-cost.py
1def generation_flops(parameters, output_tokens_per_candidate, candidates): 2 return 2 * parameters * output_tokens_per_candidate * candidates 3 4parameters = 8e9 5output_tokens_per_candidate = 512 6for candidates in [1, 4, 16]: 7 flops = generation_flops(parameters, output_tokens_per_candidate, candidates) 8 print(f"{candidates:>2} candidate(s): {flops:.2e} generation FLOPs per request")
Output
11 candidate(s): 8.19e+12 generation FLOPs per request 2 4 candidate(s): 3.28e+13 generation FLOPs per request 316 candidate(s): 1.31e+14 generation FLOPs per request

Common pitfalls

Even experienced engineers trip over scaling laws. Here are five frequent mistakes, with their symptoms, causes, and fixes.

Mistake 1: treating the 20:1 rule as universal

Symptom: You read the Chinchilla paper, memorize "20 tokens per parameter," and apply it to every model you build. A 1B-parameter model trained on 20B tokens underperforms on your messy carrier-claim corpus, and you can't figure out why.

Cause: The 20:1 ratio is a fitted result for dense transformers on general web text. It shifts with data quality, tokenizer design, optimizer regime, and architecture. It's a starting point, not a physical constant.

Fix: Treat 20:1 as a baseline, then run small proxy experiments (100M to 1B parameters) on your actual data to fit your own exponents. If your data is cleaner or more repetitive, the optimal ratio may be lower; if it's noisier or more diverse, it may be higher.

Mistake 2: treating C≈6NDC \approx 6NDC≈6ND as exact accounting

Symptom: Your spreadsheet says two candidate runs have the same training FLOPs, but the real cluster bills differ materially.

Cause: 6ND6ND6ND is a dense-transformer rule of thumb. It ignores attention-kernel details, optimizer state, activation checkpointing, sequence length, hardware utilization, distributed communication, data pipeline stalls, and sparse-routing behavior.

Fix: Use 6ND6ND6ND for first-pass sizing, then replace it with measured hardware FLOPs, wall-clock throughput, and dollars per token from your actual stack before committing budget.

Mistake 3: confusing loss with real-world capability

Symptom: Your scaling study predicts excellent cross-entropy loss, but the deployed model hallucinates facts, ignores instructions, or fails safety checks.

Cause: Scaling laws predict pre-training loss (how well the model compresses the training distribution), not downstream task performance, alignment quality, or reasoning ability. A model can have great loss and still be useless in production.

Fix: Budget separately for the post-training pipeline. Techniques such as instruction tuning and RLHF [10] can improve instruction behavior and preference alignment that a pre-training loss curve doesn't measure. Don't skip evaluation or post-training because your loss curve looks good.

Mistake 4: ignoring inference costs when sizing a model

Symptom: You train a Chinchilla-optimal 70B model, deploy it to production, and discover that your serving bill exceeds your training budget within three months.

Cause: Chinchilla optimizes training loss, not total lifetime cost. A smaller model trained longer costs more upfront in training compute but saves money on each inference call.

Fix: Before you commit to a model size, estimate your expected inference volume. Compare total cost = 6ND6ND6ND (train) + 2N⋅Tserved2N \cdot T_{\text{served}}2N⋅Tserved​ (inference) across candidates that meet a measured quality target. At sufficiently high demand, a smaller model trained for longer can be cheaper under this proxy; verify the crossover on your serving stack.

Mistake 5: extrapolating scaling laws far beyond the training regime

Symptom: Your proxy experiments span models from 100M to 1B parameters. You fit a beautiful straight line, extrapolate it to predict the loss of a 100B model, and the actual result is way off.

Cause: Power-law fits work well inside the range where they were measured. Outside that range, new bottlenecks appear: optimizer instability, numerical precision issues, data exhaustion, or hardware communication overhead that didn't exist at small scale.

Fix: Validate with at least one mid-scale experiment before committing to the full target scale. If your proxy range is 100M to 1B, run a 10B validation before you trust the curve at 100B. Treat extrapolation as a hypothesis, not a guarantee.

Scaling law breakdowns and limitations

Scaling laws are empirical fits, not physical laws. Exponents shift with architecture, data quality, tokenizer, and optimizer regime, so one fitted curve shouldn't be treated as a universal law.

Emergent abilities

Wei et al. [11] highlighted benchmark tasks where performance appears to stay near zero and then jump at larger scales, creating the impression of a phase transition in capability. Schaeffer et al. [12] later argued that many of these jumps are measurement artifacts: when you replace exact-match thresholds with continuous metrics, the same capabilities often return to smoother scaling curves.

The point isn't that emergence is fake or guaranteed. The useful lesson is measurement design: thresholded metrics can make smooth changes look abrupt, so teams should inspect continuous metrics before treating a capability jump as a new law of scale.

Task specific ceilings

Scaling laws predict pre-training loss, not downstream task performance. A model that achieves excellent perplexity may still fail at:

  • Factual accuracy (hallucinations)
  • Instruction following
  • Safety alignment
  • Specific domain tasks

Because scaling laws only measure the model's ability to compress and predict the training distribution, teams must allocate separate compute budgets for the post-training pipeline. Techniques such as instruction tuning and RLHF [10] are used to improve instruction behavior and preference alignment, which must be evaluated separately from pre-training loss.

Architecture sensitivity

Scaling laws derived for dense transformers don't directly transfer to:

  • Mixture-of-Experts (MoE): MoE models selectively activate subsets of parameters for each input, decoupling total parameters from compute per token. This makes it difficult to apply standard dense scaling laws, as active parameters and total parameters scale differently.
  • State Space Models (SSMs): SSMs process sequences recurrently rather than using quadratic attention. These alternative architectures have their own scaling exponents and require independent empirical fitting.
  • Hybrid architectures: Models that combine attention with SSMs or other mechanisms must have their scaling behavior re-characterized from scratch.

For these architectures, dense-transformer scaling laws are useful context, not a substitute for new measurements. Each architecture needs empirical scaling studies to determine its own parameter and data allocation.

From theory to practice: running a scaling study

When planning a large training run, teams conduct scaling studies (small-scale experiments that predict large-scale performance) before committing resources. This workflow shows the standard pipeline for a scaling study. It starts with training proxy models and fitting exponents, extrapolates to the target frontier, validates at a medium scale, and finally commits to the full run:

The practical loop is:

  1. Train proxy models, usually across several parameter counts and token budgets.
  2. Fit scaling exponents for the loss surface.
  3. Extrapolate to the target frontier: model size, token budget, and expected loss.
  4. Run a mid-scale validation experiment to check the extrapolation.
  5. Commit to the full training run only after the mid-scale result lands near the fitted curve.

The µ-Transfer approach

To estimate the performance of a large model before committing to its full run, engineers use proxy models. A complementary tool is µ-Transfer (Tensor Programs V) [13], which uses the Maximal Update Parametrization (µP) so selected hyperparameters tuned on a smaller proxy can be transferred to a larger target model.

Yang et al. demonstrate transfer for hyperparameters such as learning rate and learning-rate schedule, and discuss transfer across scale dimensions such as width, batch size, sequence length, and training time with caveats. They also report cases where transfer across depth is fragile. It is a tool for reducing tuning cost, not permission to copy every setting without validation.

Used alongside a scaling study, the workflow is:

  1. Train smaller proxy models across multiple data budgets to sample the loss surface.
  2. In a µP setup, tune hyperparameters covered by the transfer method at small scale and transfer candidate settings to larger models.
  3. Fit the power-law scaling exponents (α\alphaα, β\betaβ, and the constants) using the observed losses from proxy runs.
  4. Extrapolate the fitted curve to the target scale to predict final loss.
  5. Validate with a single medium-scale experiment before committing to the full run.

The µ-Transfer paper shows that zero-shot hyperparameter transfer can work across large changes in model scale in its tested setups. A medium-scale validation still matters because transferability, loss fits, data mixture, and hardware behavior are separate failure modes.

Fitting the parametric loss function

One common parametric fit for model size and data, used in Chinchilla-style scaling studies, is:[2]

L(N,D)=E+ANα+BDβL(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}L(N,D)=E+NαA​+DβB​

Reading the formula: Loss equals three fitted terms: (1) EEE, the asymptotic floor in this model of the measured regime, (2) A/NαA/N^\alphaA/Nα, the fitted penalty associated with finite parameters, and (3) B/DβB/D^\betaB/Dβ, the fitted penalty associated with finite training data. Making the model bigger reduces term 2; more data reduces term 3. Treat EEE as an extrapolated fit parameter, not proof that you have measured the irreducible entropy of language.

To see how this works in practice, consider a tiny synthetic dataset of proxy runs:

Parameters (N)Tokens (D)Observed Loss
100M1B2.894
100M5B2.745
100M20B2.634
500M1B2.811
500M5B2.662
500M20B2.551
1B5B2.629
1B20B2.518
1B100B2.407

Notice the pattern: fixing N and increasing D lowers loss; fixing D and increasing N also lowers loss. The parametric formula above captures both effects at once.

Scaling study illustration with proxy-run scatter plot, four-step fit workflow, predicted 70B target loss, and a mid-scale validation gate before full training. Scaling study illustration with proxy-run scatter plot, four-step fit workflow, predicted 70B target loss, and a mid-scale validation gate before full training.
A good scaling study doesn't jump from tiny runs straight to a frontier bet. Measure proxy runs, fit the loss law, forecast the target, then earn trust with one larger validation run.

This Python script fits the scaling constants with SciPy's curve_fit and then predicts loss at a larger target scale:

fitting-the-parametric-loss-function.py
1import numpy as np 2from scipy.optimize import curve_fit 3 4def scaling_law( 5 X: tuple[np.ndarray, np.ndarray], 6 E: float, 7 A: float, 8 alpha: float, 9 B: float, 10 beta: float, 11) -> np.ndarray: 12 N, D = X 13 return E + A / (N ** alpha) + B / (D ** beta) 14 15# Synthetic proxy-run measurements for illustration only. 16# Each row: (parameters, tokens, observed_loss) 17experiments = np.array([ 18 [1e8, 1e9, 2.894], 19 [1e8, 5e9, 2.745], 20 [1e8, 2e10, 2.634], 21 [5e8, 1e9, 2.811], 22 [5e8, 5e9, 2.662], 23 [5e8, 2e10, 2.551], 24 [1e9, 5e9, 2.629], 25 [1e9, 2e10, 2.518], 26 [1e9, 1e11, 2.407], 27]) 28 29N_data = experiments[:, 0] 30D_data = experiments[:, 1] 31L_data = experiments[:, 2] 32 33popt, pcov = curve_fit( 34 scaling_law, (N_data, D_data), L_data, 35 p0=[1.0, 2.5, 0.07, 7.0, 0.09], 36 bounds=([0, 0, 0, 0, 0], [5, 1e6, 1, 1e6, 1]), 37 maxfev=20_000, 38) 39 40E_fit, A_fit, alpha_fit, B_fit, beta_fit = popt 41print(f"Fitted asymptotic E = {E_fit:.3f}") 42print(f"Parameter scaling: A={A_fit:.1f}, alpha={alpha_fit:.4f}") 43print(f"Data scaling: B={B_fit:.1f}, beta={beta_fit:.4f}") 44 45# Extrapolate: predict loss for a 70B model on 1.4T tokens 46predicted = scaling_law((70e9, 1.4e12), *popt) 47print(f"\nPredicted loss for 70B / 1.4T tokens: {predicted:.3f}")
Output
1Fitted asymptotic E = 1.096 2Parameter scaling: A=2.8, alpha=0.0703 3Data scaling: B=7.8, beta=0.0980 4 5Predicted loss for 70B / 1.4T tokens: 2.088

Follow-up questions

If you have 102410^{24}1024 training FLOPs, what Chinchilla-style size should you start from?

Start with C≈6NDC \approx 6NDC≈6ND and the rule of thumb D≈20ND \approx 20ND≈20N. Substituting gives 1024≈6N(20N)=120N210^{24} \approx 6N(20N) = 120N^21024≈6N(20N)=120N2, so N≈1024/120≈91BN \approx \sqrt{10^{24}/120} \approx 91BN≈1024/120​≈91B parameters and D≈1.8TD \approx 1.8TD≈1.8T tokens. That is a training-loss starting point. If the model will serve heavy traffic, test smaller models trained longer.

Your product will likely see roughly 1B requests after launch. Do you still default to the Chinchilla frontier?

Not blindly. Chinchilla minimizes pre-training loss for a fixed training-compute budget. Under Sardana et al.'s fitted quality and demand objective, roughly 1B expected requests shifts the predicted optimum toward smaller models trained for longer.[6] Start with a Chinchilla-style baseline, then compare it with a smaller, longer-trained candidate using measured quality and total lifetime cost.

Why do modern dense models often train beyond 20 tokens per parameter?

There can be several reasons: a scaling fit on different data, a different loss or downstream target, or an inference-aware objective. Llama 3's 405B model is a concrete observed ratio above 20:1: 405B parameters on 15.6T tokens, or about 38.5:1.[5] Meta reports that configuration as approximately compute-optimal under its own data and training-budget scaling laws, so its ratio isn't evidence by itself for an inference-cost explanation.

How do scaling laws change for mixture-of-experts models?

Dense scaling laws use parameter count as a rough proxy for both capacity and compute. MoE breaks that shortcut. Total parameters affect memory and routing complexity; active parameters drive compute per token. You need separate scaling fits that track active parameters, total parameters, router behavior, and data budget.

Your proxy runs cover 100M to 1B parameters, but a 10B validation run misses the fitted curve. What should you do next?

Treat the miss as evidence that your fit stopped generalizing. Don't launch the full run because the small-scale line looked clean. Re-fit with the 10B point included, inspect new bottlenecks such as optimizer instability or data exhaustion, and only then decide whether the frontier still supports the full target scale.

What you should be able to defend

TierYou should be able to defend
FoundationalExplain parameters, tokens, compute, and why scaling laws are empirical power-law fits.
FoundationalRead Kaplan's single-variable loss curves and name what each bottleneck means.
IntermediateExplain why Kaplan's fitted frontier favored parameters faster than data.
IntermediateDescribe how Chinchilla revised the allocation toward roughly 20 tokens per parameter for dense transformers.
AdvancedUse C≈6NDC \approx 6NDC≈6ND to estimate tokens, parameters, and training FLOPs while stating its limits.
AdvancedExplain why inference-aware scaling can favor smaller models trained longer.
AdvancedDiagnose over-training, under-training, extrapolation risk, and dense-versus-MoE parameter ambiguity.
AdvancedDesign a small scaling study with proxy runs, fitted exponents, and a mid-scale validation check.

Key takeaways

  • Scaling laws are power laws that quantify how loss decreases with model size, data, and compute. They're empirical, not theoretical.
  • Chinchilla revised the allocation: in Hoffmann et al.'s dense-transformer training-loss setup, roughly 20 tokens per parameter was a better compute-optimal rule than parameter-heavy GPT-3-era ratios.
  • Lifetime-cost objectives change the calculation: under Sardana et al.'s fitted inference-aware objective, high demand can favor smaller models trained for far more tokens; that prediction still requires quality and cost validation.
  • Scaling laws have blind spots: they predict pre-training loss but not emergent capabilities, alignment quality, or task specific performance.
  • Public-text supply is a forecasted constraint: Villalobos et al. estimate about 320T effective tokens and project full utilization under stated growth assumptions; treat this as a planning scenario, not a confirmed exhaustion event.
  • Test-time compute is another allocation axis: evaluated inference-time strategies can lift quality in some regimes, but extra candidates or response tokens also raise serving cost.
  • Scaling studies are essential: fit power-law parameters on small proxy models before committing to expensive large-scale training runs.
Next Step
Continue to Pre-training Data at Scale

Scaling laws give you the compute and token budget; the data pipeline decides whether those tokens are clean, diverse, deduplicated, and useful enough to justify the spend.

PreviousDecoding Strategies: Greedy to Nucleus
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

Language Models are Few-Shot Learners.

Brown, T., et al. · 2020 · NeurIPS 2020

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., & Carmon, Y. · 2024

The Llama 3 Herd of Models.

Dubey, A., et al. · 2024 · arXiv preprint

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Sardana & Frankle · 2024

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

Villalobos, P., et al. (Epoch AI) · 2022 · arXiv preprint

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Emergent Abilities of Large Language Models.

Wei, J., et al. · 2022 · TMLR

Are Emergent Abilities of Large Language Models a Mirage?

Schaeffer, R., et al. · 2023 · NeurIPS 2023

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Yang, G., et al. · 2022 · NeurIPS 2022