Learn the empirical power laws governing LLM performance, from Kaplan's parameter-heavy frontier through Chinchilla-optimal ratios to modern inference-aware training strategies.
After decoding, we move from "how a trained model emits " to "how a lab decides what model to train in the first place." Scaling laws describe how model loss changes as you add parameters, tokens, and compute. This chapter explains the practical question behind the math: how to spend a fixed training budget without overbuilding the wrong part of the system.
Imagine your team has a fixed budget to build an AI model that answers customer support questions. You face a simple-sounding choice: do you build a huge model and train it on a modest amount of data, or build a smaller model and train it on much more data? Get this balance wrong, and you can spend millions of dollars on a model that underperforms a smaller, better-trained alternative.
This is the problem scaling laws address. They are empirical relationships that predict how a model's performance improves as you increase its size, its training data, or both. Kaplan et al. at OpenAI [1] quantified influential scaling curves, and the Chinchilla paper [2] later found a more balanced allocation of model size and data for compute-optimal dense training. These fits help plan expensive runs, but each fit applies only within its objective, data, architecture, and measurement assumptions.
Before we go further, let's agree on three basic terms we'll use throughout this chapter:
Frontier pre-training runs consume large compute budgets. In the early days of deep learning, architecture design and hyperparameter tuning drove many performance improvements. Today, scale is one of the main levers. But scaling isn't as simple as making everything bigger. If you misallocate your compute budget between model size and data volume, you can spend a large budget on a model that underperforms a smaller, better-trained counterpart.
Scaling laws address this problem by providing predictive models that guide resource allocation. They map the relationship between the scale of your inputs (parameters, tokens, and compute) and the expected performance of the output model. In pre-training studies, that performance is usually loss, which measures how well the model predicts the next token. By studying these relationships, engineers can estimate how a model will perform before committing to a costly, months-long training run.
These empirical laws allow teams to:
They are planning tools, not guarantees. A good scaling study reduces uncertainty before a large run, but the final decision still has to account for data quality, architecture, hardware efficiency, post-training, and deployment cost.
Before we meet any formulas, let's see the shape of scaling with a tiny concrete example. Suppose you train a 1-billion-parameter model on 10 billion tokens and measure its cross-entropy loss. Then you run three follow-up experiments:
| Experiment | Parameters | Tokens | What changed | Approximate loss |
|---|---|---|---|---|
| Baseline | 1B | 10B | Nothing | 2.80 |
| Double model size | 2B | 10B | 2x parameters | 2.60 |
| Double data | 1B | 20B | 2x tokens | 2.55 |
| Double both | 2B | 20B | 2x each | 2.35 |
These numbers are synthetic, but they capture a real pattern: doubling parameters or data lowers loss, but doubling both together lowers it more than either alone. The improvement is smooth and predictable, but it follows diminishing returns. A 10x increase in parameters doesn't cut loss by 10x; it cuts it by a smaller, fixed multiplier.
This predictable curve is a power law. In everyday language, an idealized power-law term becomes a straight line when you plot its logarithm against log(parameters) or log(data). That line estimates the scaling trend. It lets engineers answer questions like: "If I have 10x more compute, how much better will my model get?" without training the model first. Later, we'll add an asymptotic floor and fit parameters and data together.
1alpha_n = 0.076
2
3for parameter_multiplier in [2, 10, 100]:
4 scaled_term = parameter_multiplier ** (-alpha_n)
5 reduction = 1 - scaled_term
6 print(
7 f"{parameter_multiplier:>3}x parameters -> "
8 f"{scaled_term:.3f}x parameter-limited loss term "
9 f"({reduction:.1%} reduction)"
10 )12x parameters -> 0.949x parameter-limited loss term (5.1% reduction)
2 10x parameters -> 0.839x parameter-limited loss term (16.1% reduction)
3100x parameters -> 0.705x parameter-limited loss term (29.5% reduction)OpenAI's 2020 work [1] established three power-law relationships for transformer language models. Each describes a different bottleneck regime for loss :
Reading the formula: Loss decreases as a power law when you make the model bigger. Double the parameters, and loss drops by a fixed multiplier. The exponent means the improvement is smooth and predictable, but diminishing returns are steep: a 10x increase in parameters only reduces the loss by about 16%.
Reading the formula: The same pattern holds for data. More training tokens means lower loss, following a predictable power curve with .
Reading the formula: Kaplan's compute curve is written in terms of optimally allocated training compute, not arbitrary training runs. Using the standard dense-transformer budgeting rule , the fitted exponent is . The factor of 6 comes from a rough FLOP accounting: about per token for the forward pass and about per token for backpropagation. It's a planning heuristic, not an exact profiler. Real training runs also depend on attention kernels, optimizer overhead, activation recomputation, and architecture choices such as MoE sparsity.
Kaplan's influential result wasn't that was larger than ; it wasn't. The operational takeaway came from the compute-optimal frontier: for a fixed compute budget, the fitted optimum favored much larger models and relatively little data.
Kaplan's team concluded that, under their fitted scaling regime, you should spend most new compute on model size and only a smaller fraction on additional data. Think of it like this: upgrading a tiny fulfillment center into a large automated hub improves throughput more than sending more orders through the same tiny center.
This led to the initial strategy of training very large models on relatively modest data budgets, an approach exemplified by GPT-3 (175B parameters on roughly 300B tokens, a 1.7:1 ratio) [3].
When compute is the binding constraint, Kaplan suggested allocating the budget such that:
Where these exponents come from: Kaplan didn't get these exponents by comparing and directly. He first fit a joint loss surface,
and then solved for the best allocation under the transformer training-cost approximation . That optimization produced the strongly parameter-heavy split above. Chinchilla later re-fit the compute-optimal frontier and got the much more balanced / result.
What this means: Kaplan said: if you get 10x more compute, spend most of it on a bigger model (about 5x bigger) and only a little more data (about 2x more). This led to the strategy of building very large models on modest datasets.
1frontiers = {
2 "Kaplan": (0.73, 0.27),
3 "Chinchilla": (0.50, 0.50),
4}
5
6for compute_multiplier in [10, 100]:
7 print(f"{compute_multiplier}x training compute")
8 for name, (parameter_exp, token_exp) in frontiers.items():
9 parameters = compute_multiplier ** parameter_exp
10 tokens = compute_multiplier ** token_exp
11 print(f" {name:<10} N={parameters:5.2f}x, D={tokens:5.2f}x")110x training compute
2 Kaplan N= 5.37x, D= 1.86x
3 Chinchilla N= 3.16x, D= 3.16x
4100x training compute
5 Kaplan N=28.84x, D= 3.47x
6 Chinchilla N=10.00x, D=10.00x1def dense_training_flops(parameters, tokens):
2 return 6 * parameters * tokens
3
4gpt3_style = dense_training_flops(175e9, 300e9)
5small_data_rich = dense_training_flops(70e9, 1.4e12)
6
7print(f"GPT-3-style training FLOPs: {gpt3_style:.2e}")
8print(f"70B / 1.4T-token training FLOPs: {small_data_rich:.2e}")1GPT-3-style training FLOPs: 3.15e+23
270B / 1.4T-token training FLOPs: 5.88e+23Imagine two fulfillment teams get the same budget. Team A builds a large automated hub but can only process a thin sample of orders. Team B builds a modest hub but processes a much richer, higher-quality order history. Which system performs better? DeepMind ran this experiment with AI models, and the results changed how teams thought about model size.
Hoffmann et al. at DeepMind [2] challenged the Kaplan prescription by training over 400 models ranging from 70M to over 16B parameters on 5B to 500B tokens. Their central finding was that the Kaplan allocation had underestimated the value of data.
Why did two careful studies disagree so sharply? Later analysis traced the gap to methodology rather than a new law of transformers. Porian et al. [4] reproduce Kaplan-style scaling experiments and identify three contributors: omitting final decoding-layer computation, an excessively long fixed warmup for small models, and optimizer hyperparameters that weren't tuned as a function of scale. Correcting those factors produces close agreement with the Chinchilla allocation; their experiments also find that tailored learning-rate decay isn't the central explanation. The lesson is that a scaling exponent is only as trustworthy as its compute accounting and fitting procedure.
The revised scaling law for jointly optimal model size and data volume is:
In plain terms: Chinchilla's correction is that model size and data should grow at equal rates. If you double compute, make the model bigger and use more data. The practical rule:
That 20:1 ratio is a fitted result for the dense-transformer setup Hoffmann et al. studied, not a universal constant.[2]
1def chinchilla_style_size(training_flops, tokens_per_parameter=20):
2 parameters = (training_flops / (6 * tokens_per_parameter)) ** 0.5
3 tokens = tokens_per_parameter * parameters
4 return parameters, tokens
5
6parameters, tokens = chinchilla_style_size(1e24)
7print(f"Parameters at 1e24 FLOPs: {parameters / 1e9:.1f}B")
8print(f"Tokens at 20:1: {tokens / 1e12:.2f}T")1Parameters at 1e24 FLOPs: 91.3B
2Tokens at 20:1: 1.83T
DeepMind demonstrated this dramatically with Gopher and Chinchilla:
| Model | Parameters | Training Tokens | Ratio (D/N) | Compute (FLOPs) |
|---|---|---|---|---|
| Gopher | 280B | 300B | 1.07:1 | 5.76 × 10²³ |
| Chinchilla | ~70B | 1.4T | 20:1 | 5.76 × 10²³ |
The FLOP column is the matched training budget reported by Hoffmann et al. It won't equal exactly if you plug in the displayed parameter and token counts: is a planning approximation, the paper uses architecture-aware FLOP accounting, and the displayed counts are rounded.[2]
With the same reported training budget, Chinchilla outperformed Gopher across the evaluation suite DeepMind reported. The 4x smaller model trained on about 4.7x more data delivered better downstream accuracy while also being cheaper to fine-tune and serve.[2]
Published model reports make it possible to observe the token-to-parameter ratio actually used. That observation doesn't by itself reveal which objective selected the configuration.
Meta reports that the Llama 3 405B flagship was pre-trained on 15.6T text tokens.[5] That yields approximately tokens per parameter, above Chinchilla's roughly 20:1 fitted rule. Importantly, the Llama 3 paper says its 405B model size is approximately compute-optimal under scaling laws fitted on Meta's data and its FLOP training budget.[5] It doesn't establish that lifetime inference cost caused the ratio. The inference-aware objective in the next section is a separate way to evaluate that trade-off.
1def chinchilla_tokens(parameters, tokens_per_parameter=20):
2 return parameters * tokens_per_parameter
3
4def dense_training_flops(parameters, tokens):
5 return 6 * parameters * tokens
6
7parameters = 10e9
8tokens = chinchilla_tokens(parameters)
9flops = dense_training_flops(parameters, tokens)
10
11print(f"Chinchilla-style tokens for 10B parameters: {tokens:.2e}")
12print(f"Dense training FLOPs: {flops:.2e}")1Chinchilla-style tokens for 10B parameters: 2.00e+11
2Dense training FLOPs: 1.20e+22Chinchilla tells you how to train the model with the lowest pre-training loss for a fixed training-compute budget, but that's only half the story. Think of it like choosing warehouse equipment: the Chinchilla view says "pick the machine with the best performance-per-dollar purchase price." But if the fastest sorter needs expensive maintenance and constant supervision, while a smaller sorter runs cheaply every night, the smaller sorter might cost less over its lifetime. The same logic applies to AI models: a smaller, well-trained model is cheaper to serve every day.
To visualize this trade-off, it helps to separate the objective functions. Kaplan and Chinchilla both ask: "Given a fixed pre-training compute budget, how should I split it between parameters and tokens?" Inference-aware scaling asks a different question: "Given a target quality and expected demand, which model minimizes total lifetime cost?"
Chinchilla-optimal minimizes training loss for a fixed training compute budget. If an operator instead optimizes total deployment-lifetime cost, the objective includes training cost plus inference cost over the model's deployment lifetime.
A model that serves enough traffic can accumulate inference FLOPs that rival or exceed the original training cost. Sardana et al. [6] extend a Chinchilla-style objective to account for deployment demand.
Under their fitted objective and demand assumptions, Sardana et al. report that researchers expecting roughly 1B requests should prefer smaller models trained for longer. In their experiments, models from 150M to 6B parameters continue improving as token ratios rise; the 150M model is tested up to 10,000 tokens per parameter, while larger models are tested to lower maxima because of compute limits.[6]
Reading the formula: Training costs scale with model size times data ( FLOPs). Inference costs scale with model size times total processed inference tokens () for a dense decoder-only model. Once gets large enough, a smaller model that's trained longer can match the target quality while costing less over its lifetime. Here:
Here represents total input plus output tokens processed across inference requests during the model's deployment lifetime. The proxy ignores differences in hardware utilization, latency, and attention cost, so production sizing still needs measurements from the serving stack.
Suppose you target a fixed pre-training loss that two candidate models can achieve. The numbers below are an arithmetic illustration, not two measured models from Sardana et al.; a real comparison first has to show that both candidates hit the quality target.
The break-even point occurs when Model B's extra training FLOPs are recovered by its lower inference FLOPs. Under these equal-quality assumptions, Model A is cheaper below about 1.65T served tokens and Model B is cheaper above that point. A deployment decision must also replace this FLOP proxy with measured quality, latency, utilization, and hardware cost.
1def train_flops(parameters, tokens):
2 return 6 * parameters * tokens
3
4def inference_flops(parameters, served_tokens):
5 return 2 * parameters * served_tokens
6
7model_a_train = train_flops(70e9, 1.4e12)
8model_b_train = train_flops(30e9, 4e12)
9extra_train = model_b_train - model_a_train
10savings_per_token = 2 * (70e9 - 30e9)
11break_even_tokens = extra_train / savings_per_token
12
13print(f"Model A training FLOPs: {model_a_train:.2e}")
14print(f"Model B training FLOPs: {model_b_train:.2e}")
15print(f"Break-even served tokens: {break_even_tokens:.2e}")1Model A training FLOPs: 5.88e+23
2Model B training FLOPs: 7.20e+23
3Break-even served tokens: 1.65e+121def total_flops(parameters, training_tokens, served_tokens):
2 return 6 * parameters * training_tokens + 2 * parameters * served_tokens
3
4models = {
5 "70B / 1.4T": (70e9, 1.4e12),
6 "30B / 4.0T": (30e9, 4.0e12),
7}
8
9for demand in [1e12, 3e12]:
10 costs = {
11 name: total_flops(parameters, tokens, demand)
12 for name, (parameters, tokens) in models.items()
13 }
14 cheaper = min(costs, key=costs.get)
15 print(f"{demand / 1e12:.0f}T served tokens -> cheaper candidate: {cheaper}")11T served tokens -> cheaper candidate: 70B / 1.4T
23T served tokens -> cheaper candidate: 30B / 4.0TUnder the Sardana et al. objective, increasing expected demand shifts the fitted optimum toward fewer parameters and more pre-training tokens while holding the modeled loss target fixed. That is a prediction under an explicit quality model and demand estimate, not evidence that any particular released model was selected using that objective.
For a given quality target, compare candidates according to the question you can actually evaluate:
| Planning input | What to evaluate |
|---|---|
| Training loss is the objective and deployment demand is excluded | Use a Chinchilla-style baseline fitted to your data and architecture. |
| Large forecasted lifetime demand | Benchmark a smaller, longer-trained candidate against the quality target and lifetime serving cost. |
| Candidate ratios far outside fitted data | Collect proxy evidence in that regime instead of extrapolating the original fit without checks. |
The exact optimum depends on the loss target, architecture, data distribution, inference forecast, and serving stack. Don't infer a training objective from a released model's token-to-parameter ratio alone.
Both Chinchilla and inference-aware scaling require access to more useful training tokens as their recommended data budget grows. The supply of public human-generated text is finite, but a projection about that supply isn't evidence that it has already been exhausted.
Villalobos et al. [7] estimate an effective public-human-text stock of about 320T tokens after quality and multi-epoch adjustments. Under continued dataset-growth trends, they project full utilization between 2026 and 2032, with a median year of 2028; their assumed 5x over-training scenario shifts that projection one or two years earlier. These are scenario-dependent forecasts, not a statement that usable public text is exhausted today.
This projected constraint changes how you read every scaling law in this chapter. The term can only be extrapolated confidently while the data regime remains comparable to the fit. When fresh high-quality text is scarce, practitioners may evaluate better filtering and deduplication, controlled multi-epoch reuse, transfer from other domains, and synthetic data; each changes the assumptions behind the fitted curve.
1effective_stock = 320e12 # Villalobos et al. scenario estimate
2llama3_405b_tokens = 15.6e12
3chinchilla_style_tokens = 20 * 405e9
4
5print(f"Llama 3 405B reported tokens: {llama3_405b_tokens / 1e12:.1f}T")
6print(f"Chinchilla-style 405B tokens: {chinchilla_style_tokens / 1e12:.1f}T")
7print(
8 "Reported Llama 3 dataset / projected effective stock: "
9 f"{llama3_405b_tokens / effective_stock:.1%}"
10)
11print("Comparison is dataset size, not unique-text consumption.")1Llama 3 405B reported tokens: 15.6T
2Chinchilla-style 405B tokens: 8.1T
3Reported Llama 3 dataset / projected effective stock: 4.9%
4Comparison is dataset size, not unique-text consumption.For most of this chapter, "scale" meant training compute split between parameters and tokens. A third allocation question is how much compute the model spends at inference time while answering a single query.
Instead of producing one answer in a fixed number of forward passes, a model can generate longer responses, sample multiple candidates, or search with a verifier. Snell et al. [8] evaluate revision and verifier-search strategies with PaLM 2-S* on MATH. In their FLOP-matched comparison, adaptively allocated test-time compute with a smaller model can outperform a roughly 14x larger model on problems where the smaller model already attains non-trivial success; on the hardest problems or under higher inference workloads, additional pre-training is more effective.
DeepSeek-R1 provides a related post-training example: its R1-Zero model applies reinforcement learning without supervised fine-tuning before RL, while the final R1 pipeline integrates cold-start data, supervised fine-tuning, and reinforcement-learning stages.[9] It demonstrates that post-training can change reasoning behavior; it doesn't show that pre-training data or compute no longer matters.
This matters for sizing decisions in two ways. First, it adds a knob: for a quality target, you can trade a bigger or longer-trained model against a smaller model that spends more compute per query. Second, it interacts with the inference-aware objective from earlier, because additional generated candidates or longer responses multiply , raising the serving term.
The small calculation below isolates generated-token cost for sampled candidates. A real request also includes prompt tokens and any verifier work.
1def generation_flops(parameters, output_tokens_per_candidate, candidates):
2 return 2 * parameters * output_tokens_per_candidate * candidates
3
4parameters = 8e9
5output_tokens_per_candidate = 512
6for candidates in [1, 4, 16]:
7 flops = generation_flops(parameters, output_tokens_per_candidate, candidates)
8 print(f"{candidates:>2} candidate(s): {flops:.2e} generation FLOPs per request")11 candidate(s): 8.19e+12 generation FLOPs per request
2 4 candidate(s): 3.28e+13 generation FLOPs per request
316 candidate(s): 1.31e+14 generation FLOPs per requestEven experienced engineers trip over scaling laws. Here are five frequent mistakes, with their symptoms, causes, and fixes.
Symptom: You read the Chinchilla paper, memorize "20 tokens per parameter," and apply it to every model you build. A 1B-parameter model trained on 20B tokens underperforms on your messy carrier-claim corpus, and you can't figure out why.
Cause: The 20:1 ratio is a fitted result for dense transformers on general web text. It shifts with data quality, tokenizer design, optimizer regime, and architecture. It's a starting point, not a physical constant.
Fix: Treat 20:1 as a baseline, then run small proxy experiments (100M to 1B parameters) on your actual data to fit your own exponents. If your data is cleaner or more repetitive, the optimal ratio may be lower; if it's noisier or more diverse, it may be higher.
Symptom: Your spreadsheet says two candidate runs have the same training FLOPs, but the real cluster bills differ materially.
Cause: is a dense-transformer rule of thumb. It ignores attention-kernel details, optimizer state, activation checkpointing, sequence length, hardware utilization, distributed communication, data pipeline stalls, and sparse-routing behavior.
Fix: Use for first-pass sizing, then replace it with measured hardware FLOPs, wall-clock throughput, and dollars per token from your actual stack before committing budget.
Symptom: Your scaling study predicts excellent cross-entropy loss, but the deployed model hallucinates facts, ignores instructions, or fails safety checks.
Cause: Scaling laws predict pre-training loss (how well the model compresses the training distribution), not downstream task performance, alignment quality, or reasoning ability. A model can have great loss and still be useless in production.
Fix: Budget separately for the post-training pipeline. Techniques such as instruction tuning and RLHF [10] can improve instruction behavior and preference alignment that a pre-training loss curve doesn't measure. Don't skip evaluation or post-training because your loss curve looks good.
Symptom: You train a Chinchilla-optimal 70B model, deploy it to production, and discover that your serving bill exceeds your training budget within three months.
Cause: Chinchilla optimizes training loss, not total lifetime cost. A smaller model trained longer costs more upfront in training compute but saves money on each inference call.
Fix: Before you commit to a model size, estimate your expected inference volume. Compare total cost = (train) + (inference) across candidates that meet a measured quality target. At sufficiently high demand, a smaller model trained for longer can be cheaper under this proxy; verify the crossover on your serving stack.
Symptom: Your proxy experiments span models from 100M to 1B parameters. You fit a beautiful straight line, extrapolate it to predict the loss of a 100B model, and the actual result is way off.
Cause: Power-law fits work well inside the range where they were measured. Outside that range, new bottlenecks appear: optimizer instability, numerical precision issues, data exhaustion, or hardware communication overhead that didn't exist at small scale.
Fix: Validate with at least one mid-scale experiment before committing to the full target scale. If your proxy range is 100M to 1B, run a 10B validation before you trust the curve at 100B. Treat extrapolation as a hypothesis, not a guarantee.
Scaling laws are empirical fits, not physical laws. Exponents shift with architecture, data quality, tokenizer, and optimizer regime, so one fitted curve shouldn't be treated as a universal law.
Wei et al. [11] highlighted benchmark tasks where performance appears to stay near zero and then jump at larger scales, creating the impression of a phase transition in capability. Schaeffer et al. [12] later argued that many of these jumps are measurement artifacts: when you replace exact-match thresholds with continuous metrics, the same capabilities often return to smoother scaling curves.
The point isn't that emergence is fake or guaranteed. The useful lesson is measurement design: thresholded metrics can make smooth changes look abrupt, so teams should inspect continuous metrics before treating a capability jump as a new law of scale.
Scaling laws predict pre-training loss, not downstream task performance. A model that achieves excellent perplexity may still fail at:
Because scaling laws only measure the model's ability to compress and predict the training distribution, teams must allocate separate compute budgets for the post-training pipeline. Techniques such as instruction tuning and RLHF [10] are used to improve instruction behavior and preference alignment, which must be evaluated separately from pre-training loss.
Scaling laws derived for dense transformers don't directly transfer to:
For these architectures, dense-transformer scaling laws are useful context, not a substitute for new measurements. Each architecture needs empirical scaling studies to determine its own parameter and data allocation.
When planning a large training run, teams conduct scaling studies (small-scale experiments that predict large-scale performance) before committing resources. This workflow shows the standard pipeline for a scaling study. It starts with training proxy models and fitting exponents, extrapolates to the target frontier, validates at a medium scale, and finally commits to the full run:
The practical loop is:
To estimate the performance of a large model before committing to its full run, engineers use proxy models. A complementary tool is µ-Transfer (Tensor Programs V) [13], which uses the Maximal Update Parametrization (µP) so selected hyperparameters tuned on a smaller proxy can be transferred to a larger target model.
Yang et al. demonstrate transfer for hyperparameters such as learning rate and learning-rate schedule, and discuss transfer across scale dimensions such as width, batch size, sequence length, and training time with caveats. They also report cases where transfer across depth is fragile. It is a tool for reducing tuning cost, not permission to copy every setting without validation.
Used alongside a scaling study, the workflow is:
The µ-Transfer paper shows that zero-shot hyperparameter transfer can work across large changes in model scale in its tested setups. A medium-scale validation still matters because transferability, loss fits, data mixture, and hardware behavior are separate failure modes.
One common parametric fit for model size and data, used in Chinchilla-style scaling studies, is:[2]
Reading the formula: Loss equals three fitted terms: (1) , the asymptotic floor in this model of the measured regime, (2) , the fitted penalty associated with finite parameters, and (3) , the fitted penalty associated with finite training data. Making the model bigger reduces term 2; more data reduces term 3. Treat as an extrapolated fit parameter, not proof that you have measured the irreducible entropy of language.
To see how this works in practice, consider a tiny synthetic dataset of proxy runs:
| Parameters (N) | Tokens (D) | Observed Loss |
|---|---|---|
| 100M | 1B | 2.894 |
| 100M | 5B | 2.745 |
| 100M | 20B | 2.634 |
| 500M | 1B | 2.811 |
| 500M | 5B | 2.662 |
| 500M | 20B | 2.551 |
| 1B | 5B | 2.629 |
| 1B | 20B | 2.518 |
| 1B | 100B | 2.407 |
Notice the pattern: fixing N and increasing D lowers loss; fixing D and increasing N also lowers loss. The parametric formula above captures both effects at once.
This Python script fits the scaling constants with SciPy's curve_fit and then predicts loss at a larger target scale:
1import numpy as np
2from scipy.optimize import curve_fit
3
4def scaling_law(
5 X: tuple[np.ndarray, np.ndarray],
6 E: float,
7 A: float,
8 alpha: float,
9 B: float,
10 beta: float,
11) -> np.ndarray:
12 N, D = X
13 return E + A / (N ** alpha) + B / (D ** beta)
14
15# Synthetic proxy-run measurements for illustration only.
16# Each row: (parameters, tokens, observed_loss)
17experiments = np.array([
18 [1e8, 1e9, 2.894],
19 [1e8, 5e9, 2.745],
20 [1e8, 2e10, 2.634],
21 [5e8, 1e9, 2.811],
22 [5e8, 5e9, 2.662],
23 [5e8, 2e10, 2.551],
24 [1e9, 5e9, 2.629],
25 [1e9, 2e10, 2.518],
26 [1e9, 1e11, 2.407],
27])
28
29N_data = experiments[:, 0]
30D_data = experiments[:, 1]
31L_data = experiments[:, 2]
32
33popt, pcov = curve_fit(
34 scaling_law, (N_data, D_data), L_data,
35 p0=[1.0, 2.5, 0.07, 7.0, 0.09],
36 bounds=([0, 0, 0, 0, 0], [5, 1e6, 1, 1e6, 1]),
37 maxfev=20_000,
38)
39
40E_fit, A_fit, alpha_fit, B_fit, beta_fit = popt
41print(f"Fitted asymptotic E = {E_fit:.3f}")
42print(f"Parameter scaling: A={A_fit:.1f}, alpha={alpha_fit:.4f}")
43print(f"Data scaling: B={B_fit:.1f}, beta={beta_fit:.4f}")
44
45# Extrapolate: predict loss for a 70B model on 1.4T tokens
46predicted = scaling_law((70e9, 1.4e12), *popt)
47print(f"\nPredicted loss for 70B / 1.4T tokens: {predicted:.3f}")1Fitted asymptotic E = 1.096
2Parameter scaling: A=2.8, alpha=0.0703
3Data scaling: B=7.8, beta=0.0980
4
5Predicted loss for 70B / 1.4T tokens: 2.088Start with and the rule of thumb . Substituting gives , so parameters and tokens. That is a training-loss starting point. If the model will serve heavy traffic, test smaller models trained longer.
Not blindly. Chinchilla minimizes pre-training loss for a fixed training-compute budget. Under Sardana et al.'s fitted quality and demand objective, roughly 1B expected requests shifts the predicted optimum toward smaller models trained for longer.[6] Start with a Chinchilla-style baseline, then compare it with a smaller, longer-trained candidate using measured quality and total lifetime cost.
There can be several reasons: a scaling fit on different data, a different loss or downstream target, or an inference-aware objective. Llama 3's 405B model is a concrete observed ratio above 20:1: 405B parameters on 15.6T tokens, or about 38.5:1.[5] Meta reports that configuration as approximately compute-optimal under its own data and training-budget scaling laws, so its ratio isn't evidence by itself for an inference-cost explanation.
Dense scaling laws use parameter count as a rough proxy for both capacity and compute. MoE breaks that shortcut. Total parameters affect memory and routing complexity; active parameters drive compute per token. You need separate scaling fits that track active parameters, total parameters, router behavior, and data budget.
Treat the miss as evidence that your fit stopped generalizing. Don't launch the full run because the small-scale line looked clean. Re-fit with the 10B point included, inspect new bottlenecks such as optimizer instability or data exhaustion, and only then decide whether the frontier still supports the full target scale.
| Tier | You should be able to defend |
|---|---|
| Foundational | Explain parameters, tokens, compute, and why scaling laws are empirical power-law fits. |
| Foundational | Read Kaplan's single-variable loss curves and name what each bottleneck means. |
| Intermediate | Explain why Kaplan's fitted frontier favored parameters faster than data. |
| Intermediate | Describe how Chinchilla revised the allocation toward roughly 20 tokens per parameter for dense transformers. |
| Advanced | Use to estimate tokens, parameters, and training FLOPs while stating its limits. |
| Advanced | Explain why inference-aware scaling can favor smaller models trained longer. |
| Advanced | Diagnose over-training, under-training, extrapolation risk, and dense-versus-MoE parameter ambiguity. |
| Advanced | Design a small scaling study with proxy runs, fitted exponents, and a mid-scale validation check. |
Scaling Laws for Neural Language Models
Kaplan et al. · 2020
Training Compute-Optimal Large Language Models.
Hoffmann, J., et al. · 2022 · NeurIPS 2022
Language Models are Few-Shot Learners.
Brown, T., et al. · 2020 · NeurIPS 2020
Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., & Carmon, Y. · 2024
The Llama 3 Herd of Models.
Dubey, A., et al. · 2024 · arXiv preprint
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Sardana & Frankle · 2024
Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data
Villalobos, P., et al. (Epoch AI) · 2022 · arXiv preprint
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.
Snell, C., et al. · 2024 · arXiv preprint
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI · 2025
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
Emergent Abilities of Large Language Models.
Wei, J., et al. · 2022 · TMLR
Are Emergent Abilities of Large Language Models a Mirage?
Schaeffer, R., et al. · 2023 · NeurIPS 2023
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
Yang, G., et al. · 2022 · NeurIPS 2022