LearnAdvanced Training & AdaptationScaling Laws & Compute-Optimal Training

⚡HardFine-Tuning & Training

Scaling Laws & Compute-Optimal Training

Learn the empirical power laws governing LLM performance, from Kaplan's parameter-heavy frontier through Chinchilla-optimal ratios to modern inference-aware training strategies.

36 min read

Learning path

Step 96 of 158 in the full curriculum

Decoding Strategies: Greedy to Nucleus Pre-training Data at Scale

After decoding, we move from "how a trained model emits tokens" to "how a lab decides what model to train in the first place." Scaling laws describe how model loss changes as you add parameters, tokens, and compute. The practical question behind the math is how to spend a fixed training budget without overbuilding the wrong part of the system.

A fixed training budget for a code-assistant model creates a simple-sounding choice: build a huge model and train it on a modest amount of source, issue, and documentation data, or build a smaller model and train it on much more data. Get this balance wrong, and you can spend millions of dollars on a model that underperforms a smaller, better-trained alternative.

This is the problem scaling laws address. They are empirical relationships that predict how a model's performance improves as you increase its size, its training data, or both. Kaplan et al. at OpenAI ^{[1]Reference 1Scaling Laws for Neural Language Modelshttps://arxiv.org/abs/2001.08361} quantified influential transformer scaling curves, and the Chinchilla paper ^{[2]Reference 2Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556} later found a more balanced allocation of model size and data for compute-optimal dense training. These fits help plan expensive runs, but each fit applies only within its objective, data, architecture, and measurement assumptions.

Before going further, define three basic terms used throughout:

Parameters (N): The learnable numbers inside a model (weights and biases). They set how much structure the model can store and compose.
Training tokens (D): The individual words or subword pieces the model sees during training. They provide the examples the model learns from.
Compute (C): The total number of floating-point operations (FLOPs) needed for training. For a standard dense transformer, a useful rule of thumb is $C \approx 6ND$ (approximately 2N FLOPs per token for the forward pass and 4N for backpropagation). This approximation originates in Kaplan et al. and is refined in the Chinchilla work.^{[1]Reference 1Scaling Laws for Neural Language Modelshttps://arxiv.org/abs/2001.08361}^{[2]Reference 2Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}

Log-log allocation plot for a 100-fold dense-training compute increase. The fixed-compute curve N times D equals 100 marks Kaplan at 28.84-fold parameters and 3.47-fold tokens and Chinchilla at 10-fold parameters and 10-fold tokens. A separate lifetime-cost objective adds served-token compute and points toward smaller models trained on more data as demand rises. — At 100x dense training compute, Kaplan and Chinchilla choose different points on the same N times D constraint. Lifetime optimization changes the objective by adding future served-token cost, so expected demand can shift the preferred search direction.

Why scaling laws matter

Frontier pre-training runs consume large compute budgets. In the early days of deep learning, architecture design and hyperparameter tuning drove many performance improvements. Today, scale is one of the main levers. But scaling isn't as simple as making everything bigger. If you misallocate your compute budget between model size and data volume, you can spend a large budget on a model that underperforms a smaller, better-trained counterpart.

Scaling laws address this problem by providing predictive models that guide resource allocation. They map the relationship between the scale of your inputs (parameters, tokens, and compute) and the expected performance of the output model. In pre-training studies, that performance is usually cross-entropy loss, which measures how well the model predicts the next token. By studying these relationships, engineers can estimate how a model will perform before committing to a costly, months-long training run.

These empirical laws allow teams to:

Forecast performance before committing to a full training run.
Optimize compute allocation between parameters and training tokens.
Make economic trade-offs between training cost and inference cost in production environments.
Extrapolate from small-scale proxy experiments to production-scale models, reducing the risk of expensive surprises at scale.

They are planning tools, not guarantees. A good scaling study reduces uncertainty before a large run, but the final decision still has to account for data quality, architecture, hardware efficiency, post-training, and deployment cost.

Building intuition with numbers

Before any formulas, a tiny concrete example shows the shape of scaling. Suppose you train a 1-billion-parameter model on 10 billion tokens and measure its cross-entropy loss. Then you run three follow-up experiments:

Experiment	Parameters	Tokens	What changed	Approximate loss
Baseline	1B	10B	Nothing	2.80
Double model size	2B	10B	2x parameters	2.60
Double data	1B	20B	2x tokens	2.55
Double both	2B	20B	2x each	2.35

These numbers are synthetic, but they capture a real pattern: doubling parameters or data lowers loss, but doubling both together lowers it more than either alone. The improvement is smooth and predictable, but it follows diminishing returns. A 10x increase in parameters doesn't cut loss by 10x; it cuts it by a smaller, fixed multiplier.

This predictable curve is a power law. In everyday language, an idealized power-law term becomes a straight line when you plot its logarithm against log(parameters) or log(data). That line estimates the scaling trend. It lets engineers answer questions like: "If I have 10x more compute, how much better will my model get?" without training the model first. Later sections add an asymptotic floor and fit parameters and data together.

Two exact charts for the synthetic compute law L of C equals 2 plus C to the negative 0.05. Across equal 3.16-fold compute steps from 1 to 100, loss falls from 3.000 to 2.794 while each absolute drop shrinks from 0.056 to 0.047. — For the displayed law, every 3.16x compute step multiplies the power-law term by the same factor, but the absolute loss reduction shrinks from 0.056 to 0.047. One hundred times more compute reduces the scaling term by about 20.6%, not by 100x.

power-law-diminishing-returns.py

alpha_n = 0.076

for parameter_multiplier in [2, 10, 100]:
    scaled_term = parameter_multiplier ** (-alpha_n)
    reduction = 1 - scaled_term
    print(
        f"{parameter_multiplier:>3}x parameters -> "
        f"{scaled_term:.3f}x parameter-limited loss term "
        f"({reduction:.1%} reduction)"
    )

Output

2x parameters -> 0.949x parameter-limited loss term (5.1% reduction)
 10x parameters -> 0.839x parameter-limited loss term (16.1% reduction)
100x parameters -> 0.705x parameter-limited loss term (29.5% reduction)

Kaplan scaling laws (2020)

OpenAI's 2020 work ^{[1]Reference 1Scaling Laws for Neural Language Modelshttps://arxiv.org/abs/2001.08361} established three power-law relationships for transformer language models. Each describes a different bottleneck regime for loss $L$ :

The three power laws

1. Scaling with parameters (model size $N$ )

L(N) \propto N^{-\alpha_N}

Reading the formula: Loss decreases as a power law when you make the model bigger. Double the parameters, and loss drops by a fixed multiplier. The exponent $\alpha_N \approx 0.076$ means the improvement is smooth and predictable, but diminishing returns are steep: a 10x increase in parameters only reduces the loss by about 16%.

2. Scaling with data (dataset size $D$ in tokens)

L(D) \propto D^{-\alpha_D}

Reading the formula: The same pattern holds for data. More training tokens means lower loss, following a predictable power curve with $\alpha_D \approx 0.095$ .

3. Scaling with compute-optimal training compute ( $C_{\min}$ )

L(C_{\min}) \propto C_{\min}^{-\alpha_C}

Reading the formula: Kaplan's compute curve is written in terms of optimally allocated training compute, not arbitrary training runs. Using the standard dense-transformer budgeting rule $C \approx 6ND$ , the fitted exponent is $\alpha_C \approx 0.050$ . The factor of 6 comes from a rough FLOP accounting: about $2N$ per token for the forward pass and about $4N$ per token for backpropagation. It's a planning heuristic, not an exact profiler. Real training runs also depend on attention kernels, optimizer overhead, activation recomputation, and architecture choices such as MoE sparsity.

Kaplan's influential result wasn't that $\alpha_N$ was larger than $\alpha_D$ ; it wasn't. The operational takeaway came from the compute-optimal frontier: for a fixed compute budget, the fitted optimum favored much larger models and relatively little data.

Kaplan's compute-optimal frontier favored larger models

Kaplan's team concluded that, under their fitted scaling regime, you should spend most new compute on model size and only a smaller fraction on additional data. In that fitted regime, adding parameters moved the compute-optimal frontier more than adding the same relative amount of data.

This led to the initial strategy of training very large models on relatively modest data budgets. GPT-3 made that pattern visible: 175B parameters on roughly 300B tokens, a 1.7:1 ratio.^{[3]Reference 3Language Models are Few-Shot Learners.https://arxiv.org/abs/2005.14165}

The compute-optimal frontier

When compute is the binding constraint, Kaplan suggested allocating the budget such that:

N_{\text{opt}} \propto C^{0.73}, \quad D_{\text{opt}} \propto C^{0.27}

Where these exponents come from: Kaplan didn't get these exponents by comparing $\alpha_N$ and $\alpha_D$ directly. He first fit a joint loss surface,

L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}

and then solved for the best allocation under the transformer training-cost approximation $C \approx 6ND$ . That optimization produced the strongly parameter-heavy split above. Chinchilla later re-fit the compute-optimal frontier and got the much more balanced $0.50$ / $0.50$ result.

Kaplan-style scaling: If you get 10x more compute, spend most of it on a bigger model (about 5x bigger) and only a little more data (about 2x more). This led to the strategy of building very large models on modest datasets.

allocation-exponents.py

frontiers = {
    "Kaplan": (0.73, 0.27),
    "Chinchilla": (0.50, 0.50),
}

for compute_multiplier in [10, 100]:
    print(f"{compute_multiplier}x training compute")
    for name, (parameter_exp, token_exp) in frontiers.items():
        parameters = compute_multiplier ** parameter_exp
        tokens = compute_multiplier ** token_exp
        print(f"  {name:<10} N={parameters:5.2f}x, D={tokens:5.2f}x")

Output

10x training compute
  Kaplan     N= 5.37x, D= 1.86x
  Chinchilla N= 3.16x, D= 3.16x
100x training compute
  Kaplan     N=28.84x, D= 3.47x
  Chinchilla N=10.00x, D=10.00x

the-compute-optimal-frontier.py

def dense_training_flops(parameters, tokens):
    return 6 * parameters * tokens

gpt3_style = dense_training_flops(175e9, 300e9)
small_data_rich = dense_training_flops(70e9, 1.4e12)

print(f"GPT-3-style training FLOPs: {gpt3_style:.2e}")
print(f"70B / 1.4T-token training FLOPs: {small_data_rich:.2e}")

Output

GPT-3-style training FLOPs: 3.15e+23
70B / 1.4T-token training FLOPs: 5.88e+23

The Chinchilla reallocation (2022)

Two training teams get the same compute budget. Team A trains a much larger model on a thin token budget. Team B trains a smaller model on a much richer token budget. Which system performs better? DeepMind ran this experiment with AI models, and the results changed how teams thought about model size.

Hoffmann et al. at DeepMind ^{[2]Reference 2Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556} challenged the Kaplan prescription by training over 400 models ranging from 70M to over 16B parameters on 5B to 500B tokens. Their central finding was that the Kaplan allocation had underestimated the value of data.

Why did two careful studies disagree so sharply? Later analysis traced the gap to methodology rather than a new law of transformers. Porian et al. ^{[4]Reference 4Resolving Discrepancies in Compute-Optimal Scaling of Language Modelshttps://arxiv.org/abs/2406.19146} reproduce Kaplan-style scaling experiments and identify three contributors: omitting final decoding-layer computation, an excessively long fixed warmup for small models, and optimizer hyperparameters that weren't tuned as a function of scale. Correcting those factors produces close agreement with the Chinchilla allocation; their experiments also find that tailored learning-rate decay isn't the central explanation. A scaling exponent is only as trustworthy as its compute accounting and fitting procedure.

Chinchilla-optimal training

The revised scaling law for jointly optimal model size and data volume is:

N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}

In plain terms: Chinchilla's correction is that model size and data should grow at equal rates. If you double compute, make the model $\sqrt{2}\times$ bigger and use $\sqrt{2}\times$ more data. The practical rule:

That 20:1 ratio is a fitted result for the dense-transformer setup Hoffmann et al. studied, not a universal constant.^{[2]Reference 2Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}

compute-budget-to-chinchilla-size.py

def chinchilla_style_size(training_flops, tokens_per_parameter=20):
    parameters = (training_flops / (6 * tokens_per_parameter)) ** 0.5
    tokens = tokens_per_parameter * parameters
    return parameters, tokens

parameters, tokens = chinchilla_style_size(1e24)
print(f"Parameters at 1e24 FLOPs: {parameters / 1e9:.1f}B")
print(f"Tokens at 20:1: {tokens / 1e12:.2f}T")

Output

Parameters at 1e24 FLOPs: 91.3B
Tokens at 20:1: 1.83T

Frontier chart comparing Kaplan and Chinchilla allocation from 1-fold to 100-fold compute. At 100-fold compute, Kaplan reaches 28.84-fold parameters and 3.47-fold tokens, while Chinchilla reaches 10-fold parameters and 10-fold tokens. A companion bar chart shows reported or fitted token-to-parameter ratios of 1.7 for GPT-3, 20 for Chinchilla, and 38.5 for Llama 3 405B. — At 100x compute, Kaplan's fitted frontier allocates 28.84x to parameters and 3.47x to tokens; Chinchilla allocates 10x to each. The ratio bars provide context only: released-model ratios don't reveal the objective that selected them.

Chinchilla vs. Gopher: a concrete example

DeepMind demonstrated this dramatically with Gopher and Chinchilla:

Model	Parameters	Training Tokens	Ratio (D/N)	Compute (FLOPs)
Gopher	280B	300B	1.07:1	5.76 × 10²³
Chinchilla	~70B	1.4T	20:1	5.76 × 10²³

The FLOP column is the matched training budget reported by Hoffmann et al. It won't equal $6ND$ exactly if you plug in the displayed parameter and token counts: $6ND$ is a planning approximation, the paper uses architecture-aware FLOP accounting, and the displayed counts are rounded.^{[2]Reference 2Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}

With the same reported training budget, Chinchilla outperformed Gopher across the evaluation suite DeepMind reported. The 4x smaller model trained on about 4.7x more data delivered better downstream accuracy while also being cheaper to fine-tune and serve.^{[2]Reference 2Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}

Later dense-model token ratios

Published model reports make it possible to observe the token-to-parameter ratio used. That observation doesn't by itself reveal which objective selected the configuration.

Meta reports that the Llama 3 405B flagship was pre-trained on 15.6T text tokens.^{[5]Reference 5The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783} That yields approximately $15.6T / 405B \approx 38.5$ tokens per parameter, above Chinchilla's roughly 20:1 fitted rule. Importantly, the Llama 3 paper says its 405B model size is approximately compute-optimal under scaling laws fitted on Meta's data and its $3.8 \times 10^{25}$ FLOP training budget.^{[5]Reference 5The Llama 3 Herd of Models.https://arxiv.org/abs/2407.21783} It doesn't establish that lifetime inference cost caused the ratio. The inference-aware objective in the next section is a separate way to evaluate that trade-off.

chinchilla-style-budget.py

def chinchilla_tokens(parameters, tokens_per_parameter=20):
    return parameters * tokens_per_parameter

def dense_training_flops(parameters, tokens):
    return 6 * parameters * tokens

parameters = 10e9
tokens = chinchilla_tokens(parameters)
flops = dense_training_flops(parameters, tokens)

print(f"Chinchilla-style tokens for 10B parameters: {tokens:.2e}")
print(f"Dense training FLOPs: {flops:.2e}")

Output

Chinchilla-style tokens for 10B parameters: 2.00e+11
Dense training FLOPs: 1.20e+22

Beyond Chinchilla: inference-aware scaling

Chinchilla tells you how to train the model with the lowest pre-training loss for a fixed training-compute budget, but that's only half the story. If two models meet the same quality bar, the smaller one may be cheaper to serve on every request. The relevant economic objective becomes lifetime cost, not pre-training loss at the end of the run alone.

Kaplan and Chinchilla both ask: "Given a fixed pre-training compute budget, how should I split it between parameters and tokens?" Inference-aware scaling changes the objective: "Given a target quality and expected demand, which model minimizes total lifetime cost?" Chinchilla-optimal minimizes training loss for a fixed training compute budget. If an operator instead optimizes total deployment-lifetime cost, the objective includes training cost plus inference cost over the model's deployment lifetime.

The inference cost problem

A model that serves enough traffic can accumulate inference FLOPs that rival or exceed the original training cost. Sardana et al. ^{[6]Reference 6Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Lawshttps://arxiv.org/abs/2401.00448} extend a Chinchilla-style objective to account for deployment demand.

Under their fitted objective and demand assumptions, Sardana et al. report that researchers expecting roughly 1B requests should prefer smaller models trained for longer. They validate the analysis with 47 trained models and report continued improvement at token-to-parameter ratios as high as 10,000 in the measured regime.^{[6]Reference 6Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Lawshttps://arxiv.org/abs/2401.00448}

\min_{N, D} \; C_{\text{train}}(N, D) + C_{\text{inference}}(N, T_{\text{served}}) \quad \text{subject to } \quad L(N, D) \le \ell

Reading the formula: Training costs scale with model size times data ( $6ND$ FLOPs). Inference costs scale with model size times total processed inference tokens ( $2N \cdot T_{\text{served}}$ ) for a dense decoder-only model. Once $T_{\text{served}}$ gets large enough, a smaller model that's trained longer can match the target quality while costing less over its lifetime. Here:

C_{\text{train}} \approx 6ND, \quad C_{\text{inference}} \approx 2N \cdot T_{\text{served}}

Here $T_{\text{served}}$ represents total input plus output tokens processed across inference requests during the model's deployment lifetime. The proxy ignores differences in hardware utilization, latency, and attention cost, so production sizing still needs measurements from the serving stack.

Concrete break-even example

Suppose you target a fixed pre-training loss $\ell$ that two candidate models can achieve. The numbers below are an arithmetic illustration, not two measured models from Sardana et al.; a real comparison first has to show that both candidates hit the quality target.

Model A (Chinchilla-style): 70B parameters trained on 1.4T tokens. Training cost ≈ $6 \times 70\text{B} \times 1.4\text{T} \approx 5.88 \times 10^{23}$ FLOPs. Inference cost per token ≈ $2 \times 70\text{B} = 140$ B FLOPs/token.
Model B (inference-aware, smaller + over-trained): 30B parameters trained on ~4T tokens (heavier over-training to match quality). Training cost ≈ $6 \times 30\text{B} \times 4\text{T} \approx 7.2 \times 10^{23}$ FLOPs (slightly higher upfront). Inference cost per token ≈ $2 \times 30\text{B} = 60$ B FLOPs/token, less than half of Model A.

The break-even point occurs when Model B's extra training FLOPs are recovered by its lower inference FLOPs. Under these equal-quality assumptions, Model A is cheaper below about 1.65T served tokens and Model B is cheaper above that point. A deployment decision must also replace this FLOP proxy with measured quality, latency, utilization, and hardware cost.

Assumed equal-quality lifetime-FLOP comparison between a 70B model trained on 1.4 trillion tokens and a 30B model trained on 4 trillion tokens. Their cost lines cross at 1.65 trillion served tokens. At that point both total 8.19 times 10 to the 23 FLOPs: the 70B model splits into 5.88 training plus 2.31 inference, while the 30B model splits into 7.20 training plus 0.99 inference. — At the 1.65T-token break-even, both candidates total 8.19 × 10²³ FLOPs under the proxy. Their cost composition differs: the 70B model pays less to train and more to serve; the 30B model pays more to train and less to serve.

concrete-break-even-example.py

def train_flops(parameters, tokens):
    return 6 * parameters * tokens

def inference_flops(parameters, served_tokens):
    return 2 * parameters * served_tokens

model_a_train = train_flops(70e9, 1.4e12)
model_b_train = train_flops(30e9, 4e12)
extra_train = model_b_train - model_a_train
savings_per_token = 2 * (70e9 - 30e9)
break_even_tokens = extra_train / savings_per_token

print(f"Model A training FLOPs: {model_a_train:.2e}")
print(f"Model B training FLOPs: {model_b_train:.2e}")
print(f"Break-even served tokens: {break_even_tokens:.2e}")

Output

Model A training FLOPs: 5.88e+23
Model B training FLOPs: 7.20e+23
Break-even served tokens: 1.65e+12

lifetime-cost-above-and-below-break-even.py

def total_flops(parameters, training_tokens, served_tokens):
    return 6 * parameters * training_tokens + 2 * parameters * served_tokens

models = {
    "70B / 1.4T": (70e9, 1.4e12),
    "30B / 4.0T": (30e9, 4.0e12),
}

for demand in [1e12, 3e12]:
    costs = {
        name: total_flops(parameters, tokens, demand)
        for name, (parameters, tokens) in models.items()
    }
    cheaper = min(costs, key=costs.get)
    print(f"{demand / 1e12:.0f}T served tokens -> cheaper candidate: {cheaper}")

Output

1T served tokens -> cheaper candidate: 70B / 1.4T
3T served tokens -> cheaper candidate: 30B / 4.0T

What the inference-aware objective predicts

Under the Sardana et al. objective, increasing expected demand shifts the fitted optimum toward fewer parameters and more pre-training tokens while holding the modeled loss target fixed. That's a prediction under an explicit quality model and demand estimate, not evidence that any particular released model was selected using that objective.

For a given quality target, compare candidates according to the question you can evaluate:

Planning input	What to evaluate
Training loss is the objective and deployment demand is excluded	Use a Chinchilla-style baseline fitted to your data and architecture.
Large forecasted lifetime demand	Benchmark a smaller, longer-trained candidate against the quality target and lifetime serving cost.
Candidate ratios far outside fitted data	Collect proxy evidence in that regime instead of extrapolating the original fit without checks.

The exact optimum depends on the loss target, architecture, data distribution, inference forecast, and serving stack. Don't infer a training objective from a released model's token-to-parameter ratio alone.

Public-text supply as a planning constraint

Both Chinchilla and inference-aware scaling require access to more useful training tokens as their recommended data budget grows. The supply of public human-generated text is finite, but a projection about that supply isn't evidence that it has already been exhausted.

Villalobos et al. ^{[7]Reference 7Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Datahttps://arxiv.org/abs/2211.04325} estimate an effective public-human-text stock of about 320T tokens after quality and multi-epoch adjustments. Under continued dataset-growth trends, they project full utilization between 2026 and 2032, with a median year of 2028; their assumed 5x over-training scenario shifts that median one year earlier. These are scenario-dependent forecasts. The projection doesn't establish that full utilization has already occurred.

This projected constraint changes how you read every scaling law here. The $B/D^\beta$ term can only be extrapolated confidently while the data regime remains comparable to the fit. When fresh high-quality text is scarce, practitioners may evaluate better filtering and deduplication, controlled multi-epoch reuse, transfer from other domains, and synthetic data; each changes the assumptions behind the fitted curve.

projected-public-text-context.py

effective_stock = 320e12  # Villalobos et al. scenario estimate
llama3_405b_tokens = 15.6e12
chinchilla_style_tokens = 20 * 405e9

print(f"Llama 3 405B reported tokens: {llama3_405b_tokens / 1e12:.1f}T")
print(f"Chinchilla-style 405B tokens: {chinchilla_style_tokens / 1e12:.1f}T")
print(
    "Reported Llama 3 dataset / projected effective stock: "
    f"{llama3_405b_tokens / effective_stock:.1%}"
)
print("Comparison is dataset size, not unique-text consumption.")

Output

Llama 3 405B reported tokens: 15.6T
Chinchilla-style 405B tokens: 8.1T
Reported Llama 3 dataset / projected effective stock: 4.9%
Comparison is dataset size, not unique-text consumption.

A new scaling axis: test-time compute

So far, "scale" meant training compute split between parameters and tokens. A third allocation question is how much compute the model spends at inference time while answering a single query.

Instead of producing one answer in a fixed number of forward passes, a model can generate longer responses, sample multiple candidates, or search with a verifier. Snell et al. ^{[8]Reference 8Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.https://arxiv.org/abs/2408.03314} evaluate revision and verifier-search strategies with PaLM 2-S* on MATH. In their FLOP-matched comparison, adaptively allocated test-time compute with a smaller model can outperform a roughly 14x larger model on problems where the smaller model already attains non-trivial success; on the hardest problems or under higher inference workloads, additional pre-training is more effective.

DeepSeek-R1 provides a related post-training example: its R1-Zero model applies reinforcement learning without supervised fine-tuning before RL, while the final R1 pipeline integrates cold-start data, supervised fine-tuning, and reinforcement-learning stages.^{[9]Reference 9DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learninghttps://arxiv.org/abs/2501.12948} It demonstrates that post-training can change reasoning behavior; it doesn't show that pre-training data or compute no longer matters.

This matters for sizing decisions in two ways. First, it adds a knob: for a quality target, you can trade a bigger or longer-trained model against a smaller model that spends more compute per query. Second, it interacts with the inference-aware objective from earlier, because additional generated candidates or longer responses multiply $T_{\text{served}}$ , raising the serving term.

The small calculation below isolates generated-token cost for sampled candidates. A real request also includes prompt tokens and any verifier work.

sampled-candidates-raise-generation-cost.py

def generation_flops(parameters, output_tokens_per_candidate, candidates):
    return 2 * parameters * output_tokens_per_candidate * candidates

parameters = 8e9
output_tokens_per_candidate = 512
for candidates in [1, 4, 16]:
    flops = generation_flops(parameters, output_tokens_per_candidate, candidates)
    print(f"{candidates:>2} candidate(s): {flops:.2e} generation FLOPs per request")

Output

candidate(s): 8.19e+12 generation FLOPs per request
candidate(s): 3.28e+13 generation FLOPs per request
candidate(s): 1.31e+14 generation FLOPs per request

Common failures and fixes

Even experienced engineers trip over scaling laws. These five mistakes are frequent enough to check during review.

Mistake 1: treating the 20:1 rule as universal

Symptom: You read the Chinchilla paper, memorize "20 tokens per parameter," and apply it to every model you build. A 1B-parameter model trained on 20B tokens underperforms on your messy internal-code corpus, and you can't figure out why.
Cause: The 20:1 ratio is a fitted result for dense transformers on general web text. It shifts with data quality, tokenizer design, optimizer regime, and architecture. It's a starting point, not a physical constant.
Fix: Treat 20:1 as a baseline, then run small proxy experiments (100M to 1B parameters) on your actual data to fit your own exponents. Data quality, repetition, tokenizer, optimizer, and architecture can move the fitted optimum in either direction.

Mistake 2: treating $C \approx 6ND$ as exact accounting

Symptom: Your spreadsheet says two candidate runs have the same training FLOPs, but the real cluster bills differ materially.
Cause: $6ND$ is a dense-transformer rule of thumb. It ignores attention-kernel details, optimizer state, activation checkpointing, sequence length, hardware utilization, distributed communication, data pipeline stalls, and sparse-routing behavior.
Fix: Use $6ND$ for first-pass sizing, then replace it with measured hardware FLOPs, wall-clock throughput, and dollars per token from your actual stack before committing budget.

Mistake 3: confusing loss with deployed capability

Symptom: Your scaling study predicts excellent cross-entropy loss, but the deployed model hallucinates facts, ignores instructions, or fails safety checks.
Cause: Scaling laws predict pre-training loss (how well the model compresses the training distribution), not downstream task performance, alignment quality, or reasoning ability. A model can have great loss and still be useless in production.
Fix: Budget separately for the post-training pipeline. Techniques such as instruction tuning and RLHF ^{[10]Reference 10Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155} can improve instruction behavior and preference alignment that a pre-training loss curve doesn't measure. Don't skip evaluation or post-training because your loss curve looks good.

Mistake 4: ignoring inference costs when sizing a model

Symptom: You train a Chinchilla-optimal 70B model, deploy it to production, and discover that your serving bill exceeds your training budget within three months.
Cause: Chinchilla optimizes training loss, not total lifetime cost. A smaller model trained longer costs more upfront in training compute but saves money on each inference call.
Fix: Before you commit to a model size, estimate your expected inference volume. Compare total cost = $6ND$ (train) + $2N \cdot T_{\text{served}}$ (inference) across candidates that meet a measured quality target. At sufficiently high demand, a smaller model trained for longer can be cheaper under this proxy; verify the crossover on your serving stack.

Mistake 5: extrapolating scaling laws far beyond the training regime

Symptom: Your proxy experiments span models from 100M to 1B parameters. You fit a beautiful straight line, extrapolate it to predict the loss of a 100B model, and the actual result is way off.
Cause: Power-law fits work well inside the range where they were measured. Outside that range, new bottlenecks appear: optimizer instability, numerical precision issues, data exhaustion, or hardware communication overhead that didn't exist at small scale.
Fix: Validate with at least one mid-scale experiment before committing to the full target scale. If your proxy range is 100M to 1B, run a 10B validation before you trust the curve at 100B. Treat extrapolation as a hypothesis, not a guarantee.

Scaling law breakdowns and limitations

Scaling laws are empirical fits, not physical laws. Exponents shift with architecture, data quality, tokenizer, and optimizer regime, so one fitted curve shouldn't be treated as a universal law.

Emergent abilities

Wei et al. ^{[11]Reference 11Emergent Abilities of Large Language Models.https://arxiv.org/abs/2206.07682} highlighted benchmark tasks where performance appears to stay near zero and then jump at larger scales, creating the impression of a phase transition in capability. Schaeffer et al. ^{[12]Reference 12Are Emergent Abilities of Large Language Models a Mirage?https://arxiv.org/abs/2304.15004} later argued that many of these jumps are measurement artifacts: when you replace exact-match thresholds with continuous metrics, the same capabilities often return to smoother scaling curves.

The point isn't that emergence is fake or guaranteed. The useful lesson is measurement design: thresholded metrics can make smooth changes look abrupt, so teams should inspect continuous metrics before treating a capability jump as a new law of scale.

Task-specific ceilings

Scaling laws predict pre-training loss, not downstream task performance. A model that achieves excellent perplexity may still fail at:

Factual accuracy (hallucinations)
Instruction following
Safety alignment
Specific domain tasks

Because scaling laws only measure the model's ability to compress and predict the training distribution, teams must allocate separate compute budgets for the post-training pipeline. Techniques such as instruction tuning and RLHF ^{[10]Reference 10Training Language Models to Follow Instructions with Human Feedback (InstructGPT).https://arxiv.org/abs/2203.02155} are used to improve instruction behavior and preference alignment, which must be evaluated separately from pre-training loss.

Architecture sensitivity

Scaling laws derived for dense transformers don't directly transfer to:

Mixture-of-Experts (MoE) (lesson): MoE models selectively activate subsets of parameters for each input, decoupling total parameters from compute per token. This makes it difficult to apply standard dense scaling laws, as active parameters and total parameters scale differently.
State Space Models (SSMs): SSMs process sequences recurrently rather than using quadratic attention. These alternative architectures have their own scaling exponents and require independent empirical fitting.
Hybrid architectures: Models that combine attention with SSMs or other mechanisms must have their scaling behavior re-characterized from scratch.

For these architectures, dense-transformer scaling laws are useful context, not a substitute for new measurements. Each architecture needs empirical scaling studies to determine its own parameter and data allocation.

From theory to practice: running a scaling study

When planning a large training run, teams conduct scaling studies (small-scale experiments that predict large-scale performance) before committing resources. The practical loop is:

Train proxy models, usually across several parameter counts and token budgets.
Fit scaling exponents for the loss surface.
Extrapolate to the target frontier: model size, token budget, and expected loss.
Run a mid-scale validation experiment to check the extrapolation.
Commit to the full training run only after the mid-scale result lands near the fitted curve.

The µ-Transfer approach

To estimate the performance of a large model before committing to its full run, engineers use proxy models. A complementary tool is µ-Transfer (Tensor Programs V) ^{[13]Reference 13Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transferhttps://arxiv.org/abs/2203.03466}, which uses the Maximal Update Parametrization (µP) so selected hyperparameters tuned on a smaller proxy can be transferred to a larger target model.

Yang et al. show that many optimal hyperparameters remain stable as model size changes in µP, and they verify transfer on Transformer and ResNet experiments. Their reported examples transfer from 13M parameters to BERT-large and from 40M parameters to a 6.7B GPT-3 model while reducing tuning cost. This is evidence for the tested setups, not permission to copy every setting without validation.

Used alongside a scaling study, the workflow is:

Train smaller proxy models across multiple data budgets to sample the loss surface.
In a µP setup, tune hyperparameters covered by the transfer method at small scale and transfer candidate settings to larger models.
Fit the power-law scaling exponents ( $\alpha$ , $\beta$ , and the constants) using the observed losses from proxy runs.
Extrapolate the fitted curve to the target scale to predict final loss.
Validate with a single medium-scale experiment before committing to the full run.

The µ-Transfer paper shows that zero-shot hyperparameter transfer can work across large changes in model scale in its tested setups. A medium-scale validation still matters because transferability, loss fits, data mixture, and hardware behavior are separate failure modes.

Fitting the parametric loss function

One common parametric fit for model size and data, used in Chinchilla-style scaling studies, is:^{[2]Reference 2Training Compute-Optimal Large Language Models.https://arxiv.org/abs/2203.15556}

L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}

Reading the formula: Loss equals three fitted terms: (1) $E$ , the asymptotic floor in this model of the measured regime, (2) $A/N^\alpha$ , the fitted penalty associated with finite parameters, and (3) $B/D^\beta$ , the fitted penalty associated with finite training data. Making the model bigger reduces term 2; more data reduces term 3. Treat $E$ as an extrapolated fit parameter, not proof that you have measured the irreducible entropy of language.

To see how this works in practice, consider a tiny synthetic dataset of proxy runs:

Parameters (N)	Tokens (D)	Observed Loss
100M	1B	2.894
100M	5B	2.745
100M	20B	2.634
500M	1B	2.811
500M	5B	2.662
500M	20B	2.551
1B	5B	2.629
1B	20B	2.518
1B	100B	2.407

Notice the pattern: fixing N and increasing D lowers loss; fixing D and increasing N also lowers loss. The parametric formula above captures both effects at once.

Scaling study visual pairing nine synthetic proxy-run measurements with the fitted loss surface from L equals E plus A over N to alpha plus B over D to beta. The heatmap reproduces measured cells and extrapolates through 70 billion parameters and 1.4 trillion tokens, where the fitted forecast is 2.088 loss. — The scatter plot shows the nine synthetic measurements used by the article's fit. The heatmap evaluates that fitted law across a wider grid; the starred 70B / 1.4T cell is a 2.088 forecast, not a measurement, so a mid-scale validation check still matters.

This Python script fits the scaling constants with SciPy's curve_fit and then predicts loss at a larger target scale:

fitting-the-parametric-loss-function.py

import numpy as np
from scipy.optimize import curve_fit

def scaling_law(
    X: tuple[np.ndarray, np.ndarray],
    E: float,
    A: float,
    alpha: float,
    B: float,
    beta: float,
) -> np.ndarray:
    N, D = X
    return E + A / (N ** alpha) + B / (D ** beta)

# Synthetic proxy-run measurements for illustration only.
# Each row: (parameters, tokens, observed_loss)
experiments = np.array([
    [1e8,  1e9,   2.894],
    [1e8,  5e9,   2.745],
    [1e8,  2e10,  2.634],
    [5e8,  1e9,   2.811],
    [5e8,  5e9,   2.662],
    [5e8,  2e10,  2.551],
    [1e9,  5e9,   2.629],
    [1e9,  2e10,  2.518],
    [1e9,  1e11,  2.407],
])

N_data = experiments[:, 0]
D_data = experiments[:, 1]
L_data = experiments[:, 2]

popt, pcov = curve_fit(
    scaling_law, (N_data, D_data), L_data,
    p0=[1.0, 2.5, 0.07, 7.0, 0.09],
    bounds=([0, 0, 0, 0, 0], [5, 1e6, 1, 1e6, 1]),
    maxfev=20_000,
)

E_fit, A_fit, alpha_fit, B_fit, beta_fit = popt
print(f"Fitted asymptotic E = {E_fit:.3f}")
print(f"Parameter scaling: A={A_fit:.1f}, alpha={alpha_fit:.4f}")
print(f"Data scaling:      B={B_fit:.1f}, beta={beta_fit:.4f}")

# Extrapolate: predict loss for a 70B model on 1.4T tokens
predicted = scaling_law((70e9, 1.4e12), *popt)
print(f"\nPredicted loss for 70B / 1.4T tokens: {predicted:.3f}")

Output

Fitted asymptotic E = 1.096
Parameter scaling: A=2.8, alpha=0.0703
Data scaling:      B=7.8, beta=0.0980

Predicted loss for 70B / 1.4T tokens: 2.088

What to check before moving on

Tier	Defense target
Foundational	Explain parameters, tokens, compute, and why scaling laws are empirical power-law fits.
Foundational	Read Kaplan's single-variable loss curves and name what each bottleneck means.
Intermediate	Explain why Kaplan's fitted frontier favored parameters faster than data.
Intermediate	Describe how Chinchilla revised the allocation toward roughly 20 tokens per parameter for dense transformers.
Advanced	Use $C \approx 6ND$ to estimate tokens, parameters, and training FLOPs while stating its limits.
Advanced	Explain why inference-aware scaling can favor smaller models trained longer.
Advanced	Diagnose over-training, under-training, extrapolation risk, and dense-versus-MoE parameter ambiguity.
Advanced	Design a small scaling study with proxy runs, fitted exponents, and a mid-scale validation check.

What to remember

Scaling laws are power laws that quantify how loss decreases with model size, data, and compute. They're empirical, not theoretical.
Chinchilla revised the allocation: in Hoffmann et al.'s dense-transformer training-loss setup, roughly 20 tokens per parameter was a better compute-optimal rule than parameter-heavy GPT-3-era ratios.
Lifetime-cost objectives change the calculation: under Sardana et al.'s fitted inference-aware objective, high demand can favor smaller models trained for far more tokens; that prediction still requires quality and cost validation.
Scaling laws have blind spots: they predict pre-training loss but not emergent capabilities, alignment quality, or task specific performance.
Public-text supply is a forecasted constraint: Villalobos et al. estimate about 320T effective tokens and project full utilization under stated growth assumptions; treat this as a planning scenario, not a confirmed exhaustion event.
Test-time compute is another allocation axis: evaluated inference-time strategies can lift quality in some regimes, but extra candidates or response tokens also raise serving cost.
Scaling studies are essential: fit power-law parameters on small proxy models before committing to expensive large-scale training runs.

Next Step

Continue to Pre-training Data at Scale

Scaling laws give you the compute and token budget; the data pipeline decides whether those tokens are clean, diverse, deduplicated, and useful enough to justify the spend.

PreviousDecoding Strategies: Greedy to Nucleus

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

Language Models are Few-Shot Learners.

Brown, T., et al. · 2020 · NeurIPS 2020

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., & Carmon, Y. · 2024

The Llama 3 Herd of Models.

Dubey, A., et al. · 2024 · arXiv preprint

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Sardana & Frankle · 2024

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

Villalobos, P., et al. (Epoch AI) · 2022 · arXiv preprint

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Emergent Abilities of Large Language Models.

Wei, J., et al. · 2022 · TMLR

Are Emergent Abilities of Large Language Models a Mirage?

Schaeffer, R., et al. · 2023 · NeurIPS 2023

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Yang, G., et al. · 2022 · NeurIPS 2022

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Scaling Laws & Compute-Optimal Training

Why scaling laws matter

Building intuition with numbers

Your team can either double parameters from 1B to 2B while keeping data at 10B tokens, or keep 1B parameters and double data to 20B tokens. Which option gives lower loss in the table?

Kaplan scaling laws (2020)

The three power laws

1. Scaling with parameters (model size NNN)

2. Scaling with data (dataset size DDD in tokens)

3. Scaling with compute-optimal training compute (Cmin⁡C_{\min}Cmin​)

Kaplan's compute-optimal frontier favored larger models

The compute-optimal frontier

Why did Kaplan-era scaling produce parameter-heavy models?

The Chinchilla reallocation (2022)

Chinchilla-optimal training

For Chinchilla-style compute-optimal training, how many tokens does a 70B dense model need?

Chinchilla vs. Gopher: a concrete example

Why was Chinchilla also a serving-cost breakthrough?

Later dense-model token ratios

You want to train a 10B-parameter model at Chinchilla-style ratios. About how many tokens and FLOPs do you need?

Beyond Chinchilla: inference-aware scaling

The inference cost problem

What question does inference-aware scaling ask that Chinchilla doesn't?

Concrete break-even example

What the inference-aware objective predicts

Public-text supply as a planning constraint

In the Villalobos et al. forecast, why does assumed over-training move full utilization earlier?

A new scaling axis: test-time compute

Why is test-time compute a genuinely separate scaling axis rather than just bigger inference?

Common failures and fixes

Mistake 1: treating the 20:1 rule as universal

Mistake 2: treating C≈6NDC \approx 6NDC≈6ND as exact accounting

Mistake 3: confusing loss with deployed capability

Mistake 4: ignoring inference costs when sizing a model

Mistake 5: extrapolating scaling laws far beyond the training regime

Scaling law breakdowns and limitations

Emergent abilities

Task-specific ceilings

Why can a scaling study predict great loss but still produce a bad coding assistant?

Architecture sensitivity

Why can't you blindly apply dense-transformer scaling laws to MoE models?

From theory to practice: running a scaling study

The µ-Transfer approach

Fitting the parametric loss function

What to check before moving on

What to remember

Mastery Check

Discussion

1. Scaling with parameters (model size $N$ )

2. Scaling with data (dataset size $D$ in tokens)

3. Scaling with compute-optimal training compute ( $C_{\min}$ )

Mistake 2: treating $C \approx 6ND$ as exact accounting