LearnAdvanced Training & AdaptationMixed Precision Training

⚡HardFine-Tuning & Training

Mixed Precision Training

Measure how FP16 and BF16 affect training range, update precision, memory, and release evidence before enabling faster low-precision compute.

20 min read

Learning path

Step 102 of 158 in the full curriculum

Supervised Fine-Tuning Pipeline Distributed Training: FSDP & ZeRO

A training-speed change is an experiment, not an automatic improvement. A faster run isn't better if it corrupts updates, overflows, or lowers the declared held-out metric.

Suppose the deploy assistant now needs a small trace-evidence classifier that recognizes whether an incident summary is supported by logs. You want to fine-tune it faster, so the proposed run changes its precision policy from FP32 to FP16 or BF16. Build the run evidence you would need before choosing that policy.

Two log-scale comparisons of FP32, FP16, and BF16: FP32 and BF16 span roughly 1.2e-38 to 3.4e38, while FP16 has a 6.1e-5 normal floor, a roughly 6.0e-8 subnormal floor, and a 6.6e4 maximum, so a 1.2e-8 gradient underflows and a 1e5 activation overflows only in FP16; near 1.0, a 1e-4 update survives FP32 spacing but rounds away in both FP16 and BF16, motivating low-precision compute with FP32 update storage. — The upper plot separates representable magnitude from the lower plot's spacing near 1.0. BF16 protects the two range examples, but only FP32 preserves the small parameter update.

Make precision an explicit experiment parameter

Mixed precision training runs selected expensive operations in a compact floating-point format while retaining higher precision where training is fragile. Current PyTorch AMP examples create the model and optimizer in default precision, then let autocast choose an operation-specific dtype inside the forward region.^{[1]Reference 1Automatic Mixed Precision package - torch.amphttps://docs.pytorch.org/docs/stable/amp.html}

Start as you would in an experiment tracker: state the artifact under test and its guardrails before measuring candidates.

precision-contract.py

from dataclasses import dataclass
import torch

@dataclass(frozen=True)
class PrecisionContract:
    experiment: str
    artifact: str
    baseline: str
    candidates: tuple[str, ...]
    data_fingerprint: str
    seed_policy: str
    eval_suite: str
    hardware_profile: str
    required_supported_evidence_f1: float
    permitted_nonfinite_steps: int

contract = PrecisionContract(
    experiment="trace-evidence-encoder-precision",
    artifact="trace-evidence-classifier-v2",
    baseline="fp32",
    candidates=("fp16_unscaled", "fp16_scaled", "bf16"),
    data_fingerprint="trace-events@sha256:fixture-7",
    seed_policy="seeds=11,17,23",
    eval_suite="supported-evidence@sha256:suite-4",
    hardware_profile="a100-80gb-single-gpu",
    required_supported_evidence_f1=0.92,
    permitted_nonfinite_steps=0,
)

print(f"experiment={contract.experiment}")
print(f"artifact={contract.artifact}")
print(f"baseline={contract.baseline}")
print(f"candidates={','.join(contract.candidates)}")
print(f"data_fingerprint={contract.data_fingerprint}")
print(f"seed_policy={contract.seed_policy}")
print(f"eval_suite={contract.eval_suite}")
print(f"hardware_profile={contract.hardware_profile}")
print(f"metric_gate=supported_evidence_f1>={contract.required_supported_evidence_f1:.2f}")
print(f"nonfinite_steps_gate={contract.permitted_nonfinite_steps}")

Output

experiment=trace-evidence-encoder-precision
artifact=trace-evidence-classifier-v2
baseline=fp32
candidates=fp16_unscaled,fp16_scaled,bf16
data_fingerprint=trace-events@sha256:fixture-7
seed_policy=seeds=11,17,23
eval_suite=supported-evidence@sha256:suite-4
hardware_profile=a100-80gb-single-gpu
metric_gate=supported_evidence_f1>=0.92
nonfinite_steps_gate=0

The job isn't to declare BF16 good or FP16 bad in the abstract. It's to understand what each format can lose, then measure the acceptable candidates under the same validation and hardware conditions.

A floating-point number has two limits

A floating-point value is similar to scientific notation: a sign, a scale, and significant digits. Its exponent controls range, meaning how tiny or large a magnitude it can represent. Its fraction (often called the mantissa in training discussions) controls resolution, meaning how close two neighboring values can be.

Format	Bits	Exponent bits	Fraction bits	Main training consequence
FP32	32	8	23	Wide range and fine update resolution, with higher storage cost
FP16	16	5	10	Compact, but small gradients can underflow and large values can overflow
BF16	16	8	7	Compact with FP32-like range, but coarser nearby resolution

The important correction is easy to miss: BF16 improves range relative to FP16; it doesn't improve nearby resolution. BF16 has fewer fraction bits than FP16. That's why BF16 compute still normally updates FP32 parameters.

PyTorch exposes the exact format limits with torch.finfo. Run this on CPU; no accelerator is required to inspect the number system.

format-limits.py

formats = (
    ("FP32", torch.float32),
    ("FP16", torch.float16),
    ("BF16", torch.bfloat16),
)

print("format  epsilon_at_1  smallest_normal  largest_finite")
for label, dtype in formats:
    info = torch.finfo(dtype)
    print(f"{label:<6}  {info.eps:>12.1e}  {info.tiny:>15.1e}  {info.max:>14.1e}")

print(f"bf16_min_normal_matches_fp32={torch.finfo(torch.bfloat16).tiny == torch.finfo(torch.float32).tiny}")
print(f"bf16_resolution_coarser_than_fp16={torch.finfo(torch.bfloat16).eps > torch.finfo(torch.float16).eps}")

Output

format  epsilon_at_1  smallest_normal  largest_finite
FP32         1.2e-07          1.2e-38         3.4e+38
FP16         9.8e-04          6.1e-05         6.6e+04
BF16         7.8e-03          1.2e-38         3.4e+38
bf16_min_normal_matches_fp32=True
bf16_resolution_coarser_than_fp16=True

epsilon_at_1 is the spacing between 1.0 and the next representable value near it. smallest_normal and largest_finite describe range. FP16 gives finer resolution than BF16 near 1.0, but far less range.

Tiny updates need an FP32 home

Assume the classifier has a weight at 1.0. One optimizer step wants to subtract 0.0001. This is much larger than the smallest BF16 magnitude, but smaller than the spacing between BF16 values near 1.0.

tiny-parameter-update.py

weight = torch.tensor([1.0], dtype=torch.float32)
update = torch.tensor([1.0e-4], dtype=torch.float32)

for label, dtype in formats:
    before = weight.to(dtype)
    after = before - update.to(dtype)
    changed = bool(after.item() != before.item())
    print(f"{label}: stored_after_step={after.item():.8f}, update_survived={changed}")

print("lesson=BF16 protects range; FP32 protects small accumulated updates")

Output

FP32: stored_after_step=0.99989998, update_survived=True
FP16: stored_after_step=1.00000000, update_survived=False
BF16: stored_after_step=1.00000000, update_survived=False
lesson=BF16 protects range; FP32 protects small accumulated updates

Both 16-bit parameter values lose this update. The original mixed-precision recipe used an FP32 master copy of the weights so small updates accumulate instead of disappearing.^{[2]Reference 2Mixed Precision Training.https://arxiv.org/abs/1710.03740} Current PyTorch AMP gets the same protection in its ordinary pattern by creating parameters in default precision and autocasting eligible forward operations rather than converting parameter storage before the optimizer step.^{[1]Reference 1Automatic Mixed Precision package - torch.amphttps://docs.pytorch.org/docs/stable/amp.html}

Range decides whether gradients exist at all

Resolution is one issue. Range is another. Compare an extremely small gradient and a very large activation-like value when stored in FP16 and BF16.

gradient-range.py

values = torch.tensor([1.2e-8, 1.0e5], dtype=torch.float32)

for label, dtype in (("FP16", torch.float16), ("BF16", torch.bfloat16)):
    cast = values.to(dtype)
    print(
        f"{label}: small={cast[0].item():.2e}, "
        f"large={cast[1].item():.2e}, "
        f"all_finite={bool(torch.isfinite(cast).all())}"
    )

print("fp16_loses_small_and_large=True")
print("bf16_keeps_range_in_this_example=True")

Output

FP16: small=0.00e+00, large=inf, all_finite=False
BF16: small=1.20e-08, large=9.98e+04, all_finite=True
fp16_loses_small_and_large=True
bf16_keeps_range_in_this_example=True

FP16's smallest positive normal number is roughly $6.1 \times 10^{-5}$ , and its subnormal floor is about $6.0 \times 10^{-8}$ . A true gradient of $1.2 \times 10^{-8}$ becomes zero in FP16. At the other end, 100000 is beyond FP16's largest finite value of 65504, so it becomes Inf.

BF16 keeps the 8-bit exponent width of FP32, giving it a similar range and making these two magnitudes representable, although rounded. The BF16 training study documents that wider range as its main stability advantage over FP16.^{[3]Reference 3A Study of BFLOAT16 for Deep Learning Training.https://arxiv.org/abs/1905.12322}

FP16 uses loss scaling to rescue small gradients

For FP16, loss scaling moves gradient magnitudes into a representable interval during backpropagation. Multiply loss by a scale $S$ ; the chain rule multiplies each gradient by $S$ too. After backward, divide gradients by $S$ in FP32 before applying the optimizer step. The intended update has not changed.

For a true gradient of $1.2 \times 10^{-8}$ :

Operation	Value	FP16 outcome
Cast unscaled gradient	$1.2 \times 10^{-8}$	Rounds to zero
Multiply by $S=1024$ during backward	$1.23 \times 10^{-5}$	Representable
Convert to FP32 and divide by $S$	approximately $1.2 \times 10^{-8}$	Ready for FP32 update

loss-scaling-rescue.py

true_grad = torch.tensor([1.2e-8], dtype=torch.float32)
scale = 1024.0

plain_fp16 = true_grad.to(torch.float16)
scaled_fp16 = (true_grad * scale).to(torch.float16)
recovered_fp32 = scaled_fp16.to(torch.float32) / scale

print(f"plain_underflowed={plain_fp16.item() == 0.0}")
print(f"scaled_visible={scaled_fp16.item() > 0.0}")
print(f"recovered_grad={recovered_fp32.item():.2e}")
print(f"recovery_relative_error={abs(recovered_fp32.item() - true_grad.item()) / true_grad.item():.3%}")

Output

plain_underflowed=True
scaled_visible=True
recovered_grad=1.20e-08
recovery_relative_error=0.077%

Log-magnitude loss-scaling plot where multiplying gradients by 1024 shifts every value 3.01 decades: a 1.2e-8 tiny gradient moves into FP16 range at 1.23e-5, a 2e-2 quiet gradient remains finite at 20.5, and a 1e2 spike becomes 1.024e5 above FP16 maximum 65504; the finite batch is unscaled and applied at scale 1024, while the overflowed batch is skipped and backs off to scale 512. — The same scale rescues the tiny gradient and overflows the spike. Dynamic scaling applies only the finite, unscaled batch and halves the next scale after the rejected batch.

Scaling too far causes overflow

A fixed scale that saves the smallest gradient may overflow a larger gradient in the same step. Dynamic scaling therefore has two outcomes: apply a finite, descaled update, or skip an overflowed step and reduce the scale.

overflow-backoff.py

def scaled_step_status(gradients: torch.Tensor, scale: float) -> tuple[str, float]:
    scaled = (gradients * scale).to(torch.float16)
    if not bool(torch.isfinite(scaled).all()):
        return "SKIP_OVERFLOW", scale / 2
    return "APPLY_DESCALED_UPDATE", scale

quiet_step = torch.tensor([1.2e-8, 2.0e-2], dtype=torch.float32)
spiky_step = torch.tensor([1.2e-8, 1.0e2], dtype=torch.float32)

quiet_status, quiet_next_scale = scaled_step_status(quiet_step, 1024.0)
spiky_status, spiky_next_scale = scaled_step_status(spiky_step, 1024.0)

print(f"quiet_step={quiet_status}, next_scale={quiet_next_scale:.0f}")
print(f"spiky_step={spiky_status}, next_scale={spiky_next_scale:.0f}")
print("invariant=never_apply_nonfinite_gradients")

Output

quiet_step=APPLY_DESCALED_UPDATE, next_scale=1024
spiky_step=SKIP_OVERFLOW, next_scale=512
invariant=never_apply_nonfinite_gradients

In current PyTorch, torch.amp.GradScaler performs this scale, unscale, finite-check, skip, and update control flow for FP16 training. PyTorch also documents that if you inspect or clip gradients, you must call scaler.unscale_(optimizer) before clipping so thresholds apply to true gradient magnitudes.^{[1]Reference 1Automatic Mixed Precision package - torch.amphttps://docs.pytorch.org/docs/stable/amp.html}

Loss scaling isn't a general extension of FP16 range. It rescues small backward gradients that would underflow, but it can't make a forward activation above 65504 representable. PyTorch also warns that GradScaler may reduce its scale below 1 for overflow-prone models, so don't assume the scale always grows or stays above 1.^{[1]Reference 1Automatic Mixed Precision package - torch.amphttps://docs.pytorch.org/docs/stable/amp.html}

Compute low, update high

Loss scaling protects FP16 gradients from range failure. It doesn't make 16-bit parameter storage appropriate for tiny updates. Preserve the FP32 update path separately.

fp32-update-path.py

step = torch.tensor([1.0e-4], dtype=torch.float32)
fp16_parameter = torch.tensor([1.0], dtype=torch.float16)
fp32_parameter = torch.tensor([1.0], dtype=torch.float32)

fp16_after = fp16_parameter - step.to(torch.float16)
fp32_after = fp32_parameter - step

print(f"fp16_parameter_changed={fp16_after.item() != fp16_parameter.item()}")
print(f"fp32_parameter_changed={fp32_after.item() != fp32_parameter.item()}")
print(f"fp32_after={fp32_after.item():.8f}")
print("policy=low_precision_compute_with_fp32_update_state")

Output

fp16_parameter_changed=False
fp32_parameter_changed=True
fp32_after=0.99989998
policy=low_precision_compute_with_fp32_update_state

The original paper describes copying FP32 master weights into a low-precision compute copy.^{[2]Reference 2Mixed Precision Training.https://arxiv.org/abs/1710.03740} With ordinary AMP, PyTorch parameters remain FP32, autocast selects lower precision for eligible compute, and the optimizer updates the FP32 parameters directly.^{[1]Reference 1Automatic Mixed Precision package - torch.amphttps://docs.pytorch.org/docs/stable/amp.html}

This is the CUDA shape you would use for a real fine-tuning run. It isn't marked executable here because it needs an accelerator and a model workload:

cuda-amp-training-shape.py

dtype = torch.bfloat16  # compare against torch.float16 in a controlled run
use_scaler = dtype == torch.float16
scaler = torch.amp.GradScaler("cuda", enabled=use_scaler)

for batch, target in dataloader:
    optimizer.zero_grad(set_to_none=True)
    with torch.autocast(device_type="cuda", dtype=dtype):
        logits = model(batch.cuda())
        loss = criterion(logits, target.cuda())

    if scaler.is_enabled():
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()
    else:
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

For BF16, skipping GradScaler is a common policy because BF16's exponent range avoids the FP16 failure that loss scaling targets. It isn't a guarantee that every BF16 training job is stable. Bad data, unstable losses, overly large learning rates, or sensitive kernels can still produce non-finite values.

Memory savings need accounting, not slogans

When autocast runs an eligible operation in BF16 or FP16, its saved low-precision activation values use two bytes rather than four. Other operations stay in or return FP32 for numerical safety. Total training memory doesn't necessarily halve because FP32 parameters and optimizer moments may remain unchanged.

The next accounting exercise uses a deliberately small inventory: 100 million parameters, their gradients, two Adam moment buffers, and 800 million stored activation values. To make the arithmetic visible, assume those inventoried activations were saved in low precision for the AMP candidate. A real profile may retain FP32 values for some operations. This is a budget calculation, not a measured GPU profile.

memory-budget.py

def gib(values: int, bytes_per_value: int) -> float:
    return values * bytes_per_value / (1024 ** 3)

parameter_values = 100_000_000
activation_values = 800_000_000

fp32_budget = {
    "parameters": gib(parameter_values, 4),
    "gradients": gib(parameter_values, 4),
    "adam_moments": gib(parameter_values * 2, 4),
    "activations": gib(activation_values, 4),
}
amp_budget = {
    **{name: value for name, value in fp32_budget.items() if name != "activations"},
    "activations": gib(activation_values, 2),
}

print(f"fp32_total_gib={sum(fp32_budget.values()):.2f}")
print(f"amp_total_gib={sum(amp_budget.values()):.2f}")
print(f"activation_saving_gib={fp32_budget['activations'] - amp_budget['activations']:.2f}")
print(f"total_reduction={(1 - sum(amp_budget.values()) / sum(fp32_budget.values())):.1%}")
print("lesson=half_size_activations_do_not_imply_half_total_memory")

Output

fp32_total_gib=4.47
amp_total_gib=2.98
activation_saving_gib=1.49
total_reduction=33.3%
lesson=half_size_activations_do_not_imply_half_total_memory

Component memory accounting for 100 million parameters, 100 million gradients, two Adam moment buffers, and 800 million saved activations: FP32 totals 4.47 GiB, while an AMP candidate totals 2.98 GiB because only activations halve from 2.98 to 1.49 GiB, a 33.3 percent reduction; separate 100-million-gradient communication bars show 0.37 GiB for FP32 reduction and 0.19 GiB for BF16 reduction, a 50 percent payload reduction, emphasizing that compute BF16, update FP32, and reduction dtype are independent run fields. — Halving activation bytes reduces this full inventory by one third, not one half. The separate payload bars show why `reduce_dtype` must be measured independently from compute dtype.

For large models, sharding methods such as ZeRO and Fully Sharded Data Parallel (FSDP) address parameters, gradients, and optimizer-state memory that activation casting alone doesn't remove.^{[4]Reference 4ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.https://arxiv.org/abs/1910.02054}^{[5]Reference 5PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.https://arxiv.org/abs/2304.11277}

Distributed jobs add a communication dtype

When workers exchange gradients, the network payload has its own precision policy. Current PyTorch FSDP MixedPrecision configuration exposes param_dtype for forward and backward computation and reduce_dtype for gradient reduction; the two fields may differ.^{[6]Reference 6FullyShardedDataParallelhttps://docs.pytorch.org/docs/stable/fsdp.html}

communication-budget.py

gradient_values = 100_000_000
fp32_reduce_gib = gib(gradient_values, 4)
bf16_reduce_gib = gib(gradient_values, 2)

print(f"gradient_payload_fp32_gib={fp32_reduce_gib:.2f}")
print(f"gradient_payload_bf16_gib={bf16_reduce_gib:.2f}")
print(f"payload_reduction={(1 - bf16_reduce_gib / fp32_reduce_gib):.0%}")
print("warning=compute_dtype_does_not_prove_reduce_dtype")

Output

gradient_payload_fp32_gib=0.37
gradient_payload_bf16_gib=0.19
payload_reduction=50%
warning=compute_dtype_does_not_prove_reduce_dtype

A job can use BF16 for matrix computations and still communicate FP32 gradient payloads. Therefore a trustworthy run record separates compute_dtype, update_storage_dtype, and reduce_dtype rather than logging a single mixed_precision=true flag.

Decide from runs, not format preference

The final cell brings the lesson back to experiment tracking. The numbers below are illustrative recorded outcomes, not benchmark claims. They show the review rule you should apply after running the same classifier, data fingerprint, seed policy, held-out supported_evidence_f1 evaluation, and target-GPU profile for every precision configuration. One extra BF16 run changes hardware on purpose so the comparison filter has something to reject.

precision-run-decision.py

@dataclass(frozen=True)
class PrecisionRun:
    run_id: str
    policy: str
    artifact: str
    data_fingerprint: str
    seed_policy: str
    eval_suite: str
    hardware_profile: str
    supported_evidence_f1: float
    nonfinite_steps: int
    peak_memory_gib: float
    examples_per_second: int
    evidence: str

def fixture_run(
    run_id: str,
    policy: str,
    supported_evidence_f1: float,
    nonfinite_steps: int,
    peak_memory_gib: float,
    examples_per_second: int,
    *,
    hardware_profile: str = contract.hardware_profile,
) -> PrecisionRun:
    return PrecisionRun(
        run_id=run_id,
        policy=policy,
        artifact=contract.artifact,
        data_fingerprint=contract.data_fingerprint,
        seed_policy=contract.seed_policy,
        eval_suite=contract.eval_suite,
        hardware_profile=hardware_profile,
        supported_evidence_f1=supported_evidence_f1,
        nonfinite_steps=nonfinite_steps,
        peak_memory_gib=peak_memory_gib,
        examples_per_second=examples_per_second,
        evidence="illustrative_fixture",
    )

runs = (
    fixture_run("run_fp32", "fp32", 0.93, 0, 4.47, 800),
    fixture_run("run_fp16_plain", "fp16_unscaled", 0.88, 3, 2.98, 1240),
    fixture_run("run_fp16_scaled", "fp16_scaled", 0.93, 0, 2.98, 1190),
    fixture_run("run_bf16", "bf16", 0.93, 0, 2.98, 1310),
    fixture_run("run_bf16_other_hardware", "bf16", 0.93, 0, 2.98, 1770, hardware_profile="h100-80gb-single-gpu"),
)

def comparable_to_contract(run: PrecisionRun) -> bool:
    return (
        run.artifact == contract.artifact
        and run.data_fingerprint == contract.data_fingerprint
        and run.seed_policy == contract.seed_policy
        and run.eval_suite == contract.eval_suite
        and run.hardware_profile == contract.hardware_profile
    )

def passes_gates(run: PrecisionRun) -> bool:
    return (
        run.supported_evidence_f1 >= contract.required_supported_evidence_f1
        and run.nonfinite_steps <= contract.permitted_nonfinite_steps
    )

comparable_candidates = [run for run in runs if run.policy != contract.baseline and comparable_to_contract(run)]
gate_eligible = [run.run_id for run in comparable_candidates if passes_gates(run)]
rejected_gates = [run.run_id for run in comparable_candidates if not passes_gates(run)]
excluded_noncomparable = [run.run_id for run in runs if not comparable_to_contract(run)]

print(f"gate_eligible_runs={','.join(gate_eligible)}")
print(f"rejected_gate_runs={','.join(rejected_gates)}")
print(f"excluded_noncomparable_runs={','.join(excluded_noncomparable)}")
print("decision=BLOCKED_FIXTURE_ONLY_RUN_MEASURED_PROFILE")
print("next_metrics=examples_per_second,peak_memory,supported_evidence_f1,nonfinite_steps")

Output

gate_eligible_runs=run_fp16_scaled,run_bf16
rejected_gate_runs=run_fp16_plain
excluded_noncomparable_runs=run_bf16_other_hardware
decision=BLOCKED_FIXTURE_ONLY_RUN_MEASURED_PROFILE
next_metrics=examples_per_second,peak_memory,supported_evidence_f1,nonfinite_steps

The right result isn't "BF16 wins because it's modern." Both scaled FP16 and BF16 pass this small fixture, and both require real measurements under the declared contract. The faster run_bf16_other_hardware value can't rank against them because its accelerator changed. BF16 is often simpler to operate because it commonly avoids loss scaling, but only a comparable controlled run can justify promotion.

FP8 is a later optimization, not a default answer

FP8 reduces compute storage again, but its reduced range and resolution require managed scaling recipes. FP8 isn't one layout: the FP8 formats paper specifies complementary E4M3 and E5M2 encodings for deep-learning workloads. NVIDIA Transformer Engine 2.16.0 documents a hybrid recipe that uses E4M3 during the forward pass and E5M2 during the backward pass, plus delayed, current, and block-scaling recipes for supported accelerators.^{[7]Reference 7FP8 Formats for Deep Learning.https://arxiv.org/abs/2209.05433}^{[8]Reference 8Using FP8 and FP4 with Transformer Enginehttps://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html}

That's enough orientation here. Don't add FP8 to a training proposal until BF16 or scaled FP16 is measured, quality checks exist, and the team can operate the scaling policy. Precision work should reduce measured cost without creating unexplained convergence risk.

Mastery check

Mastery outcomes

Skill	Evidence you can produce
Explain range versus resolution	Read `torch.finfo` and predict whether a tiny magnitude, large value, or tiny update will survive each format
Debug FP16 training	Explain loss scaling, place unscaling before clipping, and reject non-finite steps
Budget memory honestly	Separate activation savings from FP32 parameters, gradients, and optimizer state
Review distributed precision	Record compute, update-storage, and reduction dtypes separately
Compare candidate runs	Hold artifact, data, seeds, eval suite, and hardware fixed before ranking memory, throughput, or held-out quality

Evaluation rubric

Foundational: Reads torch.finfo results and explains why FP16 and BF16 fail in different ways.
Foundational: Shows why a small update to an FP16 or BF16 stored parameter can disappear while FP32 preserves it.
Intermediate: Explains loss scaling without claiming it changes the true optimization objective.
Intermediate: Reads the AMP CUDA shape and places unscaling before gradient clipping.
Intermediate: Computes a memory budget without promising that mixed precision halves total training memory.
Advanced: Records compute, update-storage, and reduce dtypes separately in a distributed experiment.
Advanced: Refuses to promote BF16 or scaled FP16 until comparable target-hardware measurements satisfy the declared held-out metric and numerical-stability gates.

Follow-up questions

Common pitfalls

BF16 is mistaken for an FP32 optimizer replacement

Symptom: A BF16-only parameter update stops improving loss even though gradients are finite.
Cause: Wide range was confused with fine resolution near current weights.
Fix: Keep FP32 update state under ordinary AMP and log the storage policy.

FP16 silently loses gradients

Symptom: Training appears stable but supported_evidence_f1 lags the FP32 baseline.
Cause: Small unscaled FP16 gradients underflow to zero.
Fix: Use GradScaler for FP16, track non-finite or skipped steps, and compare the declared held-out metric against the same baseline.

Gradient clipping sees scaled values

Symptom: Clipping behaves erratically or training diverges under FP16 AMP.
Cause: The run clips gradients before scaler.unscale_(optimizer).
Fix: Unscale first, then clip, then let the scaler perform or skip the optimizer step.

Memory claims omit optimizer state

Symptom: "Half-memory" planning fails when the job is scheduled.
Cause: Only activation dtype changed while FP32 parameters and Adam moments remain large.
Fix: Log a component-level memory profile or accounting budget, not a dtype slogan.

Distributed bandwidth remains high

Symptom: BF16 compute is enabled, but cross-worker traffic is still a bottleneck.
Cause: Reduction payloads remain FP32.
Fix: Inspect and record reduce_dtype separately, then measure held-out metric and communication changes before promotion.

Next Step

Continue to Distributed Training: FSDP & ZeRO

You can now choose FP16, BF16, and update-state precision from measured stability and memory evidence. Distributed training keeps those dtype decisions explicit while model states, gradients, and communication are split across workers.

PreviousSupervised Fine-Tuning Pipeline

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Automatic Mixed Precision package - torch.amp

PyTorch Contributors · 2026

Mixed Precision Training.

Micikevicius, P., et al. · 2018

A Study of BFLOAT16 for Deep Learning Training.

Kalamkar, D., et al. · 2019

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.

Rajbhandari, S., et al. · 2020 · SC 2020

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.

Zhao, Y., et al. · 2023 · VLDB 2023

FullyShardedDataParallel

PyTorch Contributors · 2026

FP8 Formats for Deep Learning.

Micikevicius, P., et al. · 2022

Using FP8 and FP4 with Transformer Engine

NVIDIA · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnAdvanced Training & AdaptationMixed Precision Training

⚡HardFine-Tuning & Training

Mixed Precision Training

Measure how FP16 and BF16 affect training range, update precision, memory, and release evidence before enabling faster low-precision compute.

20 min read

Learning path

Step 102 of 158 in the full curriculum

Supervised Fine-Tuning Pipeline Distributed Training: FSDP & ZeRO

A training-speed change is an experiment, not an automatic improvement. A faster run isn't better if it corrupts updates, overflows, or lowers the declared held-out metric.

Make precision an explicit experiment parameter

Start as you would in an experiment tracker: state the artifact under test and its guardrails before measuring candidates.

precision-contract.py

from dataclasses import dataclass
import torch

@dataclass(frozen=True)
class PrecisionContract:
    experiment: str
    artifact: str
    baseline: str
    candidates: tuple[str, ...]
    data_fingerprint: str
    seed_policy: str
    eval_suite: str
    hardware_profile: str
    required_supported_evidence_f1: float
    permitted_nonfinite_steps: int

contract = PrecisionContract(
    experiment="trace-evidence-encoder-precision",
    artifact="trace-evidence-classifier-v2",
    baseline="fp32",
    candidates=("fp16_unscaled", "fp16_scaled", "bf16"),
    data_fingerprint="trace-events@sha256:fixture-7",
    seed_policy="seeds=11,17,23",
    eval_suite="supported-evidence@sha256:suite-4",
    hardware_profile="a100-80gb-single-gpu",
    required_supported_evidence_f1=0.92,
    permitted_nonfinite_steps=0,
)

print(f"experiment={contract.experiment}")
print(f"artifact={contract.artifact}")
print(f"baseline={contract.baseline}")
print(f"candidates={','.join(contract.candidates)}")
print(f"data_fingerprint={contract.data_fingerprint}")
print(f"seed_policy={contract.seed_policy}")
print(f"eval_suite={contract.eval_suite}")
print(f"hardware_profile={contract.hardware_profile}")
print(f"metric_gate=supported_evidence_f1>={contract.required_supported_evidence_f1:.2f}")
print(f"nonfinite_steps_gate={contract.permitted_nonfinite_steps}")

Output

experiment=trace-evidence-encoder-precision
artifact=trace-evidence-classifier-v2
baseline=fp32
candidates=fp16_unscaled,fp16_scaled,bf16
data_fingerprint=trace-events@sha256:fixture-7
seed_policy=seeds=11,17,23
eval_suite=supported-evidence@sha256:suite-4
hardware_profile=a100-80gb-single-gpu
metric_gate=supported_evidence_f1>=0.92
nonfinite_steps_gate=0

The job isn't to declare BF16 good or FP16 bad in the abstract. It's to understand what each format can lose, then measure the acceptable candidates under the same validation and hardware conditions.

A floating-point number has two limits

Format	Bits	Exponent bits	Fraction bits	Main training consequence
FP32	32	8	23	Wide range and fine update resolution, with higher storage cost
FP16	16	5	10	Compact, but small gradients can underflow and large values can overflow
BF16	16	8	7	Compact with FP32-like range, but coarser nearby resolution

PyTorch exposes the exact format limits with torch.finfo. Run this on CPU; no accelerator is required to inspect the number system.

format-limits.py

formats = (
    ("FP32", torch.float32),
    ("FP16", torch.float16),
    ("BF16", torch.bfloat16),
)

print("format  epsilon_at_1  smallest_normal  largest_finite")
for label, dtype in formats:
    info = torch.finfo(dtype)
    print(f"{label:<6}  {info.eps:>12.1e}  {info.tiny:>15.1e}  {info.max:>14.1e}")

print(f"bf16_min_normal_matches_fp32={torch.finfo(torch.bfloat16).tiny == torch.finfo(torch.float32).tiny}")
print(f"bf16_resolution_coarser_than_fp16={torch.finfo(torch.bfloat16).eps > torch.finfo(torch.float16).eps}")

Output

format  epsilon_at_1  smallest_normal  largest_finite
FP32         1.2e-07          1.2e-38         3.4e+38
FP16         9.8e-04          6.1e-05         6.6e+04
BF16         7.8e-03          1.2e-38         3.4e+38
bf16_min_normal_matches_fp32=True
bf16_resolution_coarser_than_fp16=True

Tiny updates need an FP32 home

tiny-parameter-update.py

weight = torch.tensor([1.0], dtype=torch.float32)
update = torch.tensor([1.0e-4], dtype=torch.float32)

for label, dtype in formats:
    before = weight.to(dtype)
    after = before - update.to(dtype)
    changed = bool(after.item() != before.item())
    print(f"{label}: stored_after_step={after.item():.8f}, update_survived={changed}")

print("lesson=BF16 protects range; FP32 protects small accumulated updates")

Output

FP32: stored_after_step=0.99989998, update_survived=True
FP16: stored_after_step=1.00000000, update_survived=False
BF16: stored_after_step=1.00000000, update_survived=False
lesson=BF16 protects range; FP32 protects small accumulated updates

Range decides whether gradients exist at all

Resolution is one issue. Range is another. Compare an extremely small gradient and a very large activation-like value when stored in FP16 and BF16.

gradient-range.py

values = torch.tensor([1.2e-8, 1.0e5], dtype=torch.float32)

for label, dtype in (("FP16", torch.float16), ("BF16", torch.bfloat16)):
    cast = values.to(dtype)
    print(
        f"{label}: small={cast[0].item():.2e}, "
        f"large={cast[1].item():.2e}, "
        f"all_finite={bool(torch.isfinite(cast).all())}"
    )

print("fp16_loses_small_and_large=True")
print("bf16_keeps_range_in_this_example=True")

Output

FP16: small=0.00e+00, large=inf, all_finite=False
BF16: small=1.20e-08, large=9.98e+04, all_finite=True
fp16_loses_small_and_large=True
bf16_keeps_range_in_this_example=True

FP16 uses loss scaling to rescue small gradients

For a true gradient of $1.2 \times 10^{-8}$ :

Operation	Value	FP16 outcome
Cast unscaled gradient	$1.2 \times 10^{-8}$	Rounds to zero
Multiply by $S=1024$ during backward	$1.23 \times 10^{-5}$	Representable
Convert to FP32 and divide by $S$	approximately $1.2 \times 10^{-8}$	Ready for FP32 update

loss-scaling-rescue.py

true_grad = torch.tensor([1.2e-8], dtype=torch.float32)
scale = 1024.0

plain_fp16 = true_grad.to(torch.float16)
scaled_fp16 = (true_grad * scale).to(torch.float16)
recovered_fp32 = scaled_fp16.to(torch.float32) / scale

print(f"plain_underflowed={plain_fp16.item() == 0.0}")
print(f"scaled_visible={scaled_fp16.item() > 0.0}")
print(f"recovered_grad={recovered_fp32.item():.2e}")
print(f"recovery_relative_error={abs(recovered_fp32.item() - true_grad.item()) / true_grad.item():.3%}")

Output

plain_underflowed=True
scaled_visible=True
recovered_grad=1.20e-08
recovery_relative_error=0.077%

Scaling too far causes overflow

overflow-backoff.py

def scaled_step_status(gradients: torch.Tensor, scale: float) -> tuple[str, float]:
    scaled = (gradients * scale).to(torch.float16)
    if not bool(torch.isfinite(scaled).all()):
        return "SKIP_OVERFLOW", scale / 2
    return "APPLY_DESCALED_UPDATE", scale

quiet_step = torch.tensor([1.2e-8, 2.0e-2], dtype=torch.float32)
spiky_step = torch.tensor([1.2e-8, 1.0e2], dtype=torch.float32)

quiet_status, quiet_next_scale = scaled_step_status(quiet_step, 1024.0)
spiky_status, spiky_next_scale = scaled_step_status(spiky_step, 1024.0)

print(f"quiet_step={quiet_status}, next_scale={quiet_next_scale:.0f}")
print(f"spiky_step={spiky_status}, next_scale={spiky_next_scale:.0f}")
print("invariant=never_apply_nonfinite_gradients")

Output

quiet_step=APPLY_DESCALED_UPDATE, next_scale=1024
spiky_step=SKIP_OVERFLOW, next_scale=512
invariant=never_apply_nonfinite_gradients

Compute low, update high

Loss scaling protects FP16 gradients from range failure. It doesn't make 16-bit parameter storage appropriate for tiny updates. Preserve the FP32 update path separately.

fp32-update-path.py

step = torch.tensor([1.0e-4], dtype=torch.float32)
fp16_parameter = torch.tensor([1.0], dtype=torch.float16)
fp32_parameter = torch.tensor([1.0], dtype=torch.float32)

fp16_after = fp16_parameter - step.to(torch.float16)
fp32_after = fp32_parameter - step

print(f"fp16_parameter_changed={fp16_after.item() != fp16_parameter.item()}")
print(f"fp32_parameter_changed={fp32_after.item() != fp32_parameter.item()}")
print(f"fp32_after={fp32_after.item():.8f}")
print("policy=low_precision_compute_with_fp32_update_state")

Output

fp16_parameter_changed=False
fp32_parameter_changed=True
fp32_after=0.99989998
policy=low_precision_compute_with_fp32_update_state

This is the CUDA shape you would use for a real fine-tuning run. It isn't marked executable here because it needs an accelerator and a model workload:

cuda-amp-training-shape.py

dtype = torch.bfloat16  # compare against torch.float16 in a controlled run
use_scaler = dtype == torch.float16
scaler = torch.amp.GradScaler("cuda", enabled=use_scaler)

for batch, target in dataloader:
    optimizer.zero_grad(set_to_none=True)
    with torch.autocast(device_type="cuda", dtype=dtype):
        logits = model(batch.cuda())
        loss = criterion(logits, target.cuda())

    if scaler.is_enabled():
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()
    else:
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

Memory savings need accounting, not slogans

memory-budget.py

def gib(values: int, bytes_per_value: int) -> float:
    return values * bytes_per_value / (1024 ** 3)

parameter_values = 100_000_000
activation_values = 800_000_000

fp32_budget = {
    "parameters": gib(parameter_values, 4),
    "gradients": gib(parameter_values, 4),
    "adam_moments": gib(parameter_values * 2, 4),
    "activations": gib(activation_values, 4),
}
amp_budget = {
    **{name: value for name, value in fp32_budget.items() if name != "activations"},
    "activations": gib(activation_values, 2),
}

print(f"fp32_total_gib={sum(fp32_budget.values()):.2f}")
print(f"amp_total_gib={sum(amp_budget.values()):.2f}")
print(f"activation_saving_gib={fp32_budget['activations'] - amp_budget['activations']:.2f}")
print(f"total_reduction={(1 - sum(amp_budget.values()) / sum(fp32_budget.values())):.1%}")
print("lesson=half_size_activations_do_not_imply_half_total_memory")

Output

fp32_total_gib=4.47
amp_total_gib=2.98
activation_saving_gib=1.49
total_reduction=33.3%
lesson=half_size_activations_do_not_imply_half_total_memory

Distributed jobs add a communication dtype

communication-budget.py

gradient_values = 100_000_000
fp32_reduce_gib = gib(gradient_values, 4)
bf16_reduce_gib = gib(gradient_values, 2)

print(f"gradient_payload_fp32_gib={fp32_reduce_gib:.2f}")
print(f"gradient_payload_bf16_gib={bf16_reduce_gib:.2f}")
print(f"payload_reduction={(1 - bf16_reduce_gib / fp32_reduce_gib):.0%}")
print("warning=compute_dtype_does_not_prove_reduce_dtype")

Output

gradient_payload_fp32_gib=0.37
gradient_payload_bf16_gib=0.19
payload_reduction=50%
warning=compute_dtype_does_not_prove_reduce_dtype

Decide from runs, not format preference

precision-run-decision.py

@dataclass(frozen=True)
class PrecisionRun:
    run_id: str
    policy: str
    artifact: str
    data_fingerprint: str
    seed_policy: str
    eval_suite: str
    hardware_profile: str
    supported_evidence_f1: float
    nonfinite_steps: int
    peak_memory_gib: float
    examples_per_second: int
    evidence: str

def fixture_run(
    run_id: str,
    policy: str,
    supported_evidence_f1: float,
    nonfinite_steps: int,
    peak_memory_gib: float,
    examples_per_second: int,
    *,
    hardware_profile: str = contract.hardware_profile,
) -> PrecisionRun:
    return PrecisionRun(
        run_id=run_id,
        policy=policy,
        artifact=contract.artifact,
        data_fingerprint=contract.data_fingerprint,
        seed_policy=contract.seed_policy,
        eval_suite=contract.eval_suite,
        hardware_profile=hardware_profile,
        supported_evidence_f1=supported_evidence_f1,
        nonfinite_steps=nonfinite_steps,
        peak_memory_gib=peak_memory_gib,
        examples_per_second=examples_per_second,
        evidence="illustrative_fixture",
    )

runs = (
    fixture_run("run_fp32", "fp32", 0.93, 0, 4.47, 800),
    fixture_run("run_fp16_plain", "fp16_unscaled", 0.88, 3, 2.98, 1240),
    fixture_run("run_fp16_scaled", "fp16_scaled", 0.93, 0, 2.98, 1190),
    fixture_run("run_bf16", "bf16", 0.93, 0, 2.98, 1310),
    fixture_run("run_bf16_other_hardware", "bf16", 0.93, 0, 2.98, 1770, hardware_profile="h100-80gb-single-gpu"),
)

def comparable_to_contract(run: PrecisionRun) -> bool:
    return (
        run.artifact == contract.artifact
        and run.data_fingerprint == contract.data_fingerprint
        and run.seed_policy == contract.seed_policy
        and run.eval_suite == contract.eval_suite
        and run.hardware_profile == contract.hardware_profile
    )

def passes_gates(run: PrecisionRun) -> bool:
    return (
        run.supported_evidence_f1 >= contract.required_supported_evidence_f1
        and run.nonfinite_steps <= contract.permitted_nonfinite_steps
    )

comparable_candidates = [run for run in runs if run.policy != contract.baseline and comparable_to_contract(run)]
gate_eligible = [run.run_id for run in comparable_candidates if passes_gates(run)]
rejected_gates = [run.run_id for run in comparable_candidates if not passes_gates(run)]
excluded_noncomparable = [run.run_id for run in runs if not comparable_to_contract(run)]

print(f"gate_eligible_runs={','.join(gate_eligible)}")
print(f"rejected_gate_runs={','.join(rejected_gates)}")
print(f"excluded_noncomparable_runs={','.join(excluded_noncomparable)}")
print("decision=BLOCKED_FIXTURE_ONLY_RUN_MEASURED_PROFILE")
print("next_metrics=examples_per_second,peak_memory,supported_evidence_f1,nonfinite_steps")

Output

gate_eligible_runs=run_fp16_scaled,run_bf16
rejected_gate_runs=run_fp16_plain
excluded_noncomparable_runs=run_bf16_other_hardware
decision=BLOCKED_FIXTURE_ONLY_RUN_MEASURED_PROFILE
next_metrics=examples_per_second,peak_memory,supported_evidence_f1,nonfinite_steps

FP8 is a later optimization, not a default answer

Mastery check

Mastery outcomes

Skill	Evidence you can produce
Explain range versus resolution	Read `torch.finfo` and predict whether a tiny magnitude, large value, or tiny update will survive each format
Debug FP16 training	Explain loss scaling, place unscaling before clipping, and reject non-finite steps
Budget memory honestly	Separate activation savings from FP32 parameters, gradients, and optimizer state
Review distributed precision	Record compute, update-storage, and reduction dtypes separately
Compare candidate runs	Hold artifact, data, seeds, eval suite, and hardware fixed before ranking memory, throughput, or held-out quality

Evaluation rubric

Foundational: Reads torch.finfo results and explains why FP16 and BF16 fail in different ways.
Foundational: Shows why a small update to an FP16 or BF16 stored parameter can disappear while FP32 preserves it.
Intermediate: Explains loss scaling without claiming it changes the true optimization objective.
Intermediate: Reads the AMP CUDA shape and places unscaling before gradient clipping.
Intermediate: Computes a memory budget without promising that mixed precision halves total training memory.
Advanced: Records compute, update-storage, and reduce dtypes separately in a distributed experiment.
Advanced: Refuses to promote BF16 or scaled FP16 until comparable target-hardware measurements satisfy the declared held-out metric and numerical-stability gates.

Follow-up questions

Common pitfalls

BF16 is mistaken for an FP32 optimizer replacement

Symptom: A BF16-only parameter update stops improving loss even though gradients are finite.
Cause: Wide range was confused with fine resolution near current weights.
Fix: Keep FP32 update state under ordinary AMP and log the storage policy.

FP16 silently loses gradients

Symptom: Training appears stable but supported_evidence_f1 lags the FP32 baseline.
Cause: Small unscaled FP16 gradients underflow to zero.
Fix: Use GradScaler for FP16, track non-finite or skipped steps, and compare the declared held-out metric against the same baseline.

Gradient clipping sees scaled values

Symptom: Clipping behaves erratically or training diverges under FP16 AMP.
Cause: The run clips gradients before scaler.unscale_(optimizer).
Fix: Unscale first, then clip, then let the scaler perform or skip the optimizer step.

Memory claims omit optimizer state

Symptom: "Half-memory" planning fails when the job is scheduled.
Cause: Only activation dtype changed while FP32 parameters and Adam moments remain large.
Fix: Log a component-level memory profile or accounting budget, not a dtype slogan.

Distributed bandwidth remains high

Symptom: BF16 compute is enabled, but cross-worker traffic is still a bottleneck.
Cause: Reduction payloads remain FP32.
Fix: Inspect and record reduce_dtype separately, then measure held-out metric and communication changes before promotion.

Next Step

Continue to Distributed Training: FSDP & ZeRO

PreviousSupervised Fine-Tuning Pipeline

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Automatic Mixed Precision package - torch.amp

PyTorch Contributors · 2026

Mixed Precision Training.

Micikevicius, P., et al. · 2018

A Study of BFLOAT16 for Deep Learning Training.

Kalamkar, D., et al. · 2019

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.

Rajbhandari, S., et al. · 2020 · SC 2020

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.

Zhao, Y., et al. · 2023 · VLDB 2023

FullyShardedDataParallel

PyTorch Contributors · 2026

FP8 Formats for Deep Learning.

Micikevicius, P., et al. · 2022

Using FP8 and FP4 with Transformer Engine

NVIDIA · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Mixed Precision Training

Make precision an explicit experiment parameter

A floating-point number has two limits

If BF16 has FP32-like range, why don't we store model updates only in BF16?

Tiny updates need an FP32 home

Range decides whether gradients exist at all

FP16 uses loss scaling to rescue small gradients

Scaling too far causes overflow

Compute low, update high

Memory savings need accounting, not slogans

Distributed jobs add a communication dtype

Decide from runs, not format preference

FP8 is a later optimization, not a default answer

Mastery check

Mastery outcomes

Evaluation rubric

Follow-up questions

BF16 represents the tiny gradient in gradient-range.py; why did it still lose the parameter update in tiny-parameter-update.py?

Why can't an FP16 run choose a huge loss scale once and keep it forever?

Your BF16 run uses less activation memory but total memory barely moves. What should you inspect?

Both fp16_scaled and bf16 pass the fixture. Why is the decision still blocked?

Why can't run_bf16_other_hardware win based on its higher examples_per_second value?

Common pitfalls

BF16 is mistaken for an FP32 optimizer replacement

FP16 silently loses gradients

Gradient clipping sees scaled values

Memory claims omit optimizer state

Distributed bandwidth remains high

Mastery Check

Discussion

Mixed Precision Training

Make precision an explicit experiment parameter

A floating-point number has two limits

If BF16 has FP32-like range, why don't we store model updates only in BF16?

Tiny updates need an FP32 home

Range decides whether gradients exist at all

FP16 uses loss scaling to rescue small gradients

Scaling too far causes overflow

Compute low, update high

Memory savings need accounting, not slogans

Distributed jobs add a communication dtype

Decide from runs, not format preference

FP8 is a later optimization, not a default answer

Mastery check

Mastery outcomes

Evaluation rubric

Follow-up questions

BF16 represents the tiny gradient in gradient-range.py; why did it still lose the parameter update in tiny-parameter-update.py?

Why can't an FP16 run choose a huge loss scale once and keep it forever?

Your BF16 run uses less activation memory but total memory barely moves. What should you inspect?

Both fp16_scaled and bf16 pass the fixture. Why is the decision still blocked?

Why can't run_bf16_other_hardware win based on its higher examples_per_second value?

Common pitfalls

BF16 is mistaken for an FP32 optimizer replacement

FP16 silently loses gradients

Gradient clipping sees scaled values

Memory claims omit optimizer state

Distributed bandwidth remains high

Mastery Check

Discussion