LearnAdvanced Training & AdaptationDistributed Training: FSDP & ZeRO

⚡HardFine-Tuning & Training

Distributed Training: FSDP & ZeRO

Understand ZeRO stages, current FSDP1 vs FSDP2 guidance, and when native PyTorch or DeepSpeed is the right choice for large-model training.

42 min read

Learning path

Step 103 of 158 in the full curriculum

Mixed Precision Training LoRA & Parameter-Efficient Tuning

Single-device SFT works until full-weight updates outgrow one accelerator. The next limit is memory: one GPU may not hold the states and activations needed to execute them. Fully Sharded Data Parallel (FSDP) and ZeRO (Zero Redundancy Optimizer) reduce repeated model-state storage across data-parallel ranks, then make communication part of the training cost.

Start with the memory contract. One useful mixed-precision sizing convention, used in the ZeRO analysis, assumes low-precision parameters and gradients plus an FP32 master parameter copy and FP32 Adam moments.^{[1]Reference 1ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.https://arxiv.org/abs/1910.02054} Under that specific recipe, each parameter costs 16 bytes of model-state memory before activations, temporary buffers, or allocator overhead:

2 bytes for the BF16 (bfloat16, short for Brain Floating Point) / FP16 (16-bit half precision) weight
2 bytes for the BF16 / FP16 gradient
4 bytes for the FP32 (32-bit single precision) master copy of the weight
4 bytes for Adam's first moment (momentum)
4 bytes for Adam's second moment (variance)

For a 1B model, that's 16 GB of model states alone. Change the optimizer, parameter dtype, master-weight policy, or offload policy and the byte count changes. Always write down the recipe before using a bytes-per-parameter number.

Scale this recipe to 7B parameters and it reaches 112 GB of model states. At 70B parameters, it exceeds 1 TB before activations. If you need full-parameter training at that scale, replicated data parallelism alone won't make it fit.

The first figure tracks only model states. Activations and transient buffers stay outside the columns because sharding model states doesn't remove those costs.

Memory layout comparison where DDP replicates model states, ZeRO-2 shards gradients and optimizer state, and ZeRO-3 shards all states. — Compare the three columns. P, G, and O stand for parameters, gradients, and optimizer state. DDP puts a full copy of everything on every GPU. ZeRO-2 splits gradients and optimizer state. ZeRO-3 splits all three, so each GPU keeps only a thin shard.

The memory wall in numbers

For a 70B parameter model under the FP16/BF16 parameter-and-gradient plus FP32-master-and-Adam accounting convention, the model-state math looks like this:

Weights: $70 \times 10^9 \times 2 \text{ bytes} \approx$ 140 GB
Gradients: $70 \times 10^9 \times 2 \text{ bytes} \approx$ 140 GB
Adam state + FP32 master weights: $70 \times 10^9 \times (4 + 4 + 4) \text{ bytes} \approx$ 840 GB

That's about 1.12 TB of model states before activations, temporary buffers, or allocator overhead. It's an assumption-bound calculation, not a promise about every optimizer implementation.

The arithmetic is small enough to turn into a quick check. Use decimal GB so the computed numbers match this table.

the-memory-wall-in-numbers.py

GB = 10**9

def state_memory_gb(params_billions: float, bytes_per_param: int) -> float:
    return params_billions * 1_000_000_000 * bytes_per_param / GB

recipe_bytes = {
    "low_precision_parameters": 2,
    "low_precision_gradients": 2,
    "fp32_master_and_adam": 12,
}
weights = state_memory_gb(70, recipe_bytes["low_precision_parameters"])
gradients = state_memory_gb(70, recipe_bytes["low_precision_gradients"])
optimizer_and_master = state_memory_gb(70, recipe_bytes["fp32_master_and_adam"])
total_model_states = weights + gradients + optimizer_and_master
zero3_per_gpu = total_model_states / 256

print("bytes_per_parameter=", sum(recipe_bytes.values()))
print(f"weights={weights:.0f} GB")
print(f"optimizer_and_master={optimizer_and_master:.0f} GB")
print(f"total_model_states={total_model_states:.0f} GB")
print(f"ZeRO-3 model states per GPU on 256 GPUs: {zero3_per_gpu:.2f} GB")

Model-state memory output

bytes_per_parameter= 16
weights=140 GB
optimizer_and_master=840 GB
total_model_states=1120 GB
ZeRO-3 model states per GPU on 256 GPUs: 4.38 GB

70B model-state memory tally under a 16-byte mixed-precision Adam assumption, compared with one 80 GB GPU and ZeRO-3 sharding. — Under this mixed-precision Adam accounting recipe, model state isn't just weights: gradients, FP32 master weights, and optimizer moments bring a 70B model to 1.12 TB before activations.

Why DDP hits a limit

First recall how ordinary Distributed Data Parallel (DDP) works. DDP places a full copy of the model on every GPU, splits the training batch across them, and synchronizes gradients with an all-reduce (a collective operation that sums values across all GPUs and returns the result to every GPU) after each backward pass. It works well for models that fit on one GPU, because adding more GPUs improves throughput without changing the per-GPU memory requirement.

The problem appears when the model itself is too large for one GPU. DDP replicates the full model states on every rank (process/GPU), so adding more GPUs improves throughput but doesn't reduce the per-GPU memory footprint. FSDP and ZeRO attack that specific problem by sharding model states across the data-parallel group.

ZeRO changes that trade-off by partitioning progressively more model state at each stage.^{[1]Reference 1ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.https://arxiv.org/abs/1910.02054} ZeRO-1 shards optimizer states, ZeRO-2 also shards gradients, and only ZeRO-3 shards parameters. Under ZeRO-3, each GPU stores only a fraction of the parameters and gathers the needed shards for computation.

DDP: Replicate everything. Simpler communication pattern, much higher memory use.
ZeRO: Shard model states. Lower memory use, more communication.

ZeRO stages: removing redundancy one layer at a time

ZeRO (Zero Redundancy Optimizer) is easiest to understand as an incremental evolution. Each stage removes one more redundant copy of the model states.

ZeRO stage 1: shard optimizer states

Each GPU keeps optimizer state for only $1/N$ of the parameters. Parameters and gradients are still replicated.

Memory savings: Optimizer state drops from 840 GB total to roughly $840/N$ GB per GPU.
Communication: Similar aggregate gradient-synchronization volume to the replicated baseline; optimizer ownership changes.

ZeRO stage 2: shard optimizer + gradients

Stage 2 also shards gradients, so each GPU owns only the gradient shard that matches its optimizer shard.

Memory savings: Optimizer states and gradients both scale down by roughly $1/N$ .
Communication: Gradient sync is typically implemented with reduce-scatter (a collective that sums values across GPUs and returns each GPU's shard) instead of all-reduce. The aggregate volume stays in the same ballpark as DDP, and each GPU receives its partitioned shard.

ZeRO stage 3: shard everything

Stage 3 shards parameters as well. No rank owns a full persistent copy of the model states anymore.

Memory savings: Parameters, gradients, and optimizer states are all partitioned. For the 70B example on 256 GPUs, the model-state footprint falls to about 4.4 GB per GPU before activations.
Communication: Parameters now have to be materialized on demand for computation.
1. Forward: all-gather (collect shards from all GPUs to rebuild the full tensor) the current layer or FSDP unit, run compute, then reshard.
2. Backward: all-gather that unit again, compute gradients, then reduce-scatter gradient shards.
Trade-off: Memory scales almost linearly with the data-parallel group size, but communication becomes a first-class bottleneck.

ZeRO memory trade-off: ZeRO doesn't reduce the aggregate model-state bytes; it eliminates persistent redundant copies. A 70B model still needs roughly 1.12 TB of model-state memory in aggregate, but ZeRO-3 partitions its steady-state storage across GPUs. Temporary materialized units and communication buffers still add to each rank's peak. You pay in network bandwidth for what you save in persistent per-GPU state.

Memory per GPU (70B model, 256 GPUs)

This table assumes BF16/FP16 weights and gradients with Adam states in FP32. The totals are for model states only. Activation memory is listed separately because it depends heavily on sequence length, hidden size, micro-batch size, checkpointing, and attention implementation.

Component	DDP	ZeRO-1	ZeRO-2	ZeRO-3
Weights	140 GB	140 GB	140 GB	0.55 GB
Gradients	140 GB	140 GB	0.55 GB	0.55 GB
Optimizer + FP32 master weights	840 GB	3.3 GB	3.3 GB	3.3 GB
Activations	Extra and workload-dependent	Extra and workload-dependent	Extra and workload-dependent	Extra and workload-dependent
Model states / GPU	1120 GB	283 GB	144 GB	4.4 GB

ZeRO-3 or FULL_SHARD is often paired with activation checkpointing (recomputing activations during the backward pass to save memory, also called gradient checkpointing). Sharding fixes model-state memory, but activations can still dominate the actual training footprint.

The same table should be generated from a recipe rather than copied into a sizing document:

zero_stage_memory.py

state_gb = {"parameters": 140, "gradients": 140, "optimizer_and_master": 840}
world_size = 256

def per_rank_gb(stage):
    sharded = {
        0: set(),
        1: {"optimizer_and_master"},
        2: {"optimizer_and_master", "gradients"},
        3: {"optimizer_and_master", "gradients", "parameters"},
    }[stage]
    return sum(value / world_size if name in sharded else value for name, value in state_gb.items())

for stage in [0, 1, 2, 3]:
    print(f"ZeRO-{stage}: {per_rank_gb(stage):.2f} GB/model-state rank")

ZeRO stage memory output

ZeRO-0: 1120.00 GB/model-state rank
ZeRO-1: 283.28 GB/model-state rank
ZeRO-2: 143.83 GB/model-state rank
ZeRO-3: 4.38 GB/model-state rank

Communication overhead analysis

ZeRO-3 buys memory by turning parameter access into communication. That makes the interconnect part of your training system design, not an afterthought. Just as a sharded parameter system only works if the interconnect can handle frequent shard exchange, ZeRO-3's memory savings depend on whether your GPU interconnect can keep up with all-gather volume.

Communication primitives you need to know

Primitive	What it does	Where it shows up
All-reduce	Sum across ranks, then give the full result back to every rank	Classic DDP gradient sync
Reduce-scatter	Sum across ranks, but return only each rank's shard	ZeRO-2/3 gradient sync
All-gather	Collect shards from all ranks to rebuild the full tensor	ZeRO-3 and FSDP parameter materialization

Those names stop blurring once you track one tiny tensor through each collective. Read each panel left to right: same four ranks, different "after" state.

Comparison of all-reduce, reduce-scatter, and all-gather showing whether each rank ends with a full tensor or one shard. — Focus on one question: after the collective, does each rank keep the full tensor or only a shard? That distinction is what makes DDP, ZeRO, and FSDP feel different in practice.

The operations become concrete on a two-rank gradient vector. All-reduce leaves the summed vector on both ranks; reduce-scatter leaves one summed slice per rank; all-gather reconstructs a parameter vector from shards.

collective_primitives.py

rank_gradients = [[1, 2], [10, 20]]
summed = [sum(values) for values in zip(*rank_gradients)]

all_reduce_result = [summed.copy(), summed.copy()]
reduce_scatter_result = [[summed[0]], [summed[1]]]

parameter_shards = [["p0"], ["p1"]]
full_parameters = [token for shard in parameter_shards for token in shard]
all_gather_result = [full_parameters.copy(), full_parameters.copy()]

print("all_reduce=", all_reduce_result)
print("reduce_scatter=", reduce_scatter_result)
print("all_gather=", all_gather_result)

Collectives output

all_reduce= [[11, 22], [11, 22]]
reduce_scatter= [[11], [22]]
all_gather= [['p0', 'p1'], ['p0', 'p1']]

The 1.5x heuristic

People often summarize ZeRO-3 communication as "about 1.5x DDP." That's a useful heuristic, but it's only a heuristic.

For a model with $P$ parameter bytes:

DDP does one gradient all-reduce per step, which is commonly modeled as about $2P$ bytes per rank.
ZeRO-3 / FULL_SHARD does roughly two parameter all-gathers plus one gradient reduce-scatter, which is commonly modeled as about $3P$ bytes per rank.

That yields the familiar $3P / 2P = 1.5x$ rule of thumb. The harder part in practice isn't only byte volume, but how that volume is scheduled. DDP tends to use a smaller number of large collectives. ZeRO-3 and FSDP break communication into many layer-level or unit-level operations. If those collectives are small and your network has high latency, throughput can fall off quickly.

ZeRO-3 is most attractive when memory is the hard constraint. If the model already fits and the network is weak, DDP or ZeRO-2 style sharding can be faster.

communication_heuristic.py

parameter_bytes = 140  # GB of low-precision parameters in the 70B example
ddp_modeled_gb = 2 * parameter_bytes
full_shard_modeled_gb = 3 * parameter_bytes

print("DDP_modeled_traffic_GB=", ddp_modeled_gb)
print("FULL_SHARD_modeled_traffic_GB=", full_shard_modeled_gb)
print("byte_ratio=", full_shard_modeled_gb / ddp_modeled_gb)
print("warning=latency_and_overlap_still_determine_step_time")

Communication heuristic output

DDP_modeled_traffic_GB= 280
FULL_SHARD_modeled_traffic_GB= 420
byte_ratio= 1.5
warning=latency_and_overlap_still_determine_step_time

FSDP (Fully Sharded Data Parallel)

FSDP is PyTorch's native sharded data-parallel stack.^{[2]Reference 2PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.https://arxiv.org/abs/2304.11277} The classic API is FullyShardedDataParallel (often called FSDP1). Current PyTorch exposes a DTensor-based API via torch.distributed.fsdp.fully_shard (often called FSDP2), and its tutorial marks FSDP1 deprecated while giving a migration guide. Treat FSDP1 as an existing-code path and start new designs from fully_shard.^{[3]Reference 3torch.distributed.fsdp.fully_shardhttps://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html}^{[4]Reference 4Getting Started with Fully Sharded Data Parallel (FSDP2)https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html}

The main reason people choose FSDP is operational simplicity. It stays close to PyTorch's native distributed stack, native profilers, and native checkpointing tools. PyTorch's general torch.compile guidance is to compile the highest-level train or evaluation step, or the top-level module. If a distributed wrapper such as FSDP causes issues, compile the inner module instead. If you're compiling FSDP1, current docs also require use_orig_params=True.^{[5]Reference 5Where to apply torch.compile?https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.where_to_apply_compile.html}^{[6]Reference 6FullyShardedDataParallelhttps://docs.pytorch.org/docs/stable/fsdp.html}

FSDP solves model-state memory. It doesn't solve activation memory. Long sequences and large micro-batches still push activations up, which is why FSDP is often paired with activation checkpointing.

How full sharding materializes one unit

For FULL_SHARD, or for an FSDP2 configuration that reshards after forward, one transformer-sized communication unit follows this lifecycle:

Group: Parameters are grouped at transformer-block granularity.
Shard: Each unit's parameters are sharded across the process group.
Execution:
- Before forward pass of a unit: All-gather full parameters.
- After forward pass: Discard full parameters (keep only shard).
- Before backward pass: All-gather full parameters again.
- After backward pass: Reduce-scatter gradients and update the sharded optimizer state.

Full-sharding communication-unit lifecycle: persistent parameter shard, all-gather for forward, reshard, re-gather for backward, and reduce-scatter gradient shard. — With reshard-after-forward behavior, full parameters exist only around compute for the current block; persistent storage returns to shards after the unit finishes.

This sequence diagram shows one wrapped unit in a two-GPU FSDP job.

Diagram showing GPU 0 and GPU 1. — GPU 0 and GPU 1.

This is a practical FSDP1 setup. You'll still meet this API in existing codebases, so it's worth reading even though new designs should begin with fully_shard. The important details are:

Wrap at the transformer-block level instead of wrapping the whole model as one giant unit.
Create the optimizer after FSDP wraps the model.
Don't confuse backward_prefetch with limit_all_gathers. The first helps overlap communication with compute. The second is mainly a CPU-side rate limiter that caps in-flight all-gathers and peak memory.^{[6]Reference 6FullyShardedDataParallelhttps://docs.pytorch.org/docs/stable/fsdp.html}

This is distributed setup code, not a single-process notebook cell. It assumes torch.distributed.init_process_group() has already run under torchrun and that each rank has selected its local CUDA device.

how-fsdp-works.py

import torch
import torch.nn as nn
import functools
from torch.distributed.fsdp import (
    BackwardPrefetch,
    FullyShardedDataParallel as FSDP,
    MixedPrecision,
    ShardingStrategy,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy

class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model=1024, n_heads=16):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        attn_out, _ = self.self_attn(x, x, x, attn_mask=mask)
        x = self.norm1(x + attn_out)
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)
        return x

my_auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={TransformerDecoderLayer},
)

mixed_precision_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)

model = nn.Sequential(*[
    TransformerDecoderLayer(d_model=1024, n_heads=16)
    for _ in range(8)
])

fsdp_model = FSDP(
    model,
    auto_wrap_policy=my_auto_wrap_policy,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    mixed_precision=mixed_precision_policy,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
    limit_all_gathers=True,
    device_id=torch.cuda.current_device(),
)

optimizer = torch.optim.AdamW(fsdp_model.parameters(), lr=2e-5)  # example SFT start; evaluate it

FSDP2: the newer DTensor-based API

FSDP2 moves away from the flat-parameter wrapper model and toward per-parameter sharding with DTensor. The point isn't only elegance. It changes how composable and inspectable the system is.

For new code, the structural pattern is bottom-up application of fully_shard, then optimizer construction over the resulting DTensor parameters:^{[3]Reference 3torch.distributed.fsdp.fully_shardhttps://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html}

fsdp2_structure.py

from torch.distributed.fsdp import fully_shard

model = TransformerModel()
for block in model.blocks:
    fully_shard(block)       # one communication group per transformer block
fully_shard(model)           # shard remaining root-level parameters

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

This snippet gives structure rather than a runnable local demo: fully_shard must execute inside a distributed process group on suitable accelerators. The learning rate remains an SFT baseline to evaluate, not a distributed-training requirement.

Feature	FSDP1 (`FullyShardedDataParallel`)	FSDP2 (`fully_shard`)
How it applies sharding	Wrapper module with flat parameters	In-place hooks on original modules
Sharding representation	`FlatParameter` groups	Per-parameter `DTensor`
Parameter names	Flattened internals to reason about	Original FQNs stay intact
Composition with TP / frozen params	More awkward	Better fit with `DeviceMesh` and frozen params
Checkpointing style	Full and sharded state dict APIs	Sharded state dicts first, reshard to full when needed

Key technical improvements

Cleaner composition: FSDP2 uses DeviceMesh, which makes it much easier to compose data-parallel sharding with tensor parallelism in the same topology.
Cleaner state dict story for large runs: Sharded state dicts are first-class, and distributed checkpointing avoids the old pattern of gathering everything onto rank 0 for saving.
Cleaner handling of frozen parameters: Per-parameter sharding is a better fit for PEFT and LoRA-style workflows where only a small subset of weights should update.
Different scheduling controls: fully_shard is no longer a wrapper configured through FSDP1's auto_wrap_policy, backward_prefetch, use_orig_params, or limit_all_gathers arguments. Apply it bottom-up, then use FSDP2 methods such as set_modules_to_forward_prefetch or set_modules_to_backward_prefetch when explicit prefetch control is needed.^{[3]Reference 3torch.distributed.fsdp.fully_shardhttps://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html}

The practical takeaway is simple: FSDP1 still shows up in existing codebases, while current PyTorch's FSDP2 documentation provides the migration path and fully_shard API for new work.

The most important FSDP choice is usually sharding granularity. Use transformer blocks as communication units. If one root unit owns the entire model, every all-gather becomes too large and peak memory shoots back up.

This simplified calculation isolates why granularity matters. It models only the transient full-parameter materialization, not activations or prefetch overlap:

fsdp_unit_granularity.py

block_parameter_gb = [3.5, 3.5, 3.5, 3.5]

root_only_peak_full_materialization = sum(block_parameter_gb)
blockwise_peak_full_materialization = max(block_parameter_gb)

print("root_only_active_parameters_GB=", root_only_peak_full_materialization)
print("blockwise_active_parameters_GB=", blockwise_peak_full_materialization)
print("modeled_reduction=", root_only_peak_full_materialization / blockwise_peak_full_materialization)

Sharding granularity output

root_only_active_parameters_GB= 14.0
blockwise_active_parameters_GB= 3.5
modeled_reduction= 4.0

FSDP1 sharding strategies reference

Strategy	Equivalent	Description
`FULL_SHARD`	ZeRO-3	Shards parameters, gradients, and optimizer states. Most memory efficient.
`SHARD_GRAD_OP`	ZeRO-2-like	Unshards parameters for forward, keeps them unsharded until backward finishes, then reshards. Uses more memory than `FULL_SHARD`, but does fewer reshard steps.
`NO_SHARD`	DDP-like replication	Keeps model states replicated and avoids sharded parameter materialization, but doesn't reduce per-rank model-state memory.
`HYBRID_SHARD`	ZeRO-3 within node + replication across nodes	Shards within a node and replicates across nodes. Useful when intra-node links are fast and inter-node bandwidth is tighter.

Checkpointing, resume, and fault recovery

Large training runs fail in the middle all the time. Nodes reboot, preemptible instances disappear, and jobs get rescheduled. If your checkpoint only contains model weights, you didn't save the training run. You saved an export artifact.^{[7]Reference 7Checkpointing in torchtune.https://meta-pytorch.org/torchtune/stable/deep_dives/checkpointer.html}

For a real resume point, the checkpoint usually needs at least:

Must-save state	Why it matters on resume
Model weights	Restores current parameters
Optimizer state	Preserves momentum and adaptive moments
Scheduler state	Keeps warmup and decay aligned with training step
Gradient-scaler / AMP state, if used	Preserves mixed-precision control state across resume
Consumed step or token count	Lets logs, schedulers, and stop rules stay honest
Sampler / dataloader cursor	Prevents replaying or skipping large data regions
RNG state	Keeps dropout, shuffling, and sampling reproducible when needed

In a sharded system, the safest default is to save sharded checkpoints, not to gather everything onto rank 0 first. FSDP2 and distributed checkpoint APIs are pushing in that direction because the "collect all weights on one machine and write a giant file" pattern eventually becomes the bottleneck or the failure point.^{[3]Reference 3torch.distributed.fsdp.fully_shardhttps://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html}^{[7]Reference 7Checkpointing in torchtune.https://meta-pytorch.org/torchtune/stable/deep_dives/checkpointer.html}

DeepSpeed goes one step further with Universal Checkpointing, which is meant to make checkpoint artifacts more portable across different parallelism layouts instead of tying restore logic to the exact original topology.^{[8]Reference 8Universal Checkpointing with DeepSpeed: A Practical Guide.https://www.deepspeed.ai/tutorials/universal-checkpointing/} That matters once you start changing world size between runs or resume on a different cluster shape.

The operational rule is simple:

Save checkpoints by training step and consumed tokens, not vague names like latest-final.
Test one real resume path early in the run, before trusting a week-long job to it.
Verify that resumed loss on the next batch is close to the non-resumed run.

If you can't answer "what exact state do we restore, and how do we know resume is faithful?", the training system isn't production-ready yet.

A distributed checkpoint manifest also has to record the sharding topology it was written under, even when a portable conversion path exists:

distributed_checkpoint_contract.py

required = {
    "model_shards",
    "optimizer_shards",
    "scheduler_state",
    "consumed_tokens",
    "sampler_cursor",
    "rng_state",
    "gradient_scaler_state",
    "template_tokenizer_version",
    "data_manifest",
    "evaluation_manifest",
    "best_metric",
    "world_size",
    "sharding_strategy",
}

manifest = {
    "model_shards": "ckpt/step_800/model/",
    "optimizer_shards": "ckpt/step_800/optimizer/",
    "scheduler_state": "ckpt/step_800/scheduler.pt",
    "consumed_tokens": 104_857_600,
    "sampler_cursor": {"epoch": 0, "batch": 800},
    "rng_state": "ckpt/step_800/rng.pt",
    "gradient_scaler_state": None,  # BF16 run; store scaler state for FP16 when used
    "template_tokenizer_version": "support-chat-v3",
    "data_manifest": "manifests/sft-train-v7.json",
    "evaluation_manifest": "manifests/support-eval-v4.json",
    "best_metric": {"name": "support_resolution_accuracy", "value": 0.78},
    "world_size": 32,
    "sharding_strategy": "FULL_SHARD",
}

missing = sorted(required - manifest.keys())
print("resume_manifest_complete=", not missing)
print("saved_topology=", manifest["sharding_strategy"], manifest["world_size"])
print("gradient_scaler_state=", manifest["gradient_scaler_state"])
print("best_metric=", manifest["best_metric"])

Distributed checkpoint output

resume_manifest_complete= True
saved_topology= FULL_SHARD 32
gradient_scaler_state= None
best_metric= {'name': 'support_resolution_accuracy', 'value': 0.78}

DeepSpeed ZeRO

DeepSpeed is Microsoft's distributed training runtime that introduced ZeRO and provides heterogeneous-memory offload through ZeRO-Offload and ZeRO-Infinity.^{[1]Reference 1ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.https://arxiv.org/abs/1910.02054}^{[9]Reference 9ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.https://arxiv.org/abs/2104.07857} It has a steeper learning curve than native PyTorch, but it remains a strong choice when memory pressure is the real blocker or when the rest of your stack already depends on Megatron-DeepSpeed style training.

One important DeepSpeed-specific detail is that ZeRO-3 depends on consistent module execution order across ranks. Dynamic routing patterns such as MoE can deadlock parameter all-gathers if different ranks enter different submodules, which is why DeepSpeed exposes the idea of ZeRO-3 leaf modules.^{[10]Reference 10Training APIhttps://deepspeed.readthedocs.io/en/latest/training.html}

This distributed setup snippet configures and initializes a model with DeepSpeed. Run it through the DeepSpeed launcher in a real training job. The configuration dictionary defines the ZeRO stage, offloading settings, optimizer, and training batch parameters. The learning rate is an example SFT starting point to evaluate, and the bucket sizes are tuning knobs rather than canonical values:

deepspeed-zero.py

import deepspeed
import torch.nn as nn

ds_config = {
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 8,
    "bf16": {
        "enabled": True,
    },
    "optimizer": {
        "type": "DeepSpeedCPUAdam",
        "params": {
            "lr": 2e-5,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01,
        },
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True,
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": True,
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": 500_000_000,
        "stage3_prefetch_bucket_size": 50_000_000,
        "stage3_param_persistence_threshold": 100_000,
    }
}

class SimpleTransformer(nn.Module):
    def __init__(self, d_model=1024, n_layers=4):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(d_model, nhead=16, batch_first=True)
            for _ in range(n_layers)
        ])
        self.proj = nn.Linear(d_model, 1000)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return self.proj(x)

model = SimpleTransformer(d_model=1024, n_layers=4)

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config,
    model_parameters=model.parameters(),
)

# for batch in dataloader:
#     outputs = model_engine(batch)
#     loss = compute_loss(outputs)
#     model_engine.backward(loss)
#     model_engine.step()

ZeRO-Infinity: CPU and NVMe offloading

DeepSpeed extends ZeRO-3 with ZeRO-Infinity, which allows offloading sharded parameters and optimizer states to CPU RAM or even NVMe SSDs.^{[9]Reference 9ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.https://arxiv.org/abs/2104.07857}

Tier	Memory Pool	Use Case
GPU	GPU HBM	Active compute, activations, temporary buffers
CPU	System RAM	Optimizer states or parameter shards when GPU memory is tight
NVMe	Local SSD / NVMe	Deep spillover tier when RAM still isn't enough

This makes some otherwise impossible training runs feasible, but it doesn't make offload free. Once parameters or optimizer state spill into CPU or NVMe, throughput becomes constrained by PCIe, host memory bandwidth, storage latency, and how well the runtime can overlap data movement with compute.

DeepSpeed's feature matrix is broader, but it isn't "all combinations are valid." Current docs say AutoTP training supports ZeRO stages 0, 1, and 2, while PipelineModule isn't compatible with ZeRO-2 or ZeRO-3.^{[10]Reference 10Training APIhttps://deepspeed.readthedocs.io/en/latest/training.html}^{[11]Reference 11Pipeline Parallelismhttps://deepspeed.readthedocs.io/en/latest/pipeline.html}

Turn compatibility constraints into a config gate, not a surprise after scheduling a large job:

deepspeed_compatibility_gate.py

requested_runs = [
    {"feature": "AutoTP", "zero_stage": 2},
    {"feature": "AutoTP", "zero_stage": 3},
    {"feature": "PipelineModule", "zero_stage": 2},
]

def supported(run):
    if run["feature"] == "AutoTP":
        return run["zero_stage"] in {0, 1, 2}
    if run["feature"] == "PipelineModule":
        return run["zero_stage"] in {0, 1}
    return False

for run in requested_runs:
    print(run["feature"], f"ZeRO-{run['zero_stage']}", "allowed=", supported(run))

Compatibility gate output

AutoTP ZeRO-2 allowed= True
AutoTP ZeRO-3 allowed= False
PipelineModule ZeRO-2 allowed= False

FSDP vs DeepSpeed comparison

Both frameworks solve the same core problem, but they optimize for different operational constraints. FSDP optimizes for staying close to standard PyTorch. DeepSpeed optimizes for a wider set of large-scale runtime features.

Feature	FSDP	DeepSpeed ZeRO
Runtime model	Native PyTorch distributed stack	Separate DeepSpeed runtime
ZeRO stages	ZeRO-2-like and ZeRO-3-like sharding modes	Native ZeRO-1/2/3
CPU offloading	`CPUOffload` for params and gradients	Strong CPU and NVMe offload
NVMe offloading	No	Yes
Compile story	Compile the train step or top-level module first; fall back to the inner module if the wrapper causes issues. FSDP1 needs `use_orig_params=True`	Depends on integration path
Pipeline parallel	External	Built-in, but `PipelineModule` excludes ZeRO-2/3
Tensor parallel	External	AutoTP or Megatron integration (AutoTP supports ZeRO-0/1/2)
Checkpointing / tooling	Standard PyTorch APIs	DeepSpeed-specific runtime APIs
Debugging	Standard PyTorch stack traces and profiler	More runtime-specific behavior

When to use which?

Use FSDP when you want the PyTorch-native path, the model fits in aggregate GPU memory once sharded, and standard PyTorch profiling and checkpointing matter more than exotic runtime features.
Use DeepSpeed when CPU or NVMe offload is non-negotiable, or when you're already in a DeepSpeed or Megatron-DeepSpeed stack whose TP/PP requirements fit the current stage-compatibility rules.

Beyond data parallelism: 3D parallelism

For very large jobs, data-parallel sharding alone isn't enough. As you push the data-parallel group wider, global batch size, optimizer dynamics, and cross-node communication all start fighting you. The standard answer is to combine multiple forms of parallelism in the same training topology.^{[12]Reference 12Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.https://arxiv.org/abs/1909.08053}

That's what people mean by 3D parallelism: data parallelism for replica groups, tensor parallelism (splitting weight matrices within a layer across GPUs) inside layers, and pipeline parallelism (splitting sequential layer groups across different devices) across layer ranges.

This diagram shows how data, tensor, and pipeline parallelism combine to distribute a model across a large cluster. Solid arrows show pipeline flow through layer ranges. Dotted arrows show synchronization between matching data-parallel replicas.

3D parallelism topology with two data-parallel replicas, pipeline stages across depth, and tensor-parallel ranks within each stage. — 3D parallelism isn't one split. Data parallelism repeats the pipeline, pipeline parallelism splits depth, and tensor parallelism splits the layer work inside each stage.

Tensor parallelism (TP)

Tensor parallelism splits the work inside a layer. Instead of putting a full weight matrix on one GPU, it partitions the matrix across GPUs.

How it works: In a transformer MLP, the first projection is often column-sharded and the second projection row-sharded. Each rank computes a partial result, then the ranks synchronize those partials.
Bandwidth requirement: This synchronization happens inside the layer, so TP strongly prefers very fast intra-node links such as NVLink or NVSwitch (high-bandwidth GPU interconnects).

Pipeline parallelism (PP)

Pipeline parallelism splits the model across depth by assigning different layer ranges to different devices or nodes.

How it works: Stage 0 might own layers 1-12, stage 1 owns 13-24, and activations flow from one stage to the next.
Main trade-off: You reduce memory pressure per rank and avoid TP-style per-layer collectives across the whole model, but you introduce pipeline bubbles and need micro-batching to keep stages busy.

Parallelism	Primary shard	Communication pattern	Best for
Data / ZeRO / FSDP	Batch across replicas, with model states optionally sharded	All-reduce, reduce-scatter, all-gather	General scaling and memory reduction
Tensor (TP)	Weight matrices within layers	Collectives inside each layer	Very wide layers on fast intra-node links
Pipeline (PP)	Sequential layer groups	Point-to-point activations and gradients	Very deep models and multi-node partitioning

The escalation path should be deliberate.^{[13]Reference 13Parallelism Strategies Guide.https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/parallelism-guide.html}

Start with single-node or data-parallel sharding when the model mostly fits and you want the simplest debug story.
Add tensor parallelism when single layers are too wide for one device or when matmul shards need to stay on NVLink-class links.
Add pipeline parallelism when layer depth itself must be partitioned across devices.

A common placement keeps TP on fast intra-node links, uses PP to partition depth across broader topology boundaries, and applies data parallelism across replica groups. The right degrees still depend on available links, memory, model shape, and global-batch constraints.

In current PyTorch, a DeviceMesh is the abstraction used to express multiple device dimensions, which lets FSDP2 compose with tensor parallelism on the same hardware (often called FSDP + TP, or 2D parallelism). Megatron Core also documents context parallelism, which splits the sequence dimension for long-context workloads. Combining data, tensor, pipeline, and context parallelism creates a four-axis topology; runtimes don't always use the same "3D" or "4D" label for that composition.^{[3]Reference 3torch.distributed.fsdp.fully_shardhttps://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html}^{[13]Reference 13Parallelism Strategies Guide.https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/parallelism-guide.html}

Each topology degree multiplies into the required world size. Calculate it before reserving hardware:

parallelism_topology_size.py

topology = {
    "data_parallel": 8,
    "tensor_parallel": 4,
    "pipeline_parallel": 2,
    "context_parallel": 2,
}

world_size = 1
for degree in topology.values():
    world_size *= degree

print("world_size=", world_size)
print("ranks_per_data_replica=", topology["tensor_parallel"] * topology["pipeline_parallel"] * topology["context_parallel"])
assert world_size == 128

Topology size output

world_size= 128
ranks_per_data_replica= 16

Debugging & profiling distributed training

Distributed training fails in ways that single-GPU training doesn't. Jobs hang instead of crash. One bad rank can stall everybody else. Performance regressions often come from communication scheduling, not math kernels.

Common failure modes

NCCL timeouts or hangs: Often one rank died earlier or entered a different collective pattern. TORCH_NCCL_BLOCKING_WAIT=1 can turn a silent hang into a visible failure.
Startup OOM: Large models can OOM before training starts if you materialize full parameters too early. Lazy or meta-device initialization matters.
Wrong gradient clipping: With an FSDP1 sharded strategy, use fsdp_model.clip_grad_norm_() because it computes across sharded gradients. FSDP2's DTensor-based parameters instead support torch.nn.utils.clip_grad_norm_(model.parameters(), ...) in the current tutorial.^{[6]Reference 6FullyShardedDataParallelhttps://docs.pytorch.org/docs/stable/fsdp.html}^{[4]Reference 4Getting Started with Fully Sharded Data Parallel (FSDP2)https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html}

Profiling tools

To optimize throughput (tokens/sec), you must identify whether you're compute-bound or communication-bound.

torch.profiler: Look at ncclAllGather, ncclReduceScatter, and gaps where GPUs wait for communication. For FSDP1, tune wrapping granularity and backward_prefetch; use limit_all_gathers=True for peak-memory rate limiting, not as the overlap knob. For FSDP2, use its module prefetch APIs when implicit scheduling is insufficient.^{[6]Reference 6FullyShardedDataParallelhttps://docs.pytorch.org/docs/stable/fsdp.html}^{[3]Reference 3torch.distributed.fsdp.fully_shardhttps://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html}
DeepSpeed Flops Profiler or DeepSpeed runtime logs: Use them to break step time into compute vs communication and to see whether offload is the bottleneck.

For hard distributed bugs, capture one clean repro run with NCCL_DEBUG=INFO and TORCH_DISTRIBUTED_DEBUG=DETAIL instead of trying to reason from high-level symptoms alone.

Training observability signals that matter

Once the job runs, you still need to know why it's slow or unstable. Good training dashboards separate data, compute, communication, and memory instead of showing only one loss curve.^{[13]Reference 13Parallelism Strategies Guide.https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/parallelism-guide.html}

Signal	What it answers	Bad symptom
tokens/sec	Is end-to-end throughput improving?	flat or falling throughput after adding GPUs
step time split	Is time spent in data load, forward, backward, optimizer, or checkpoint save?	one phase dominates without explanation
all-gather / reduce-scatter share	Is sharding traffic now bottlenecking the job?	communication time grows faster than compute time
dataloader wait time	Are GPUs starved by CPU preprocessing or storage?	GPU idle gaps before forward
peak HBM and reserved memory	Are you close to OOM or fragmenting memory?	retries, allocator spikes, sudden OOM at checkpoint save
MFU or FLOP utilization	Are expensive kernels doing useful math?	very low utilization despite full memory use
checkpoint save duration	Is fault tolerance becoming a throughput tax?	long pauses every save interval

If you only watch training loss, a run can look healthy while throughput quietly collapses. Distributed training needs systems telemetry, not ML telemetry alone.

Training-system bottleneck map

Strong distributed-training debugging connects the abstraction to the bottleneck. A useful answer names the split, the communication primitive, and the symptom you would measure.

Topic	What to explain	Debug signal
Data parallelism	each rank owns a batch shard and synchronizes gradients	all-reduce time rises with parameter size
Tensor parallelism	one layer is split across ranks	frequent collectives inside attention and MLP blocks
Pipeline parallelism	different ranks own different layer ranges	idle bubbles when micro-batching is too small
ZeRO / FSDP	optimizer states, gradients, and parameters are sharded	all-gather and reduce-scatter dominate step time
Activation checkpointing	save memory by recomputing activations in backward	lower peak memory, higher compute time
NCCL bottlenecks	collectives depend on rank order, topology, and matching calls	hangs, timeouts, or wide gaps in profiler traces
FlashAttention	reduce attention memory traffic with tiled kernels	better long-context throughput when attention is memory-bound

A practical debugging loop is:

Reproduce on the smallest rank count that still fails.
Confirm every rank enters the same collective sequence.
Capture profiler traces and separate compute, communication, and idle time.
Check peak memory before and after forward, backward, and optimizer step.
Change one axis at a time: sharding granularity, micro-batch size, checkpointing, or placement.

This is the bridge between ML and distributed systems. Weak answers stop at "use FSDP" or "use DeepSpeed." Stronger answers say "model states don't fit, so I shard them; then I measure whether communication or activation memory became the new bottleneck."

A worked sizing exercise

A common sizing prompt is: "How many 80 GB GPUs do you need to train a 70B model with ZeRO-3?" Walk through it step by step.

Step 1: count the model states

With mixed-precision Adam, each parameter carries:

2 bytes (BF16 weight)
2 bytes (BF16 gradient)
12 bytes (FP32 master weight + Adam moments)

Total: 16 bytes per parameter.

For 70B parameters: $70 \times 10^9 \times 16 \text{ bytes} = 1{,}120 \text{ GB}$ of model states.

Step 2: compute the model-state lower bound

If you looked only at model states, the absolute lower bound on 80 GB GPUs would be:

\frac{1{,}120 \text{ GB}}{80 \text{ GB/GPU}} = 14

So 14 GPUs is the math-floor for model states alone.

The same calculation is worth making executable:

zero3_gpu_floor.py

import math

total_state_gb = 70 * 16
gpu_memory_gb = 80

raw_floor = math.ceil(total_state_gb / gpu_memory_gb)
print("raw_lower_bound_gpus=", raw_floor)

for gpus in [14, 16, 24]:
    per_gpu = total_state_gb / gpus
    headroom = gpu_memory_gb - per_gpu
    print(f"{gpus:>2} GPUs -> {per_gpu:5.1f} GB states/GPU, {headroom:5.1f} GB before activations")

GPU lower-bound output

raw_lower_bound_gpus= 14
GPUs ->  80.0 GB states/GPU,   0.0 GB before activations
GPUs ->  70.0 GB states/GPU,  10.0 GB before activations
GPUs ->  46.7 GB states/GPU,  33.3 GB before activations

Step 3: add overhead

The math above is for model states only. Real training also needs:

Activations (depends on sequence length and batch size)
Communication buffers for all-gather and reduce-scatter
CUDA allocator overhead and fragmentation

There isn't one honest fixed headroom percentage: a context-length or micro-batch change can dominate it. Measure or estimate the remaining peak memory for the specific workload, then compare candidate topologies:

zero3_workload_budget.py

state_total_gb = 1120
gpu_capacity_gb = 80
measured_non_state_peak_gb = 18  # activations + buffers + allocator margin for this trial workload

def fits(world_size):
    state_per_gpu = state_total_gb / world_size
    peak = state_per_gpu + measured_non_state_peak_gb
    return state_per_gpu, peak, peak <= gpu_capacity_gb

for world_size in [16, 24, 32]:
    state, peak, ok = fits(world_size)
    print(f"{world_size} GPUs -> states={state:.1f} GB peak_estimate={peak:.1f} GB fits={ok}")

Workload memory budget output

GPUs -> states=70.0 GB peak_estimate=88.0 GB fits=False
GPUs -> states=46.7 GB peak_estimate=64.7 GB fits=True
GPUs -> states=35.0 GB peak_estimate=53.0 GB fits=True

Step 4: answer the question

The clean answer is: 14 GPUs is only the raw ZeRO-3 lower bound for model states under this 16-byte recipe. It isn't a runnable cluster recommendation. A practical count requires a measured or modeled activation/buffer peak and an acceptable throughput result. In the example above, an 18 GB non-state peak rules out 16 GPUs but fits at 24; a longer sequence could invalidate that answer again. Activation checkpointing, a smaller micro-batch, tensor parallelism, or pipeline parallelism may change the result.

Common pitfalls

Misidentifying which axis is split

Symptom: A team says they "added ZeRO-3 for tensor parallelism," but profiler traces still show whole-layer matmuls on each rank and the real change is only lower model-state memory.
Cause: ZeRO and FSDP shard model states across data-parallel ranks. They don't split the matrix multiply itself. Tensor parallelism splits layer computation, and pipeline parallelism splits model depth.
Fix: Check what moved in the trace or memory profile before naming the technique. If weights, gradients, and optimizer states shrank per rank, that's sharding. If one layer's matmul is split across devices, that's tensor parallelism.

Wrapping the whole model as one FSDP unit

Symptom: You enable ZeRO-3, but peak memory is still unexpectedly high and you get OOM during the forward pass.
Cause: You wrapped the entire model as a single FSDP unit instead of wrapping individual transformer layers. When the whole model is one unit, FSDP all-gathers every parameter at once. That defeats the purpose of layer-by-layer sharding.
Fix: Use auto_wrap_policy to wrap at the transformer-block level. The code example earlier shows how to do this with transformer_auto_wrap_policy.

Thinking FSDP1 `limit_all_gathers=True` creates overlap

Symptom: You set limit_all_gathers=True and expect faster training, but throughput doesn't improve.
Cause: limit_all_gathers is a CPU-side rate limiter. It caps how many all-gathers are in flight at once to control peak memory. It isn't an overlap or performance knob.
Fix: In FSDP1, use backward_prefetch and communication-unit granularity for overlap. In FSDP2, use its prefetch methods when needed; it doesn't expose FSDP1's limit_all_gathers argument.

Treating `SHARD_GRAD_OP` as identical to ZeRO-2

Symptom: You switch from ZeRO-2-style reasoning to FSDP SHARD_GRAD_OP, then wonder why peak memory is higher than expected even though gradients are sharded.
Cause: The names are similar, but the parameter lifetime differs. PyTorch's SHARD_GRAD_OP keeps parameters unsharded through forward and backward, then reshards after backward.^{[6]Reference 6FullyShardedDataParallelhttps://docs.pytorch.org/docs/stable/fsdp.html} That changes peak memory behavior even when the high-level mental model sounds like ZeRO-2.
Fix: Confirm the actual parameter lifetime in docs and profiler traces before sizing memory. If you need the lowest parameter footprint, compare against FULL_SHARD rather than assuming SHARD_GRAD_OP behaves like a DeepSpeed stage name.

Ignoring activation memory

Symptom: The model-state math says everything should fit, but you still OOM.
Cause: Activations can be larger than model states for long sequences. A single forward pass with a long context can generate gigabytes of intermediate tensors.
Fix: Pair ZeRO-3 or FSDP with activation checkpointing. Trade compute for memory by recomputing activations during the backward pass instead of storing them.

Assuming every DeepSpeed feature composes with ZeRO-3

Symptom: You configure AutoTP with ZeRO-3 and the job crashes or hangs.
Cause: DeepSpeed has documented compatibility limits. AutoTP currently supports ZeRO stages 0, 1, and 2, but not 3. PipelineModule isn't compatible with ZeRO-2 or ZeRO-3.
Fix: Check the stage-compatibility matrix before you design your training topology. Don't assume that because DeepSpeed supports both features, they work together.

What to check before moving on

Check that you can explain and operate these parts of a distributed training setup:

How weights, gradients, FP32 master weights, and Adam moments create the memory wall.
Why DDP improves throughput but doesn't reduce per-GPU model-state memory.
What ZeRO-1, ZeRO-2, and ZeRO-3 shard.
How FSDP1 FULL_SHARD, SHARD_GRAD_OP, NO_SHARD, and HYBRID_SHARD trade memory for communication, and why new code should start with FSDP2 fully_shard.
Why FSDP2's fully_shard is different from the FSDP1 wrapper API.
How all-reduce, reduce-scatter, and all-gather show up in training traces.
When to use FSDP, DeepSpeed ZeRO, offload, tensor parallelism, and pipeline parallelism.
How to size a 70B-style training job before accounting for activations and communication buffers.
Which profiler and observability signals reveal communication, memory, checkpointing, and dataloader bottlenecks.

Final selection workflow

Use this sequence when sizing a real run:

Compute model-state memory first. If the bytes-per-parameter math already blows past one GPU, DDP isn't your answer.
Ask what is limiting memory. If sharding fixes model states but activations still dominate, add activation checkpointing or reduce sequence and micro-batch before changing frameworks.
Measure the new bottleneck after sharding. If step time now shifts into all-gather and reduce-scatter, network quality and wrapping granularity matter more than the headline ZeRO stage.
Choose the operational stack. Prefer FSDP when you want native PyTorch tooling and the run fits once sharded. Prefer DeepSpeed when CPU or NVMe offload, or an existing DeepSpeed stack, is the real requirement.
Escalate topology only when needed. Add tensor parallelism when layers are too wide for one device, pipeline parallelism when depth must be split, and always verify stage-compatibility before combining features.

Quick decision reference

Scenario	Recommendation	Why
Model fits in aggregate GPU memory and you want native PyTorch tooling	FSDP2 `fully_shard`, applied bottom-up	Current native path with DTensor-backed sharding and PyTorch checkpointing
Maintaining FSDP1 code and willing to spend more memory for fewer reshard events	FSDP1 `SHARD_GRAD_OP`	Higher peak memory, less aggressive resharding
GPU memory is the blocker and CPU/NVMe offload is acceptable	DeepSpeed ZeRO-Infinity	Uses larger memory tiers to make the run feasible
Need DeepSpeed tensor parallel training with optimizer sharding	DeepSpeed AutoTP + ZeRO-1/2	Current docs support AutoTP with ZeRO stages 0, 1, and 2, but not 3
Need DeepSpeed `PipelineModule`	DeepSpeed pipeline + ZeRO-0/1	Current docs say `PipelineModule` isn't compatible with ZeRO-2 or ZeRO-3

Practice drill

Build a sharded-training readiness sheet for a 70B run:

Calculate per-rank memory for DDP, ZeRO-1, ZeRO-2, ZeRO-3, and FSDP full shard, then add activation and temporary-buffer headroom.
Choose wrap granularity or module units and note the peak all-gather size.
Write a checkpoint contract that includes model shards, optimizer state, scheduler, RNG, sampler cursor, consumed progress, and AMP scaler if used.
Mark topology constraints: data parallel, tensor parallel, pipeline parallel, context parallel, and unsupported combinations.

The sheet should explain whether the model fits and what will slow or break the resume path.

Next Step

Continue to LoRA & Parameter-Efficient Tuning

Distributed training shows how full-model updates become possible; LoRA shows when you can avoid that cost by training small adapter matrices for targeted adaptation instead of updating the entire model.

PreviousMixed Precision Training

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.

Rajbhandari, S., et al. · 2020 · SC 2020

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.

Zhao, Y., et al. · 2023 · VLDB 2023

torch.distributed.fsdp.fully_shard

PyTorch Contributors · 2025

Getting Started with Fully Sharded Data Parallel (FSDP2)

PyTorch Contributors · 2025

Where to apply torch.compile?

PyTorch Contributors · 2025

FullyShardedDataParallel

PyTorch Contributors · 2026

Checkpointing in torchtune.

PyTorch Contributors · 2026

Universal Checkpointing with DeepSpeed: A Practical Guide.

DeepSpeed Team · 2026

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.

Rajbhandari, S., et al. · 2021 · SC 2021

Training API

DeepSpeed Team · 2026

Pipeline Parallelism

DeepSpeed Team · 2026

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.

Shoeybi, M., et al. · 2019

Parallelism Strategies Guide.

NVIDIA · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Distributed Training: FSDP & ZeRO

The memory wall in numbers

Why DDP hits a limit

ZeRO stages: removing redundancy one layer at a time

ZeRO stage 1: shard optimizer states

ZeRO stage 2: shard optimizer + gradients

ZeRO stage 3: shard everything

Memory per GPU (70B model, 256 GPUs)

Communication overhead analysis

Communication primitives you need to know

The 1.5x heuristic

Why can ZeRO-3 be slower than DDP even if the back-of-the-envelope byte ratio is only about 1.5x?

FSDP (Fully Sharded Data Parallel)

How full sharding materializes one unit

FSDP2: the newer DTensor-based API

Key technical improvements

FSDP1 sharding strategies reference

Checkpointing, resume, and fault recovery

DeepSpeed ZeRO

ZeRO-Infinity: CPU and NVMe offloading

FSDP vs DeepSpeed comparison

When to use which?

You already know the model-state math fits once sharded. What question decides whether you stay with FSDP or reach for DeepSpeed?

Beyond data parallelism: 3D parallelism

Tensor parallelism (TP)

Pipeline parallelism (PP)

Debugging & profiling distributed training

Common failure modes

Profiling tools

Training observability signals that matter

Training-system bottleneck map

A worked sizing exercise

The raw model-state math says a 70B-style run needs about 14 GPUs with ZeRO-3, but you only have sixteen 80 GB cards. Why is that still a shaky plan?

Common pitfalls

Misidentifying which axis is split

Wrapping the whole model as one FSDP unit

Thinking FSDP1 limit_all_gathers=True creates overlap

Treating SHARD_GRAD_OP as identical to ZeRO-2

Ignoring activation memory

Assuming every DeepSpeed feature composes with ZeRO-3

What to check before moving on

Final selection workflow

Quick decision reference

Practice drill

Mastery Check

Discussion

Thinking FSDP1 `limit_all_gathers=True` creates overlap

Treating `SHARD_GRAD_OP` as identical to ZeRO-2