Learn tensor parallelism, pipeline parallelism, context parallelism, and how multi-GPU serving trades memory capacity for communication overhead.
HBM, KV cache, and scheduler policy limit single-node serving. The next question is what changes when a single large language model (LLM) copy no longer fits comfortably on one accelerator.
Serving Qwen3.6-35B-A3B for a codebase assistant that reads long diffs, build logs, and architecture notes creates two memory questions. Model-weight memory decides whether one replica fits at all, while per-request KV state decides how many long-context sessions can run at once.
Qwen3.6-35B-A3B is a sparse MoE checkpoint with about 35B total parameters and about 3B activated per token.[1] At BF16, the full checkpoint is roughly 70 GB before KV cache, runtime buffers, and allocator headroom. The A3B suffix helps reason about active compute, but it doesn't mean the serving system only needs 3B parameters worth of memory.
Model parallelism is the set of techniques that split one model across multiple accelerators. The split can make a large model fit, but every added device also creates communication or scheduling work.
Distributed training cares about gradients, optimizer states, activation checkpointing, and throughput over many examples. Distributed inference cares about time to first token (TTFT), tokens per second (TPS), KV-cache memory, and request scheduling.
The same names appear in both worlds, but the trade-offs shift:
| Technique | Training concern | Inference concern |
|---|---|---|
| Tensor parallelism | Split matmuls and gradients | Split weights and activations with low latency |
| Pipeline parallelism | Fill stages with microbatches | Avoid pipeline bubbles during generation |
| Sequence parallelism | Reduce selected activation memory alongside tensor parallelism | Usually a training optimization, not shorthand for long-context inference |
| Context parallelism | Split long-sequence work across devices | Split long prompts, attention work, or KV-cache state when runtime supports it |
| Data parallel serving | Replicate model | Increase throughput for many requests |
Serving runtimes expose topology controls such as tensor- and pipeline-parallel sizes. Treat those knobs as a deployment mechanism, not a performance guarantee: the selected topology still needs memory accounting and latency benchmarks. vLLM's official scaling guide recommends one GPU when the model fits, single-node tensor parallelism when it needs several GPUs in one node, and tensor plus pipeline parallelism when it exceeds one node.[2]
Before picking a sharding strategy, do the simplest memory math:
1weight memory ~= parameters x bytes per parameter
2serving memory ~= weights + KV cache + runtime buffers + safety marginFor BF16 or FP16 weights, a parameter takes 2 bytes. That makes Qwen3.6-35B-A3B about 70 GB by total parameters before a single prompt arrives. KV cache grows with active requests, context length, layers, KV heads, head dimension, and bytes per value. Runtime buffers and memory fragmentation add more headroom. "Enough VRAM" means all of those buckets fit at the traffic level you plan to serve, not the checkpoint file or active-parameter count alone.
1from math import ceil
2
3def weight_memory_gb(params_billion: float, bytes_per_param: float) -> float:
4 return params_billion * bytes_per_param
5
6def next_power_of_two(value: int) -> int:
7 size = 1
8 while size < value:
9 size *= 2
10 return size
11
12params_billion = 35
13gpu_gb = 80
14reserve_fraction = 0.20
15usable_gb = gpu_gb * (1 - reserve_fraction)
16
17bf16_weights = weight_memory_gb(params_billion, bytes_per_param=2)
18raw_min_gpus = ceil(bf16_weights / usable_gb)
19tp_candidate = next_power_of_two(raw_min_gpus)
20int4_weights = weight_memory_gb(params_billion, bytes_per_param=0.5)
21
22print(f"Qwen3.6-35B-A3B BF16 total weights: {bf16_weights:.0f} GB")
23print(f"80GB GPU usable with 20% reserve: {usable_gb:.0f} GB")
24print(f"minimum GPUs for weights with reserve: {raw_min_gpus}")
25print(f"practical TP candidate: {tp_candidate}")
26print(f"Qwen3.6-35B-A3B INT4 total weight-only estimate: {int4_weights:.0f} GB")1Qwen3.6-35B-A3B BF16 total weights: 70 GB
280GB GPU usable with 20% reserve: 64 GB
3minimum GPUs for weights with reserve: 2
4practical TP candidate: 2
5Qwen3.6-35B-A3B INT4 total weight-only estimate: 18 GBThis calculator is deliberately conservative but incomplete. It only sizes weights plus a reserve. Production sizing still needs KV cache, activation buffers, tensor-parallel divisibility, interconnect measurements, and latency SLOs.
1weights_gb = 70
2gpu_gb = 80
3reserved_runtime_gb = 16
4
5for tp_size in (1, 2):
6 shard_gb = weights_gb / tp_size
7 kv_room_gb = gpu_gb - reserved_runtime_gb - shard_gb
8 print(f"TP={tp_size}: {shard_gb:.0f} GB weights/GPU, {kv_room_gb:.0f} GB left for KV")1TP=1: 70 GB weights/GPU, -6 GB left for KV
2TP=2: 35 GB weights/GPU, 29 GB left for KVTensor parallelism splits large matrix operations across GPUs. In a transformer layer, the model contains big linear projections for attention and feed-forward blocks. Tensor parallelism shards those weights so each GPU owns part of the matrix.
For a simplified linear layer:
1y = xWyou might split W across four GPUs. Each GPU computes part of the output, then the system communicates to combine results. Interconnect decides whether that split pays off. Tensor parallelism is strongest on GPUs connected by fast NVLink (NVIDIA high-bandwidth GPU-to-GPU interconnect) or similar high-bandwidth links.
Naively, every sharded matmul would need a sync to glue the pieces back together. Megatron-LM avoids that by choosing the split directions so each transformer sub-block needs only one synchronization.[3] It chains a column-parallel layer into a row-parallel layer.
In the MLP block Z = (GeLU(xA))B:
A by columns. Each GPU computes GeLU(x A_i) independently. GeLU is element-wise, so no sync is needed before the nonlinearity. (Splitting A by rows would force a sync first, because GeLU(x1 A1 + x2 A2) isn't the sum of the per-shard GeLUs.)B by rows. The column-sharded output of the first layer is exactly the input layout the row-parallel second layer expects, so the partial results flow through with no intermediate communication. One all-reduce after B sums the partial outputs.The attention block follows the same shape. The query, key, and value projections are split column-wise, which maps cleanly onto independent attention heads, and the output projection is split row-wise. That gives one all-reduce after attention.
For the dense Megatron layout described here, a tensor-parallel group executes two all-reduces per transformer layer in the forward pass: one for attention and one for the MLP.[3] During decode, that means every generated token pays two collectives per layer. An 80-layer model therefore executes 160 all-reduces per decode step under this layout, unless a runtime changes or fuses the communication scheme.
For inference, tensor parallelism often reduces memory pressure and can improve latency for large models, but it adds communication inside layers. If communication is slow, adding GPUs can make serving worse.
The interconnect hierarchy is why tensor parallelism is usually easiest to justify within a fast-link domain. NVIDIA's H100 SXM specification advertises up to 900 GB/s NVLink aggregate bandwidth per GPU.[4] PCIe and cross-node network paths have different bandwidth and latency characteristics, so the same tensor-parallel layout can perform very differently across machines. Because every decode token triggers collectives, benchmark the target topology instead of assuming more GPUs lower latency.
1layers = 80
2all_reduces_per_layer = 2
3output_tokens = 128
4
5per_token = layers * all_reduces_per_layer
6generation_total = per_token * output_tokens
7
8print(f"all-reduces per decode token: {per_token}")
9print(f"all-reduces for {output_tokens} output tokens: {generation_total:,}")1all-reduces per decode token: 160
2all-reduces for 128 output tokens: 20,4801collectives_per_token = 160
2
3for assumed_collective_latency_us in (5, 20):
4 floor_ms = collectives_per_token * assumed_collective_latency_us / 1000
5 print(f"{assumed_collective_latency_us} us startup -> {floor_ms:.1f} ms/token before payload transfer")15 us startup -> 0.8 ms/token before payload transfer
220 us startup -> 3.2 ms/token before payload transferUse tensor parallelism when:
Megatron-LM popularized practical tensor model parallelism for large transformers.[3] In serving stacks, the same idea appears as tensor_parallel_size.
Pipeline parallelism splits layers into stages. GPU 0 owns early layers, GPU 1 owns middle layers, and GPU 2 owns later layers. A token's hidden state moves through the stages.
This reduces memory per GPU because each stage stores only part of the model. It can also reduce communication compared with tensor parallelism because tensors move between stages rather than across every large matmul.
Pipeline parallelism creates bubbles. If only one request is active, stage 2 waits for stage 1, stage 3 waits for stage 2, and so on. Bigger batches or many concurrent requests can fill the pipeline better.
Use pipeline parallelism when:
Tensor and pipeline parallelism can be combined. For example, eight GPUs can run tensor_parallel_size=4 and pipeline_parallel_size=2, giving two layer stages where each stage is a four-GPU tensor-parallel group. For a multi-node vLLM deployment, the common first layout is tensor parallelism inside each node and pipeline parallelism across nodes. vLLM also recommends considering pipeline parallelism inside one node when GPU count doesn't evenly divide the model or the node lacks NVLink.[2] This may make a larger replica fit, but queue depth and interconnect measurements still decide TTFT and throughput.
1def ideal_pipeline_utilization(stages: int, microbatches: int) -> float:
2 return microbatches / (microbatches + stages - 1)
3
4for microbatches in (1, 4, 16):
5 utilization = ideal_pipeline_utilization(stages=4, microbatches=microbatches)
6 print(f"4 stages, {microbatches:2d} microbatches: {utilization:.1%} ideal utilization")14 stages, 1 microbatches: 25.0% ideal utilization
24 stages, 4 microbatches: 57.1% ideal utilization
34 stages, 16 microbatches: 84.2% ideal utilizationBoth names mention the token dimension, but they solve different problems. In Megatron Core, sequence parallelism works alongside tensor parallelism: it shards sequence-dimension work in components such as LayerNorm and Dropout to reduce activation memory. Context parallelism partitions the sequence across devices through the transformer layers and is the long-sequence strategy in Megatron's current guide.[5]
For inference, support depends on the runtime and model architecture. Context sharding means the system distributes long-prompt attention work or KV state across devices instead of only splitting weights. Ring Attention is one family of techniques for this problem. Devices hold local query blocks while Key and Value blocks circulate through a ring for blockwise attention. The paper applies this idea to training and inference.[6]
For a codebase assistant reading a long repository map plus build logs, sequence length can dominate prefill cost. Tensor parallelism helps with model weights. Prefix caching helps with repeated prefixes. Context-aware serving helps when the prompt itself is large and attention work or KV state needs to be spread out.
Context parallelism isn't the first knob most teams touch. Start with model size, quantization, tensor parallelism, and batching. Reach for context-level techniques when long prompts are the bottleneck and your runtime supports the required communication pattern.
1tokens = 1_000_000
2layers = 80
3kv_heads = 8
4head_dim = 128
5dtype_bytes = 2
6devices = 4
7
8kv_gib = 2 * tokens * layers * kv_heads * head_dim * dtype_bytes / 1024**3
9print(f"single-request KV footprint: {kv_gib:.1f} GiB")
10print(f"even {devices}-way context shard: {kv_gib / devices:.1f} GiB/device before overhead")1single-request KV footprint: 305.2 GiB
2even 4-way context shard: 76.3 GiB/device before overheadMixture-of-Experts models add another sharding axis: experts. Instead of every token using every feed-forward block, the router sends each token to a small subset of experts.[7] Expert parallelism places different experts on different GPUs, so the serving system can scale total expert capacity without copying every expert to every device.
Expert parallelism pays in routing communication and load balance. Expert-parallel implementations commonly dispatch tokens to devices that own selected experts and combine results afterward, often using all-to-all-style communication. If many tokens choose the same expert, that expert's device becomes the bottleneck while other devices wait. For dense models, start with tensor/pipeline/context choices. For MoE serving, add expert placement and router-load metrics to the plan.
DeepSeek-V3, for example, reports 671B total parameters with 37B activated per token, illustrating why total expert storage and active-token compute are different capacity questions.[8] Serving such a model still needs expert placement, routing balance, and communication measurements; expert parallelism can combine with tensor and data parallelism rather than replace them.
1tokens_by_device = [48, 19, 17, 16]
2average = sum(tokens_by_device) / len(tokens_by_device)
3peak_ratio = max(tokens_by_device) / average
4
5print(f"average routed tokens/device: {average:.1f}")
6print(f"hottest device tokens: {max(tokens_by_device)}")
7print(f"hotspot ratio: {peak_ratio:.2f}x average")1average routed tokens/device: 25.0
2hottest device tokens: 48
3hotspot ratio: 1.92x averageSuppose you need to serve Qwen3.6-35B-A3B for a codebase-reasoning assistant:
| Requirement | Implication |
|---|---|
| Full BF16 checkpoint exceeds conservative one-GPU budget | Need tensor, pipeline, or expert-aware placement |
| 8K context and many concurrent users | KV cache budget matters |
| Low TTFT | Avoid slow cross-node communication |
| High traffic bursts | Consider replicas plus batching |
| Strict data boundary | Maybe self-host rather than hosted API |
A reasonable benchmark candidate is one fast-linked node with enough high-memory GPUs for one replica, tensor parallelism within that node, continuous batching, and prefix caching for stable policy text. If its cost or latency is unacceptable, measure quantization before adding cross-node parallelism.
Some systems add one more axis: disaggregated serving runs prefill and decode on separate worker pools, each with its own parallelism, and transfers KV state between them. Systems such as DistServe and Splitwise study when this can raise goodput by reducing phase interference, subject to KV-transfer overhead.[9][10] The parallelism choices below still apply, but they can be evaluated per phase.
1gpu_count = 2
2gpu_capacity_gb = 80
3weights_gb = 70
4runtime_reserve_gb = 32
5measured_kv_per_request_gb = 0.75
6
7kv_budget_gb = gpu_count * gpu_capacity_gb - weights_gb - runtime_reserve_gb
8arithmetic_batch_ceiling = int(kv_budget_gb / measured_kv_per_request_gb)
9
10print(f"KV budget after weights and reserve: {kv_budget_gb} GB")
11print(f"arithmetic request ceiling at measured KV/request: {arithmetic_batch_ceiling}")
12print("Latency and burst headroom determine the admitted batch below this ceiling.")1KV budget after weights and reserve: 58 GB
2arithmetic request ceiling at measured KV/request: 77
3Latency and burst headroom determine the admitted batch below this ceiling.Multi-GPU inference should be measured with serving metrics, not offline tokens per second alone.
Track:
The worst mistake is counting total VRAM and declaring victory. A four-GPU box with enough raw memory can still miss latency targets if the interconnect is saturated or the scheduler can't fill the pipeline.
Model parallelism is a capacity tool for models that need multiple GPUs. Replicas fit many independent requests when each model copy fits on one GPU. Combine both when the product needs a large model and real throughput.
1benchmarks = [
2 {"name": "TP=1", "fits": False, "ttft_p95": 230, "tpot_p95": 34, "tps": 650},
3 {"name": "TP=2", "fits": True, "ttft_p95": 310, "tpot_p95": 45, "tps": 620},
4 {"name": "TP=4", "fits": True, "ttft_p95": 430, "tpot_p95": 62, "tps": 700},
5]
6ttft_limit, tpot_limit = 400, 55
7eligible = [b for b in benchmarks if b["fits"] and b["ttft_p95"] <= ttft_limit and b["tpot_p95"] <= tpot_limit]
8best = max(eligible, key=lambda b: b["tps"])
9print(f"eligible configurations: {[b['name'] for b in eligible]}")
10print(f"highest-throughput configuration inside SLO: {best['name']}")1eligible configurations: ['TP=2']
2highest-throughput configuration inside SLO: TP=2Consider two serving plans:
| Workload | Better first move | Reason |
|---|---|---|
| Gemma 4 12B code-classification model, high traffic | Replicas | One copy fits on one GPU, so duplicate it for throughput |
| Qwen3.6-35B-A3B reasoning model, low traffic | Tensor or expert-aware parallelism | One copy is too tight for a conservative one-GPU serving budget |
| Qwen3.6-35B-A3B, high traffic | Parallelism plus replicas | One request needs a sharded copy, and traffic needs multiple copies |
| Gemma 4 12B, 64K repository prompts | Measure sequence pressure | Long prefill or KV cache may dominate before weights do |
This is the decision habit to build: ask whether the bottleneck is model weight memory, request volume, context length, or communication. Different bottlenecks need different tools.
1gpu_budget = 8
2measured_tps_per_replica = {"Gemma4 12B on 1 GPU": 600, "Qwen3.6-35B-A3B TP=2": 260}
3gpus_per_replica = {"Gemma4 12B on 1 GPU": 1, "Qwen3.6-35B-A3B TP=2": 2}
4
5for name, per_replica_tps in measured_tps_per_replica.items():
6 replicas = gpu_budget // gpus_per_replica[name]
7 print(f"{name}: {replicas} replicas, {replicas * per_replica_tps} measured aggregate TPS")1Gemma4 12B on 1 GPU: 8 replicas, 4800 measured aggregate TPS
2Qwen3.6-35B-A3B TP=2: 4 replicas, 1040 measured aggregate TPSStrong answers should:
Symptom: The model fits on paper, but the runtime still hits OOM. Cause: You counted weight memory and forgot KV cache, runtime buffers, allocator slack, and bursty long-context headroom. Fix: Size memory by bucket, not checkpoint size alone.
Symptom: Latency gets worse after you spread the model across more GPUs. Cause: Cross-node tensor-parallel collectives now dominate decode. Fix: Keep the tensor-parallel group inside one fast node first. Only cross weaker links when pipeline behavior and queue depth justify it.
Symptom: TTFT rises even though the bigger shard plan finally fits. Cause: Extra communication and startup coordination removed less pressure than they added. Fix: Measure TTFT and decode TPS directly. More GPUs aren't automatically a serving win.
Symptom: You shard a model that already fits, but throughput barely improves. Cause: Traffic volume was the real bottleneck, so communication replaced a simpler replica plan. Fix: Use replicas first when one full model copy fits and requests are independent.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.
Qwen3.6-35B-A3B
Qwen Team · 2026
Distributed Inference and Serving.
vLLM Project. · 2026 · Official documentation
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.
Shoeybi, M., et al. · 2019
NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference
NVIDIA · 2024
Parallelism Strategies Guide.
NVIDIA · 2026
Ring Attention with Blockwise Transformers for Near-Infinite Context.
Liu, H., et al. · 2024 · arXiv preprint
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
Shazeer, N., et al. · 2017 · ICLR 2017
DeepSeek-V3 Technical Report.
DeepSeek-AI · 2024 · arXiv preprint
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.
Zhong, Y., et al. · 2024 · OSDI 2024
Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
Patel, P., et al. · 2023