LearnInference & Production ScaleModel Parallelism for LLM Inference

🚀HardInference Optimization

Model Parallelism for LLM Inference

Learn tensor parallelism, pipeline parallelism, context parallelism, and how multi-GPU serving trades memory capacity for communication overhead.

21 min read

Learning path

Step 133 of 158 in the full curriculum

Scaling LLM Inference Model Quantization: GPTQ, AWQ & GGUF

HBM, KV cache, and scheduler policy limit single-node serving. The next question is what changes when a single large language model (LLM) copy no longer fits comfortably on one accelerator.

Serving Qwen3.6-35B-A3B for a codebase assistant that reads long diffs, build logs, and architecture notes creates two memory questions. Model-weight memory decides whether one replica fits at all, while per-request KV state decides how many long-context sessions can run at once.

Qwen3.6-35B-A3B is a sparse MoE checkpoint with about 35B total parameters and about 3B activated per token.^[1] At BF16, the full checkpoint is roughly 70 GB before KV cache, runtime buffers, and allocator headroom. The A3B suffix helps reason about active compute, but it doesn't mean the serving system only needs 3B parameters worth of memory.

Model parallelism is the set of techniques that split one model across multiple accelerators. The split can make a large model fit, but every added device also creates communication or scheduling work.

Inference parallelism decision frame that starts from the bottleneck, maps it to a split axis, and highlights the communication or memory cost you pay next. — Parallelism choices start with the pressure point. Replicas solve traffic. Shards solve fit problems in weights, layer depth, or long-context state.

Why inference sharding differs from training

Distributed training cares about gradients, optimizer states, activation checkpointing, and throughput over many examples. Distributed inference cares about time to first token (TTFT), tokens per second (TPS), KV-cache memory, and request scheduling.

The same names appear in both worlds, but the trade-offs shift:

Technique	Training concern	Inference concern
Tensor parallelism	Split matmuls and gradients	Split weights and activations with low latency
Pipeline parallelism	Fill stages with microbatches	Avoid pipeline bubbles during generation
Sequence parallelism	Reduce selected activation memory alongside tensor parallelism	Usually a training optimization, not shorthand for long-context inference
Context parallelism	Split long-sequence work across devices	Split long prompts, attention work, or KV-cache state when runtime supports it
Data parallel serving	Replicate model	Increase throughput for many requests

Serving runtimes expose topology controls such as tensor- and pipeline-parallel sizes. Treat those knobs as a deployment mechanism, not a performance guarantee: the selected topology still needs memory accounting and latency benchmarks. vLLM's official scaling guide recommends one GPU when the model fits, single-node tensor parallelism when it needs several GPUs in one node, and tensor plus pipeline parallelism when it exceeds one node.^[2]

Before picking a sharding strategy, do the simplest memory math:

text

weight memory ~= parameters x bytes per parameter
serving memory ~= weights + KV cache + runtime buffers + safety margin

For BF16 or FP16 weights, a parameter takes 2 bytes. That makes Qwen3.6-35B-A3B about 70 GB by total parameters before a single prompt arrives. KV cache grows with active requests, context length, layers, KV heads, head dimension, and bytes per value. Runtime buffers and memory fragmentation add more headroom. "Enough VRAM" means all of those buckets fit at the traffic level you plan to serve, not the checkpoint file or active-parameter count alone.

multi-gpu-sizing-sketch.py

from math import ceil

def weight_memory_gb(params_billion: float, bytes_per_param: float) -> float:
    return params_billion * bytes_per_param

def next_power_of_two(value: int) -> int:
    size = 1
    while size < value:
        size *= 2
    return size

params_billion = 35
gpu_gb = 80
reserve_fraction = 0.20
usable_gb = gpu_gb * (1 - reserve_fraction)

bf16_weights = weight_memory_gb(params_billion, bytes_per_param=2)
raw_min_gpus = ceil(bf16_weights / usable_gb)
tp_candidate = next_power_of_two(raw_min_gpus)
int4_weights = weight_memory_gb(params_billion, bytes_per_param=0.5)

print(f"Qwen3.6-35B-A3B BF16 total weights: {bf16_weights:.0f} GB")
print(f"80GB GPU usable with 20% reserve: {usable_gb:.0f} GB")
print(f"minimum GPUs for weights with reserve: {raw_min_gpus}")
print(f"practical TP candidate: {tp_candidate}")
print(f"Qwen3.6-35B-A3B INT4 total weight-only estimate: {int4_weights:.0f} GB")

Output

Qwen3.6-35B-A3B BF16 total weights: 70 GB
80GB GPU usable with 20% reserve: 64 GB
minimum GPUs for weights with reserve: 2
practical TP candidate: 2
Qwen3.6-35B-A3B INT4 total weight-only estimate: 18 GB

This calculator is deliberately conservative but incomplete. It only sizes weights plus a reserve. Production sizing still needs KV cache, activation buffers, tensor-parallel divisibility, interconnect measurements, and latency SLOs.

weight-shards-leave-room-for-kv.py

weights_gb = 70
gpu_gb = 80
reserved_runtime_gb = 16

for tp_size in (1, 2):
    shard_gb = weights_gb / tp_size
    kv_room_gb = gpu_gb - reserved_runtime_gb - shard_gb
    print(f"TP={tp_size}: {shard_gb:.0f} GB weights/GPU, {kv_room_gb:.0f} GB left for KV")

Output

TP=1: 70 GB weights/GPU, -6 GB left for KV
TP=2: 35 GB weights/GPU, 29 GB left for KV

Tensor parallelism

Tensor parallelism splits large matrix operations across GPUs. In a transformer layer, the model contains big linear projections for attention and feed-forward blocks. Tensor parallelism shards those weights so each GPU owns part of the matrix.

For a simplified linear layer:

text

y = xW

you might split W across four GPUs. Each GPU computes part of the output, then the system communicates to combine results. Interconnect decides whether that split pays off. Tensor parallelism is strongest on GPUs connected by fast NVLink (NVIDIA high-bandwidth GPU-to-GPU interconnect) or similar high-bandwidth links.

The Megatron pattern: column then row

Naively, every sharded matmul would need a sync to glue the pieces back together. Megatron-LM avoids that by choosing the split directions so each transformer sub-block needs only one synchronization.^[3] It chains a column-parallel layer into a row-parallel layer.

In the MLP block Z = (GeLU(xA))B:

Split the first weight A by columns. Each GPU computes GeLU(x A_i) independently. GeLU is element-wise, so no sync is needed before the nonlinearity. (Splitting A by rows would force a sync first, because GeLU(x1 A1 + x2 A2) isn't the sum of the per-shard GeLUs.)
Split the second weight B by rows. The column-sharded output of the first layer is exactly the input layout the row-parallel second layer expects, so the partial results flow through with no intermediate communication. One all-reduce after B sums the partial outputs.

The attention block follows the same shape. The query, key, and value projections are split column-wise, which maps cleanly onto independent attention heads, and the output projection is split row-wise. That gives one all-reduce after attention.

For the dense Megatron layout described here, a tensor-parallel group executes two all-reduces per transformer layer in the forward pass: one for attention and one for the MLP.^[3] During decode, that means every generated token pays two collectives per layer. An 80-layer model therefore executes 160 all-reduces per decode step under this layout, unless a runtime changes or fuses the communication scheme.

Megatron tensor-parallel MLP flow where GPUs compute local column-sharded GeLU activations, feed them directly into row-sharded second projections, and all-reduce only after the paired projections. — Megatron's column-then-row pairing matters: local GeLU shards flow into local row-parallel projections before one all-reduce restores the full block output.

For inference, tensor parallelism often reduces memory pressure and can improve latency for large models, but it adds communication inside layers. If communication is slow, adding GPUs can make serving worse.

The interconnect hierarchy is why tensor parallelism is usually easiest to justify within a fast-link domain. NVIDIA's H100 SXM specification advertises up to 900 GB/s NVLink aggregate bandwidth per GPU.^[4] PCIe and cross-node network paths have different bandwidth and latency characteristics, so the same tensor-parallel layout can perform very differently across machines. Because every decode token triggers collectives, benchmark the target topology instead of assuming more GPUs lower latency.

decode-collective-count.py

layers = 80
all_reduces_per_layer = 2
output_tokens = 128

per_token = layers * all_reduces_per_layer
generation_total = per_token * output_tokens

print(f"all-reduces per decode token: {per_token}")
print(f"all-reduces for {output_tokens} output tokens: {generation_total:,}")

Output

all-reduces per decode token: 160
all-reduces for 128 output tokens: 20,480

collective-latency-floor.py

collectives_per_token = 160

for assumed_collective_latency_us in (5, 20):
    floor_ms = collectives_per_token * assumed_collective_latency_us / 1000
    print(f"{assumed_collective_latency_us} us startup -> {floor_ms:.1f} ms/token before payload transfer")

Output

5 us startup -> 0.8 ms/token before payload transfer
20 us startup -> 3.2 ms/token before payload transfer

Use tensor parallelism when:

The model doesn't fit on one GPU.
The GPUs have fast interconnect.
You need one request to use multiple GPUs at once.
Batch sizes aren't large enough to rely only on model replicas.

Megatron-LM popularized practical tensor model parallelism for large transformers.^[3] In serving stacks, the same idea appears as tensor_parallel_size.

Pipeline parallelism

Pipeline parallelism splits layers into stages. GPU 0 owns early layers, GPU 1 owns middle layers, and GPU 2 owns later layers. A token's hidden state moves through the stages.

This reduces memory per GPU because each stage stores only part of the model. It can also reduce communication compared with tensor parallelism because tensors move between stages rather than across every large matmul.

Pipeline parallelism creates bubbles. If only one request is active, stage 2 waits for stage 1, stage 3 waits for stage 2, and so on. Bigger batches or many concurrent requests can fill the pipeline better.

Use pipeline parallelism when:

Tensor parallelism alone doesn't fit the model.
The model must cross node boundaries.
You can batch enough work to keep stages busy.
You can tolerate slightly more scheduling complexity.

Tensor and pipeline parallelism can be combined. For example, eight GPUs can run tensor_parallel_size=4 and pipeline_parallel_size=2, giving two layer stages where each stage is a four-GPU tensor-parallel group. For a multi-node vLLM deployment, the common first layout is tensor parallelism inside each node and pipeline parallelism across nodes. vLLM also recommends considering pipeline parallelism inside one node when GPU count doesn't evenly divide the model or the node lacks NVLink.^[2] This may make a larger replica fit, but queue depth and interconnect measurements still decide TTFT and throughput.

Pipeline parallel timeline comparing one active request, where later stages sit idle, with continuous queued requests that keep most stages occupied. — Pipeline parallelism reduces per-GPU memory by splitting model depth, but queue depth decides whether that split becomes useful throughput or just idle bubbles.

pipeline-bubble-utilization.py

def ideal_pipeline_utilization(stages: int, microbatches: int) -> float:
    return microbatches / (microbatches + stages - 1)

for microbatches in (1, 4, 16):
    utilization = ideal_pipeline_utilization(stages=4, microbatches=microbatches)
    print(f"4 stages, {microbatches:2d} microbatches: {utilization:.1%} ideal utilization")

Output

stages,  1 microbatches: 25.0% ideal utilization
stages,  4 microbatches: 57.1% ideal utilization
stages, 16 microbatches: 84.2% ideal utilization

Sequence parallelism and context parallelism aren't interchangeable

Both names mention the token dimension, but they solve different problems. In Megatron Core, sequence parallelism works alongside tensor parallelism: it shards sequence-dimension work in components such as LayerNorm and Dropout to reduce activation memory. Context parallelism partitions the sequence across devices through the transformer layers and is the long-sequence strategy in Megatron's current guide.^[5]

For inference, support depends on the runtime and model architecture. Context sharding means the system distributes long-prompt attention work or KV state across devices instead of only splitting weights. Ring Attention is one family of techniques for this problem. Devices hold local query blocks while Key and Value blocks circulate through a ring for blockwise attention. The paper applies this idea to training and inference.^[6]

For a codebase assistant reading a long repository map plus build logs, sequence length can dominate prefill cost. Tensor parallelism helps with model weights. Prefix caching helps with repeated prefixes. Context-aware serving helps when the prompt itself is large and attention work or KV state needs to be spread out.

Context parallelism isn't the first knob most teams touch. Start with model size, quantization, tensor parallelism, and batching. Reach for context-level techniques when long prompts are the bottleneck and your runtime supports the required communication pattern.

context-parallel-kv-capacity.py

tokens = 1_000_000
layers = 80
kv_heads = 8
head_dim = 128
dtype_bytes = 2
devices = 4

kv_gib = 2 * tokens * layers * kv_heads * head_dim * dtype_bytes / 1024**3
print(f"single-request KV footprint: {kv_gib:.1f} GiB")
print(f"even {devices}-way context shard: {kv_gib / devices:.1f} GiB/device before overhead")

Output

single-request KV footprint: 305.2 GiB
even 4-way context shard: 76.3 GiB/device before overhead

Expert parallelism for MoE models

Mixture-of-Experts models add another sharding axis: experts. Instead of every token using every feed-forward block, the router sends each token to a small subset of experts.^[7] Expert parallelism places different experts on different GPUs, so the serving system can scale total expert capacity without copying every expert to every device.

Expert parallelism pays in routing communication and load balance. Expert-parallel implementations commonly dispatch tokens to devices that own selected experts and combine results afterward, often using all-to-all-style communication. If many tokens choose the same expert, that expert's device becomes the bottleneck while other devices wait. For dense models, start with tensor/pipeline/context choices. For MoE serving, add expert placement and router-load metrics to the plan.

DeepSeek-V3, for example, reports 671B total parameters with 37B activated per token, illustrating why total expert storage and active-token compute are different capacity questions.^[8] Serving such a model still needs expert placement, routing balance, and communication measurements; expert parallelism can combine with tensor and data parallelism rather than replace them.

expert-routing-imbalance.py

tokens_by_device = [48, 19, 17, 16]
average = sum(tokens_by_device) / len(tokens_by_device)
peak_ratio = max(tokens_by_device) / average

print(f"average routed tokens/device: {average:.1f}")
print(f"hottest device tokens: {max(tokens_by_device)}")
print(f"hotspot ratio: {peak_ratio:.2f}x average")

Output

average routed tokens/device: 25.0
hottest device tokens: 48
hotspot ratio: 1.92x average

Sizing example

Suppose you need to serve Qwen3.6-35B-A3B for a codebase-reasoning assistant:

Requirement	Implication
Full BF16 checkpoint exceeds conservative one-GPU budget	Need tensor, pipeline, or expert-aware placement
8K context and many concurrent users	KV cache budget matters
Low TTFT	Avoid slow cross-node communication
High traffic bursts	Consider replicas plus batching
Strict data boundary	Maybe self-host rather than hosted API

Two illustrative charts for multi-GPU inference planning: a Qwen3.6-35B-A3B BF16 weight footprint compared with one 80 GB GPU after reserve, and relative decode communication cost as tensor parallelism crosses slower links. — Fit and speed are separate tests. One chart asks whether the model can stay in memory. The other asks whether the chosen shard plan will still serve tokens quickly.

A reasonable benchmark candidate is one fast-linked node with enough high-memory GPUs for one replica, tensor parallelism within that node, continuous batching, and prefix caching for stable policy text. If its cost or latency is unacceptable, measure quantization before adding cross-node parallelism.

Some systems add one more axis: disaggregated serving runs prefill and decode on separate worker pools, each with its own parallelism, and transfers KV state between them. Systems such as DistServe and Splitwise study when this can raise goodput by reducing phase interference, subject to KV-transfer overhead.^[9]^[10] The parallelism choices below still apply, but they can be evaluated per phase.

serving-memory-budget.py

gpu_count = 2
gpu_capacity_gb = 80
weights_gb = 70
runtime_reserve_gb = 32
measured_kv_per_request_gb = 0.75

kv_budget_gb = gpu_count * gpu_capacity_gb - weights_gb - runtime_reserve_gb
arithmetic_batch_ceiling = int(kv_budget_gb / measured_kv_per_request_gb)

print(f"KV budget after weights and reserve: {kv_budget_gb} GB")
print(f"arithmetic request ceiling at measured KV/request: {arithmetic_batch_ceiling}")
print("Latency and burst headroom determine the admitted batch below this ceiling.")

Output

KV budget after weights and reserve: 58 GB
arithmetic request ceiling at measured KV/request: 77
Latency and burst headroom determine the admitted batch below this ceiling.

What to measure

Multi-GPU inference should be measured with serving metrics, not offline tokens per second alone.

Track:

Time to first token.
Decode tokens per second.
Aggregate throughput.
GPU memory used by weights.
GPU memory used by KV cache.
Interconnect utilization.
Queue time under burst traffic.
Error rate when one GPU or node fails.

The worst mistake is counting total VRAM and declaring victory. A four-GPU box with enough raw memory can still miss latency targets if the interconnect is saturated or the scheduler can't fill the pipeline.

Model parallelism is a capacity tool for models that need multiple GPUs. Replicas fit many independent requests when each model copy fits on one GPU. Combine both when the product needs a large model and real throughput.

select-config-under-slo.py

benchmarks = [
    {"name": "TP=1", "fits": False, "ttft_p95": 230, "tpot_p95": 34, "tps": 650},
    {"name": "TP=2", "fits": True, "ttft_p95": 310, "tpot_p95": 45, "tps": 620},
    {"name": "TP=4", "fits": True, "ttft_p95": 430, "tpot_p95": 62, "tps": 700},
]
ttft_limit, tpot_limit = 400, 55
eligible = [b for b in benchmarks if b["fits"] and b["ttft_p95"] <= ttft_limit and b["tpot_p95"] <= tpot_limit]
best = max(eligible, key=lambda b: b["tps"])
print(f"eligible configurations: {[b['name'] for b in eligible]}")
print(f"highest-throughput configuration inside SLO: {best['name']}")

Output

eligible configurations: ['TP=2']
highest-throughput configuration inside SLO: TP=2

Practice: choose sharding or replicas

Consider two serving plans:

Workload	Better first move	Reason
Gemma 4 12B code-classification model, high traffic	Replicas	One copy fits on one GPU, so duplicate it for throughput
Qwen3.6-35B-A3B reasoning model, low traffic	Tensor or expert-aware parallelism	One copy is too tight for a conservative one-GPU serving budget
Qwen3.6-35B-A3B, high traffic	Parallelism plus replicas	One request needs a sharded copy, and traffic needs multiple copies
Gemma 4 12B, 64K repository prompts	Measure sequence pressure	Long prefill or KV cache may dominate before weights do

This is the decision habit to build: ask whether the bottleneck is model weight memory, request volume, context length, or communication. Different bottlenecks need different tools.

replicas-versus-shards.py

gpu_budget = 8
measured_tps_per_replica = {"Gemma4 12B on 1 GPU": 600, "Qwen3.6-35B-A3B TP=2": 260}
gpus_per_replica = {"Gemma4 12B on 1 GPU": 1, "Qwen3.6-35B-A3B TP=2": 2}

for name, per_replica_tps in measured_tps_per_replica.items():
    replicas = gpu_budget // gpus_per_replica[name]
    print(f"{name}: {replicas} replicas, {replicas * per_replica_tps} measured aggregate TPS")

Output

Gemma4 12B on 1 GPU: 8 replicas, 4800 measured aggregate TPS
Qwen3.6-35B-A3B TP=2: 4 replicas, 1040 measured aggregate TPS

Mastery check

Why a model may need sharding for inference even after quantization.
How tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, expert parallelism, and replicas solve different bottlenecks.
Why sequence parallelism and context parallelism aren't interchangeable names for long-context serving.
Why the Megatron column-then-row pattern needs only two all-reduces per layer, and what that costs on every decode token.
Why tensor parallelism depends on fast interconnect (NVLink vs PCIe vs network) and can become communication-bound.
Why pipeline parallelism can save memory but hurt small-batch latency through bubbles.
How disaggregated prefill and decode let each phase pick its own parallelism layout.
How to design a first serving plan from memory buckets, traffic shape, context length, and latency targets.

Evaluation rubric

Strong answers should:

identify the real bottleneck before recommending replicas or sharding
separate memory-fit math from latency and communication measurements
explain why tensor, pipeline, context, and expert parallelism solve different limits
connect interconnect speed directly to TTFT and decode behavior
name the smallest production-worthy first plan instead of the fanciest one

Follow-up questions

Common pitfalls

Symptom: The model fits on paper, but the runtime still hits OOM. Cause: You counted weight memory and forgot KV cache, runtime buffers, allocator slack, and bursty long-context headroom. Fix: Size memory by bucket, not checkpoint size alone.
Symptom: Latency gets worse after you spread the model across more GPUs. Cause: Cross-node tensor-parallel collectives now dominate decode. Fix: Keep the tensor-parallel group inside one fast node first. Only cross weaker links when pipeline behavior and queue depth justify it.
Symptom: TTFT rises even though the bigger shard plan finally fits. Cause: Extra communication and startup coordination removed less pressure than they added. Fix: Measure TTFT and decode TPS directly. More GPUs aren't automatically a serving win.
Symptom: You shard a model that already fits, but throughput barely improves. Cause: Traffic volume was the real bottleneck, so communication replaced a simpler replica plan. Fix: Use replicas first when one full model copy fits and requests are independent.

Next Step

Continue to Model Quantization: GPTQ, AWQ & GGUF

Model parallelism splits a model across GPUs; quantization shrinks the bytes each GPU must store and move, so the next chapter teaches the main compression lever for serving.

PreviousScaling LLM Inference

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Qwen3.6-35B-A3B

Qwen Team · 2026

Distributed Inference and Serving.

vLLM Project. · 2026 · Official documentation

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.

Shoeybi, M., et al. · 2019

NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference

NVIDIA · 2024

Parallelism Strategies Guide.

NVIDIA · 2026

Ring Attention with Blockwise Transformers for Near-Infinite Context.

Liu, H., et al. · 2024 · arXiv preprint

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.

Shazeer, N., et al. · 2017 · ICLR 2017

DeepSeek-V3 Technical Report.

DeepSeek-AI · 2024 · arXiv preprint

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

Back to Topics

LearnInference & Production ScaleModel Parallelism for LLM Inference

🚀HardInference Optimization

Model Parallelism for LLM Inference

Learn tensor parallelism, pipeline parallelism, context parallelism, and how multi-GPU serving trades memory capacity for communication overhead.

21 min read

Learning path

Step 133 of 158 in the full curriculum

Scaling LLM Inference Model Quantization: GPTQ, AWQ & GGUF

HBM, KV cache, and scheduler policy limit single-node serving. The next question is what changes when a single large language model (LLM) copy no longer fits comfortably on one accelerator.

Why inference sharding differs from training

The same names appear in both worlds, but the trade-offs shift:

Technique	Training concern	Inference concern
Tensor parallelism	Split matmuls and gradients	Split weights and activations with low latency
Pipeline parallelism	Fill stages with microbatches	Avoid pipeline bubbles during generation
Sequence parallelism	Reduce selected activation memory alongside tensor parallelism	Usually a training optimization, not shorthand for long-context inference
Context parallelism	Split long-sequence work across devices	Split long prompts, attention work, or KV-cache state when runtime supports it
Data parallel serving	Replicate model	Increase throughput for many requests

Before picking a sharding strategy, do the simplest memory math:

text

weight memory ~= parameters x bytes per parameter
serving memory ~= weights + KV cache + runtime buffers + safety margin

multi-gpu-sizing-sketch.py

from math import ceil

def weight_memory_gb(params_billion: float, bytes_per_param: float) -> float:
    return params_billion * bytes_per_param

def next_power_of_two(value: int) -> int:
    size = 1
    while size < value:
        size *= 2
    return size

params_billion = 35
gpu_gb = 80
reserve_fraction = 0.20
usable_gb = gpu_gb * (1 - reserve_fraction)

bf16_weights = weight_memory_gb(params_billion, bytes_per_param=2)
raw_min_gpus = ceil(bf16_weights / usable_gb)
tp_candidate = next_power_of_two(raw_min_gpus)
int4_weights = weight_memory_gb(params_billion, bytes_per_param=0.5)

print(f"Qwen3.6-35B-A3B BF16 total weights: {bf16_weights:.0f} GB")
print(f"80GB GPU usable with 20% reserve: {usable_gb:.0f} GB")
print(f"minimum GPUs for weights with reserve: {raw_min_gpus}")
print(f"practical TP candidate: {tp_candidate}")
print(f"Qwen3.6-35B-A3B INT4 total weight-only estimate: {int4_weights:.0f} GB")

Output

Qwen3.6-35B-A3B BF16 total weights: 70 GB
80GB GPU usable with 20% reserve: 64 GB
minimum GPUs for weights with reserve: 2
practical TP candidate: 2
Qwen3.6-35B-A3B INT4 total weight-only estimate: 18 GB

weight-shards-leave-room-for-kv.py

weights_gb = 70
gpu_gb = 80
reserved_runtime_gb = 16

for tp_size in (1, 2):
    shard_gb = weights_gb / tp_size
    kv_room_gb = gpu_gb - reserved_runtime_gb - shard_gb
    print(f"TP={tp_size}: {shard_gb:.0f} GB weights/GPU, {kv_room_gb:.0f} GB left for KV")

Output

TP=1: 70 GB weights/GPU, -6 GB left for KV
TP=2: 35 GB weights/GPU, 29 GB left for KV

Tensor parallelism

For a simplified linear layer:

text

y = xW

The Megatron pattern: column then row

In the MLP block Z = (GeLU(xA))B:

Split the first weight A by columns. Each GPU computes GeLU(x A_i) independently. GeLU is element-wise, so no sync is needed before the nonlinearity. (Splitting A by rows would force a sync first, because GeLU(x1 A1 + x2 A2) isn't the sum of the per-shard GeLUs.)
Split the second weight B by rows. The column-sharded output of the first layer is exactly the input layout the row-parallel second layer expects, so the partial results flow through with no intermediate communication. One all-reduce after B sums the partial outputs.

decode-collective-count.py

layers = 80
all_reduces_per_layer = 2
output_tokens = 128

per_token = layers * all_reduces_per_layer
generation_total = per_token * output_tokens

print(f"all-reduces per decode token: {per_token}")
print(f"all-reduces for {output_tokens} output tokens: {generation_total:,}")

Output

all-reduces per decode token: 160
all-reduces for 128 output tokens: 20,480

collective-latency-floor.py

collectives_per_token = 160

for assumed_collective_latency_us in (5, 20):
    floor_ms = collectives_per_token * assumed_collective_latency_us / 1000
    print(f"{assumed_collective_latency_us} us startup -> {floor_ms:.1f} ms/token before payload transfer")

Output

5 us startup -> 0.8 ms/token before payload transfer
20 us startup -> 3.2 ms/token before payload transfer

Use tensor parallelism when:

The model doesn't fit on one GPU.
The GPUs have fast interconnect.
You need one request to use multiple GPUs at once.
Batch sizes aren't large enough to rely only on model replicas.

Megatron-LM popularized practical tensor model parallelism for large transformers.^[3] In serving stacks, the same idea appears as tensor_parallel_size.

Pipeline parallelism

Pipeline parallelism splits layers into stages. GPU 0 owns early layers, GPU 1 owns middle layers, and GPU 2 owns later layers. A token's hidden state moves through the stages.

Use pipeline parallelism when:

Tensor parallelism alone doesn't fit the model.
The model must cross node boundaries.
You can batch enough work to keep stages busy.
You can tolerate slightly more scheduling complexity.

pipeline-bubble-utilization.py

def ideal_pipeline_utilization(stages: int, microbatches: int) -> float:
    return microbatches / (microbatches + stages - 1)

for microbatches in (1, 4, 16):
    utilization = ideal_pipeline_utilization(stages=4, microbatches=microbatches)
    print(f"4 stages, {microbatches:2d} microbatches: {utilization:.1%} ideal utilization")

Output

stages,  1 microbatches: 25.0% ideal utilization
stages,  4 microbatches: 57.1% ideal utilization
stages, 16 microbatches: 84.2% ideal utilization

Sequence parallelism and context parallelism aren't interchangeable

context-parallel-kv-capacity.py

tokens = 1_000_000
layers = 80
kv_heads = 8
head_dim = 128
dtype_bytes = 2
devices = 4

kv_gib = 2 * tokens * layers * kv_heads * head_dim * dtype_bytes / 1024**3
print(f"single-request KV footprint: {kv_gib:.1f} GiB")
print(f"even {devices}-way context shard: {kv_gib / devices:.1f} GiB/device before overhead")

Output

single-request KV footprint: 305.2 GiB
even 4-way context shard: 76.3 GiB/device before overhead

Expert parallelism for MoE models

expert-routing-imbalance.py

tokens_by_device = [48, 19, 17, 16]
average = sum(tokens_by_device) / len(tokens_by_device)
peak_ratio = max(tokens_by_device) / average

print(f"average routed tokens/device: {average:.1f}")
print(f"hottest device tokens: {max(tokens_by_device)}")
print(f"hotspot ratio: {peak_ratio:.2f}x average")

Output

average routed tokens/device: 25.0
hottest device tokens: 48
hotspot ratio: 1.92x average

Sizing example

Suppose you need to serve Qwen3.6-35B-A3B for a codebase-reasoning assistant:

Requirement	Implication
Full BF16 checkpoint exceeds conservative one-GPU budget	Need tensor, pipeline, or expert-aware placement
8K context and many concurrent users	KV cache budget matters
Low TTFT	Avoid slow cross-node communication
High traffic bursts	Consider replicas plus batching
Strict data boundary	Maybe self-host rather than hosted API

serving-memory-budget.py

gpu_count = 2
gpu_capacity_gb = 80
weights_gb = 70
runtime_reserve_gb = 32
measured_kv_per_request_gb = 0.75

kv_budget_gb = gpu_count * gpu_capacity_gb - weights_gb - runtime_reserve_gb
arithmetic_batch_ceiling = int(kv_budget_gb / measured_kv_per_request_gb)

print(f"KV budget after weights and reserve: {kv_budget_gb} GB")
print(f"arithmetic request ceiling at measured KV/request: {arithmetic_batch_ceiling}")
print("Latency and burst headroom determine the admitted batch below this ceiling.")

Output

KV budget after weights and reserve: 58 GB
arithmetic request ceiling at measured KV/request: 77
Latency and burst headroom determine the admitted batch below this ceiling.

What to measure

Multi-GPU inference should be measured with serving metrics, not offline tokens per second alone.

Track:

Time to first token.
Decode tokens per second.
Aggregate throughput.
GPU memory used by weights.
GPU memory used by KV cache.
Interconnect utilization.
Queue time under burst traffic.
Error rate when one GPU or node fails.

select-config-under-slo.py

benchmarks = [
    {"name": "TP=1", "fits": False, "ttft_p95": 230, "tpot_p95": 34, "tps": 650},
    {"name": "TP=2", "fits": True, "ttft_p95": 310, "tpot_p95": 45, "tps": 620},
    {"name": "TP=4", "fits": True, "ttft_p95": 430, "tpot_p95": 62, "tps": 700},
]
ttft_limit, tpot_limit = 400, 55
eligible = [b for b in benchmarks if b["fits"] and b["ttft_p95"] <= ttft_limit and b["tpot_p95"] <= tpot_limit]
best = max(eligible, key=lambda b: b["tps"])
print(f"eligible configurations: {[b['name'] for b in eligible]}")
print(f"highest-throughput configuration inside SLO: {best['name']}")

Output

eligible configurations: ['TP=2']
highest-throughput configuration inside SLO: TP=2

Practice: choose sharding or replicas

Consider two serving plans:

Workload	Better first move	Reason
Gemma 4 12B code-classification model, high traffic	Replicas	One copy fits on one GPU, so duplicate it for throughput
Qwen3.6-35B-A3B reasoning model, low traffic	Tensor or expert-aware parallelism	One copy is too tight for a conservative one-GPU serving budget
Qwen3.6-35B-A3B, high traffic	Parallelism plus replicas	One request needs a sharded copy, and traffic needs multiple copies
Gemma 4 12B, 64K repository prompts	Measure sequence pressure	Long prefill or KV cache may dominate before weights do

This is the decision habit to build: ask whether the bottleneck is model weight memory, request volume, context length, or communication. Different bottlenecks need different tools.

replicas-versus-shards.py

gpu_budget = 8
measured_tps_per_replica = {"Gemma4 12B on 1 GPU": 600, "Qwen3.6-35B-A3B TP=2": 260}
gpus_per_replica = {"Gemma4 12B on 1 GPU": 1, "Qwen3.6-35B-A3B TP=2": 2}

for name, per_replica_tps in measured_tps_per_replica.items():
    replicas = gpu_budget // gpus_per_replica[name]
    print(f"{name}: {replicas} replicas, {replicas * per_replica_tps} measured aggregate TPS")

Output

Gemma4 12B on 1 GPU: 8 replicas, 4800 measured aggregate TPS
Qwen3.6-35B-A3B TP=2: 4 replicas, 1040 measured aggregate TPS

Mastery check

Why a model may need sharding for inference even after quantization.
How tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, expert parallelism, and replicas solve different bottlenecks.
Why sequence parallelism and context parallelism aren't interchangeable names for long-context serving.
Why the Megatron column-then-row pattern needs only two all-reduces per layer, and what that costs on every decode token.
Why tensor parallelism depends on fast interconnect (NVLink vs PCIe vs network) and can become communication-bound.
Why pipeline parallelism can save memory but hurt small-batch latency through bubbles.
How disaggregated prefill and decode let each phase pick its own parallelism layout.
How to design a first serving plan from memory buckets, traffic shape, context length, and latency targets.

Evaluation rubric

Strong answers should:

identify the real bottleneck before recommending replicas or sharding
separate memory-fit math from latency and communication measurements
explain why tensor, pipeline, context, and expert parallelism solve different limits
connect interconnect speed directly to TTFT and decode behavior
name the smallest production-worthy first plan instead of the fanciest one

Follow-up questions

Common pitfalls

Symptom: The model fits on paper, but the runtime still hits OOM. Cause: You counted weight memory and forgot KV cache, runtime buffers, allocator slack, and bursty long-context headroom. Fix: Size memory by bucket, not checkpoint size alone.
Symptom: Latency gets worse after you spread the model across more GPUs. Cause: Cross-node tensor-parallel collectives now dominate decode. Fix: Keep the tensor-parallel group inside one fast node first. Only cross weaker links when pipeline behavior and queue depth justify it.
Symptom: TTFT rises even though the bigger shard plan finally fits. Cause: Extra communication and startup coordination removed less pressure than they added. Fix: Measure TTFT and decode TPS directly. More GPUs aren't automatically a serving win.
Symptom: You shard a model that already fits, but throughput barely improves. Cause: Traffic volume was the real bottleneck, so communication replaced a simpler replica plan. Fix: Use replicas first when one full model copy fits and requests are independent.

Next Step

Continue to Model Quantization: GPTQ, AWQ & GGUF

Model parallelism splits a model across GPUs; quantization shrinks the bytes each GPU must store and move, so the next chapter teaches the main compression lever for serving.

PreviousScaling LLM Inference

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Qwen3.6-35B-A3B

Qwen Team · 2026

Distributed Inference and Serving.

vLLM Project. · 2026 · Official documentation

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.

Shoeybi, M., et al. · 2019

NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference

NVIDIA · 2024

Parallelism Strategies Guide.

NVIDIA · 2026

Ring Attention with Blockwise Transformers for Near-Infinite Context.

Liu, H., et al. · 2024 · arXiv preprint

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.

Shazeer, N., et al. · 2017 · ICLR 2017

DeepSeek-V3 Technical Report.

DeepSeek-AI · 2024 · arXiv preprint

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

Model Parallelism for LLM Inference

Why inference sharding differs from training

Tensor parallelism

The Megatron pattern: column then row

Pipeline parallelism

Sequence parallelism and context parallelism aren't interchangeable

Expert parallelism for MoE models

Sizing example

What to measure

Practice: choose sharding or replicas

Mastery check

Evaluation rubric

Follow-up questions

Common pitfalls

Mastery Check

Model Parallelism for LLM Inference

Why inference sharding differs from training

Tensor parallelism

The Megatron pattern: column then row

Pipeline parallelism

Sequence parallelism and context parallelism aren't interchangeable

Expert parallelism for MoE models

Sizing example

What to measure

Practice: choose sharding or replicas

Mastery check

Evaluation rubric

Follow-up questions

Common pitfalls

Mastery Check

Model Parallelism for LLM Inference

Why can Qwen3.6-35B-A3B need more than one GPU before serving any real traffic?

Why inference sharding differs from training

Why can't you reuse the same mental model for distributed training and distributed inference?

Tensor parallelism

The Megatron pattern: column then row

In a tensor-parallel linear layer, what does each GPU compute and why is communication required afterward?

Roughly how much communication does tensor parallelism add per decode step, and why does the interconnect decide if it helps?

When is tensor parallelism the right first sharding knob?

Pipeline parallelism

Why can pipeline parallelism hurt single-request latency?

What does tensor_parallel_size=4 and pipeline_parallel_size=2 mean on eight GPUs?

Sequence parallelism and context parallelism aren't interchangeable

When does context-aware parallelism become relevant for inference?

Expert parallelism for MoE models

What new bottleneck does expert parallelism introduce for MoE inference?

Sizing example

Why is one fast 8-GPU node often a better first serving target than two weaker 4-GPU nodes?

What to measure

How do you choose between model parallelism and replicas?

Practice: choose sharding or replicas

For a Gemma 4 12B-style model with very high traffic, why are replicas usually the first move?

Mastery check

Evaluation rubric

Follow-up questions

When is tensor parallelism the right default?

Why can pipeline parallelism hurt latency for small batches?

Why can more GPUs make TTFT worse?

Why does Megatron split the first MLP matrix by columns and the second by rows?

Common pitfalls

What symptom suggests your multi-GPU plan is communication-bound?

Mastery Check

Model Parallelism for LLM Inference

Why can Qwen3.6-35B-A3B need more than one GPU before serving any real traffic?

Why inference sharding differs from training

Why can't you reuse the same mental model for distributed training and distributed inference?

Tensor parallelism

The Megatron pattern: column then row

In a tensor-parallel linear layer, what does each GPU compute and why is communication required afterward?

Roughly how much communication does tensor parallelism add per decode step, and why does the interconnect decide if it helps?

When is tensor parallelism the right first sharding knob?

Pipeline parallelism

Why can pipeline parallelism hurt single-request latency?

What does tensor_parallel_size=4 and pipeline_parallel_size=2 mean on eight GPUs?

Sequence parallelism and context parallelism aren't interchangeable

When does context-aware parallelism become relevant for inference?

Expert parallelism for MoE models

What new bottleneck does expert parallelism introduce for MoE inference?

Sizing example

Why is one fast 8-GPU node often a better first serving target than two weaker 4-GPU nodes?

What to measure

How do you choose between model parallelism and replicas?

Practice: choose sharding or replicas

For a Gemma 4 12B-style model with very high traffic, why are replicas usually the first move?

Mastery check

Evaluation rubric

Follow-up questions

When is tensor parallelism the right default?

Why can pipeline parallelism hurt latency for small batches?

Why can more GPUs make TTFT worse?

Why does Megatron split the first MLP matrix by columns and the second by rows?

Common pitfalls

What symptom suggests your multi-GPU plan is communication-bound?

Mastery Check