LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleModel Parallelism for LLM Inference
🚀HardInference Optimization

Model Parallelism for LLM Inference

Learn tensor parallelism, pipeline parallelism, context parallelism, and how multi-GPU serving trades memory capacity for communication overhead.

21 min read
Learning path
Step 133 of 158 in the full curriculum
Scaling LLM InferenceModel Quantization: GPTQ, AWQ & GGUF

HBM, KV cache, and scheduler policy limit single-node serving. The next question is what changes when a single large language model (LLM) copy no longer fits comfortably on one accelerator.

Serving Qwen3.6-35B-A3B for a codebase assistant that reads long diffs, build logs, and architecture notes creates two memory questions. Model-weight memory decides whether one replica fits at all, while per-request KV state decides how many long-context sessions can run at once.

Qwen3.6-35B-A3B is a sparse MoE checkpoint with about 35B total parameters and about 3B activated per token.[1] At BF16, the full checkpoint is roughly 70 GB before KV cache, runtime buffers, and allocator headroom. The A3B suffix helps reason about active compute, but it doesn't mean the serving system only needs 3B parameters worth of memory.

Model parallelism is the set of techniques that split one model across multiple accelerators. The split can make a large model fit, but every added device also creates communication or scheduling work.

Inference parallelism decision frame that starts from the bottleneck, maps it to a split axis, and highlights the communication or memory cost you pay next. Inference parallelism decision frame that starts from the bottleneck, maps it to a split axis, and highlights the communication or memory cost you pay next.
Parallelism choices start with the pressure point. Replicas solve traffic. Shards solve fit problems in weights, layer depth, or long-context state.

Why inference sharding differs from training

Distributed training cares about gradients, optimizer states, activation checkpointing, and throughput over many examples. Distributed inference cares about time to first token (TTFT), tokens per second (TPS), KV-cache memory, and request scheduling.

The same names appear in both worlds, but the trade-offs shift:

TechniqueTraining concernInference concern
Tensor parallelismSplit matmuls and gradientsSplit weights and activations with low latency
Pipeline parallelismFill stages with microbatchesAvoid pipeline bubbles during generation
Sequence parallelismReduce selected activation memory alongside tensor parallelismUsually a training optimization, not shorthand for long-context inference
Context parallelismSplit long-sequence work across devicesSplit long prompts, attention work, or KV-cache state when runtime supports it
Data parallel servingReplicate modelIncrease throughput for many requests

Serving runtimes expose topology controls such as tensor- and pipeline-parallel sizes. Treat those knobs as a deployment mechanism, not a performance guarantee: the selected topology still needs memory accounting and latency benchmarks. vLLM's official scaling guide recommends one GPU when the model fits, single-node tensor parallelism when it needs several GPUs in one node, and tensor plus pipeline parallelism when it exceeds one node.[2]

Before picking a sharding strategy, do the simplest memory math:

text
1weight memory ~= parameters x bytes per parameter 2serving memory ~= weights + KV cache + runtime buffers + safety margin

For BF16 or FP16 weights, a parameter takes 2 bytes. That makes Qwen3.6-35B-A3B about 70 GB by total parameters before a single prompt arrives. KV cache grows with active requests, context length, layers, KV heads, head dimension, and bytes per value. Runtime buffers and memory fragmentation add more headroom. "Enough VRAM" means all of those buckets fit at the traffic level you plan to serve, not the checkpoint file or active-parameter count alone.

multi-gpu-sizing-sketch.py
1from math import ceil 2 3def weight_memory_gb(params_billion: float, bytes_per_param: float) -> float: 4 return params_billion * bytes_per_param 5 6def next_power_of_two(value: int) -> int: 7 size = 1 8 while size < value: 9 size *= 2 10 return size 11 12params_billion = 35 13gpu_gb = 80 14reserve_fraction = 0.20 15usable_gb = gpu_gb * (1 - reserve_fraction) 16 17bf16_weights = weight_memory_gb(params_billion, bytes_per_param=2) 18raw_min_gpus = ceil(bf16_weights / usable_gb) 19tp_candidate = next_power_of_two(raw_min_gpus) 20int4_weights = weight_memory_gb(params_billion, bytes_per_param=0.5) 21 22print(f"Qwen3.6-35B-A3B BF16 total weights: {bf16_weights:.0f} GB") 23print(f"80GB GPU usable with 20% reserve: {usable_gb:.0f} GB") 24print(f"minimum GPUs for weights with reserve: {raw_min_gpus}") 25print(f"practical TP candidate: {tp_candidate}") 26print(f"Qwen3.6-35B-A3B INT4 total weight-only estimate: {int4_weights:.0f} GB")
Output
1Qwen3.6-35B-A3B BF16 total weights: 70 GB 280GB GPU usable with 20% reserve: 64 GB 3minimum GPUs for weights with reserve: 2 4practical TP candidate: 2 5Qwen3.6-35B-A3B INT4 total weight-only estimate: 18 GB

This calculator is deliberately conservative but incomplete. It only sizes weights plus a reserve. Production sizing still needs KV cache, activation buffers, tensor-parallel divisibility, interconnect measurements, and latency SLOs.

weight-shards-leave-room-for-kv.py
1weights_gb = 70 2gpu_gb = 80 3reserved_runtime_gb = 16 4 5for tp_size in (1, 2): 6 shard_gb = weights_gb / tp_size 7 kv_room_gb = gpu_gb - reserved_runtime_gb - shard_gb 8 print(f"TP={tp_size}: {shard_gb:.0f} GB weights/GPU, {kv_room_gb:.0f} GB left for KV")
Output
1TP=1: 70 GB weights/GPU, -6 GB left for KV 2TP=2: 35 GB weights/GPU, 29 GB left for KV

Tensor parallelism

Tensor parallelism splits large matrix operations across GPUs. In a transformer layer, the model contains big linear projections for attention and feed-forward blocks. Tensor parallelism shards those weights so each GPU owns part of the matrix.

For a simplified linear layer:

text
1y = xW

you might split W across four GPUs. Each GPU computes part of the output, then the system communicates to combine results. Interconnect decides whether that split pays off. Tensor parallelism is strongest on GPUs connected by fast NVLink (NVIDIA high-bandwidth GPU-to-GPU interconnect) or similar high-bandwidth links.

The Megatron pattern: column then row

Naively, every sharded matmul would need a sync to glue the pieces back together. Megatron-LM avoids that by choosing the split directions so each transformer sub-block needs only one synchronization.[3] It chains a column-parallel layer into a row-parallel layer.

In the MLP block Z = (GeLU(xA))B:

  • Split the first weight A by columns. Each GPU computes GeLU(x A_i) independently. GeLU is element-wise, so no sync is needed before the nonlinearity. (Splitting A by rows would force a sync first, because GeLU(x1 A1 + x2 A2) isn't the sum of the per-shard GeLUs.)
  • Split the second weight B by rows. The column-sharded output of the first layer is exactly the input layout the row-parallel second layer expects, so the partial results flow through with no intermediate communication. One all-reduce after B sums the partial outputs.

The attention block follows the same shape. The query, key, and value projections are split column-wise, which maps cleanly onto independent attention heads, and the output projection is split row-wise. That gives one all-reduce after attention.

For the dense Megatron layout described here, a tensor-parallel group executes two all-reduces per transformer layer in the forward pass: one for attention and one for the MLP.[3] During decode, that means every generated token pays two collectives per layer. An 80-layer model therefore executes 160 all-reduces per decode step under this layout, unless a runtime changes or fuses the communication scheme.

Megatron tensor-parallel MLP flow where GPUs compute local column-sharded GeLU activations, feed them directly into row-sharded second projections, and all-reduce only after the paired projections. Megatron tensor-parallel MLP flow where GPUs compute local column-sharded GeLU activations, feed them directly into row-sharded second projections, and all-reduce only after the paired projections.
Megatron's column-then-row pairing matters: local GeLU shards flow into local row-parallel projections before one all-reduce restores the full block output.

For inference, tensor parallelism often reduces memory pressure and can improve latency for large models, but it adds communication inside layers. If communication is slow, adding GPUs can make serving worse.

The interconnect hierarchy is why tensor parallelism is usually easiest to justify within a fast-link domain. NVIDIA's H100 SXM specification advertises up to 900 GB/s NVLink aggregate bandwidth per GPU.[4] PCIe and cross-node network paths have different bandwidth and latency characteristics, so the same tensor-parallel layout can perform very differently across machines. Because every decode token triggers collectives, benchmark the target topology instead of assuming more GPUs lower latency.

decode-collective-count.py
1layers = 80 2all_reduces_per_layer = 2 3output_tokens = 128 4 5per_token = layers * all_reduces_per_layer 6generation_total = per_token * output_tokens 7 8print(f"all-reduces per decode token: {per_token}") 9print(f"all-reduces for {output_tokens} output tokens: {generation_total:,}")
Output
1all-reduces per decode token: 160 2all-reduces for 128 output tokens: 20,480
collective-latency-floor.py
1collectives_per_token = 160 2 3for assumed_collective_latency_us in (5, 20): 4 floor_ms = collectives_per_token * assumed_collective_latency_us / 1000 5 print(f"{assumed_collective_latency_us} us startup -> {floor_ms:.1f} ms/token before payload transfer")
Output
15 us startup -> 0.8 ms/token before payload transfer 220 us startup -> 3.2 ms/token before payload transfer

Use tensor parallelism when:

  1. The model doesn't fit on one GPU.
  2. The GPUs have fast interconnect.
  3. You need one request to use multiple GPUs at once.
  4. Batch sizes aren't large enough to rely only on model replicas.

Megatron-LM popularized practical tensor model parallelism for large transformers.[3] In serving stacks, the same idea appears as tensor_parallel_size.

Pipeline parallelism

Pipeline parallelism splits layers into stages. GPU 0 owns early layers, GPU 1 owns middle layers, and GPU 2 owns later layers. A token's hidden state moves through the stages.

This reduces memory per GPU because each stage stores only part of the model. It can also reduce communication compared with tensor parallelism because tensors move between stages rather than across every large matmul.

Pipeline parallelism creates bubbles. If only one request is active, stage 2 waits for stage 1, stage 3 waits for stage 2, and so on. Bigger batches or many concurrent requests can fill the pipeline better.

Use pipeline parallelism when:

  1. Tensor parallelism alone doesn't fit the model.
  2. The model must cross node boundaries.
  3. You can batch enough work to keep stages busy.
  4. You can tolerate slightly more scheduling complexity.

Tensor and pipeline parallelism can be combined. For example, eight GPUs can run tensor_parallel_size=4 and pipeline_parallel_size=2, giving two layer stages where each stage is a four-GPU tensor-parallel group. For a multi-node vLLM deployment, the common first layout is tensor parallelism inside each node and pipeline parallelism across nodes. vLLM also recommends considering pipeline parallelism inside one node when GPU count doesn't evenly divide the model or the node lacks NVLink.[2] This may make a larger replica fit, but queue depth and interconnect measurements still decide TTFT and throughput.

Pipeline parallel timeline comparing one active request, where later stages sit idle, with continuous queued requests that keep most stages occupied. Pipeline parallel timeline comparing one active request, where later stages sit idle, with continuous queued requests that keep most stages occupied.
Pipeline parallelism reduces per-GPU memory by splitting model depth, but queue depth decides whether that split becomes useful throughput or just idle bubbles.
pipeline-bubble-utilization.py
1def ideal_pipeline_utilization(stages: int, microbatches: int) -> float: 2 return microbatches / (microbatches + stages - 1) 3 4for microbatches in (1, 4, 16): 5 utilization = ideal_pipeline_utilization(stages=4, microbatches=microbatches) 6 print(f"4 stages, {microbatches:2d} microbatches: {utilization:.1%} ideal utilization")
Output
14 stages, 1 microbatches: 25.0% ideal utilization 24 stages, 4 microbatches: 57.1% ideal utilization 34 stages, 16 microbatches: 84.2% ideal utilization

Sequence parallelism and context parallelism aren't interchangeable

Both names mention the token dimension, but they solve different problems. In Megatron Core, sequence parallelism works alongside tensor parallelism: it shards sequence-dimension work in components such as LayerNorm and Dropout to reduce activation memory. Context parallelism partitions the sequence across devices through the transformer layers and is the long-sequence strategy in Megatron's current guide.[5]

For inference, support depends on the runtime and model architecture. Context sharding means the system distributes long-prompt attention work or KV state across devices instead of only splitting weights. Ring Attention is one family of techniques for this problem. Devices hold local query blocks while Key and Value blocks circulate through a ring for blockwise attention. The paper applies this idea to training and inference.[6]

For a codebase assistant reading a long repository map plus build logs, sequence length can dominate prefill cost. Tensor parallelism helps with model weights. Prefix caching helps with repeated prefixes. Context-aware serving helps when the prompt itself is large and attention work or KV state needs to be spread out.

Context parallelism isn't the first knob most teams touch. Start with model size, quantization, tensor parallelism, and batching. Reach for context-level techniques when long prompts are the bottleneck and your runtime supports the required communication pattern.

context-parallel-kv-capacity.py
1tokens = 1_000_000 2layers = 80 3kv_heads = 8 4head_dim = 128 5dtype_bytes = 2 6devices = 4 7 8kv_gib = 2 * tokens * layers * kv_heads * head_dim * dtype_bytes / 1024**3 9print(f"single-request KV footprint: {kv_gib:.1f} GiB") 10print(f"even {devices}-way context shard: {kv_gib / devices:.1f} GiB/device before overhead")
Output
1single-request KV footprint: 305.2 GiB 2even 4-way context shard: 76.3 GiB/device before overhead

Expert parallelism for MoE models

Mixture-of-Experts models add another sharding axis: experts. Instead of every token using every feed-forward block, the router sends each token to a small subset of experts.[7] Expert parallelism places different experts on different GPUs, so the serving system can scale total expert capacity without copying every expert to every device.

Expert parallelism pays in routing communication and load balance. Expert-parallel implementations commonly dispatch tokens to devices that own selected experts and combine results afterward, often using all-to-all-style communication. If many tokens choose the same expert, that expert's device becomes the bottleneck while other devices wait. For dense models, start with tensor/pipeline/context choices. For MoE serving, add expert placement and router-load metrics to the plan.

DeepSeek-V3, for example, reports 671B total parameters with 37B activated per token, illustrating why total expert storage and active-token compute are different capacity questions.[8] Serving such a model still needs expert placement, routing balance, and communication measurements; expert parallelism can combine with tensor and data parallelism rather than replace them.

expert-routing-imbalance.py
1tokens_by_device = [48, 19, 17, 16] 2average = sum(tokens_by_device) / len(tokens_by_device) 3peak_ratio = max(tokens_by_device) / average 4 5print(f"average routed tokens/device: {average:.1f}") 6print(f"hottest device tokens: {max(tokens_by_device)}") 7print(f"hotspot ratio: {peak_ratio:.2f}x average")
Output
1average routed tokens/device: 25.0 2hottest device tokens: 48 3hotspot ratio: 1.92x average

Sizing example

Suppose you need to serve Qwen3.6-35B-A3B for a codebase-reasoning assistant:

RequirementImplication
Full BF16 checkpoint exceeds conservative one-GPU budgetNeed tensor, pipeline, or expert-aware placement
8K context and many concurrent usersKV cache budget matters
Low TTFTAvoid slow cross-node communication
High traffic burstsConsider replicas plus batching
Strict data boundaryMaybe self-host rather than hosted API
Two illustrative charts for multi-GPU inference planning: a Qwen3.6-35B-A3B BF16 weight footprint compared with one 80 GB GPU after reserve, and relative decode communication cost as tensor parallelism crosses slower links. Two illustrative charts for multi-GPU inference planning: a Qwen3.6-35B-A3B BF16 weight footprint compared with one 80 GB GPU after reserve, and relative decode communication cost as tensor parallelism crosses slower links.
Fit and speed are separate tests. One chart asks whether the model can stay in memory. The other asks whether the chosen shard plan will still serve tokens quickly.

A reasonable benchmark candidate is one fast-linked node with enough high-memory GPUs for one replica, tensor parallelism within that node, continuous batching, and prefix caching for stable policy text. If its cost or latency is unacceptable, measure quantization before adding cross-node parallelism.

Some systems add one more axis: disaggregated serving runs prefill and decode on separate worker pools, each with its own parallelism, and transfers KV state between them. Systems such as DistServe and Splitwise study when this can raise goodput by reducing phase interference, subject to KV-transfer overhead.[9][10] The parallelism choices below still apply, but they can be evaluated per phase.

serving-memory-budget.py
1gpu_count = 2 2gpu_capacity_gb = 80 3weights_gb = 70 4runtime_reserve_gb = 32 5measured_kv_per_request_gb = 0.75 6 7kv_budget_gb = gpu_count * gpu_capacity_gb - weights_gb - runtime_reserve_gb 8arithmetic_batch_ceiling = int(kv_budget_gb / measured_kv_per_request_gb) 9 10print(f"KV budget after weights and reserve: {kv_budget_gb} GB") 11print(f"arithmetic request ceiling at measured KV/request: {arithmetic_batch_ceiling}") 12print("Latency and burst headroom determine the admitted batch below this ceiling.")
Output
1KV budget after weights and reserve: 58 GB 2arithmetic request ceiling at measured KV/request: 77 3Latency and burst headroom determine the admitted batch below this ceiling.

What to measure

Multi-GPU inference should be measured with serving metrics, not offline tokens per second alone.

Track:

  1. Time to first token.
  2. Decode tokens per second.
  3. Aggregate throughput.
  4. GPU memory used by weights.
  5. GPU memory used by KV cache.
  6. Interconnect utilization.
  7. Queue time under burst traffic.
  8. Error rate when one GPU or node fails.

The worst mistake is counting total VRAM and declaring victory. A four-GPU box with enough raw memory can still miss latency targets if the interconnect is saturated or the scheduler can't fill the pipeline.

Model parallelism is a capacity tool for models that need multiple GPUs. Replicas fit many independent requests when each model copy fits on one GPU. Combine both when the product needs a large model and real throughput.

select-config-under-slo.py
1benchmarks = [ 2 {"name": "TP=1", "fits": False, "ttft_p95": 230, "tpot_p95": 34, "tps": 650}, 3 {"name": "TP=2", "fits": True, "ttft_p95": 310, "tpot_p95": 45, "tps": 620}, 4 {"name": "TP=4", "fits": True, "ttft_p95": 430, "tpot_p95": 62, "tps": 700}, 5] 6ttft_limit, tpot_limit = 400, 55 7eligible = [b for b in benchmarks if b["fits"] and b["ttft_p95"] <= ttft_limit and b["tpot_p95"] <= tpot_limit] 8best = max(eligible, key=lambda b: b["tps"]) 9print(f"eligible configurations: {[b['name'] for b in eligible]}") 10print(f"highest-throughput configuration inside SLO: {best['name']}")
Output
1eligible configurations: ['TP=2'] 2highest-throughput configuration inside SLO: TP=2

Practice: choose sharding or replicas

Consider two serving plans:

WorkloadBetter first moveReason
Gemma 4 12B code-classification model, high trafficReplicasOne copy fits on one GPU, so duplicate it for throughput
Qwen3.6-35B-A3B reasoning model, low trafficTensor or expert-aware parallelismOne copy is too tight for a conservative one-GPU serving budget
Qwen3.6-35B-A3B, high trafficParallelism plus replicasOne request needs a sharded copy, and traffic needs multiple copies
Gemma 4 12B, 64K repository promptsMeasure sequence pressureLong prefill or KV cache may dominate before weights do

This is the decision habit to build: ask whether the bottleneck is model weight memory, request volume, context length, or communication. Different bottlenecks need different tools.

replicas-versus-shards.py
1gpu_budget = 8 2measured_tps_per_replica = {"Gemma4 12B on 1 GPU": 600, "Qwen3.6-35B-A3B TP=2": 260} 3gpus_per_replica = {"Gemma4 12B on 1 GPU": 1, "Qwen3.6-35B-A3B TP=2": 2} 4 5for name, per_replica_tps in measured_tps_per_replica.items(): 6 replicas = gpu_budget // gpus_per_replica[name] 7 print(f"{name}: {replicas} replicas, {replicas * per_replica_tps} measured aggregate TPS")
Output
1Gemma4 12B on 1 GPU: 8 replicas, 4800 measured aggregate TPS 2Qwen3.6-35B-A3B TP=2: 4 replicas, 1040 measured aggregate TPS

Mastery check

  • Why a model may need sharding for inference even after quantization.
  • How tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, expert parallelism, and replicas solve different bottlenecks.
  • Why sequence parallelism and context parallelism aren't interchangeable names for long-context serving.
  • Why the Megatron column-then-row pattern needs only two all-reduces per layer, and what that costs on every decode token.
  • Why tensor parallelism depends on fast interconnect (NVLink vs PCIe vs network) and can become communication-bound.
  • Why pipeline parallelism can save memory but hurt small-batch latency through bubbles.
  • How disaggregated prefill and decode let each phase pick its own parallelism layout.
  • How to design a first serving plan from memory buckets, traffic shape, context length, and latency targets.

Evaluation rubric

Strong answers should:

  • identify the real bottleneck before recommending replicas or sharding
  • separate memory-fit math from latency and communication measurements
  • explain why tensor, pipeline, context, and expert parallelism solve different limits
  • connect interconnect speed directly to TTFT and decode behavior
  • name the smallest production-worthy first plan instead of the fanciest one

Follow-up questions

Common pitfalls

  • Symptom: The model fits on paper, but the runtime still hits OOM. Cause: You counted weight memory and forgot KV cache, runtime buffers, allocator slack, and bursty long-context headroom. Fix: Size memory by bucket, not checkpoint size alone.

  • Symptom: Latency gets worse after you spread the model across more GPUs. Cause: Cross-node tensor-parallel collectives now dominate decode. Fix: Keep the tensor-parallel group inside one fast node first. Only cross weaker links when pipeline behavior and queue depth justify it.

  • Symptom: TTFT rises even though the bigger shard plan finally fits. Cause: Extra communication and startup coordination removed less pressure than they added. Fix: Measure TTFT and decode TPS directly. More GPUs aren't automatically a serving win.

  • Symptom: You shard a model that already fits, but throughput barely improves. Cause: Traffic volume was the real bottleneck, so communication replaced a simpler replica plan. Fix: Use replicas first when one full model copy fits and requests are independent.

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.Qwen3.6-35B-A3B is served in BF16 on 80 GB GPUs, with 20% of each GPU reserved for runtime headroom. Before KV cache is counted, which sizing estimate respects total checkpoint memory and headroom?
2.Three TP candidates have these results: TP=1 does not fit; TP=2 fits with p95 TTFT 310 ms, p95 time per output token 45 ms, and 620 TPS; TP=4 fits with 430 ms, 62 ms, and 700 TPS. The limits are 400 ms TTFT and 55 ms per output token. Which candidate should be deployed?
3.In Megatron-style tensor parallelism for the MLP block Z = (GeLU(xA))B, why split A by columns and B by rows?
4.An 80-layer transformer uses the dense Megatron layout with two forward all-reduces per layer. For a 128-token decode, what communication count is implied, and why can topology dominate latency?
5.An 8B code-assistant model fits on one GPU, but 64K repository prompts make prefill time and KV-cache memory the bottleneck. Which deployment choice targets long-prompt attention or KV-state pressure without confusing Megatron sequence parallelism with context parallelism?
6.Eight GPUs run tensor_parallel_size=4 and pipeline_parallel_size=2 for a large model. What does this layout mean, and why might it hurt a single-request workload?
7.In an MoE inference step, the router sends tokens to expert devices as [48, 19, 17, 16]. What is the capacity warning?
8.An 8-GPU budget must handle a Gemma 4 12B classifier that fits on one GPU with many independent requests and a Qwen3.6-35B-A3B model that needs TP=2 per copy with high traffic. Which plan matches those bottlenecks?
9.A serving system moves prefill and decode to separate worker pools. Which trade-off should the deployment test?

9 questions remaining.

Next Step
Continue to Model Quantization: GPTQ, AWQ & GGUF

Model parallelism splits a model across GPUs; quantization shrinks the bytes each GPU must store and move, so the next chapter teaches the main compression lever for serving.

PreviousScaling LLM Inference
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Qwen3.6-35B-A3B

Qwen Team · 2026

Distributed Inference and Serving.

vLLM Project. · 2026 · Official documentation

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.

Shoeybi, M., et al. · 2019

NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference

NVIDIA · 2024

Parallelism Strategies Guide.

NVIDIA · 2026

Ring Attention with Blockwise Transformers for Near-Infinite Context.

Liu, H., et al. · 2024 · arXiv preprint

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.

Shazeer, N., et al. · 2017 · ICLR 2017

DeepSeek-V3 Technical Report.

DeepSeek-AI · 2024 · arXiv preprint

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023