Master MoE routing, load balancing, and understand why models like Mixtral and DeepSeek deliver strong capacity-per-compute tradeoffs compared with dense architectures.
The last two chapters focused on serving pressure: memory, long contexts, batching, and GPU capacity. Mixture of Experts (MoE) attacks a different bottleneck inside the Transformer layer itself.
By the time you reach this article, you've seen how a Transformer layer works: mixes information across tokens, then a Feed-Forward Network (FFN) processes each token independently. That FFN is dense. Every single weight in it participates in every single forward pass, no matter what the input says. This is simple, but it's also expensive. If you want the model to know more facts, handle more syntax, or reason about more domains, the standard recipe is to make the FFN wider or deeper. Every extra neuron adds parameters and adds compute for every token.
Mixture of Experts (MoE) breaks that link. Instead of one giant FFN, an MoE layer keeps many smaller FFNs in parallel and uses a lightweight router to decide which ones to run for each . This means the model can have hundreds of billions of total parameters while only spending compute on a small subset of them for any given input.
MoE is one widely used way to separate model capacity from per-token compute. Switch Transformer[1] popularized sparse activation in trillion-parameter models, and open models such as Mixtral[2], DeepSeek-V2[3], and DeepSeek-V3[4] show how MoE can scale to hundreds of billions of total parameters while keeping the active compute for each token much closer to a far smaller dense model.
In a standard Transformer layer, the FFN sublayer is a single dense network. In an MoE layer, that one FFN is replaced by parallel "expert" FFNs, plus a small router (also called a gating network) that decides which experts handle each token.[5]
In many modern LLMs, the experts replace the FFN sublayer while attention stays dense and shared across all tokens.[2][3] A common router design is a single linear projection from the token's hidden state down to scores.
Before we write a formula or code, let's trace one token through an MoE layer by hand. This is the shape of the computation that happens inside models like Mixtral on every single token.
The router multiplies the token by a learned weight matrix to get a raw score for each expert. Suppose the scores come out as:
| Expert | Raw score (logit) |
|---|---|
| 1 | -0.65 |
| 2 | -1.77 |
| 3 | -1.35 |
| 4 | -3.00 |
Apply softmax so the scores become probabilities that sum to 1. The result is approximately:
| Expert | Probability |
|---|---|
| 1 | 0.52 |
| 2 | 0.17 |
| 3 | 0.26 |
| 4 | 0.05 |
Pick the top 2 experts. That's Expert 1 (0.52) and Expert 3 (0.26).
The selected weights are renormalized so they sum to 1 again within the chosen subset:
Only Experts 1 and 3 run. Each processes the same input token through its own FFN weights.
The final layer output blends the two expert outputs using the renormalized gate weights:
That's it. The other two experts never process this token, so they contribute no expert-FFN FLOPs for this routing decision. Their weights still need to be stored or fetched by the serving system for other tokens that may select them.
1from math import exp
2
3logits = [-0.65, -1.77, -1.35, -3.00]
4unnormalized = [exp(score) for score in logits]
5probabilities = [value / sum(unnormalized) for value in unnormalized]
6selected = sorted(range(len(probabilities)), key=probabilities.__getitem__, reverse=True)[:2]
7selected_total = sum(probabilities[index] for index in selected)
8weights = [probabilities[index] / selected_total for index in selected]
9
10print("selected experts:", [index + 1 for index in selected])
11print("renormalized weights:", [round(weight, 2) for weight in weights])
12print("selected weight sum:", round(sum(weights), 2))1selected experts: [1, 3]
2renormalized weights: [0.67, 0.33]
3selected weight sum: 1.0We can write the same walkthrough as a compact equation. For an input token , the MoE layer output is:
Reading the formula: the router computes a gate distribution over all experts. We keep only the configured top- entries, zero out the rest, and renormalize the remaining weights so they sum to 1. Mixtral uses top-2 out of 8, while other architectures choose different expert counts and routing policies.[2][3] Each selected expert processes independently, then their outputs are blended using the gate weights .
The architecture figure above is the visual form of this equation: the router scores every expert, keeps the top-, runs only those FFNs, then blends the selected outputs.
Think of a dense model like one fulfillment center where every active station processes every order. An MoE model keeps a larger menu of stations but sends each order through only a selected subset. It can spend less computation per order than a similarly sized dense menu, but the facilities still occupy space and coordination between them still costs time. The analogy maps to active FLOPs, total weight residency, and dispatch traffic respectively.
| Property | Dense Model | MoE Model |
|---|---|---|
| Total parameters | 7B | ~47B (8 experts) |
| Active parameters/token | 7B | ~13B (top-2 of 8) |
| Training FLOPs | Per total params | Per active params |
| Quality at fixed active compute | Baseline | Often stronger if routing trains well |
| Inference compute/token | Per total params | Roughly per active params |
The main benefit is that MoE can scale total parameters faster than per-token compute. By adding more experts while keeping top- fixed, the model gets a larger pool of FFN parameters, but each token still runs only a small subset. The expert FLOPs stay roughly tied to the active experts, not the full expert pool. Routing, batching, and communication still add overhead, so "same active parameters" doesn't mean identical latency.
That doesn't mean serving cost scales only with active parameters. Compute per token is sparse, but the system still needs access to the full expert pool, so memory footprint and communication often scale with total parameters, not active parameters.
Only the FFN blocks are replicated across experts. Attention layers, embeddings, normalization layers, and the router are still shared. That's why Mixtral 8x7B exposes about 47B total parameters and about 13B active parameters per token, not the naive 56B and 14B you would get from multiplying everything by 8.[2]
The key point is the split: active parameters mostly drive per-token expert FLOPs, while total parameters drive weight memory and checkpoint size. Network traffic depends on where selected experts live.
1shared_parameters_b = 3.0
2expert_parameters_b = 2.0
3expert_count = 8
4top_k = 2
5
6total_parameters_b = shared_parameters_b + expert_count * expert_parameters_b
7active_parameters_b = shared_parameters_b + top_k * expert_parameters_b
8
9print("illustrative total parameters:", f"{total_parameters_b:.1f}B")
10print("illustrative active parameters/token:", f"{active_parameters_b:.1f}B")
11print("weights needed for residency:", f"{total_parameters_b:.1f}B")1illustrative total parameters: 19.0B
2illustrative active parameters/token: 7.0B
3weights needed for residency: 19.0BHere's a basic PyTorch implementation of the standard top-k softmax router. It takes token embeddings, computes scores for each expert, and returns the indices and normalized weights for the top- choices.
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class TopKRouter(nn.Module):
6 """
7 Routes tokens to the top-k experts based on learned gating weights.
8 """
9 def __init__(self, d_model: int, n_experts: int, top_k: int = 2):
10 super().__init__()
11 self.gate = nn.Linear(d_model, n_experts, bias=False)
12 self.top_k = top_k
13
14 def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
15 # x: (batch, seq_len, d_model)
16 logits = self.gate(x) # (batch, seq_len, n_experts)
17 scores = F.softmax(logits, dim=-1)
18 top_scores, top_indices = scores.topk(self.top_k, dim=-1)
19
20 # Renormalize selected expert weights so they sum to 1
21 top_scores = top_scores / top_scores.sum(dim=-1, keepdim=True)
22
23 return top_scores, top_indices # Which experts, with what weight
24
25torch.manual_seed(0)
26torch.set_printoptions(precision=3, sci_mode=False)
27router = TopKRouter(d_model=4, n_experts=4, top_k=2)
28x = torch.randn(2, 3, 4)
29weights, indices = router(x)
30
31first_weights = [round(value, 3) for value in weights[0, 0].tolist()]
32weight_sums = [
33 [round(value, 3) for value in row]
34 for row in weights.sum(dim=-1).tolist()
35]
36
37print("first token experts:", indices[0, 0].tolist())
38print("first token weights:", first_weights)
39print("weight sums:", weight_sums)1first token experts: [3, 1]
2first token weights: [0.682, 0.318]
3weight sums: [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]Switch Transformer used top-1 routing, while Mixtral and DeepSeek-V2 use multiple selected routed experts per token.[1][2][3] Selecting more experts changes quality, compute, capacity, and communication trade-offs; it is not automatically a better serving configuration.
While the standard Top-K router forces each token to choose its top experts, Expert Choice Routing[6] flips the mechanism: each expert selects a fixed bucket of top-scoring tokens from the current batch.
This gives fixed expert-side load balancing because every expert processes the same number of tokens. However, it means some tokens might be processed by many experts while other tokens might be ignored entirely if no expert selects them.
By fixing the number of tokens each expert processes, expert choice routing avoids the same expert-side load-balancing problem. This can simplify the training objective and improve hardware utilization because every expert gets a predictable bucket. The risk moves to token coverage: a token can be selected by many experts, one expert, or none unless the design adds coverage rules. In practice, token-choice routing with balancing losses remains the simpler mental model for open-weight MoE LLMs, while expert-choice routing is useful to know when reading training-efficiency papers.
1expert_buckets = {
2 "E1": ["t0", "t1"],
3 "E2": ["t0", "t2"],
4 "E3": ["t2", "t3"],
5}
6all_tokens = {"t0", "t1", "t2", "t3", "t4"}
7selected_tokens = [token for bucket in expert_buckets.values() for token in bucket]
8uncovered = sorted(all_tokens - set(selected_tokens))
9
10print("tokens per expert:", [len(bucket) for bucket in expert_buckets.values()])
11print("t0 expert count:", selected_tokens.count("t0"))
12print("uncovered tokens:", uncovered)1tokens per expert: [2, 2, 2]
2t0 expert count: 2
3uncovered tokens: ['t4']During training, routers can drift toward expert collapse: many tokens route to a few "popular" experts while others are underutilized. Load statistics tell you whether this is occurring; they do not prove that a perfectly uniform router is best for model quality.
Imagine an e-commerce fulfillment center with four specialist teams: one for small electronics, one for apparel, one for bulky furniture, and one for perishable goods. On the first day, the electronics team happens to be slightly faster at handling common orders. The router notices this and sends more orders to the electronics team. That team gets more practice, so it improves. The router sends even more orders to them. Within a few weeks, the electronics team is handling 80% of all orders, while the furniture and perishables teams sit idle. The warehouse is still paying rent for all four teams, but it's effectively operating with one overworked team.
The router is trained jointly alongside the rest of the network.[5] If a particular expert looks better for common tokens early in training, the router can send it more assignments. That expert then receives more gradient updates, which can reinforce the preference and leave other experts under-trained. The operational test is not a story about specialization: inspect assignment shares, overflow, task loss, and communication load together.
1assignments = [0, 0, 0, 0, 0, 0, 1, 1, 2, 3]
2expert_count = 4
3counts = [assignments.count(expert) for expert in range(expert_count)]
4shares = [count / len(assignments) for count in counts]
5max_share = max(shares)
6
7print("assignment shares:", [round(share, 2) for share in shares])
8print("flag for investigation:", max_share > 0.50)1assignment shares: [0.6, 0.2, 0.1, 0.1]
2flag for investigation: TrueSwitch Transformer uses a top-1 auxiliary loss that penalizes uneven routing.[1] The following form is also a useful teaching calculation for top- routing when is normalized over routed assignments:
Reading the formula: multiply each expert's routed-assignment fraction by its average routing probability and sum them up. If the router sends too many assignments to one expert (high ) and gives it high probability (high ), this product is large, so the loss penalizes that concentration. It is a balancing pressure, not proof that equal utilization maximizes task quality.
where:
A balanced reference point is (uniform distribution). A trained deployment still needs task-quality and systems measurements before you judge its observed specialization.
Here's a concrete top-1 numeric check. Suppose we have 4 experts and a mini-batch of 100 tokens:
| Expert | (actual fraction) | (average prob) | Product |
|---|---|---|---|
| 1 | 0.60 | 0.55 | 0.330 |
| 2 | 0.20 | 0.22 | 0.044 |
| 3 | 0.15 | 0.15 | 0.022 |
| 4 | 0.05 | 0.08 | 0.004 |
| Sum | 0.400 |
If the load were perfectly uniform, each expert would get , and the sum would be . With positive , this auxiliary term applies pressure against the imbalanced 0.400 case. Tune its weight with task-quality measurements because too much balancing pressure can conflict with the primary objective.
The auxiliary loss is a routing pressure, not the main learning objective. It says, "use the expert pool evenly enough," while the language-modeling loss still decides whether the model predicts the next token well.
1assignment_counts = [60, 20, 15, 5]
2mean_router_probability = [0.55, 0.22, 0.15, 0.08]
3expert_count = len(assignment_counts)
4assignment_total = sum(assignment_counts)
5fractions = [count / assignment_total for count in assignment_counts]
6
7concentration = expert_count * sum(
8 fraction * probability
9 for fraction, probability in zip(fractions, mean_router_probability)
10)
11uniform_reference = expert_count * sum((1 / expert_count) ** 2 for _ in fractions)
12
13print("scaled concentration:", round(concentration, 2))
14print("uniform reference:", round(uniform_reference, 2))
15print("excess balancing pressure:", round(concentration - uniform_reference, 2))1scaled concentration: 1.6
2uniform reference: 1.0
3excess balancing pressure: 0.6In a batched environment, many MoE implementations use fixed-size expert buffers so the layer can run predictable tensor operations. This size is determined by the capacity factor:
This computes the average number of tokens each expert would receive under perfectly uniform routing, then multiplies by the capacity factor. A factor of means the buffer is exactly sized for uniform routing. Setting it to allows an expert to receive 20% more tokens than the average, providing a safety margin.
What happens if an expert is assigned more tokens than its capacity buffer allows depends on implementation:
This highlights the trade-off: a higher capacity factor reduces overflow risk but reserves more buffer space and may waste compute on empty slots. A lower factor is more compact but makes whichever overflow policy you selected more important.
1from math import ceil
2
3tokens = 4096
4top_k = 2
5experts = 32
6capacity_factor = 1.0
7assigned_to_hot_expert = 400
8capacity = ceil(tokens * top_k / experts * capacity_factor)
9overflow = max(assigned_to_hot_expert - capacity, 0)
10
11print("per-expert capacity:", capacity)
12print("overflow assignments:", overflow)
13for policy in ("drop", "reroute", "dropless"):
14 print(policy, "must be evaluated for quality and throughput")1per-expert capacity: 256
2overflow assignments: 144
3drop must be evaluated for quality and throughput
4reroute must be evaluated for quality and throughput
5dropless must be evaluated for quality and throughputThe basic MoE pattern is consistent across modern models, but architects have refined it in interesting ways.
| Property | Value |
|---|---|
| Total params | 46.7B |
| Active params | ~12.9B |
| Experts | 8 |
| Top-K | 2 |
| Expert type | Standard FFN (SwiGLU activation[7]) |
| Performance | Matches or exceeds Llama 2 70B and GPT-3.5 across the evaluated benchmarks[2] |
MoE is applied only to the Feed-Forward Network (FFN) layers. Attention layers are shared across all tokens, which is the standard approach. It uses a SwiGLU (Swish Gated Linear Unit) activation function, a variant of the Gated Linear Unit (GLU) that's been shown to improve Transformer performance.
| Property | Value |
|---|---|
| Total params | 236B |
| Active params | 21B |
| Routed experts | 160 (fine-grained) |
| Shared experts | 2 (always-on) |
| Routed Top-K | 6 |
| Total active experts | 8 (6 routed + 2 shared) |
DeepSeek-V2 uses several extensions to coarse-grained MoE:
The following diagram contrasts Mixtral's coarse-grained top-2 routing with DeepSeek-V2's fine-grained routed-plus-shared design. The DeepSeek side is schematic, not a full 162-expert grid.
DeepSeek-V3 retains the DeepSeekMoE pattern at a larger reported scale.
| Property | Value |
|---|---|
| Total params | 671B |
| Active params | 37B |
| Routed experts | 256 |
| Shared experts | 1 |
| Routed Top-K | 8 |
| Total active experts | 9 (8 routed + 1 shared) |
| Load balancing | Auxiliary-loss-free strategy |
| Other key component | Multi-head Latent Attention (MLA) |
Each MoE layer has 1 shared expert and 256 routed experts, with 8 routed experts activated per token.[4]
The paper also highlights a major training change: instead of leaning on auxiliary balancing losses, DeepSeek-V3 uses an auxiliary-loss-free load-balancing strategy.[4] That's important because it shows how modern MoE work is not only about adding more experts. Router stability, communication efficiency, and keeping the main training objective clean matter just as much.
Implementing MoE introduces complexities not present in dense models. The compute per token is sparse, but the system still faces several hard serving constraints.
In distributed training and inference, experts often reside on different devices to meet memory requirements. Routing a token to an expert on another GPU requires transferring the token's hidden state over a high-bandwidth interconnect such as NVLink within a node or InfiniBand across nodes.
This communication can quickly become a bottleneck if not managed carefully. The process usually has an all-to-all dispatch phase that scatters tokens to the GPUs hosting their selected experts, followed by an all-to-all combine phase that returns the expert outputs to the original token order. To maintain high throughput, system designers try to overlap those transfers with useful compute so the network isn't stalling the whole layer.
1expert_device = {"E0": "GPU0", "E1": "GPU0", "E2": "GPU1", "E3": "GPU1"}
2routed_assignments = [
3 ("GPU0", "E0"),
4 ("GPU0", "E2"),
5 ("GPU1", "E2"),
6 ("GPU1", "E1"),
7 ("GPU0", "E3"),
8]
9remote = sum(
10 source_device != expert_device[expert]
11 for source_device, expert in routed_assignments
12)
13
14print("routed assignments:", len(routed_assignments))
15print("remote assignments:", remote)
16print("remote rate:", f"{remote / len(routed_assignments):.0%}")1routed assignments: 5
2remote assignments: 3
3remote rate: 60%Unlike dense models where there's only one FFN per layer, MoE models carry many expert FFNs. Fast GPU deployments generally keep those expert weights resident somewhere in the serving pool so routing does not stall on weight fetches.
This creates the core MoE paradox: low active compute per token, but very high total weight residency. A model like Mixtral 8x7B only activates about 13B parameters per token, yet the serving system still needs the full ~47B parameter checkpoint available across the cluster.[2] Tensor Parallelism (TP) splits individual layer computations across multiple GPUs, allowing larger shared layers to fit into memory, but it doesn't solve the unique structure of MoE. Expert Parallelism (EP) is needed to place different full experts on different GPUs while keeping every expert reachable when a token needs it.
MoE training can be more sensitive than training a comparable dense Transformer. The top- routing step introduces discontinuities at expert-selection boundaries, and small routing biases can snowball into expert collapse. Gradients still flow through the selected experts and their gate weights, but the router is noisier to optimize than a dense FFN.
To combat this, implementations may add stabilizers. Router z-loss[8] penalizes unnecessarily large routing logits; the ST-MoE paper reports improved training stability from this term. Router jitter can add exploration pressure early in training. The relevant evidence is training loss, routing distribution, overflow, and downstream quality rather than assuming one stabilizer is mandatory for every MoE.
Large distributed MoE deployments commonly use Expert Parallelism (EP). While Tensor Parallelism (TP) splits individual matrix multiplications, EP maps entire experts to specific GPUs. The layout below illustrates how a 96-expert model might be distributed across three GPUs, with each device holding a distinct subset of experts while sharing the attention parameters:
1GPU 0: Attention layers (shared) + Experts 0-31
2GPU 1: Attention layers (shared) + Experts 32-63
3GPU 2: Attention layers (shared) + Experts 64-95
4...Each GPU stores a subset of experts. When a token routes to an expert on another GPU, an all-to-all communication sends the token there and retrieves the result.
For CPU-offloaded serving (for example, running a quantized MoE checkpoint on consumer hardware), one possible design is:
This can make large MoE checkpoints fit where full GPU residency cannot, especially when combined with weight quantization, but it trades memory for transfer-dependent latency. You cannot know an exact next-layer route until that layer receives its hidden state, so unconditional "prefetch the next experts" is not free.
In batch serving, different tokens in a batch may route to different experts. The key optimization is batched expert execution:
This turns small matrix multiplies into a smaller number of large, efficient ones, which is critical for achieving high GPU utilization with MoE models.
1from collections import defaultdict
2
3routes = [
4 ("t0", "E2"),
5 ("t1", "E0"),
6 ("t2", "E2"),
7 ("t3", "E1"),
8 ("t4", "E2"),
9]
10grouped = defaultdict(list)
11for token, expert in routes:
12 grouped[expert].append(token)
13
14for expert in sorted(grouped):
15 print(expert, "batch tokens:", grouped[expert], "gemm rows:", len(grouped[expert]))1E0 batch tokens: ['t1'] gemm rows: 1
2E1 batch tokens: ['t3'] gemm rows: 1
3E2 batch tokens: ['t0', 't2', 't4'] gemm rows: 3| Batch Size | Dense model | MoE model | Typical effect |
|---|---|---|---|
| 1 | Higher baseline utilization | Lower | Dense models often win on latency |
| 8 | Strong | Catching up | Routing overhead starts to amortize |
| 32+ | High | High | Throughput gap narrows a lot |
This table is a hypothesis to benchmark, not a universal crossover. Exact outcomes depend on kernels, expert placement, interconnect, and request mix. MoE adds router computation, expert selection, token grouping, and sometimes all-to-all communication; batching can amortize that overhead, but only measurements on the target system establish whether MoE wins latency or throughput.
1dense = {"quality": 0.82, "p95_ms": 48, "tokens_per_second": 820}
2moe = {"quality": 0.83, "p95_ms": 55, "tokens_per_second": 1100}
3requirements = {"quality": 0.82, "max_p95_ms": 60, "min_tokens_per_second": 1000}
4
5approved = (
6 moe["quality"] >= requirements["quality"]
7 and moe["p95_ms"] <= requirements["max_p95_ms"]
8 and moe["tokens_per_second"] >= requirements["min_tokens_per_second"]
9)
10
11print("moe throughput improvement:", f"{moe['tokens_per_second'] / dense['tokens_per_second'] - 1:.1%}")
12print("moe p95 delta ms:", moe["p95_ms"] - dense["p95_ms"])
13print("candidate approved:", approved)1moe throughput improvement: 34.1%
2moe p95 delta ms: 7
3candidate approved: TrueSymptom: FLOP estimates look cheap, but deployment still runs out of VRAM. Cause: Team budgeted only for active parameters per token and forgot that the full expert pool must still be resident somewhere in the serving system. Fix: Separate active expert FLOPs from total weight residency. Size memory for the full expert pool, not only the routed path.
Symptom: One or two experts receive most tokens by early training. Cause: Router found a slightly better path, that expert got more gradients, and collapse reinforced itself. Fix: Track routing histograms from the start and add balancing pressure such as auxiliary loss, z-loss, jitter, or capacity constraints before unused experts go stale.
Symptom: MoE benchmark wins disappear at batch size 1. Cause: Sparse expert FLOPs are not the whole latency story. Router work, grouping, dispatch, and all-to-all traffic can dominate single-request latency. Fix: Evaluate MoE on the real serving regime. Use dense models for low-latency single requests when routing overhead outweighs sparse compute gains.
Symptom: Throughput drops sharply after experts are split across GPUs. Cause: Expert placement created communication hotspots. Tokens now bounce across devices faster than grouped GEMMs can amortize the transfers. Fix: Treat routing and placement as one systems problem. Use expert parallelism, device-aware routing, and batching to keep all-to-all traffic under control.
Symptom: Quality drops even though router histograms look balanced. Cause: Experts may be hitting capacity limits, and the configured overflow policy is dropping, rerouting, or delaying assignments. Fix: Inspect overflow counts and policy directly. Evaluate capacity factor, routing balance, burstiness, and throughput before assuming the issue is elsewhere.
Symptom: Expert-choice routing leaves some tokens underprocessed. Cause: Experts select their favorite tokens first, so token coverage is no longer guaranteed automatically. Fix: Add explicit coverage rules or use token-choice top-k routing when every token must reliably receive expert computation.
Symptom: Teams describe experts as clean topic buckets like "math expert" or "biology expert." Cause: Human-friendly labels are being projected onto behavior that often follows token patterns, formatting, or structural clusters instead. Fix: Validate specialization with routing traces, activation analysis, and targeted prompts that test a claimed specialty before telling a story about it.
Active FLOPs follow only the experts selected for each token, but the full expert pool still has to live somewhere in the serving system. That means VRAM or cluster-wide weight residency is driven by total parameters, not only active parameters. MoE separates compute cost from capacity, not memory from capacity.
It is self-reinforcing because the favored expert gets more tokens, more gradients, and therefore improves faster, which makes the router favor it even more. First inspect routing histograms, average router probabilities, dropped-token rates, and balancing-loss terms. Those signals show whether collapse is still reversible or already starving the rest of the expert pool.
At batch size 1, MoE pays router, grouping, and often all-to-all overhead without enough routed tokens to amortize it. Dense models avoid that coordination overhead and can win on raw latency. At larger batch sizes, grouped expert execution can make sparse compute competitive on throughput. Benchmark both claims on the actual hardware.
Tensor parallelism shards individual matrix multiplications, but MoE also needs whole-expert placement, token dispatch, and result recombination. That is why expert parallelism exists: the problem is not only splitting math, but moving tokens to whichever GPU holds their selected experts and then restoring original token order.
The nominal per-expert buffer is (4096 * 2) / 32 = 256 token assignments. If one expert gets 400 assignments, then 144 assignments overflow that buffer. A Switch-style dropping policy skips the overflowed expert computation and may hurt quality; rerouting or dropless designs pay different compute and communication costs. Inspect the configured policy.
Start by constraining and measuring communication, not by assuming more experts solve it. Device-limited routing, expert placement that keeps common routes local, shared experts for common computation, and larger grouped batches are candidate mitigations. Adding routed experts changes capacity and traffic; benchmark that change against the existing bottleneck.[3][4]
Here's a self-contained pencil-and-paper exercise that checks whether you can trace the exact computation an MoE router performs. Think of it as routing a single customer order through the right fulfillment specialists. Do this without looking back at the worked example above.
If your numbers are within 0.02 of these, you understand the routing math well enough to read any MoE paper.
You now understand the central trick behind large sparse MoE models: replace selected dense FFNs with routed expert sets and activate only configured paths per token. You can trace a token through the router by hand, explain why total parameters do not equal compute cost, and evaluate balance, overflow policy, memory residency, and communication before claiming a serving win.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
Fedus, W., Zoph, B., & Shazeer, N. · 2022
Mixtral of Experts.
Jiang, A. Q., et al. · 2024
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI · 2024
DeepSeek-V3 Technical Report.
DeepSeek-AI · 2024 · arXiv preprint
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
Shazeer, N., et al. · 2017 · ICLR 2017
Mixture-of-Experts with Expert Choice Routing
Zhou, Y., et al. · 2022
GLU Variants Improve Transformer
Shazeer, N. · 2020
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Zoph, B., et al. · 2022