LearnInference & Production ScaleMixture of Experts Architecture

🧠HardTransformer Architecture

Mixture of Experts Architecture

Master MoE routing, load balancing, and the dense-vs-sparse serving tradeoffs behind Mixtral, DeepSeek, and Qwen3.6-style expert models.

38 min read

Learning path

Step 140 of 158 in the full curriculum

Long Context Window Management Mamba & State Space Models

Serving pressure shows up as KV-cache memory, long contexts, batching, and GPU capacity. Mixture of Experts (MoE) attacks a different bottleneck inside the Transformer layer itself.

You've seen how a Transformer layer works: self-attention mixes information across tokens, then a Feed-Forward Network (FFN) processes each token independently. That FFN is dense. Every single weight in it participates in every single forward pass, no matter what the input says. This is simple, but it's also expensive. If you want the model to know more facts, handle more syntax, or reason about more domains, the standard recipe is to make the FFN wider or deeper. Every extra neuron adds parameters and adds compute for every token.

Mixture of Experts (MoE) breaks that link. Instead of one giant FFN, an MoE layer keeps many smaller FFNs in parallel and uses a lightweight router to decide which ones to run for each token. This means the model can have hundreds of billions of total parameters while only spending compute on a small subset of them for any given input.

MoE is one widely used way to separate model capacity from per-token compute. Switch Transformer^{[1]Reference 1Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.https://arxiv.org/abs/2101.03961} popularized sparse activation in trillion-parameter models, and open models such as Mixtral^{[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088}, DeepSeek-V2^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}, DeepSeek-V3^{[4]Reference 4DeepSeek-V3 Technical Report.https://arxiv.org/abs/2412.19437}, and Qwen3.6-35B-A3B^{[5]Reference 5Qwen3.6-35B-A3Bhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B} show how MoE can scale total parameters while keeping the active compute for each token much closer to a far smaller dense model.

From one FFN to many experts

In a standard Transformer layer, the FFN sublayer is a single dense network. In an MoE layer, that one FFN is replaced by $N$ parallel "expert" FFNs, plus a small router (also called a gating network) that decides which $k$ experts handle each token. Those winners form a small top-k subset.^{[6]Reference 6Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.https://arxiv.org/abs/1701.06538}

In many modern LLMs, the experts replace the FFN sublayer while attention stays dense and shared across all tokens.^{[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088}^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434} A common router design is a single linear projection from the token's hidden state down to $N$ scores.

MoE layer where shared attention feeds a router that activates only top-k experts. — Track one token through the layer: shared attention runs first, the router selects E1 and E3, skipped experts stay inactive, and only selected FFN outputs are blended.

A concrete routing walkthrough

Before any formula or code, trace one token through an MoE layer by hand. This is the shape of the computation that happens inside models like Mixtral on every single token.

Setup

One token arrives with a hidden state $x$ of dimension 4.
There are 4 experts, each a small FFN.
The router uses top- $k = 2$ (it picks the 2 best experts).

Step 1: Router scores

The router multiplies the token by a learned weight matrix $W_g$ to get a raw score for each expert. Suppose the scores come out as:

Expert	Raw score (logit)
1	-0.65
2	-1.77
3	-1.35
4	-3.00

Step 2: Softmax probabilities

Apply softmax so the scores become probabilities that sum to 1. Approximate probabilities:

Expert	Probability
1	0.52
2	0.17
3	0.26
4	0.05

Step 3: Top-k selection

Pick the top 2 experts. That's Expert 1 (0.52) and Expert 3 (0.26).

Step 4: Renormalize

The selected weights are renormalized so they sum to 1 again within the chosen subset:

Expert 1: $0.52 / (0.52 + 0.26) \approx 0.67$
Expert 3: $0.26 / (0.52 + 0.26) \approx 0.33$

Step 5: Expert computation

Only Experts 1 and 3 run. Each processes the same input token $x$ through its own FFN weights.

$E_1(x)$ produces output vector $o_1$
$E_3(x)$ produces output vector $o_3$

Step 6: Weighted sum

The final layer output blends the two expert outputs using the renormalized gate weights:

\text{Output} = 0.67 \cdot E_1(x) + 0.33 \cdot E_3(x)

That's it. The other two experts never process this token, so they contribute no expert-FFN FLOPs for this routing decision. Their weights still need to be stored or fetched by the serving system for other tokens that may select them.

routing-walkthrough.py

from math import exp

logits = [-0.65, -1.77, -1.35, -3.00]
unnormalized = [exp(score) for score in logits]
probabilities = [value / sum(unnormalized) for value in unnormalized]
selected = sorted(range(len(probabilities)), key=probabilities.__getitem__, reverse=True)[:2]
selected_total = sum(probabilities[index] for index in selected)
weights = [probabilities[index] / selected_total for index in selected]

print("selected experts:", [index + 1 for index in selected])
print("renormalized weights:", [round(weight, 2) for weight in weights])
print("selected weight sum:", round(sum(weights), 2))

Output

selected experts: [1, 3]
renormalized weights: [0.67, 0.33]
selected weight sum: 1.0

The general MoE layer formula

We can write the same walkthrough as a compact equation. For an input token $x$ , the MoE layer output is:

\text{MoE}(x) = \sum_{i \in \text{TopK}(g(x))} \tilde{g}_i(x) \cdot E_i(x)

Reading the formula: the router computes a gate distribution $g(x)$ over all $N$ experts. We keep only the configured top- $k$ entries, zero out the rest, and renormalize the survivors into $\tilde{g}(x)$ so those remaining weights sum to 1. Mixtral uses top-2 out of 8, while other architectures choose different expert counts and routing policies.^{[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088}^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434} Each selected expert $E_i$ processes $x$ independently, then their outputs are blended using the renormalized gates $\tilde{g}_i$ .

The architecture figure above is the visual form of this equation: the router scores every expert, keeps the top- $k$ , runs only those FFNs, then blends the selected outputs.

Why sparse activation can pay off

A dense model is one GPU worker pool where every active block processes every token. An MoE model keeps a larger menu of stations but sends each token through only a selected subset. It can spend less computation per token than a similarly sized dense menu, but the parameters still occupy memory and coordination between them still costs time. The analogy maps to active FLOPs, total weight residency, and dispatch traffic respectively.

Property	Dense baseline (~active compute)	MoE Model (Mixtral-shaped)
Total parameters	~13B illustrative dense peer	46.7B (8 experts)
Active parameters/token	~13B	~12.9B (top-2 of 8)
Training FLOPs	Per total params	Per active params
Quality at fixed active compute	Baseline	Often stronger if routing trains well
Inference compute/token	Per total params	Roughly per active params

The main benefit is that MoE can scale total parameters faster than per-token compute. By adding more experts while keeping top- $k$ fixed, the model gets a larger pool of FFN parameters, but each token still runs only a small subset. The expert FLOPs stay roughly tied to the active experts, not the full expert pool. Routing, batching, and communication still add overhead, so "same active parameters" doesn't mean identical latency.

That doesn't mean serving cost scales only with active parameters. Compute per token is sparse, but the system still needs access to the full expert pool, so memory footprint and communication often scale with total parameters, not active parameters.

Only the FFN blocks are replicated across experts. Attention layers, embeddings, normalization layers, and the router are still shared. That's why Mixtral 8x7B exposes 46.7B total parameters and about 12.9B active parameters per token, not the naive 56B and 14B you would get from multiplying everything by 8.^{[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088}

The split matters: active parameters mostly drive per-token expert FLOPs, while total parameters drive weight memory and checkpoint size. Network traffic depends on where selected experts live.

capacity-versus-compute.py

shared_parameters_b = 3.0
expert_parameters_b = 2.0
expert_count = 8
top_k = 2

total_parameters_b = shared_parameters_b + expert_count * expert_parameters_b
active_parameters_b = shared_parameters_b + top_k * expert_parameters_b

print("illustrative total parameters:", f"{total_parameters_b:.1f}B")
print("illustrative active parameters/token:", f"{active_parameters_b:.1f}B")
print("weights needed for residency:", f"{total_parameters_b:.1f}B")

Output

illustrative total parameters: 19.0B
illustrative active parameters/token: 7.0B
weights needed for residency: 19.0B

The router in code

Here's a basic PyTorch implementation of the standard top-k softmax router. It takes token embeddings, computes scores for each expert, and returns the indices and normalized weights for the top- $k$ choices.

the-router-in-code.py

import torch
import torch.nn as nn
import torch.nn.functional as F

class TopKRouter(nn.Module):
    """
    Routes tokens to the top-k experts based on learned gating weights.
    """
    def __init__(self, d_model: int, n_experts: int, top_k: int = 2):
        super().__init__()
        self.gate = nn.Linear(d_model, n_experts, bias=False)
        self.top_k = top_k

    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        # x: (batch, seq_len, d_model)
        logits = self.gate(x)  # (batch, seq_len, n_experts)
        scores = F.softmax(logits, dim=-1)
        top_scores, top_indices = scores.topk(self.top_k, dim=-1)

        # Renormalize selected expert weights so they sum to 1
        top_scores = top_scores / top_scores.sum(dim=-1, keepdim=True)

        return top_scores, top_indices  # Which experts, with what weight

torch.manual_seed(0)
torch.set_printoptions(precision=3, sci_mode=False)
router = TopKRouter(d_model=4, n_experts=4, top_k=2)
x = torch.randn(2, 3, 4)
weights, indices = router(x)

first_weights = [round(value, 3) for value in weights[0, 0].tolist()]
weight_sums = [
    [round(value, 3) for value in row]
    for row in weights.sum(dim=-1).tolist()
]

print("first token experts:", indices[0, 0].tolist())
print("first token weights:", first_weights)
print("weight sums:", weight_sums)

Output

first token experts: [3, 1]
first token weights: [0.682, 0.318]
weight sums: [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]

Switch Transformer used top-1 routing, while Mixtral and DeepSeek-V2 use multiple selected routed experts per token.^{[1]Reference 1Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.https://arxiv.org/abs/2101.03961}^{[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088}^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434} Selecting more experts changes quality, compute, capacity, and communication trade-offs; it isn't automatically a better serving configuration.

Alternative: expert choice routing

While the standard Top-K router forces each token to choose its top $k$ experts, Expert Choice Routing^{[7]Reference 7Mixture-of-Experts with Expert Choice Routinghttps://arxiv.org/abs/2202.09368} flips the mechanism: each expert selects a fixed bucket of top-scoring tokens from the current batch.

Each expert processes a fixed token budget, so expert-side load is predictable and hardware utilization is easier to schedule. The trade-off moves to token coverage: one token may land in many expert buckets, one bucket, or none unless the design adds coverage rules. Token-choice routing with balancing losses remains the simpler mental model for open-weight MoE LLMs; expert-choice routing is still worth knowing when you read training-efficiency papers.

expert-choice-coverage.py

expert_buckets = {
    "E1": ["t0", "t1"],
    "E2": ["t0", "t2"],
    "E3": ["t2", "t3"],
}
all_tokens = {"t0", "t1", "t2", "t3", "t4"}
selected_tokens = [token for bucket in expert_buckets.values() for token in bucket]
uncovered = sorted(all_tokens - set(selected_tokens))

print("tokens per expert:", [len(bucket) for bucket in expert_buckets.values()])
print("t0 expert count:", selected_tokens.count("t0"))
print("uncovered tokens:", uncovered)

Output

tokens per expert: [2, 2, 2]
t0 expert count: 2
uncovered tokens: ['t4']

Why routers collapse without a balancing policy

During training, routers can drift toward expert collapse: many tokens route to a few "popular" experts while others are underutilized. Load statistics tell you whether this is occurring; they don't prove that a perfectly uniform router is best for model quality.

Start with four interchangeable experts labeled E1 through E4. Early in training, E1 happens to win slightly higher gate scores on common tokens. The router therefore assigns E1 more often. Because those tokens flow through E1, E1 receives more gradient updates and becomes even better at the tokens it already sees. The preference reinforces itself. After enough steps, E1 may take most assignments while E2 through E4 stay under-trained. The model still reserves memory for all four experts, but it's effectively operating with one overworked expert. That is load collapse, not proof that E1 became a "math expert" or that the idle ones own other domains.

Two token-load histograms comparing collapsed and balanced expert routing. — Routing histograms reveal imbalance candidates. Compare them with quality, overflow, and communication metrics before selecting a balancing policy.

The router is trained jointly alongside the rest of the network.^{[6]Reference 6Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.https://arxiv.org/abs/1701.06538} If a particular expert looks better for common tokens early in training, the router can send it more assignments. That expert then receives more gradient updates, which can reinforce the preference and leave other experts under-trained. The operational test isn't a story about specialization: inspect assignment shares, overflow, task loss, and communication load together.

routing-load-monitor.py

assignments = [0, 0, 0, 0, 0, 0, 1, 1, 2, 3]
expert_count = 4
counts = [assignments.count(expert) for expert in range(expert_count)]
shares = [count / len(assignments) for count in counts]
max_share = max(shares)

print("assignment shares:", [round(share, 2) for share in shares])
print("flag for investigation:", max_share > 0.50)

Output

assignment shares: [0.6, 0.2, 0.1, 0.1]
flag for investigation: True

An auxiliary load-balancing loss

Switch Transformer uses a top-1 auxiliary loss that penalizes uneven routing.^{[1]Reference 1Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.https://arxiv.org/abs/2101.03961} This form is also a useful teaching calculation for top- $k$ routing when $f_i$ is normalized over routed assignments:

\mathcal{L}_{\text{balance}} = \alpha \cdot N \sum_{i=1}^N f_i \cdot p_i

Reading the formula: multiply each expert's routed-assignment fraction $f_i$ by its average routing probability $p_i$ and sum them up. If the router sends too many assignments to one expert (high $f_i$ ) and gives it high probability (high $p_i$ ), this product is large, so the loss penalizes that concentration. It's a balancing pressure, not proof that equal utilization maximizes task quality.

where:

$f_i = \frac{\text{assignments routed to expert } i}{\text{total routed assignments}}$ ; with top- $k$ , the denominator is $T \times k$
$p_i = \frac{1}{T}\sum_{t=1}^T \text{softmax}(W_g \cdot x_t)_i$ (average router probability)
$\alpha$ is a configurable coefficient (Switch reports experiments including $10^{-2}$ )
$N$ is the number of experts

A balanced reference point is $f_i \approx p_i \approx 1/N$ (uniform distribution). A trained deployment still needs task-quality and systems measurements before you judge its observed specialization.

Here's a concrete top-1 numeric check. Suppose we have 4 experts and a mini-batch of 100 tokens:

Expert	$f_i$ (actual fraction)	$p_i$ (average prob)	Product $f_i \cdot p_i$
1	0.60	0.55	0.330
2	0.20	0.22	0.044
3	0.15	0.15	0.022
4	0.05	0.08	0.004
Sum			0.400

If the load were perfectly uniform, each expert would get $f_i = p_i = 0.25$ , and the sum would be $4 \times (0.25 \times 0.25) = 0.25$ . With positive $\alpha$ , this auxiliary term applies pressure against the imbalanced 0.400 case. Tune its weight with task-quality measurements because too much balancing pressure can conflict with the primary objective.

The auxiliary loss is a routing pressure, not the main learning objective. It says, "use the expert pool evenly enough," while the language-modeling loss still decides whether the model predicts the next token well.

load-balancing-score.py

assignment_counts = [60, 20, 15, 5]
mean_router_probability = [0.55, 0.22, 0.15, 0.08]
expert_count = len(assignment_counts)
assignment_total = sum(assignment_counts)
fractions = [count / assignment_total for count in assignment_counts]

concentration = expert_count * sum(
    fraction * probability
    for fraction, probability in zip(fractions, mean_router_probability)
)
uniform_reference = expert_count * sum((1 / expert_count) ** 2 for _ in fractions)

print("scaled concentration:", round(concentration, 2))
print("uniform reference:", round(uniform_reference, 2))
print("excess balancing pressure:", round(concentration - uniform_reference, 2))

Output

scaled concentration: 1.6
uniform reference: 1.0
excess balancing pressure: 0.6

Capacity factor and overflow policy

In a batched environment, many MoE implementations use fixed-size expert buffers so the layer can run predictable tensor operations. This size is determined by the capacity factor:

\text{Expert Capacity} = \frac{\text{Tokens per Batch} \times k}{N} \times \text{Capacity Factor}

This computes the average number of tokens each expert would receive under perfectly uniform routing, then multiplies by the capacity factor. A factor of $1.0$ means the buffer is exactly sized for uniform routing. Setting it to $1.2$ allows an expert to receive 20% more tokens than the average, providing a safety margin.

What happens if an expert is assigned more tokens than its capacity buffer allows depends on implementation:

Dropping or pass-through: Switch-style designs can skip overflowed expert computations and rely on the residual path for those token-layer assignments.^{[1]Reference 1Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.https://arxiv.org/abs/2101.03961}
Rerouting: A system can send overflow to another eligible expert, changing load and communication.
Dropless execution: A system can execute all assignments with variable work, retaining quality paths while accepting less predictable workload.

This shows the trade-off: a higher capacity factor reduces overflow risk but reserves more buffer space and may waste compute on empty slots. A lower factor is more compact but makes whichever overflow policy you selected more important.

capacity-overflow-policy.py

from math import ceil

tokens = 4096
top_k = 2
experts = 32
capacity_factor = 1.0
assigned_to_hot_expert = 400
capacity = ceil(tokens * top_k / experts * capacity_factor)
overflow = max(assigned_to_hot_expert - capacity, 0)

print("per-expert capacity:", capacity)
print("overflow assignments:", overflow)
for policy in ("drop", "reroute", "dropless"):
    print(policy, "must be evaluated for quality and throughput")

Output

per-expert capacity: 256
overflow assignments: 144
drop must be evaluated for quality and throughput
reroute must be evaluated for quality and throughput
dropless must be evaluated for quality and throughput

How modern architectures scale the idea

The basic MoE pattern is consistent across modern models, but architects have refined it in interesting ways.

Mixtral 8x7B^{[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088}

Property	Value
Total params	46.7B
Active params	~12.9B
Experts	8
Top-K	2
Expert type	Standard FFN (SwiGLU activation^{[8]Reference 8GLU Variants Improve Transformerhttps://arxiv.org/abs/2002.05202})
Performance	Mixtral paper (2024) reported matching or exceeding Llama 2 70B and GPT-3.5 on its evaluated suite^{[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088}

MoE is applied only to the Feed-Forward Network (FFN) layers. Attention layers are shared across all tokens, which is the standard approach. It uses a SwiGLU (Swish Gated Linear Unit) activation function, a variant of the Gated Linear Unit (GLU) that's been shown to improve Transformer performance.

DeepSeek-V2^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}

Property	Value
Total params	236B
Active params	21B
Routed experts	160 (fine-grained)
Shared experts	2 (always-on)
Routed Top-K	6
Total active experts	8 (6 routed + 2 shared)

DeepSeek-V2 uses several extensions to coarse-grained MoE:

Fine-grained experts: 160 smaller routed experts instead of a handful of coarse experts, which gives the router many more combinations to compose per token.^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}
Shared experts: 2 experts are always active for all tokens. The design aims to capture common computation and reduce redundancy among routed experts; every token gets 2 shared expert outputs in addition to the 6 routed ones.^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}
Device-limited routing: The selected routed experts for each token are constrained to a small number of devices, reducing cross-device communication overhead.^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}
Multi-level balancing losses: DeepSeek-V2 adds separate expert-level, device-level, and communication-balance losses to keep routing efficient at cluster scale.^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}
Multi-head Latent Attention (MLA): Compresses the KV cache into a low-rank latent vector, which the paper reports as a large KV-memory and throughput win relative to DeepSeek 67B.^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}

The diagram contrasts Mixtral's coarse-grained top-2 routing with DeepSeek-V2's fine-grained routed-plus-shared design. It shows the full expert counts, while tile sizes and the example selections are schematic.

Expert-pool comparison showing Mixtral selecting 2 of 8 routed experts and DeepSeek-V2 activating 2 shared experts plus 6 of 160 routed experts. — Cell counts are exact: Mixtral chooses 2 of 8 routed experts. DeepSeek-V2 runs 2 shared experts plus 6 selected from 160 routed experts; tile sizes are schematic.

DeepSeek-V3^{[4]Reference 4DeepSeek-V3 Technical Report.https://arxiv.org/abs/2412.19437}

DeepSeek-V3 retains the DeepSeekMoE pattern at a larger reported scale.

Property	Value
Total params	671B
Active params	37B
Routed experts	256
Shared experts	1
Routed Top-K	8
Total active experts	9 (8 routed + 1 shared)
Main load balancing	Bias-adjusted auxiliary-loss-free strategy
Auxiliary guard	Sequence-wise balance loss
Other key component	Multi-head Latent Attention (MLA)

Each MoE layer has 1 shared expert and 256 routed experts, with 8 routed experts activated per token.^{[4]Reference 4DeepSeek-V3 Technical Report.https://arxiv.org/abs/2412.19437}

DeepSeek-V3 changes the routing details too. It computes token-to-expert affinity with sigmoid scores, normalizes the selected scores, and adds a per-expert bias only when choosing the top- $k$ routed experts. Training adjusts that bias from observed expert load. The weighted expert output still uses the original affinity score, not the routing bias.^{[4]Reference 4DeepSeek-V3 Technical Report.https://arxiv.org/abs/2412.19437}

The paper calls this main batch-wise mechanism auxiliary-loss-free because it doesn't rely on an auxiliary loss to keep expert load balanced across a training batch. It still keeps a small complementary sequence-wise balance loss to prevent extreme imbalance inside one sequence.^{[4]Reference 4DeepSeek-V3 Technical Report.https://arxiv.org/abs/2412.19437} The distinction matters: the standard softmax router above is a useful mental model, not a universal implementation contract.

Qwen3.6 sparse MoE checkpoint^{[5]Reference 5Qwen3.6-35B-A3Bhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B}

Qwen3.6 makes the active-versus-total split visible in the model name:

Checkpoint	Total params	Activated params	What the suffix means
Qwen3.6-35B-A3B	35B	3B	35B total, about 3B active per token

This naming is a useful production reminder. "A3B" hints at active compute, not full deployment footprint. You still need enough memory, sharding, quantization, or offload capacity for the full checkpoint, and you still need serving support for routing, grouped expert execution, and any long-context KV-cache pressure.

When MoE is worth the complexity

Use MoE when the problem needs more model capacity at roughly fixed active compute, and the serving stack can absorb routing complexity. Prefer a dense model when the bottleneck is single-request latency, weak interconnect, or full-checkpoint memory.

Decision signal	Dense is likely better	MoE is worth testing
Traffic shape	Low batch size, strict p95 latency	Large batches or offline throughput
Memory budget	One model must fit on a small fixed GPU set	Full checkpoint can be sharded, quantized, or offloaded
Network	All-to-all traffic already saturates links	Fast interconnect and device-aware routing are available
Quality target	Dense baseline already meets requirements	Extra capacity may improve harder tasks
Serving stack	Kernels lack grouped expert execution	Runtime supports routing, dispatch, combine, and expert placement
Observability	No routing, overflow, or per-expert metrics	You can inspect load balance, capacity, drops, and communication

The wrong question is "does this model activate fewer parameters?" The useful question is whether the target workload gains enough quality or throughput after you account for full weight residency, routing overhead, interconnect cost, and operational visibility.

The hardware reality of serving an MoE

Implementing MoE introduces complexities not present in dense models. The compute per token is sparse, but the system still faces several hard serving constraints.

Communication overhead

In distributed training and inference, experts often reside on different devices to meet memory requirements. Routing a token to an expert on another GPU requires transferring the token's hidden state over a high-bandwidth interconnect such as NVLink within a node or InfiniBand across nodes.

This communication can quickly become a bottleneck if not managed carefully. The process usually has an all-to-all dispatch phase that scatters tokens to the GPUs hosting their selected experts, followed by an all-to-all combine phase that returns the expert outputs to the original token order. To maintain high throughput, system designers try to overlap those transfers with useful compute so the network isn't stalling the whole layer.

expert-placement-traffic.py

expert_device = {"E0": "GPU0", "E1": "GPU0", "E2": "GPU1", "E3": "GPU1"}
routed_assignments = [
    ("GPU0", "E0"),
    ("GPU0", "E2"),
    ("GPU1", "E2"),
    ("GPU1", "E1"),
    ("GPU0", "E3"),
]
remote = sum(
    source_device != expert_device[expert]
    for source_device, expert in routed_assignments
)

print("routed assignments:", len(routed_assignments))
print("remote assignments:", remote)
print("remote rate:", f"{remote / len(routed_assignments):.0%}")

Output

routed assignments: 5
remote assignments: 3
remote rate: 60%

Memory requirements

Unlike dense models where there's only one FFN per layer, MoE models carry many expert FFNs. Fast GPU deployments generally keep those expert weights resident somewhere in the serving pool so routing doesn't stall on weight fetches.

This creates the core MoE paradox: low active compute per token, but very high total weight residency. A model like Mixtral 8x7B only activates about 12.9B parameters per token, yet the serving system still needs the full 46.7B parameter checkpoint available across the cluster.^{[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088} Tensor Parallelism (TP) splits individual layer computations across multiple GPUs, allowing larger shared layers to fit into memory, but it doesn't solve the unique structure of MoE. Expert Parallelism (EP) is needed to place different full experts on different GPUs while keeping every expert reachable when a token needs it.

Training instability

MoE training can be more sensitive than training a comparable dense Transformer. The top- $k$ routing step introduces discontinuities at expert-selection boundaries, and small routing biases can snowball into expert collapse. Gradients still flow through the selected experts and their gate weights, but the router is noisier to optimize than a dense FFN.

To combat this, implementations may add stabilizers. Router z-loss^{[9]Reference 9ST-MoE: Designing Stable and Transferable Sparse Expert Modelshttps://arxiv.org/abs/2202.08906} penalizes unnecessarily large routing logits; the ST-MoE paper reports improved training stability from this term. Input jitter is another candidate to test, but ST-MoE's XL-scale ablation improved stability while reducing quality.^{[9]Reference 9ST-MoE: Designing Stable and Transferable Sparse Expert Modelshttps://arxiv.org/abs/2202.08906} The relevant evidence is training loss, routing distribution, overflow, and downstream quality rather than assuming one stabilizer is free or mandatory for every MoE.

Expert parallelism layout

Large distributed MoE deployments commonly use Expert Parallelism (EP). While Tensor Parallelism (TP) splits individual matrix multiplications, EP maps entire experts to specific GPUs. The layout below is deliberately schematic: it illustrates expert ownership and a shared attention path, not one universal sharding plan. Real deployments can combine EP with TP, sequence parallelism, data parallelism, and redundant expert placement. For example, DeepSeek-V3 describes a decoding layout with TP4 for attention and EP320 for the MoE path.^{[4]Reference 4DeepSeek-V3 Technical Report.https://arxiv.org/abs/2412.19437}

The smaller 96-expert sketch looks like this:

text

GPU 0: Attention layers (shared) + Experts 0-31
GPU 1: Attention layers (shared) + Experts 32-63
GPU 2: Attention layers (shared) + Experts 64-95
...

Each GPU stores a subset of experts. When a token routes to an expert on another GPU, an all-to-all communication sends the token there and retrieves the result.

Expert parallelism pipeline where a router assigns tokens, token states scatter to GPU-local experts, batches compute on owner GPUs, and outputs gather back to sequence order. — Expert weights stay on their owner GPUs. Token states scatter to selected experts, batch for grouped computation, then cross the network again to restore original order.

Expert prefetching

For CPU-offloaded serving (for example, running a quantized MoE checkpoint on consumer hardware), one possible design is:

Keep attention weights and router weights in GPU memory (always needed).
Keep expert weights on CPU or SSD.
After a layer's router decides which experts are needed, transfer those expert weights before its expert computation.
Optionally use prediction or caching to stage likely later-layer experts, accepting misses or unused transfers.

This can make large MoE checkpoints fit where full GPU residency can't, especially when combined with weight quantization, but it trades memory for transfer-dependent latency. You can't know an exact next-layer route until that layer receives its hidden state, so unconditional "prefetch the next experts" isn't free.

Batch routing efficiency

In batch serving, different tokens in a batch may route to different experts. Batched expert execution is the serving trick:

Router assigns all tokens in the batch to their top-k experts.
Group tokens by expert assignment.
Execute each expert as a batched matrix multiply over its assigned tokens.
Scatter results back to original token positions.

This groups many per-token expert calls into a smaller number of larger matrix multiplies, which helps keep GPU utilization high with MoE models.

grouped-expert-execution.py

from collections import defaultdict

routes = [
    ("t0", "E2"),
    ("t1", "E0"),
    ("t2", "E2"),
    ("t3", "E1"),
    ("t4", "E2"),
]
grouped = defaultdict(list)
for token, expert in routes:
    grouped[expert].append(token)

for expert in sorted(grouped):
    print(expert, "batch tokens:", grouped[expert], "gemm rows:", len(grouped[expert]))

Output

E0 batch tokens: ['t1'] gemm rows: 1
E1 batch tokens: ['t3'] gemm rows: 1
E2 batch tokens: ['t0', 't2', 't4'] gemm rows: 3

Benchmark slices instead of memorizing a crossover:

Measurement slice	What it exposes
Single request, batch size 1	Fixed router, grouping, and dispatch overhead
Moderate and large batches	How much grouped expert execution amortizes that overhead
Local versus remote expert routes	Cost of expert placement and all-to-all traffic
Strong versus weak interconnect	Whether network bandwidth changes the winner

Exact outcomes depend on kernels, expert placement, interconnect, and request mix. MoE adds router computation, expert selection, token grouping, and sometimes all-to-all communication; batching can amortize that overhead, but only measurements on the target system establish whether MoE wins latency or throughput.

The next code block uses an illustrative measurement fixture. Replace these numbers with benchmark output from the serving stack under evaluation.

measured-serving-gate.py

dense = {"quality": 0.82, "p95_ms": 48, "tokens_per_second": 820}
moe = {"quality": 0.83, "p95_ms": 55, "tokens_per_second": 1100}
requirements = {"quality": 0.82, "max_p95_ms": 60, "min_tokens_per_second": 1000}

approved = (
    moe["quality"] >= requirements["quality"]
    and moe["p95_ms"] <= requirements["max_p95_ms"]
    and moe["tokens_per_second"] >= requirements["min_tokens_per_second"]
)

print("moe throughput improvement:", f"{moe['tokens_per_second'] / dense['tokens_per_second'] - 1:.1%}")
print("moe p95 delta ms:", moe["p95_ms"] - dense["p95_ms"])
print("candidate approved:", approved)

Output

moe throughput improvement: 34.1%
moe p95 delta ms: 7
candidate approved: True

When MoE serving breaks down

"Active FLOPs are small, so deployment memory will be small too"

Symptom: FLOP estimates look cheap, but deployment still runs out of VRAM.
Cause: Team budgeted only for active parameters per token and forgot that the full expert pool must still be resident somewhere in the serving system.
Fix: Separate active expert FLOPs from total weight residency. Size memory for the full expert pool as well as the routed path.

"Router imbalance will fix itself later"

Symptom: One or two experts receive most tokens by early training.
Cause: Router found a slightly better path, that expert got more gradients, and collapse reinforced itself.
Fix: Track routing histograms from the start and tune the load-balancing policy before unused experts go stale. Evaluate router z-loss separately for numerical stability, and inspect capacity plus overflow policy separately for dropped or delayed work.

"Sparse compute guarantees lower latency"

Symptom: MoE benchmark wins disappear at batch size 1.
Cause: Sparse expert FLOPs aren't the whole latency story. Router work, grouping, dispatch, and all-to-all traffic can dominate single-request latency.
Fix: Evaluate MoE on the real serving regime. Use dense models for low-latency single requests when routing overhead outweighs sparse compute gains.

"Expert placement is a later infra detail"

Symptom: Throughput drops sharply after experts are split across GPUs.
Cause: Expert placement created communication hotspots. Tokens now bounce across devices faster than grouped GEMMs can amortize the transfers.
Fix: Treat routing and placement as one systems problem. Use expert parallelism, device-aware routing, and batching to keep all-to-all traffic under control.

"Balanced histograms mean routing is healthy"

Symptom: Quality drops even though router histograms look balanced.
Cause: Experts may be hitting capacity limits, and the configured overflow policy is dropping, rerouting, or delaying assignments.
Fix: Inspect overflow counts and policy directly. Evaluate capacity factor, routing balance, burstiness, and throughput before assuming the issue is elsewhere.

"Expert-choice routing covers every token automatically"

Symptom: Expert-choice routing leaves some tokens underprocessed.
Cause: Experts select their favorite tokens first, so token coverage is no longer guaranteed automatically.
Fix: Add explicit coverage rules or use token-choice top-k routing when every token must reliably receive expert computation.

"Experts are neat human-labeled topic buckets"

Symptom: Teams describe experts as clean topic buckets like "math expert" or "biology expert."
Cause: Human-friendly labels are being projected onto behavior that often follows token patterns, formatting, or structural clusters instead.
Fix: Validate specialization with routing traces, activation analysis, and targeted prompts that test a claimed specialty before telling a story about it.

Mastery check

Key concepts

Sparse versus dense FFN computation
Router logits, top-k selection, and renormalization
Expert collapse and balancing pressure
Capacity factor and dropped-token overflow
Total weight residency versus active FLOPs
Expert parallelism, dispatch, and all-to-all cost
Dense-vs-MoE deployment gates for latency, throughput, memory, and interconnect limits

What strong answers show

Choose dense vs MoE for a serving regime, and justify that choice with latency, throughput, memory residency, and communication cost.
Trace one token by hand: compute router scores, pick top- $k$ , renormalize, then blend expert outputs.
Diagnose collapse from routing histograms and say which mitigation you would try first.
Compute capacity risk from batch tokens, top- $k$ , number of experts, and capacity factor, then explain what overflow does to quality.
Explain why tensor parallelism helps shared layers but expert parallelism and placement strategy are needed for routed FFNs.
Defend when shared experts, device-limited routing, or finer-grained expert pools buy enough systems value to justify extra complexity.
Interpret names like Qwen3.6-35B-A3B correctly: active parameters are a compute hint, not a full memory estimate.

Follow-up questions

Your team wants a 48-expert MoE because active FLOPs look like a 12.9B dense model. Why can memory still look like a much larger checkpoint?

Active FLOPs follow only the experts selected for each token, but the full expert pool still has to live somewhere in the serving system. That means VRAM or cluster-wide weight residency is driven by total parameters as well as active parameters. MoE separates compute cost from capacity, not memory from capacity.

A checkpoint is named Qwen3.6-35B-A3B. What does that name tell you, and what does it not tell you?

The figure says the model has about 35B total parameters and about 3B activated parameters per token. It doesn't mean deployment only needs 3B parameters worth of memory. The full checkpoint still has to be stored, sharded, quantized, or offloaded, and the serving runtime still has to handle routing and expert execution.

One expert takes 70% of tokens by step 500. Why is that failure self-reinforcing, and what do you inspect first?

It's self-reinforcing because the favored expert gets more tokens, more gradients, and therefore improves faster, which makes the router favor it even more. First inspect routing histograms, average router probabilities, dropped-token rates, and balancing-loss terms. Those signals show whether collapse is still reversible or already starving the rest of the expert pool.

Why might a dense model beat MoE on latency for batch size 1 while an MoE candidate wins throughput at batch size 64?

At batch size 1, MoE pays router, grouping, and often all-to-all overhead without enough routed tokens to amortize it. Dense models avoid that coordination overhead and can win on raw latency. At larger batch sizes, grouped expert execution can make sparse compute competitive on throughput. Benchmark both claims on the actual hardware.

A serving stack already uses tensor parallelism for attention. Why is that not enough for MoE layers?

Tensor parallelism shards individual matrix multiplications, but MoE also needs whole-expert placement, token dispatch, and result recombination. That's why expert parallelism exists: the problem isn't only splitting math, but moving tokens to whichever GPU holds their selected experts and then restoring original token order.

A batch sends 4096 tokens through top-2 routing with 32 experts and capacity factor 1.0. What is the nominal per-expert buffer, and what failure appears if one expert gets 400 assignments?

The nominal per-expert buffer is (4096 * 2) / 32 = 256 token assignments. If one expert gets 400 assignments, then 144 assignments overflow that buffer. A Switch-style dropping policy skips the overflowed expert computation and may hurt quality; rerouting or dropless designs pay different compute and communication costs. Inspect the configured policy.

Your interconnect is weak, and all-to-all traffic is already the bottleneck. What do you change before adding even more experts?

Start by constraining and measuring communication, not by assuming more experts solve it. Device-limited routing, expert placement that keeps common routes local, shared experts for common computation, and larger grouped batches are candidate mitigations. Adding routed experts changes capacity and traffic; benchmark that change against the existing bottleneck.^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}^{[4]Reference 4DeepSeek-V3 Technical Report.https://arxiv.org/abs/2412.19437}

Try it yourself: manual router exercise

Here's a self-contained pencil-and-paper exercise that checks whether you can trace the exact computation an MoE router performs. Route one token representation through the selected expert subnetworks without looking back at the worked example above.

Setup

One token arrives with hidden state $x = [0.5, -0.2, 0.1]$ .
The router weight matrix is $W_g = \begin{bmatrix} 0.4 & -0.1 & 0.2 \\ -0.2 & 0.3 & 0.1 \\ 0.1 & 0.1 & -0.3 \end{bmatrix}$ .
There are 3 experts. Top- $k = 2$ .
Suppose Expert 1 outputs $o_1 = [1.0, 0.5]$ , Expert 2 outputs $o_2 = [0.2, 1.2]$ , and Expert 3 outputs $o_3 = [0.8, -0.1]$ .

Questions

Compute the raw router scores: $s = W_g \cdot x$ . (Hint: this is a 3-vector.)
Apply softmax to get probabilities $p_1, p_2, p_3$ .
Which two experts are selected?
Renormalize the selected probabilities so they sum to 1.
Compute the final layer output as a weighted sum of the selected expert outputs.

Solution sketch

$s = [0.4(0.5) + (-0.1)(-0.2) + 0.2(0.1), \dots] = [0.24, -0.15, 0.00]$
Softmax gives roughly $p \approx [0.406, 0.275, 0.319]$
Selected: Expert 1 and Expert 3 (top 2 probabilities).
Renormalized: $g_1 = 0.406 / (0.406 + 0.319) \approx 0.560$ , $g_3 = 0.319 / (0.406 + 0.319) \approx 0.440$
Output $\approx 0.560 \cdot [1.0, 0.5] + 0.440 \cdot [0.8, -0.1] \approx [0.912, 0.236]$

If your numbers are within 0.02 of these, you understand the routing math well enough to read any MoE paper.

Sparse routing checklist

The central trick behind large sparse MoE models should be concrete: replace selected dense FFNs with routed expert sets and activate only configured paths per token. A real serving claim still has to pass the router trace, total-parameter versus active-compute distinction, balance, overflow policy, memory residency, and communication checks.

Next Step

Continue to Mamba & State Space Models

There, you'll compare selective state-space sequence processing with attention-based paths and reason about when architecture changes alter memory, throughput, and quality trade-offs.

PreviousLong Context Window Management

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.

Fedus, W., Zoph, B., & Shazeer, N. · 2022

Mixtral of Experts.

Jiang, A. Q., et al. · 2024

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI · 2024

DeepSeek-V3 Technical Report.

DeepSeek-AI · 2024 · arXiv preprint

Qwen3.6-35B-A3B

Qwen Team · 2026

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.

Shazeer, N., et al. · 2017 · ICLR 2017

Mixture-of-Experts with Expert Choice Routing

Zhou, Y., et al. · 2022

GLU Variants Improve Transformer

Shazeer, N. · 2020

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Zoph, B., et al. · 2022

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Mixture of Experts Architecture

What link does MoE break compared with a dense FFN layer?

From one FFN to many experts

Which parts of a Transformer layer are usually replicated in modern MoE LLMs?

A concrete routing walkthrough

Setup

Step 1: Router scores

Step 2: Softmax probabilities

Step 3: Top-k selection

Step 4: Renormalize

Step 5: Expert computation

Step 6: Weighted sum

What are the six steps for tracing one token through top-k MoE routing?

The general MoE layer formula

Why do selected expert weights need renormalization after top-k selection?

Why sparse activation can pay off

If an MoE has 46.7B total parameters and 12.9B active parameters per token, which number drives FLOPs and which number drives memory residency?

The router in code

What additional trade-off does top-2 routing introduce compared with top-1?

Alternative: expert choice routing

What trade-off does expert choice routing make compared with token choice top-k routing?

Why routers collapse without a balancing policy

Why does expert collapse reinforce itself during training?

An auxiliary load-balancing loss

In the load-balancing loss, what do fif_ifi​ and pip_ipi​ measure?

Capacity factor and overflow policy

Why does a capacity factor of 1.2 reduce overflow risk versus 1.0, and what must you know when an expert is full?

How modern architectures scale the idea

Mixtral 8x7B[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088

DeepSeek-V2[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434

DeepSeek-V3[4]Reference 4DeepSeek-V3 Technical Report.https://arxiv.org/abs/2412.19437

Qwen3.6 sparse MoE checkpoint[5]Reference 5Qwen3.6-35B-A3Bhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B

What does DeepSeek-style fine-grained MoE add beyond "more experts"?

When MoE is worth the complexity

What is the fastest practical dense-vs-MoE gate?

The hardware reality of serving an MoE

Communication overhead

Memory requirements

Why doesn't tensor parallelism alone solve MoE serving?

Training instability

Expert parallelism layout

Expert prefetching

Batch routing efficiency

Why can dense models beat MoE on single-request latency even when MoE has fewer active parameters?

When MoE serving breaks down

"Active FLOPs are small, so deployment memory will be small too"

"Router imbalance will fix itself later"

"Sparse compute guarantees lower latency"

"Expert placement is a later infra detail"

"Balanced histograms mean routing is healthy"

"Expert-choice routing covers every token automatically"

"Experts are neat human-labeled topic buckets"

What's the simplest way to tell whether someone understands MoE serving cost?

Mastery check

Key concepts

What strong answers show

Follow-up questions

Your team wants a 48-expert MoE because active FLOPs look like a 12.9B dense model. Why can memory still look like a much larger checkpoint?

A checkpoint is named Qwen3.6-35B-A3B. What does that name tell you, and what does it not tell you?

One expert takes 70% of tokens by step 500. Why is that failure self-reinforcing, and what do you inspect first?

Why might a dense model beat MoE on latency for batch size 1 while an MoE candidate wins throughput at batch size 64?

A serving stack already uses tensor parallelism for attention. Why is that not enough for MoE layers?

A batch sends 4096 tokens through top-2 routing with 32 experts and capacity factor 1.0. What is the nominal per-expert buffer, and what failure appears if one expert gets 400 assignments?

Your interconnect is weak, and all-to-all traffic is already the bottleneck. What do you change before adding even more experts?

Try it yourself: manual router exercise

Setup

Questions

Solution sketch

Sparse routing checklist

Mastery Check

Discussion

Mixtral 8x7B^{[2]Reference 2Mixtral of Experts.https://arxiv.org/abs/2401.04088}

DeepSeek-V2^{[3]Reference 3DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}

DeepSeek-V3^{[4]Reference 4DeepSeek-V3 Technical Report.https://arxiv.org/abs/2412.19437}

Qwen3.6 sparse MoE checkpoint^{[5]Reference 5Qwen3.6-35B-A3Bhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B}