LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleMixture of Experts Architecture
🧠HardTransformer Architecture

Mixture of Experts Architecture

Master MoE routing, load balancing, and understand why models like Mixtral and DeepSeek deliver strong capacity-per-compute tradeoffs compared with dense architectures.

35 min read
Learning path
Step 137 of 155 in the full curriculum
Context EngineeringMamba & State Space Models

The last two chapters focused on serving pressure: memory, long contexts, batching, and GPU capacity. Mixture of Experts (MoE) attacks a different bottleneck inside the Transformer layer itself.

By the time you reach this article, you've seen how a Transformer layer works: mixes information across tokens, then a Feed-Forward Network (FFN) processes each token independently. That FFN is dense. Every single weight in it participates in every single forward pass, no matter what the input says. This is simple, but it's also expensive. If you want the model to know more facts, handle more syntax, or reason about more domains, the standard recipe is to make the FFN wider or deeper. Every extra neuron adds parameters and adds compute for every token.

Mixture of Experts (MoE) breaks that link. Instead of one giant FFN, an MoE layer keeps many smaller FFNs in parallel and uses a lightweight router to decide which ones to run for each . This means the model can have hundreds of billions of total parameters while only spending compute on a small subset of them for any given input.

MoE is one widely used way to separate model capacity from per-token compute. Switch Transformer[1] popularized sparse activation in trillion-parameter models, and open models such as Mixtral[2], DeepSeek-V2[3], and DeepSeek-V3[4] show how MoE can scale to hundreds of billions of total parameters while keeping the active compute for each token much closer to a far smaller dense model.

From one FFN to many experts

In a standard Transformer layer, the FFN sublayer is a single dense network. In an MoE layer, that one FFN is replaced by NNN parallel "expert" FFNs, plus a small router (also called a gating network) that decides which kkk experts handle each token.[5]

In many modern LLMs, the experts replace the FFN sublayer while attention stays dense and shared across all tokens.[2][3] A common router design is a single linear projection from the token's hidden state down to NNN scores.

MoE block where shared attention stays dense, router picks top two experts, and only those FFNs run. MoE block where shared attention stays dense, router picks top two experts, and only those FFNs run.
One token scores expert menu, keeps top two winners, runs only those FFNs, then blends outputs.

A concrete routing walkthrough

Before we write a formula or code, let's trace one token through an MoE layer by hand. This is the shape of the computation that happens inside models like Mixtral on every single token.

Setup

  • One token arrives with a hidden state xxx of dimension 4.
  • There are 4 experts, each a small FFN.
  • The router uses top-k=2k = 2k=2 (it picks the 2 best experts).

Step 1: Router scores

The router multiplies the token by a learned weight matrix WgW_gWg​ to get a raw score for each expert. Suppose the scores come out as:

ExpertRaw score (logit)
1-0.65
2-1.77
3-1.35
4-3.00

Step 2: Softmax probabilities

Apply softmax so the scores become probabilities that sum to 1. The result is approximately:

ExpertProbability
10.52
20.17
30.26
40.05

Step 3: Top-k selection

Pick the top 2 experts. That's Expert 1 (0.52) and Expert 3 (0.26).

Step 4: Renormalize

The selected weights are renormalized so they sum to 1 again within the chosen subset:

  • Expert 1: 0.52/(0.52+0.26)≈0.670.52 / (0.52 + 0.26) \approx 0.670.52/(0.52+0.26)≈0.67
  • Expert 3: 0.26/(0.52+0.26)≈0.330.26 / (0.52 + 0.26) \approx 0.330.26/(0.52+0.26)≈0.33

Step 5: Expert computation

Only Experts 1 and 3 run. Each processes the same input token xxx through its own FFN weights.

  • E1(x)E_1(x)E1​(x) produces output vector o1o_1o1​
  • E3(x)E_3(x)E3​(x) produces output vector o3o_3o3​

Step 6: Weighted sum

The final layer output blends the two expert outputs using the renormalized gate weights:

Output=0.67⋅E1(x)+0.33⋅E3(x)\text{Output} = 0.67 \cdot E_1(x) + 0.33 \cdot E_3(x)Output=0.67⋅E1​(x)+0.33⋅E3​(x)

That's it. The other two experts never process this token, so they contribute no expert-FFN FLOPs for this routing decision. Their weights still need to be stored or fetched by the serving system for other tokens that may select them.

routing-walkthrough.py
1from math import exp 2 3logits = [-0.65, -1.77, -1.35, -3.00] 4unnormalized = [exp(score) for score in logits] 5probabilities = [value / sum(unnormalized) for value in unnormalized] 6selected = sorted(range(len(probabilities)), key=probabilities.__getitem__, reverse=True)[:2] 7selected_total = sum(probabilities[index] for index in selected) 8weights = [probabilities[index] / selected_total for index in selected] 9 10print("selected experts:", [index + 1 for index in selected]) 11print("renormalized weights:", [round(weight, 2) for weight in weights]) 12print("selected weight sum:", round(sum(weights), 2))
Output
1selected experts: [1, 3] 2renormalized weights: [0.67, 0.33] 3selected weight sum: 1.0

The general MoE layer formula

We can write the same walkthrough as a compact equation. For an input token xxx, the MoE layer output is:

MoE(x)=∑i∈TopK(g(x))gi(x)⋅Ei(x)\text{MoE}(x) = \sum_{i \in \text{TopK}(g(x))} g_i(x) \cdot E_i(x)MoE(x)=i∈TopK(g(x))∑​gi​(x)⋅Ei​(x)

Reading the formula: the router computes a gate distribution g(x)g(x)g(x) over all NNN experts. We keep only the configured top-kkk entries, zero out the rest, and renormalize the remaining weights so they sum to 1. Mixtral uses top-2 out of 8, while other architectures choose different expert counts and routing policies.[2][3] Each selected expert EiE_iEi​ processes xxx independently, then their outputs are blended using the gate weights gig_igi​.

The architecture figure above is the visual form of this equation: the router scores every expert, keeps the top-kkk, runs only those FFNs, then blends the selected outputs.

Why sparse activation can pay off

Think of a dense model like one fulfillment center where every active station processes every order. An MoE model keeps a larger menu of stations but sends each order through only a selected subset. It can spend less computation per order than a similarly sized dense menu, but the facilities still occupy space and coordination between them still costs time. The analogy maps to active FLOPs, total weight residency, and dispatch traffic respectively.

PropertyDense ModelMoE Model
Total parameters7B~47B (8 experts)
Active parameters/token7B~13B (top-2 of 8)
Training FLOPsPer total paramsPer active params
Quality at fixed active computeBaselineOften stronger if routing trains well
Inference compute/tokenPer total paramsRoughly per active params

The main benefit is that MoE can scale total parameters faster than per-token compute. By adding more experts while keeping top-kkk fixed, the model gets a larger pool of FFN parameters, but each token still runs only a small subset. The expert FLOPs stay roughly tied to the active experts, not the full expert pool. Routing, batching, and communication still add overhead, so "same active parameters" doesn't mean identical latency.

That doesn't mean serving cost scales only with active parameters. Compute per token is sparse, but the system still needs access to the full expert pool, so memory footprint and communication often scale with total parameters, not active parameters.

Only the FFN blocks are replicated across experts. Attention layers, embeddings, normalization layers, and the router are still shared. That's why Mixtral 8x7B exposes about 47B total parameters and about 13B active parameters per token, not the naive 56B and 14B you would get from multiplying everything by 8.[2]

The key point is the split: active parameters mostly drive per-token expert FLOPs, while total parameters drive weight memory and checkpoint size. Network traffic depends on where selected experts live.

capacity-versus-compute.py
1shared_parameters_b = 3.0 2expert_parameters_b = 2.0 3expert_count = 8 4top_k = 2 5 6total_parameters_b = shared_parameters_b + expert_count * expert_parameters_b 7active_parameters_b = shared_parameters_b + top_k * expert_parameters_b 8 9print("illustrative total parameters:", f"{total_parameters_b:.1f}B") 10print("illustrative active parameters/token:", f"{active_parameters_b:.1f}B") 11print("weights needed for residency:", f"{total_parameters_b:.1f}B")
Output
1illustrative total parameters: 19.0B 2illustrative active parameters/token: 7.0B 3weights needed for residency: 19.0B

The router in code

Here's a basic PyTorch implementation of the standard top-k softmax router. It takes token embeddings, computes scores for each expert, and returns the indices and normalized weights for the top-kkk choices.

the-router-in-code.py
1import torch 2import torch.nn as nn 3import torch.nn.functional as F 4 5class TopKRouter(nn.Module): 6 """ 7 Routes tokens to the top-k experts based on learned gating weights. 8 """ 9 def __init__(self, d_model: int, n_experts: int, top_k: int = 2): 10 super().__init__() 11 self.gate = nn.Linear(d_model, n_experts, bias=False) 12 self.top_k = top_k 13 14 def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: 15 # x: (batch, seq_len, d_model) 16 logits = self.gate(x) # (batch, seq_len, n_experts) 17 scores = F.softmax(logits, dim=-1) 18 top_scores, top_indices = scores.topk(self.top_k, dim=-1) 19 20 # Renormalize selected expert weights so they sum to 1 21 top_scores = top_scores / top_scores.sum(dim=-1, keepdim=True) 22 23 return top_scores, top_indices # Which experts, with what weight 24 25torch.manual_seed(0) 26torch.set_printoptions(precision=3, sci_mode=False) 27router = TopKRouter(d_model=4, n_experts=4, top_k=2) 28x = torch.randn(2, 3, 4) 29weights, indices = router(x) 30 31first_weights = [round(value, 3) for value in weights[0, 0].tolist()] 32weight_sums = [ 33 [round(value, 3) for value in row] 34 for row in weights.sum(dim=-1).tolist() 35] 36 37print("first token experts:", indices[0, 0].tolist()) 38print("first token weights:", first_weights) 39print("weight sums:", weight_sums)
Output
1first token experts: [3, 1] 2first token weights: [0.682, 0.318] 3weight sums: [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]

Switch Transformer used top-1 routing, while Mixtral and DeepSeek-V2 use multiple selected routed experts per token.[1][2][3] Selecting more experts changes quality, compute, capacity, and communication trade-offs; it is not automatically a better serving configuration.

Alternative: expert choice routing

While the standard Top-K router forces each token to choose its top kkk experts, Expert Choice Routing[6] flips the mechanism: each expert selects a fixed bucket of top-scoring tokens from the current batch.

This gives fixed expert-side load balancing because every expert processes the same number of tokens. However, it means some tokens might be processed by many experts while other tokens might be ignored entirely if no expert selects them.

By fixing the number of tokens each expert processes, expert choice routing avoids the same expert-side load-balancing problem. This can simplify the training objective and improve hardware utilization because every expert gets a predictable bucket. The risk moves to token coverage: a token can be selected by many experts, one expert, or none unless the design adds coverage rules. In practice, token-choice routing with balancing losses remains the simpler mental model for open-weight MoE LLMs, while expert-choice routing is useful to know when reading training-efficiency papers.

expert-choice-coverage.py
1expert_buckets = { 2 "E1": ["t0", "t1"], 3 "E2": ["t0", "t2"], 4 "E3": ["t2", "t3"], 5} 6all_tokens = {"t0", "t1", "t2", "t3", "t4"} 7selected_tokens = [token for bucket in expert_buckets.values() for token in bucket] 8uncovered = sorted(all_tokens - set(selected_tokens)) 9 10print("tokens per expert:", [len(bucket) for bucket in expert_buckets.values()]) 11print("t0 expert count:", selected_tokens.count("t0")) 12print("uncovered tokens:", uncovered)
Output
1tokens per expert: [2, 2, 2] 2t0 expert count: 2 3uncovered tokens: ['t4']

Why routers collapse without a balancing policy

During training, routers can drift toward expert collapse: many tokens route to a few "popular" experts while others are underutilized. Load statistics tell you whether this is occurring; they do not prove that a perfectly uniform router is best for model quality.

Imagine an e-commerce fulfillment center with four specialist teams: one for small electronics, one for apparel, one for bulky furniture, and one for perishable goods. On the first day, the electronics team happens to be slightly faster at handling common orders. The router notices this and sends more orders to the electronics team. That team gets more practice, so it improves. The router sends even more orders to them. Within a few weeks, the electronics team is handling 80% of all orders, while the furniture and perishables teams sit idle. The warehouse is still paying rent for all four teams, but it's effectively operating with one overworked team.

Two illustrative token-load histograms. A suspected collapse case sends most tokens to Expert 1, while a comparison case spreads assignments across all experts. Two illustrative token-load histograms. A suspected collapse case sends most tokens to Expert 1, while a comparison case spreads assignments across all experts.
Routing histograms reveal imbalance candidates. Compare them with quality, overflow, and communication metrics before selecting a balancing policy.

The router is trained jointly alongside the rest of the network.[5] If a particular expert looks better for common tokens early in training, the router can send it more assignments. That expert then receives more gradient updates, which can reinforce the preference and leave other experts under-trained. The operational test is not a story about specialization: inspect assignment shares, overflow, task loss, and communication load together.

routing-load-monitor.py
1assignments = [0, 0, 0, 0, 0, 0, 1, 1, 2, 3] 2expert_count = 4 3counts = [assignments.count(expert) for expert in range(expert_count)] 4shares = [count / len(assignments) for count in counts] 5max_share = max(shares) 6 7print("assignment shares:", [round(share, 2) for share in shares]) 8print("flag for investigation:", max_share > 0.50)
Output
1assignment shares: [0.6, 0.2, 0.1, 0.1] 2flag for investigation: True

An auxiliary load-balancing loss

Switch Transformer uses a top-1 auxiliary loss that penalizes uneven routing.[1] The following form is also a useful teaching calculation for top-kkk routing when fif_ifi​ is normalized over routed assignments:

Lbalance=α⋅N∑i=1Nfi⋅pi\mathcal{L}_{\text{balance}} = \alpha \cdot N \sum_{i=1}^N f_i \cdot p_iLbalance​=α⋅Ni=1∑N​fi​⋅pi​

Reading the formula: multiply each expert's routed-assignment fraction fif_ifi​ by its average routing probability pip_ipi​ and sum them up. If the router sends too many assignments to one expert (high fif_ifi​) and gives it high probability (high pip_ipi​), this product is large, so the loss penalizes that concentration. It is a balancing pressure, not proof that equal utilization maximizes task quality.

where:

  • fi=assignments routed to expert itotal routed assignmentsf_i = \frac{\text{assignments routed to expert } i}{\text{total routed assignments}}fi​=total routed assignmentsassignments routed to expert i​; with top-kkk, the denominator is T×kT \times kT×k
  • pi=1T∑t=1Tsoftmax(Wg⋅xt)ip_i = \frac{1}{T}\sum_{t=1}^T \text{softmax}(W_g \cdot x_t)_ipi​=T1​∑t=1T​softmax(Wg​⋅xt​)i​ (average router probability)
  • α\alphaα is a configurable coefficient (Switch reports experiments including 10−210^{-2}10−2)
  • NNN is the number of experts

A balanced reference point is fi≈pi≈1/Nf_i \approx p_i \approx 1/Nfi​≈pi​≈1/N (uniform distribution). A trained deployment still needs task-quality and systems measurements before you judge its observed specialization.

Here's a concrete top-1 numeric check. Suppose we have 4 experts and a mini-batch of 100 tokens:

Expertfif_ifi​ (actual fraction)pip_ipi​ (average prob)Product fi⋅pif_i \cdot p_ifi​⋅pi​
10.600.550.330
20.200.220.044
30.150.150.022
40.050.080.004
Sum0.400

If the load were perfectly uniform, each expert would get fi=pi=0.25f_i = p_i = 0.25fi​=pi​=0.25, and the sum would be 4×(0.25×0.25)=0.254 \times (0.25 \times 0.25) = 0.254×(0.25×0.25)=0.25. With positive α\alphaα, this auxiliary term applies pressure against the imbalanced 0.400 case. Tune its weight with task-quality measurements because too much balancing pressure can conflict with the primary objective.

The auxiliary loss is a routing pressure, not the main learning objective. It says, "use the expert pool evenly enough," while the language-modeling loss still decides whether the model predicts the next token well.

load-balancing-score.py
1assignment_counts = [60, 20, 15, 5] 2mean_router_probability = [0.55, 0.22, 0.15, 0.08] 3expert_count = len(assignment_counts) 4assignment_total = sum(assignment_counts) 5fractions = [count / assignment_total for count in assignment_counts] 6 7concentration = expert_count * sum( 8 fraction * probability 9 for fraction, probability in zip(fractions, mean_router_probability) 10) 11uniform_reference = expert_count * sum((1 / expert_count) ** 2 for _ in fractions) 12 13print("scaled concentration:", round(concentration, 2)) 14print("uniform reference:", round(uniform_reference, 2)) 15print("excess balancing pressure:", round(concentration - uniform_reference, 2))
Output
1scaled concentration: 1.6 2uniform reference: 1.0 3excess balancing pressure: 0.6

Capacity factor and overflow policy

In a batched environment, many MoE implementations use fixed-size expert buffers so the layer can run predictable tensor operations. This size is determined by the capacity factor:

Expert Capacity=Tokens per Batch×kN×Capacity Factor\text{Expert Capacity} = \frac{\text{Tokens per Batch} \times k}{N} \times \text{Capacity Factor}Expert Capacity=NTokens per Batch×k​×Capacity Factor

This computes the average number of tokens each expert would receive under perfectly uniform routing, then multiplies by the capacity factor. A factor of 1.01.01.0 means the buffer is exactly sized for uniform routing. Setting it to 1.21.21.2 allows an expert to receive 20% more tokens than the average, providing a safety margin.

What happens if an expert is assigned more tokens than its capacity buffer allows depends on implementation:

  1. Dropping or pass-through: Switch-style designs can skip overflowed expert computations and rely on the residual path for those token-layer assignments.[1]
  2. Rerouting: A system can send overflow to another eligible expert, changing load and communication.
  3. Dropless execution: A system can execute all assignments with variable work, retaining quality paths while accepting less predictable workload.

This highlights the trade-off: a higher capacity factor reduces overflow risk but reserves more buffer space and may waste compute on empty slots. A lower factor is more compact but makes whichever overflow policy you selected more important.

capacity-overflow-policy.py
1from math import ceil 2 3tokens = 4096 4top_k = 2 5experts = 32 6capacity_factor = 1.0 7assigned_to_hot_expert = 400 8capacity = ceil(tokens * top_k / experts * capacity_factor) 9overflow = max(assigned_to_hot_expert - capacity, 0) 10 11print("per-expert capacity:", capacity) 12print("overflow assignments:", overflow) 13for policy in ("drop", "reroute", "dropless"): 14 print(policy, "must be evaluated for quality and throughput")
Output
1per-expert capacity: 256 2overflow assignments: 144 3drop must be evaluated for quality and throughput 4reroute must be evaluated for quality and throughput 5dropless must be evaluated for quality and throughput

How modern architectures scale the idea

The basic MoE pattern is consistent across modern models, but architects have refined it in interesting ways.

Mixtral 8x7B[2]

PropertyValue
Total params46.7B
Active params~12.9B
Experts8
Top-K2
Expert typeStandard FFN (SwiGLU activation[7])
PerformanceMatches or exceeds Llama 2 70B and GPT-3.5 across the evaluated benchmarks[2]

MoE is applied only to the Feed-Forward Network (FFN) layers. Attention layers are shared across all tokens, which is the standard approach. It uses a SwiGLU (Swish Gated Linear Unit) activation function, a variant of the Gated Linear Unit (GLU) that's been shown to improve Transformer performance.

DeepSeek-V2[3]

PropertyValue
Total params236B
Active params21B
Routed experts160 (fine-grained)
Shared experts2 (always-on)
Routed Top-K6
Total active experts8 (6 routed + 2 shared)

DeepSeek-V2 uses several extensions to coarse-grained MoE:

  • Fine-grained experts: 160 smaller routed experts instead of a handful of coarse experts, which gives the router many more combinations to compose per token.[3]
  • Shared experts: 2 experts are always active for all tokens. The design aims to capture common computation and reduce redundancy among routed experts; every token gets 2 shared expert outputs in addition to the 6 routed ones.[3]
  • Device-limited routing: The selected routed experts for each token are constrained to a small number of devices, reducing cross-device communication overhead.[3]
  • Multi-level balancing losses: DeepSeek-V2 adds separate expert-level, device-level, and communication-balance losses to keep routing efficient at cluster scale.[3]
  • Multi-head Latent Attention (MLA): Compresses the KV cache into a low-rank latent vector, which the paper reports as a large KV-memory and throughput win relative to DeepSeek 67B.[3]

The following diagram contrasts Mixtral's coarse-grained top-2 routing with DeepSeek-V2's fine-grained routed-plus-shared design. The DeepSeek side is schematic, not a full 162-expert grid.

Comparison of two MoE patterns. Mixtral uses 8 large experts with top-2 routing. DeepSeek-V2 is shown schematically with many small routed experts plus shared experts that always stay active. Comparison of two MoE patterns. Mixtral uses 8 large experts with top-2 routing. DeepSeek-V2 is shown schematically with many small routed experts plus shared experts that always stay active.
Mixtral is the clean mental model: 8 large experts, choose 2. DeepSeek-style MoE keeps sparse routing but adds many smaller experts, shared paths, and stronger communication constraints.

DeepSeek-V3[4]

DeepSeek-V3 retains the DeepSeekMoE pattern at a larger reported scale.

PropertyValue
Total params671B
Active params37B
Routed experts256
Shared experts1
Routed Top-K8
Total active experts9 (8 routed + 1 shared)
Load balancingAuxiliary-loss-free strategy
Other key componentMulti-head Latent Attention (MLA)

Each MoE layer has 1 shared expert and 256 routed experts, with 8 routed experts activated per token.[4]

The paper also highlights a major training change: instead of leaning on auxiliary balancing losses, DeepSeek-V3 uses an auxiliary-loss-free load-balancing strategy.[4] That's important because it shows how modern MoE work is not only about adding more experts. Router stability, communication efficiency, and keeping the main training objective clean matter just as much.

The hardware reality of serving an MoE

Implementing MoE introduces complexities not present in dense models. The compute per token is sparse, but the system still faces several hard serving constraints.

Communication overhead

In distributed training and inference, experts often reside on different devices to meet memory requirements. Routing a token to an expert on another GPU requires transferring the token's hidden state over a high-bandwidth interconnect such as NVLink within a node or InfiniBand across nodes.

This communication can quickly become a bottleneck if not managed carefully. The process usually has an all-to-all dispatch phase that scatters tokens to the GPUs hosting their selected experts, followed by an all-to-all combine phase that returns the expert outputs to the original token order. To maintain high throughput, system designers try to overlap those transfers with useful compute so the network isn't stalling the whole layer.

expert-placement-traffic.py
1expert_device = {"E0": "GPU0", "E1": "GPU0", "E2": "GPU1", "E3": "GPU1"} 2routed_assignments = [ 3 ("GPU0", "E0"), 4 ("GPU0", "E2"), 5 ("GPU1", "E2"), 6 ("GPU1", "E1"), 7 ("GPU0", "E3"), 8] 9remote = sum( 10 source_device != expert_device[expert] 11 for source_device, expert in routed_assignments 12) 13 14print("routed assignments:", len(routed_assignments)) 15print("remote assignments:", remote) 16print("remote rate:", f"{remote / len(routed_assignments):.0%}")
Output
1routed assignments: 5 2remote assignments: 3 3remote rate: 60%

Memory requirements

Unlike dense models where there's only one FFN per layer, MoE models carry many expert FFNs. Fast GPU deployments generally keep those expert weights resident somewhere in the serving pool so routing does not stall on weight fetches.

This creates the core MoE paradox: low active compute per token, but very high total weight residency. A model like Mixtral 8x7B only activates about 13B parameters per token, yet the serving system still needs the full ~47B parameter checkpoint available across the cluster.[2] Tensor Parallelism (TP) splits individual layer computations across multiple GPUs, allowing larger shared layers to fit into memory, but it doesn't solve the unique structure of MoE. Expert Parallelism (EP) is needed to place different full experts on different GPUs while keeping every expert reachable when a token needs it.

Training instability

MoE training can be more sensitive than training a comparable dense Transformer. The top-kkk routing step introduces discontinuities at expert-selection boundaries, and small routing biases can snowball into expert collapse. Gradients still flow through the selected experts and their gate weights, but the router is noisier to optimize than a dense FFN.

To combat this, implementations may add stabilizers. Router z-loss[8] penalizes unnecessarily large routing logits; the ST-MoE paper reports improved training stability from this term. Router jitter can add exploration pressure early in training. The relevant evidence is training loss, routing distribution, overflow, and downstream quality rather than assuming one stabilizer is mandatory for every MoE.

Expert parallelism layout

Large distributed MoE deployments commonly use Expert Parallelism (EP). While Tensor Parallelism (TP) splits individual matrix multiplications, EP maps entire experts to specific GPUs. The layout below illustrates how a 96-expert model might be distributed across three GPUs, with each device holding a distinct subset of experts while sharing the attention parameters:

text
1GPU 0: Attention layers (shared) + Experts 0-31 2GPU 1: Attention layers (shared) + Experts 32-63 3GPU 2: Attention layers (shared) + Experts 64-95 4...

Each GPU stores a subset of experts. When a token routes to an expert on another GPU, an all-to-all communication sends the token there and retrieves the result.

Compact expert-parallel dispatch path from GPU 0 to Expert 45 on GPU 1. Compact expert-parallel dispatch path from GPU 0 to Expert 45 on GPU 1.
Expert parallelism keeps expert weights resident by GPU; routed token states cross all-to-all links.

Expert prefetching

For CPU-offloaded serving (for example, running a quantized MoE checkpoint on consumer hardware), one possible design is:

  1. Keep attention weights and router weights in GPU memory (always needed).
  2. Keep expert weights on CPU or SSD.
  3. After a layer's router decides which experts are needed, transfer those expert weights before its expert computation.
  4. Optionally use prediction or caching to stage likely later-layer experts, accepting misses or unused transfers.

This can make large MoE checkpoints fit where full GPU residency cannot, especially when combined with weight quantization, but it trades memory for transfer-dependent latency. You cannot know an exact next-layer route until that layer receives its hidden state, so unconditional "prefetch the next experts" is not free.

Batch routing efficiency

In batch serving, different tokens in a batch may route to different experts. The key optimization is batched expert execution:

  1. Router assigns all tokens in the batch to their top-k experts.
  2. Group tokens by expert assignment.
  3. Execute each expert as a batched matrix multiply over its assigned tokens.
  4. Scatter results back to original token positions.

This turns NNN small matrix multiplies into a smaller number of large, efficient ones, which is critical for achieving high GPU utilization with MoE models.

grouped-expert-execution.py
1from collections import defaultdict 2 3routes = [ 4 ("t0", "E2"), 5 ("t1", "E0"), 6 ("t2", "E2"), 7 ("t3", "E1"), 8 ("t4", "E2"), 9] 10grouped = defaultdict(list) 11for token, expert in routes: 12 grouped[expert].append(token) 13 14for expert in sorted(grouped): 15 print(expert, "batch tokens:", grouped[expert], "gemm rows:", len(grouped[expert]))
Output
1E0 batch tokens: ['t1'] gemm rows: 1 2E1 batch tokens: ['t3'] gemm rows: 1 3E2 batch tokens: ['t0', 't2', 't4'] gemm rows: 3
Batch SizeDense modelMoE modelTypical effect
1Higher baseline utilizationLowerDense models often win on latency
8StrongCatching upRouting overhead starts to amortize
32+HighHighThroughput gap narrows a lot

This table is a hypothesis to benchmark, not a universal crossover. Exact outcomes depend on kernels, expert placement, interconnect, and request mix. MoE adds router computation, expert selection, token grouping, and sometimes all-to-all communication; batching can amortize that overhead, but only measurements on the target system establish whether MoE wins latency or throughput.

measured-serving-gate.py
1dense = {"quality": 0.82, "p95_ms": 48, "tokens_per_second": 820} 2moe = {"quality": 0.83, "p95_ms": 55, "tokens_per_second": 1100} 3requirements = {"quality": 0.82, "max_p95_ms": 60, "min_tokens_per_second": 1000} 4 5approved = ( 6 moe["quality"] >= requirements["quality"] 7 and moe["p95_ms"] <= requirements["max_p95_ms"] 8 and moe["tokens_per_second"] >= requirements["min_tokens_per_second"] 9) 10 11print("moe throughput improvement:", f"{moe['tokens_per_second'] / dense['tokens_per_second'] - 1:.1%}") 12print("moe p95 delta ms:", moe["p95_ms"] - dense["p95_ms"]) 13print("candidate approved:", approved)
Output
1moe throughput improvement: 34.1% 2moe p95 delta ms: 7 3candidate approved: True

Common pitfalls

"Active FLOPs are small, so deployment memory will be small too"

Symptom: FLOP estimates look cheap, but deployment still runs out of VRAM. Cause: Team budgeted only for active parameters per token and forgot that the full expert pool must still be resident somewhere in the serving system. Fix: Separate active expert FLOPs from total weight residency. Size memory for the full expert pool, not only the routed path.

"Router imbalance will fix itself later"

Symptom: One or two experts receive most tokens by early training. Cause: Router found a slightly better path, that expert got more gradients, and collapse reinforced itself. Fix: Track routing histograms from the start and add balancing pressure such as auxiliary loss, z-loss, jitter, or capacity constraints before unused experts go stale.

"Sparse compute guarantees lower latency"

Symptom: MoE benchmark wins disappear at batch size 1. Cause: Sparse expert FLOPs are not the whole latency story. Router work, grouping, dispatch, and all-to-all traffic can dominate single-request latency. Fix: Evaluate MoE on the real serving regime. Use dense models for low-latency single requests when routing overhead outweighs sparse compute gains.

"Expert placement is a later infra detail"

Symptom: Throughput drops sharply after experts are split across GPUs. Cause: Expert placement created communication hotspots. Tokens now bounce across devices faster than grouped GEMMs can amortize the transfers. Fix: Treat routing and placement as one systems problem. Use expert parallelism, device-aware routing, and batching to keep all-to-all traffic under control.

"Balanced histograms mean routing is healthy"

Symptom: Quality drops even though router histograms look balanced. Cause: Experts may be hitting capacity limits, and the configured overflow policy is dropping, rerouting, or delaying assignments. Fix: Inspect overflow counts and policy directly. Evaluate capacity factor, routing balance, burstiness, and throughput before assuming the issue is elsewhere.

"Expert-choice routing covers every token automatically"

Symptom: Expert-choice routing leaves some tokens underprocessed. Cause: Experts select their favorite tokens first, so token coverage is no longer guaranteed automatically. Fix: Add explicit coverage rules or use token-choice top-k routing when every token must reliably receive expert computation.

"Experts are neat human-labeled topic buckets"

Symptom: Teams describe experts as clean topic buckets like "math expert" or "biology expert." Cause: Human-friendly labels are being projected onto behavior that often follows token patterns, formatting, or structural clusters instead. Fix: Validate specialization with routing traces, activation analysis, and targeted prompts that test a claimed specialty before telling a story about it.

Mastery check

Key concepts

  • Sparse versus dense FFN computation
  • Router logits, top-k selection, and renormalization
  • Expert collapse and balancing pressure
  • Capacity factor and dropped-token overflow
  • Total weight residency versus active FLOPs
  • Expert parallelism, dispatch, and all-to-all cost

Evaluation rubric

  • Choose dense vs MoE for a serving regime, and justify that choice with latency, throughput, memory residency, and communication cost.
  • Trace one token by hand: compute router scores, pick top-kkk, renormalize, then blend expert outputs.
  • Diagnose collapse from routing histograms and say which mitigation you would try first.
  • Compute capacity risk from batch tokens, top-kkk, number of experts, and capacity factor, then explain what overflow does to quality.
  • Explain why tensor parallelism helps shared layers but expert parallelism and placement strategy are needed for routed FFNs.
  • Defend when shared experts, device-limited routing, or finer-grained expert pools buy enough systems value to justify extra complexity.

Follow-up questions

Your team wants a 48-expert MoE because active FLOPs look like a 13B dense model. Why can memory still look like a much larger checkpoint?

Active FLOPs follow only the experts selected for each token, but the full expert pool still has to live somewhere in the serving system. That means VRAM or cluster-wide weight residency is driven by total parameters, not only active parameters. MoE separates compute cost from capacity, not memory from capacity.

One expert takes 70% of tokens by step 500. Why is that failure self-reinforcing, and what do you inspect first?

It is self-reinforcing because the favored expert gets more tokens, more gradients, and therefore improves faster, which makes the router favor it even more. First inspect routing histograms, average router probabilities, dropped-token rates, and balancing-loss terms. Those signals show whether collapse is still reversible or already starving the rest of the expert pool.

Why might a dense model beat MoE on latency for batch size 1 while an MoE candidate wins throughput at batch size 64?

At batch size 1, MoE pays router, grouping, and often all-to-all overhead without enough routed tokens to amortize it. Dense models avoid that coordination overhead and can win on raw latency. At larger batch sizes, grouped expert execution can make sparse compute competitive on throughput. Benchmark both claims on the actual hardware.

A serving stack already uses tensor parallelism for attention. Why is that not enough for MoE layers?

Tensor parallelism shards individual matrix multiplications, but MoE also needs whole-expert placement, token dispatch, and result recombination. That is why expert parallelism exists: the problem is not only splitting math, but moving tokens to whichever GPU holds their selected experts and then restoring original token order.

A batch sends 4096 tokens through top-2 routing with 32 experts and capacity factor 1.0. What is the nominal per-expert buffer, and what failure appears if one expert gets 400 assignments?

The nominal per-expert buffer is (4096 * 2) / 32 = 256 token assignments. If one expert gets 400 assignments, then 144 assignments overflow that buffer. A Switch-style dropping policy skips the overflowed expert computation and may hurt quality; rerouting or dropless designs pay different compute and communication costs. Inspect the configured policy.

Your interconnect is weak, and all-to-all traffic is already the bottleneck. What do you change before adding even more experts?

Start by constraining and measuring communication, not by assuming more experts solve it. Device-limited routing, expert placement that keeps common routes local, shared experts for common computation, and larger grouped batches are candidate mitigations. Adding routed experts changes capacity and traffic; benchmark that change against the existing bottleneck.[3][4]

Try it yourself: manual router exercise

Here's a self-contained pencil-and-paper exercise that checks whether you can trace the exact computation an MoE router performs. Think of it as routing a single customer order through the right fulfillment specialists. Do this without looking back at the worked example above.

Setup

  • One token arrives with hidden state x=[0.5,−0.2,0.1]x = [0.5, -0.2, 0.1]x=[0.5,−0.2,0.1].
  • The router weight matrix is Wg=[0.4−0.10.2−0.20.30.10.10.1−0.3]W_g = \begin{bmatrix} 0.4 & -0.1 & 0.2 \\ -0.2 & 0.3 & 0.1 \\ 0.1 & 0.1 & -0.3 \end{bmatrix}Wg​=​0.4−0.20.1​−0.10.30.1​0.20.1−0.3​​.
  • There are 3 experts. Top-k=2k = 2k=2.
  • Suppose Expert 1 outputs o1=[1.0,0.5]o_1 = [1.0, 0.5]o1​=[1.0,0.5], Expert 2 outputs o2=[0.2,1.2]o_2 = [0.2, 1.2]o2​=[0.2,1.2], and Expert 3 outputs o3=[0.8,−0.1]o_3 = [0.8, -0.1]o3​=[0.8,−0.1].

Questions

  1. Compute the raw router scores: s=Wg⋅xs = W_g \cdot xs=Wg​⋅x. (Hint: this is a 3-vector.)
  2. Apply softmax to get probabilities p1,p2,p3p_1, p_2, p_3p1​,p2​,p3​.
  3. Which two experts are selected?
  4. Renormalize the selected probabilities so they sum to 1.
  5. Compute the final layer output as a weighted sum of the selected expert outputs.

Solution sketch

  1. s=[0.4(0.5)+(−0.1)(−0.2)+0.2(0.1),… ]=[0.24,−0.15,0.00]s = [0.4(0.5) + (-0.1)(-0.2) + 0.2(0.1), \dots] = [0.24, -0.15, 0.00]s=[0.4(0.5)+(−0.1)(−0.2)+0.2(0.1),…]=[0.24,−0.15,0.00]
  2. Softmax gives roughly p≈[0.406,0.275,0.319]p \approx [0.406, 0.275, 0.319]p≈[0.406,0.275,0.319]
  3. Selected: Expert 1 and Expert 3 (top 2 probabilities).
  4. Renormalized: g1=0.406/(0.406+0.319)≈0.560g_1 = 0.406 / (0.406 + 0.319) \approx 0.560g1​=0.406/(0.406+0.319)≈0.560, g3=0.319/(0.406+0.319)≈0.440g_3 = 0.319 / (0.406 + 0.319) \approx 0.440g3​=0.319/(0.406+0.319)≈0.440
  5. Output ≈0.560⋅[1.0,0.5]+0.440⋅[0.8,−0.1]≈[0.912,0.236]\approx 0.560 \cdot [1.0, 0.5] + 0.440 \cdot [0.8, -0.1] \approx [0.912, 0.236]≈0.560⋅[1.0,0.5]+0.440⋅[0.8,−0.1]≈[0.912,0.236]

If your numbers are within 0.02 of these, you understand the routing math well enough to read any MoE paper.

What this unlocks next

You now understand the central trick behind large sparse MoE models: replace selected dense FFNs with routed expert sets and activate only configured paths per token. You can trace a token through the router by hand, explain why total parameters do not equal compute cost, and evaluate balance, overflow policy, memory residency, and communication before claiming a serving win.

Next Step
Continue to Mamba & State Space Models

There, you will compare selective state-space sequence processing with attention-based paths and reason about when architecture changes alter memory, throughput, and quality trade-offs.

PreviousContext Engineering
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.

Fedus, W., Zoph, B., & Shazeer, N. · 2022

Mixtral of Experts.

Jiang, A. Q., et al. · 2024

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI · 2024

DeepSeek-V3 Technical Report.

DeepSeek-AI · 2024 · arXiv preprint

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.

Shazeer, N., et al. · 2017 · ICLR 2017

Mixture-of-Experts with Expert Choice Routing

Zhou, Y., et al. · 2022

GLU Variants Improve Transformer

Shazeer, N. · 2020

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Zoph, B., et al. · 2022