LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
⚡Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
🔍Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
🤖Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
📊Evaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
🛠️LLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
🏗️System Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnSystem Design Case StudiesContent Moderation System
🏗️MediumSystem Design

Content Moderation System

Master the architecture of a real-time content moderation system using LLMs and specialized classifiers.

35 min readMeta, Google, OpenAI +26 key concepts

Imagine managing a town square with 100,000 speakers shouting simultaneously. Some are friendly, some are spamming ads, and a few are dangerously violent. A human moderator can handle one conversation at a time; an AI system must handle thousands per second, instantly deciding who gets the microphone and who gets silenced. This is the challenge of real-time content moderation: balancing safety, free speech, and speed at massive scale.

Designing this system requires balancing the raw throughput of lightweight classifiers with the reasoning capabilities of Large Language Models (LLMs). While traditional BERT-based models are fast, they struggle with nuance, satire, and evolving policies. LLMs excel at understanding context but are too slow and expensive to run on every message. The engineering challenge is to architect a tiered pipeline that routes content to the right model at the right time.


System requirements

A production-grade moderation system for a major social platform must meet strict operational constraints:

  • •Latency: <200ms p95 (95th percentile) for real-time chat; <1s for posts.
  • •Throughput: Handle 10K+ requests per second (RPS) with varying load spikes.
  • •Flexibility: Support evolving content policies without requiring full model retraining.
  • •Multimodality: Moderate text, images, video, and audio simultaneously.
  • •Due Process: Support transparent appeal workflows and human review loops.

Architecture

The standard industry pattern is a cascade or tiered architecture. Content flows through a funnel of increasingly expensive but accurate models.

Real-time content moderation system: text and images are classified through fast first-pass filters, then escalated to LLM-based analysis for edge cases. Real-time content moderation system: text and images are classified through fast first-pass filters, then escalated to LLM-based analysis for edge cases.
Diagram Diagram

💡 Key insight: Do not send every request to the LLM. 90-95% of content is either obviously safe (e.g., "Hello world") or obviously spam/toxic (e.g., specific slurs). A lightweight Tier 1 model can handle these for a fraction of the cost.


Tiered approach

Tier 1: Fast classifier (<10ms)[1]

The first line of defense is a high-performance, specialized model (typically a distilled BERT, DeBERTa, or even a fasttext classifier). These are smaller, faster versions of large transformers, trained on labeled data to detect specific categories of violations.

Characteristics:

  • •Architecture: DistilBERT, MobileBERT, or DeBERTa-v3-xsmall.
  • •Task: Multi-label classification (Output: [prob_hate, prob_violence, prob_spam, ...]).
  • •Precision: Tuned for high precision on "Block" (don't auto-block unless sure) and high precision on "Allow" (don't auto-allow unless sure). Everything else goes to Tier 2.

The following Python code demonstrates a tiered moderation controller. It takes the user's text as input and computes probability scores across predefined categories. If any score exceeds the strict block threshold, the action is blocked; if it falls into the review range, it is escalated.

python
1from typing import Dict, Literal 2# Pseudocode for a tiered moderation controller 3 4categories = [ 5 "hate_speech", "violence", "sexual_content", 6 "self_harm", "spam", "misinformation" 7] 8 9class ModerationResult: 10 def __init__(self, action: Literal["ALLOW", "BLOCK", "REVIEW"], category: str = None): 11 self.action = action 12 self.category = category 13 14class FastModerator: 15 def __init__(self): 16 # Load a lightweight, distilled model (e.g., ONNX optimized) 17 # In production, use a framework like ONNX Runtime or TensorRT 18 self.model = load_model("moderation-bert-v3-quantized") 19 # Strict thresholds for auto-action 20 self.thresholds = { 21 "hate_speech": {"block": 0.98, "review": 0.65}, 22 "violence": {"block": 0.95, "review": 0.60}, 23 } 24 25 def classify(self, text: str) -> ModerationResult: 26 # Inference takes ~5-10ms on CPU or <1ms on GPU 27 scores: Dict[str, float] = self.model.predict(text) 28 29 for cat, score in scores.items(): 30 if score > self.thresholds[cat]["block"]: 31 return ModerationResult(action="BLOCK", category=cat) 32 33 # If any category is suspicious but not certain, escalation is needed 34 if score > self.thresholds[cat]["review"]: 35 return ModerationResult(action="REVIEW", category=cat) 36 37 # If all scores are low, it's safe 38 return ModerationResult(action="ALLOW")

Tier 2: LLM policy judge (<200ms)[2]

For the 5-10% of content that is "uncertain" (e.g., potential hate speech vs. reclaimed slurs, or news reporting vs. glorification of violence), we invoke an LLM.

State-of-the-art systems now use specialized safety models like Llama Guard 3[3] or ShieldGemma[4], which are fine-tuned specifically for instruction-following moderation tasks.

Instead of generic training, we inject the current policy definitions directly into the context. This makes the system "policy-aware" and allows for immediate rule changes without retraining. The code block below illustrates a system prompt containing the exact policy rules. It outputs a structured JSON decision based on these guidelines.

text
1System: You are a content moderation expert. 2Your task is to classify user content based on the following policy: 3 4<POLICY_DEFINITION> 5Hate Speech: Dehumanizing speech, calls for violence, or inferiority claims based on protected characteristics (race, religion, etc.). 6Exceptions: 7- Counterspeech (raising awareness) 8- Self-referential use (reclaimed terms) 9- Fictional content (unless glorifying) 10</POLICY_DEFINITION> 11 12Analyze the content below. Think step-by-step about whether it falls under an exception. 13Respond with JSON: { "action": "ALLOW" | "BLOCK" | "ESCALATE", "reason": "...", "category": "..." } 14 15Content: {user_content}

Advantages:

  • •Zero-shot adaptation: Update the system prompt to change enforcement rules instantly.
  • •Contextual awareness: Can distinguish between "I want to kill you" (violence) and "I killed it at the gym" (slang).
  • •Explanation: Provides a reasoning chain ("reason": "...") which is crucial for the appeals process.

Tier 3: Human review

When the LLM is low-confidence (e.g., ESCALATE or low prob), the content is queued for human review. This is the most expensive tier but provides the ground truth for future training.

PrioritySLA (Service Level Agreement)Typical Cases
P0 (Critical)15 minCSAM (Child Sexual Abuse Material), imminent self-harm, terrorism
P1 (High)1 hourSevere harassment, viral misinformation
P2 (Normal)24 hoursGeneral policy gray areas, appeals

Multi-modal content

Modern platforms aren't just text. They must moderate images, video, and audio. The architecture extends the tiered approach to other modalities by passing each medium through its own specialized pipeline before combining the signals.

A major engineering challenge in multimodality is cross-modal context. A benign text overlay like "Look at what I found today" becomes a policy violation when paired with a graphic image. To handle this, the outputs from individual classifiers must feed into a late-fusion layer or a multimodal LLM that can evaluate the combined context of the post.

Content TypeTier 1 (Fast)Tier 2 (Deep)
TextDistilBERT / DeBERTaLlama-Guard 3 / ShieldGemma
ImagesResNet / EfficientNet (Not Safe For Work - NSFW detection)Vision-Language Models (CLIP, Llama Guard 3 Vision)
VideoKeyframe sampling + Image ClassifierMultimodal LLM (Gemini/GPT-5.2) on sampled frames
AudioAudio event classificationWhisper transcription → Text Pipeline

⚠️ Common mistake: Treating video as a black box. Moderating every frame is impossible. Keyframe sampling (e.g., 1 frame per second) + audio transcription is the standard efficient approach. Note that NSFW (Not Safe For Work) classifiers must run on every keyframe to catch flashing content.


Handling policy evolution

A major challenge in moderation is that the rules change. A new slang term becomes offensive, or a geopolitical event requires new strictness on misinformation.

Content policy decision tree: content is evaluated against policies for hate speech, violence, self-harm, and sexual content, with severity levels determining action (allow, warn, block). Content policy decision tree: content is evaluated against policies for hate speech, violence, self-harm, and sexual content, with severity levels determining action (allow, warn, block).
Diagram Diagram

Key insight: The LLM-based Tier 2 adapts immediately via prompt engineering. The Tier 1 classifier lags behind because it must be retrained on new data (often labeled by Tier 2 or humans) to learn the new pattern. This "student-teacher" loop keeps the fast path efficient while the slow path handles novelty.[5]

Robust systems also continuously generate "red team" data to test classifiers against adversarial attacks (e.g., using datasets like ToxiGen[6] to validate resilience against implicit hate speech).


Scaling to 10K+ RPS

Content fingerprinting and caching

Duplicate and near-duplicate content is pervasive (reposts, viral memes, copypasta). Never waste GPU cycles re-moderating known content.

The following caching implementation shows how to bypass the ML pipeline entirely for known content. It checks for exact string matches using SHA256 (returning instantly if found), and for minor edits or re-encodings using SimHash-based Locality-Sensitive Hashing.

python
1from typing import Optional 2import hashlib 3from redis import Redis 4from simhash import SimHashIndex, compute_simhash 5 6def sha256(text: str) -> str: 7 return hashlib.sha256(text.encode()).hexdigest() 8 9def normalize(text: str) -> str: 10 return text.lower().strip() 11 12class ModerationCache: 13 def __init__(self): 14 self.exact_cache = Redis(ttl=3600) # 1 hour TTL 15 self.fuzzy_index = SimHashIndex() # Near-duplicate detection (Locality-Sensitive Hashing) 16 17 def check(self, content: str) -> Optional[ModerationResult]: 18 # 1. Exact match ($O(1)$) - SHA256 19 # Catches exact copy-pastes 20 content_hash = sha256(normalize(content)) 21 if result := self.exact_cache.get(content_hash): 22 return result 23 24 # 2. Near-duplicate ($O(\log N)$) - SimHash / LSH 25 # Catches minor edits, resizing, re-encoding 26 simhash = compute_simhash(content) 27 if similar := self.fuzzy_index.find_similar(simhash, threshold=0.95): 28 return similar.result 29 30 return None # Cache miss - run full pipeline

At scale, 30-50% of content can often be served from cache, dramatically reducing inference costs.

Batch processing for throughput

GPUs are throughput-optimized devices. Processing requests one-by-one is inefficient. Implement micro-batching at the inference gateway.

ComponentSingle RequestBatch of 32Throughput Gain
Tier 1 (BERT)3ms8ms12×
Tier 2 (LLM)150ms400ms12×
Embedding5ms10ms16×

Strategy:

  1. •Accumulate: Collect incoming requests for a small window (e.g., 5ms) or until a batch fills (e.g., 32 items).
  2. •Execute: Run the batch through the model.
  3. •Scatter: Distribute results back to individual response handlers.

This is typically implemented using Envoy or Nginx sidecars for rate limiting and buffering, with Kafka topics decoupling ingestion from processing. Specialized inference servers like Triton Inference Server or vLLM handle the actual batching logic to maximize GPU utilization.

Geographic distribution

Latency is governed by the speed of light. Moderation must happen close to the user.

Diagram Diagram

Architecture pattern: Deploy Tier 1 classifiers (CPU-friendly or light GPU) in all edge locations (Blue). Route only the complex Tier 2 cases to regional hubs with heavy GPU clusters (Purple). This minimizes latency for the vast majority of safe traffic while centralizing expensive compute.


Appeal and review workflow

No moderation system is perfect. In consumer products, Precision (avoiding false bans) is often more important than Recall (catching every bad post) because false bans actively destroy user trust. However, for severe violations like CSAM, recall becomes paramount. This tension requires an explicit appeal and review loop built directly into the core user experience.

When a user appeals a moderation decision:

  1. •Auto-Review: Re-run the content through the pipeline with a "lenient" configuration or a larger, more accurate state-of-the-art model (e.g., Gemini 1.5 Pro or GPT-5.2 instead of a smaller Llama Guard model). This catches obvious classifier errors immediately.
  2. •Human Queue: If the auto-review upholds the ban, the case goes to a human reviewer. Crucially, the reviewer is provided the full context: the flagged post, the user's history, the specific policy triggered, and the LLM's reasoning chain.

The Feedback Loop: The real value of the appeal system is training data generation:

  • •False Positives: These are gold. They become "hard negatives" for the next training round of the Tier 1 classifier, actively reducing future false ban rates.
  • •Confirmed Violations: Added to the dataset as new positive examples to strengthen pattern recognition.
  • •Novel Violations: Edge cases that humans struggle with typically trigger formal policy reviews and prompt updates for the LLM.

Regulatory and cultural considerations

Content policies are rarely global. What is considered standard political discourse in one country might be illegal hate speech in another. A robust moderation system must be flexible enough to apply different rulesets based on user geography, while still maintaining a baseline of universal safety.

RegionRegulationImplication
EUDSA (Digital Services Act)Strict SLAs on appeals (24h), transparency reports required.
USSection 230Broad platform discretion, but strict reporting on CSAM.
IndiaIT Rules 2021Requirement to remove flagged content within 36 hours.

Implementation:

  • •Base Policy: Global rules (e.g., no CSAM, no malware).
  • •Regional Overlays: Inject region-specific instructions into the Tier 2 LLM prompt based on user IP/locale.
  • •Geo-fencing: Some content may be blocked in Germany (e.g., Nazi symbolism) but allowed in the US (First Amendment).

Summary

  1. •Tiered Architecture: Always filter with cheap models first. Use LLMs only when necessary.
  2. •Latency vs. Accuracy: Optimizing for <200ms requires caching, batching, and distillation.
  3. •Policy Agility: Use LLMs to adapt to policy changes instantly; retrain classifiers asynchronously.
  4. •Multimodality: Don't ignore images/video. Use specialized pipelines and fuse the signals.
  5. •Feedback Loops: The system is a cycle. Human decisions must improve the models.

Common misconceptions

  • •"Just use an LLM for everything." Too slow and too expensive. At 10K RPS, pure LLM moderation would burn millions of dollars a month.
  • •"Moderation is a binary classification problem." It is highly contextual. "I hate you" is bad; "I hate Mondays" is fine. Context matters.
  • •"Once trained, the model is done." Adversaries evolve. Users find ways to bypass filters (e.g., "unc1vilized"). The model must be continuously retrained.

Evaluation Rubric
  • 1
    Designs fast first pass with lightweight classifiers
  • 2
    Uses LLM as second stage for nuanced decisions
  • 3
    Plans for configurable policy rules per content type
  • 4
    Includes human review queue for edge cases
  • 5
    Addresses false positive/negative tradeoffs
  • 6
    Handles multimodal content (text, image, video)
  • 7
    Implements caching and batching for scale
Common Pitfalls
  • Using only an LLM (too slow and expensive)
  • Ignoring cultural and linguistic context
  • Not planning for appeals workflow
  • Overlooking adversarial content manipulation
  • Lack of versioning for policy changes
Follow-up Questions to Expect

Key Concepts Tested
Multi-stage moderation pipelineSpeed vs accuracy tradeoffsPolicy enforcement and configurabilityAppeal and human review workflowsMultilingual moderation challengesAdversarial evasion defense
References

A Holistic Approach to Undesired Content Detection in the Real World

Markov, T., et al. · 2023 · AAAI 2023

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.

Inan, H., et al. · 2023

A New Generation of Perspective API: Efficient Multilingual Character-level Transformers.

Lees, A., et al. · 2022

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech.

Hartvigsen, T., et al. · 2022 · ACL 2022

Llama Guard 3.

Llama Team, Meta. · 2024 · Meta AI Research

ShieldGemma: Generative AI Content Moderation.

Google DeepMind. · 2024 · Google DeepMind Report

Your account is free and you can post anonymously if you choose.