LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & Model Evaluation
⚡Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time Compute
🔍Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
🤖Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
📊Evaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
🛠️LLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge Distillation for LLMsModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
🏗️System Design Case Studies0/10
Design an Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnSystem Design Case StudiesContent Moderation System
🏗️HardSystem Design

Content Moderation System

Master the architecture of a real-time content moderation system using LLMs and specialized classifiers.

25 min readMeta, Google, OpenAI +28 key concepts

Imagine managing a town square with 10,000 speakers shouting simultaneously. Some are friendly, some are spamming ads, and a few are dangerously violent. A human moderator can handle one conversation at a time; an AI system must handle thousands per second, instantly deciding who gets the microphone and who gets silenced. This is the challenge of real-time content moderation. It requires balancing safety, free speech, and speed at massive scale.

Designing this system requires balancing the raw throughput of lightweight classifiers with the reasoning capabilities of Large Language Models (LLMs). While traditional BERT (Bidirectional Encoder Representations from Transformers) models are fast, they struggle with subtlety, satire, and evolving policies. LLMs excel at understanding context but are too slow and expensive to run on every message. The engineering challenge is to architect a tiered pipeline (a cascade of increasingly sophisticated filters) that routes content to the right model at the right time.

System requirements

A production-grade moderation system for a major social platform has to meet strict operational constraints to ensure user safety without disrupting the user experience. It needs to operate invisibly to good actors while instantly catching bad actors.

Latency

<200ms p95 (95th percentile) for real-time chat; <1s for posts. Fast turnaround is critical so the conversational flow isn't interrupted. If moderation takes too long, users experience noticeable lag when sending messages.

Throughput

Handle 10K+ requests per second (RPS) with varying load spikes. Social platforms experience dramatic traffic surges during major real-world events (like sports finals or elections). The moderation pipeline needs to scale elastically to handle these spikes without falling behind.

Flexibility

Support evolving content policies without requiring full model retraining. The definition of a policy violation can change overnight based on new trends, slang, or geopolitical events.

Multimodality

Moderate text, images, video, and audio simultaneously. Users interact across multiple formats, and malicious actors often hide violations in cross-media contexts (for example, placing hateful text over a benign image).

Due Process

Support transparent appeal workflows and human review loops. Automation errors are inevitable, so durable and transparent paths for human remediation are needed to maintain user trust.

Architecture

The standard industry pattern is a cascade or tiered architecture. Content flows through a funnel of increasingly expensive but accurate models. This design isolates the bulk of the computational cost to a small percentage of edge cases, ensuring that simple requests are handled almost instantly.

Real-time content moderation system: text and images are classified through fast first-pass filters, then escalated to LLM-based analysis for edge cases. Real-time content moderation system: text and images are classified through fast first-pass filters, then escalated to LLM-based analysis for edge cases.
Diagram Diagram

By placing lightweight, high-throughput systems at the top of the funnel, the architecture preserves low latency for the vast majority of benign user interactions. Only the ambiguous, context-dependent content is escalated to the more complex and slower models further down the pipeline.

💡 Key insight: Don't send every request to the LLM. Tier 1 handles 90-95% of content: the obvious safe cases get auto-allowed, and the obvious violations get auto-blocked. Only the remaining 5-10% of ambiguous content needs to escalate to the expensive Tier 2 LLM pipeline.

Tiered approach

To balance speed and contextual understanding, the system divides the workload across three distinct operational tiers. Each tier serves a specific purpose in the funnel, filtering out clear-cut cases and escalating only what it can't confidently resolve.

Tier 1: Fast classifier (<10ms)

The first line of defense is a high-performance, specialized model (typically a distilled BERT, DeBERTa (Decoding-enhanced BERT with disentangled attention), or even a fastText (a library for efficient learning of word representations and text classification) classifier). These are smaller, faster versions of large transformers, trained on labeled data to detect specific categories of violations.

Characteristics

  • •Architecture: DistilBERT, MobileBERT, or DeBERTa-v3-xsmall.
  • •Task: Multi-label classification (Output: [prob_hate, prob_violence, prob_spam, ...]).
  • •Precision: Tuned for high precision on "Block" (don't auto-block unless sure) and high precision on "Allow" (don't auto-allow unless sure). Everything else goes to Tier 2.

The following Python code demonstrates a tiered moderation controller using ONNX (Open Neural Network Exchange) Runtime.

🎯 Production tip: Threshold tuning is itself a product decision. Set block thresholds conservatively (high confidence before removing content) to avoid over-blocking. review thresholds can be looser since humans are in the loop. Tune these quarterly against your false-positive and false-negative rates.

python
1import re 2import onnxruntime as ort 3import numpy as np 4from typing import Dict, Literal 5 6categories = [ 7 "hate_speech", "violence", "sexual_content", 8 "self_harm", "spam", "misinformation" 9] 10 11class ModerationResult: 12 def __init__(self, action: Literal["ALLOW", "BLOCK", "REVIEW"], category: str = None): 13 self.action = action 14 self.category = category 15 16class FastModerator: 17 def __init__(self): 18 # Load a lightweight, distilled model via ONNX Runtime for low latency 19 self.session = ort.InferenceSession("moderation-bert-v3-quantized.onnx") 20 # Strict thresholds for auto-action 21 self.thresholds = { 22 "hate_speech": {"block": 0.98, "review": 0.65}, 23 "violence": {"block": 0.95, "review": 0.60}, 24 } 25 26 def _preprocess(self, text: str) -> dict: 27 # Normalize: lowercase, strip whitespace, replace URLs/mentions with tokens 28 text = text.lower().strip() 29 text = re.sub(r'https?://\S+', '<URL>', text) 30 text = re.sub(r'@\w+', '<USER>', text) 31 # ONNX Tokenizer is baked into the ONNX model in production. 32 # For this example, inputs are tokenized externally and loaded as bytes. 33 # The ONNX model input name is typically "input_ids". 34 return {"input_ids": np.array([[1, 2, 3]])} # placeholder 35 36 def classify(self, text: str) -> ModerationResult: 37 # Inference takes ~5-10ms on CPU 38 inputs = self._preprocess(text) 39 scores_array = self.session.run(None, inputs)[0] 40 41 # Map array back to categories (mock mapping logic) 42 scores: Dict[str, float] = {cat: float(scores_array[0][i]) for i, cat in enumerate(categories)} 43 44 for cat, score in scores.items(): 45 if cat in self.thresholds: 46 if score > self.thresholds[cat]["block"]: 47 return ModerationResult(action="BLOCK", category=cat) 48 49 # If any category is suspicious but not certain, escalation is needed 50 if score > self.thresholds[cat]["review"]: 51 return ModerationResult(action="REVIEW", category=cat) 52 53 # If all scores are low, it's safe 54 return ModerationResult(action="ALLOW")

Tier 2: LLM policy judge (<200ms)

For the 5-10% of content that's "uncertain" (for example, potential hate speech versus reclaimed slurs, or news reporting versus glorification of violence), we use an LLM[1].

State-of-the-art systems now use specialized safety models like Qwen3.5 (a multimodal open-source model) or ShieldGemma[2] that are fine-tuned specifically for instruction-following moderation tasks.

Instead of generic training, we inject the current policy definitions directly into the context. This makes the system "policy-aware" and allows for immediate rule changes without retraining. The code block below illustrates a system prompt containing the exact policy rules. It serves as the instruction template for the Tier 2 LLM. The template takes the user's content as an input variable and instructs the model to output a structured JSON decision with an explicit reasoning chain.

text
1System: You are a content moderation expert. 2Your task is to classify user content based on the following policy: 3 4<POLICY_DEFINITION> 5Hate Speech: Dehumanizing speech, calls for violence, or inferiority claims based on protected characteristics (race, religion, etc.). 6Exceptions: 7- Counterspeech (raising awareness) 8- Self-referential use (reclaimed terms) 9- Fictional content (unless glorifying) 10</POLICY_DEFINITION> 11 12Analyze the content below. Think step-by-step about whether it falls under an exception. 13Respond with JSON: { "action": "ALLOW" | "BLOCK" | "ESCALATE", "reason": "...", "category": "..." } 14 15Content: {user_content}

Advantages

  • •Zero-shot adaptation: Update the system prompt to change enforcement rules instantly.
  • •Contextual awareness: It can distinguish between "I want to kill you" (violence) and "I killed it at the gym" (slang).
  • •Explanation: It provides a reasoning chain ("reason": "..."), which is crucial for the appeals process.

Tier 3: Human review

When the LLM is low-confidence (for example, returning an ESCALATE status or a low probability score), the content gets queued for human review. This is the most expensive and slowest tier, but it's indispensable for maintaining system integrity and providing ground truth labels for future training.

Human reviewers operate under strict Service Level Agreements (SLAs) tailored to the severity of the potential violation. Critical safety issues require immediate intervention, whereas ambiguous policy debates can be handled asynchronously.

PrioritySLA (Service Level Agreement)Typical Cases
P0 (Critical)15 minCSAM (Child Sexual Abuse Material), imminent self-harm, terrorism
P1 (High)1 hourSevere harassment, viral misinformation
P2 (Normal)24 hoursGeneral policy gray areas, appeals

To protect the psychological well-being of the moderation team, platforms employ safety tooling. This includes blurring graphic imagery by default, removing audio from videos, and strictly limiting the amount of time any single reviewer spends evaluating P0 content. This tier isn't just a fallback; it's the critical feedback mechanism that defines what the automated systems will learn next.

Multi-modal content

Modern platforms aren't just text. They need to moderate images, video, and audio. The architecture extends the tiered approach to other modalities by passing each medium through its own specialized pipeline before combining the signals.

A major engineering challenge in multimodality is handling cross-modal context. A benign text overlay like "Look at what I found today" becomes a policy violation when paired with a graphic image. To handle this, the outputs from individual classifiers need to feed into a late-fusion layer or a multimodal LLM that can evaluate the combined context of the post.

Content TypeTier 1 (Fast)Tier 2 (Deep)
TextDistilBERT / DeBERTaQwen3.5 / ShieldGemma
ImagesResNet / EfficientNet (NSFW detection)Vision-Language Models (CLIP, Qwen3.5 multimodal)
VideoKeyframe sampling + Image ClassifierMultimodal judge / vision-language model on sampled frames
AudioAudio event classificationWhisper transcription → Text Pipeline

Here is how a keyframe sampler works in practice. Given a video, it extracts frames at a fixed interval (e.g., every 1 second) plus additional frames whenever a significant scene change is detected. The combined set of keyframes is then passed to the image classifier pipeline. Scenes with rapid cuts or flashing content need extra coverage.

python
1import cv2 2import numpy as np 3from dataclasses import dataclass 4from typing import List 5 6@dataclass 7class Keyframe: 8 frame_index: int 9 timestamp_sec: float 10 is_scene_cut: bool 11 12def extract_keyframes(video_path: str, fps: int = 30, 13 sample_interval_sec: float = 1.0, 14 scene_cut_threshold: float = 30.0) -> List[Keyframe]: 15 """ 16 Extract keyframes from a video at regular intervals plus scene cuts. 17 18 Args: 19 video_path: Path to the video file. 20 fps: Frames per second of the video (use cv2.VideoCapture to detect). 21 sample_interval_sec: Extract one frame every N seconds. 22 scene_cut_threshold: Hysteresis threshold for scene cut detection. 23 """ 24 cap = cv2.VideoCapture(video_path) 25 frame_idx = 0 26 last_frame = None 27 keyframes: List[Keyframe] = [] 28 29 interval_frames = int(sample_interval_sec * fps) 30 31 while cap.isOpened(): 32 ret, frame = cap.read() 33 if not ret: 34 break 35 36 gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) 37 38 if last_frame is not None: 39 diff = np.mean(np.abs(gray.astype(float) - last_frame.astype(float))) 40 is_scene_cut = diff > scene_cut_threshold 41 42 if is_scene_cut: 43 # Include extra frames around scene cuts for full context 44 keyframes.append(Keyframe(frame_idx, frame_idx / fps, True)) 45 46 # Regular interval samples 47 if frame_idx % interval_frames == 0: 48 keyframes.append(Keyframe(frame_idx, frame_idx / fps, False)) 49 50 last_frame = gray 51 frame_idx += 1 52 53 cap.release() 54 return keyframes

🎯 Production tip: For live streams, use a sliding window of the last N frames and continuously emit keyframes to the moderation pipeline. For on-demand video uploads, batch processing is cheaper because you can wait for the full file before starting inference.

Once the individual pipelines process their respective modalities, the late-fusion layer normalizes the confidence scores across different models. This aggregated representation then feeds into a final classification head or is used as the context for a multimodal LLM to make the ultimate moderation decision. If the combined score falls within the uncertain range, the entire multimodal package is escalated to human review.

⚠️ Common mistake: Treating video as a black box. Moderating every frame is impossible. Keyframe sampling (e.g., 1 frame per second) plus audio transcription is the standard efficient approach. Note that NSFW classifiers need to run on every keyframe to catch flashing content, which is why sampling rate is a critical tuning knob.

Handling policy evolution

A major challenge in moderation is that the rules change constantly. A new slang term becomes offensive, or a geopolitical event requires new strictness on misinformation. The moderation architecture needs to be able to deploy new policy enforcements in minutes, not weeks, to prevent the viral spread of harmful content. The decision tree below shows how the system routes content based on model confidence scores: high-confidence violations are auto-blocked, mid-range cases go to human review, and low-confidence content is auto-approved. The key architectural insight is the two-track policy update workflow: Tier 2 adapts immediately via prompt changes, while Tier 1 lags because it needs retraining on new labeled data.

Content policy decision tree: content is evaluated against policies for hate speech, violence, self-harm, and sexual content, with severity levels determining action (allow, warn, block). Content policy decision tree: content is evaluated against policies for hate speech, violence, self-harm, and sexual content, with severity levels determining action (allow, warn, block).
Diagram Diagram

💡 Key insight: The LLM-based Tier 2 adapts immediately via prompt engineering. The Tier 1 classifier lags behind because it needs to be retrained on new data (often labeled by Tier 2 or humans) to learn the new pattern. This "student-teacher" loop keeps the fast path efficient while the slow path handles novelty.

Well-designed systems also continuously generate "red team" data to test classifiers against adversarial attacks (for example, using datasets like ToxiGen[3] to validate resilience against implicit hate speech).

Scaling to 10K+ RPS

Building a system that can accurately classify content is only half the challenge. The other half is making sure the system stays responsive when traffic spikes unpredictably during major global events. Achieving 10,000+ requests per second (RPS) requires heavy optimization at the infrastructure layer. We can use a combination of caching, batching, and intelligent routing to keep latency low and compute costs manageable.

Content fingerprinting and caching

Duplicate and near-duplicate content is common (reposts, viral memes, copypasta). Never waste GPU cycles re-moderating known content.

The following caching implementation shows how to bypass the ML pipeline entirely for known content. It takes the raw user content as input and checks for exact string matches using SHA256 (Secure Hash Algorithm 256), returning instantly if found. If no exact match exists, it checks for minor edits using SimHash-based locality-sensitive hashing (LSH), outputting a cached moderation decision if a high-similarity match is found.

Here is why LSH matters with a concrete example. A user posts "You are such an idiot" and it gets reviewed and blocked. Another user reposts the same text with a single character changed: "You are such an id1ot." A SHA256 exact match would miss this, but SimHash would produce fingerprints that differ by only 1 bit, and the LSH bucketing system would catch it immediately. The k=3 parameter in the code below means two strings are considered near-duplicates if their SimHash fingerprints differ in at most 3 of 64 bit positions.

python
1import hashlib 2from dataclasses import dataclass 3from typing import Literal, Optional 4 5from redis import Redis 6from simhash import Simhash, SimhashIndex # pip install simhash 7 8@dataclass 9class CachedModerationResult: 10 action: Literal["ALLOW", "BLOCK", "REVIEW"] 11 category: str = None 12 13def sha256(text: str) -> str: 14 return hashlib.sha256(text.encode()).hexdigest() 15 16def normalize(text: str) -> str: 17 return text.lower().strip() 18 19class ModerationCache: 20 def __init__(self): 21 self.exact_cache = Redis(ttl=3600) # 1 hour TTL 22 self.fuzzy_index = SimhashIndex([], k=3) # Near-duplicate detection 23 24 def check(self, content: str) -> Optional[CachedModerationResult]: 25 # 1. Exact match ($O(1)$) - SHA256 26 # Catches exact copy-pastes 27 content_hash = sha256(normalize(content)) 28 if result := self.exact_cache.get(content_hash): 29 return result 30 31 # 2. Near-duplicate ($O(\log N)$) - SimHash / LSH 32 # Catches minor edits, resizing, re-encoding 33 simhash = Simhash(content) 34 if similar := self.fuzzy_index.get_near_dups(simhash): 35 return similar[0].result if similar else None 36 37 return None # Cache miss - run full pipeline

At scale, 30-50% of content can often be served from cache, dramatically reducing inference costs.

Here is how the three fingerprinting strategies compare in practice:

StrategySpeedWhat it CatchesFalse Positive Risk
SHA256 exact hash<1msIdentical copy-pastesVery low (collision unlikely)
SimHash + LSH~2msMinor edits, resizing, re-encodingModerate (similar-but-different may match)
Perceptual hashing (pHash)~5msImage/video transformationsModerate (semantically different may match)
Embedding cosine similarity~10msSemantically similar contentHigher (requires threshold tuning)

In practice, most systems layer these: SHA256 first (fastest), then SimHash for text, then perceptual hashing for images and video. Embedding similarity is the last resort because it requires a full model inference call.

💡 Key insight: The fingerprinting layer pays for itself immediately. At 10K RPS, even a 40% cache hit rate eliminates 4,000 GPU inference calls per second. If each inference costs 0.1ms GPU time and you have 10K RPS, that's 1 GPU-second per second of compute saved. At scale, this is the difference between needing 10 GPUs versus 16.

🔬 Research insight: Micro-batching at inference time amortizes the overhead of GPU kernel launches and memory transfers across multiple requests. At batch size 32, a typical A100 sustains 200+ TFLOPs versus under 20 TFLOPs at batch size 1, making batching the single highest-impact optimization in a high-RPS moderation system.

GPUs are throughput-optimized devices. Processing requests one by one leaves GPU memory bandwidth and compute massively underutilized. Micro-batching accumulates incoming requests into a small batch (typically 8-32 items) before running inference, so the GPU processes all items in parallel and shares the overhead cost.

This is distinct from training-time gradient accumulation. At inference time, each request is independent. Micro-batching is purely an infrastructure optimization. The table below shows measured latency and throughput improvements:

ComponentSingle RequestBatch of 32Throughput Gain
Tier 1 (BERT)3ms8ms12×
Tier 2 (LLM)150ms400ms12×
Embedding5ms10ms16×

Strategy

There are two common accumulation strategies, each with a different latency-throughput trade-off:

  1. •Window-based: Wait a fixed time window (e.g., 5ms) regardless of batch size. This keeps tail latency bounded, but during low-traffic periods you may underfill the batch.
  2. •Size-based: Wait until the batch reaches a target size (e.g., 32). This maximizes GPU utilization, but very slow requests at the head of the queue block the rest.

Production systems typically use a hybrid: start a timer when the first request arrives, and fire the batch when either the timer expires or the batch fills, whichever comes first. This keeps p99 latency predictable while still achieving high GPU utilization during traffic spikes.

The actual batching and GPU scheduling is handled by specialized inference servers. Triton Inference Server exposes a BLS (Business Logic Scripting) backend that lets you write custom batching logic in Python, while vLLM uses continuous batching with paged attention to overlap batch assembly with inference execution. Both approaches avoid the GPU starvation that occurs when a single large batch blocks the queue.

⚠️ Common mistake: Batching alone is not enough. Without proper request timeout handling, a slow request at the head of the batch blocks all subsequent requests in that batch. Set per-request deadlines at the gateway layer and immediately return a "REVIEW" verdict if the timeout fires, so the user doesn't experience a timeout error.

Geographic distribution and regional policies

Latency is governed by the speed of light, making physical distance a primary bottleneck for real-time applications. To achieve the sub-200ms latency requirement, moderation inference needs to happen geographically close to the user. This means distributing lightweight classifiers to edge regions while maintaining powerful LLMs in centralized hubs to handle the complex, low-confidence edge cases.

Beyond latency, geographic distribution also serves a critical policy-routing purpose. Content that is legal in one country may be illegal in another. Nazi symbolism is banned in Germany but protected speech in the United States. A platform operating globally cannot apply a single rulebook. The architecture must route content to region-specific policy configurations at the inference layer.

Policy routing strategy:

  • •Each user request carries a region_code derived from their account settings and IP geolocation.
  • •At inference time, the region code selects the active policy overlay. This overlay is injected into the Tier 2 LLM prompt alongside the base policy.
  • •Tier 1 fast classifiers are trained on region-specific labeled data where available, and fall back to the most conservative interpretation for unlabeled regions.
  • •Regional SLAs also vary: India's IT Rules 2021 mandate content removal within 3 hours for flagged posts, while EU's DSA requires explicit transparency reporting on moderation outcomes.
Diagram Diagram

Architecture pattern

Deploy both Tier 1 and Tier 2 classifiers in major edge regions (US, EU). Route only the most complex cases from smaller regions to regional hubs with heavy GPU clusters. For APAC, Tier 1 runs locally to minimize latency for the bulk of safe traffic, with Tier 2 calls routed to a centralized hub. This hybrid approach keeps latency low for the vast majority of requests while centralizing expensive LLM inference.

Appeal and review workflow

No moderation system is perfect. In consumer products, Precision (avoiding false bans) is often more important than Recall (catching every bad post) because false bans destroy user trust. However, for severe violations like CSAM (Child Sexual Abuse Material), recall becomes the top priority. This tension requires an explicit appeal and review loop built directly into the core user experience.

When a user appeals a moderation decision:

  1. •Auto-Review: Re-run the content through the pipeline with a "lenient" configuration or a larger, more accurate review model than the fast path uses. This catches obvious classifier errors immediately.
  2. •Human Queue: If the auto-review upholds the ban, the case goes to a human reviewer. Crucially, the reviewer gets the full context: the flagged post, the user's history, the specific policy triggered, and the LLM's reasoning chain.

By automating the initial stages of an appeal, platforms can resolve the vast majority of user complaints without human intervention. This auto-review tier acts as a pressure relief valve for moderation teams, ensuring that only genuinely complex disputes need manual processing.

The feedback loop

The appeal system's real value is in generating training data:

  • •False Positives: These are gold. They become "hard negatives" for the next training round of the Tier 1 classifier, actively reducing future false ban rates.
  • •Confirmed Violations: Added to the dataset as new positive examples to strengthen pattern recognition.
  • •Novel Violations: Edge cases that humans struggle with typically trigger formal policy reviews and prompt updates for the LLM.

Training moderation models on harmful content requires strict safety controls to prevent inadvertent exposure. The preferred approach is synthetic data generation: instead of feeding real toxic content to a model, prompt an LLM to generate complex adversarial and edge-case examples that teach the classifier to recognize subtle violations without handling actual harmful material. Human annotators who do review raw content operate under rotation policies and have access to psychological support, limiting cumulative exposure. Federated learning is another avenue where models train on distributed sensitive data without centralizing it. Verified decisions from the human review pipeline become ground-truth labels for offline retraining cycles, ensuring the automated system improves without engineers directly handling toxic datasets.

Regulatory and cultural considerations

Content policies are rarely global. What's considered standard political discourse in one country might be illegal hate speech in another. A well-architected moderation system needs to be flexible enough to apply different rulesets based on user geography, while still maintaining a baseline of universal safety.

RegionRegulationImplication
EUDSA (Digital Services Act)Mandatory appeal mechanisms, strict transparency rules.
USSection 230Broad platform discretion, but strict reporting on CSAM.
IndiaIT Rules 2021Requirement to remove flagged content within 3 hours.

Implementation

  • •Base Policy: Global rules (e.g., no CSAM, no malware).
  • •Regional Overlays: Inject region-specific instructions into the Tier 2 LLM prompt based on user IP/locale.
  • •Geo-fencing: Some content may be blocked in Germany (e.g., Nazi symbolism) but allowed in the US (First Amendment).

Summary

  1. •Tiered Architecture: Always filter with cheap models first. Use LLMs only when necessary.
  2. •Latency vs. Accuracy: Optimizing for <200ms requires caching, batching, and distillation.
  3. •Policy Agility: Use LLMs to adapt to policy changes instantly; retrain classifiers asynchronously.
  4. •Multimodality: Don't ignore images/video. Use specialized pipelines and fuse the signals.
  5. •Feedback Loops: The system is a cycle. Human decisions must improve the models.

Common misconceptions

  • •"Just use an LLM for everything." Too slow and too expensive. At 10K RPS, pure LLM moderation would burn millions of dollars a month.
  • •"Moderation is a binary classification problem." It's highly contextual. "I hate you" is bad; "I hate Mondays" is fine. Context matters.
  • •"Once trained, the model is done." Adversaries evolve. Users find ways to bypass filters (e.g., "unc1vilized"). The model needs to be continuously retrained.
Evaluation Rubric
  • 1
    Architects a multi-tier moderation pipeline using fast classifiers and LLMs
  • 2
    Implements policy-aware zero-shot adaptation using context-injected LLM prompts
  • 3
    Designs caching strategies using exact hashing (SHA256) and locality-sensitive hashing (SimHash)
  • 4
    Optimizes GPU throughput using micro-batching for high-RPS environments
  • 5
    Structures a multi-modal evaluation strategy using late-fusion layers
  • 6
    Builds an appeals workflow with auto-review to generate hard-negative training data
Common Pitfalls
  • Using only an LLM (too slow and expensive)
  • Ignoring cultural and linguistic context
  • Not planning for appeals workflow
  • Overlooking adversarial content manipulation
  • Lack of versioning for policy changes
Follow-up Questions to Expect

Key Concepts Tested
Multi-tier Model CascadesLocality-Sensitive Hashing (LSH)Micro-batching for GPU OptimizationPolicy-Injected LLM PromptsMultimodal Late-FusionAdversarial Evasion DefenseHuman-in-the-Loop SLAsContinuous Classifier Retraining
References

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.

Inan, H., et al. · 2023 · arXiv preprint

ShieldGemma: Generative AI Content Moderation.

Google DeepMind. · 2024 · Google DeepMind Report

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech.

Hartvigsen, T., et al. · 2022 · ACL 2022

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.