Master the architecture of a real-time content moderation system using LLMs and specialized classifiers.
Imagine managing a town square with 10,000 speakers shouting simultaneously. Some are friendly, some are spamming ads, and a few are dangerously violent. A human moderator can handle one conversation at a time; an AI system must handle thousands per second, instantly deciding who gets the microphone and who gets silenced. This is the challenge of real-time content moderation. It requires balancing safety, free speech, and speed at massive scale.
Designing this system requires balancing the raw throughput of lightweight classifiers with the reasoning capabilities of Large Language Models (LLMs). While traditional BERT (Bidirectional Encoder Representations from Transformers) models are fast, they struggle with subtlety, satire, and evolving policies. LLMs excel at understanding context but are too slow and expensive to run on every message. The engineering challenge is to architect a tiered pipeline (a cascade of increasingly sophisticated filters) that routes content to the right model at the right time.
A production-grade moderation system for a major social platform has to meet strict operational constraints to ensure user safety without disrupting the user experience. It needs to operate invisibly to good actors while instantly catching bad actors.
<200ms p95 (95th percentile) for real-time chat; <1s for posts. Fast turnaround is critical so the conversational flow isn't interrupted. If moderation takes too long, users experience noticeable lag when sending messages.
Handle 10K+ requests per second (RPS) with varying load spikes. Social platforms experience dramatic traffic surges during major real-world events (like sports finals or elections). The moderation pipeline needs to scale elastically to handle these spikes without falling behind.
Support evolving content policies without requiring full model retraining. The definition of a policy violation can change overnight based on new trends, slang, or geopolitical events.
Moderate text, images, video, and audio simultaneously. Users interact across multiple formats, and malicious actors often hide violations in cross-media contexts (for example, placing hateful text over a benign image).
Support transparent appeal workflows and human review loops. Automation errors are inevitable, so durable and transparent paths for human remediation are needed to maintain user trust.
The standard industry pattern is a cascade or tiered architecture. Content flows through a funnel of increasingly expensive but accurate models. This design isolates the bulk of the computational cost to a small percentage of edge cases, ensuring that simple requests are handled almost instantly.
By placing lightweight, high-throughput systems at the top of the funnel, the architecture preserves low latency for the vast majority of benign user interactions. Only the ambiguous, context-dependent content is escalated to the more complex and slower models further down the pipeline.
💡 Key insight: Don't send every request to the LLM. Tier 1 handles 90-95% of content: the obvious safe cases get auto-allowed, and the obvious violations get auto-blocked. Only the remaining 5-10% of ambiguous content needs to escalate to the expensive Tier 2 LLM pipeline.
To balance speed and contextual understanding, the system divides the workload across three distinct operational tiers. Each tier serves a specific purpose in the funnel, filtering out clear-cut cases and escalating only what it can't confidently resolve.
The first line of defense is a high-performance, specialized model (typically a distilled BERT, DeBERTa (Decoding-enhanced BERT with disentangled attention), or even a fastText (a library for efficient learning of word representations and text classification) classifier). These are smaller, faster versions of large transformers, trained on labeled data to detect specific categories of violations.
[prob_hate, prob_violence, prob_spam, ...]).The following Python code demonstrates a tiered moderation controller using ONNX (Open Neural Network Exchange) Runtime.
🎯 Production tip: Threshold tuning is itself a product decision. Set
blockthresholds conservatively (high confidence before removing content) to avoid over-blocking.reviewthresholds can be looser since humans are in the loop. Tune these quarterly against your false-positive and false-negative rates.
python1import re 2import onnxruntime as ort 3import numpy as np 4from typing import Dict, Literal 5 6categories = [ 7 "hate_speech", "violence", "sexual_content", 8 "self_harm", "spam", "misinformation" 9] 10 11class ModerationResult: 12 def __init__(self, action: Literal["ALLOW", "BLOCK", "REVIEW"], category: str = None): 13 self.action = action 14 self.category = category 15 16class FastModerator: 17 def __init__(self): 18 # Load a lightweight, distilled model via ONNX Runtime for low latency 19 self.session = ort.InferenceSession("moderation-bert-v3-quantized.onnx") 20 # Strict thresholds for auto-action 21 self.thresholds = { 22 "hate_speech": {"block": 0.98, "review": 0.65}, 23 "violence": {"block": 0.95, "review": 0.60}, 24 } 25 26 def _preprocess(self, text: str) -> dict: 27 # Normalize: lowercase, strip whitespace, replace URLs/mentions with tokens 28 text = text.lower().strip() 29 text = re.sub(r'https?://\S+', '<URL>', text) 30 text = re.sub(r'@\w+', '<USER>', text) 31 # ONNX Tokenizer is baked into the ONNX model in production. 32 # For this example, inputs are tokenized externally and loaded as bytes. 33 # The ONNX model input name is typically "input_ids". 34 return {"input_ids": np.array([[1, 2, 3]])} # placeholder 35 36 def classify(self, text: str) -> ModerationResult: 37 # Inference takes ~5-10ms on CPU 38 inputs = self._preprocess(text) 39 scores_array = self.session.run(None, inputs)[0] 40 41 # Map array back to categories (mock mapping logic) 42 scores: Dict[str, float] = {cat: float(scores_array[0][i]) for i, cat in enumerate(categories)} 43 44 for cat, score in scores.items(): 45 if cat in self.thresholds: 46 if score > self.thresholds[cat]["block"]: 47 return ModerationResult(action="BLOCK", category=cat) 48 49 # If any category is suspicious but not certain, escalation is needed 50 if score > self.thresholds[cat]["review"]: 51 return ModerationResult(action="REVIEW", category=cat) 52 53 # If all scores are low, it's safe 54 return ModerationResult(action="ALLOW")
For the 5-10% of content that's "uncertain" (for example, potential hate speech versus reclaimed slurs, or news reporting versus glorification of violence), we use an LLM[1].
State-of-the-art systems now use specialized safety models like Qwen3.5 (a multimodal open-source model) or ShieldGemma[2] that are fine-tuned specifically for instruction-following moderation tasks.
Instead of generic training, we inject the current policy definitions directly into the context. This makes the system "policy-aware" and allows for immediate rule changes without retraining. The code block below illustrates a system prompt containing the exact policy rules. It serves as the instruction template for the Tier 2 LLM. The template takes the user's content as an input variable and instructs the model to output a structured JSON decision with an explicit reasoning chain.
text1System: You are a content moderation expert. 2Your task is to classify user content based on the following policy: 3 4<POLICY_DEFINITION> 5Hate Speech: Dehumanizing speech, calls for violence, or inferiority claims based on protected characteristics (race, religion, etc.). 6Exceptions: 7- Counterspeech (raising awareness) 8- Self-referential use (reclaimed terms) 9- Fictional content (unless glorifying) 10</POLICY_DEFINITION> 11 12Analyze the content below. Think step-by-step about whether it falls under an exception. 13Respond with JSON: { "action": "ALLOW" | "BLOCK" | "ESCALATE", "reason": "...", "category": "..." } 14 15Content: {user_content}
"reason": "..."), which is crucial for the appeals process.When the LLM is low-confidence (for example, returning an ESCALATE status or a low probability score), the content gets queued for human review. This is the most expensive and slowest tier, but it's indispensable for maintaining system integrity and providing ground truth labels for future training.
Human reviewers operate under strict Service Level Agreements (SLAs) tailored to the severity of the potential violation. Critical safety issues require immediate intervention, whereas ambiguous policy debates can be handled asynchronously.
| Priority | SLA (Service Level Agreement) | Typical Cases |
|---|---|---|
| P0 (Critical) | 15 min | CSAM (Child Sexual Abuse Material), imminent self-harm, terrorism |
| P1 (High) | 1 hour | Severe harassment, viral misinformation |
| P2 (Normal) | 24 hours | General policy gray areas, appeals |
To protect the psychological well-being of the moderation team, platforms employ safety tooling. This includes blurring graphic imagery by default, removing audio from videos, and strictly limiting the amount of time any single reviewer spends evaluating P0 content. This tier isn't just a fallback; it's the critical feedback mechanism that defines what the automated systems will learn next.
Modern platforms aren't just text. They need to moderate images, video, and audio. The architecture extends the tiered approach to other modalities by passing each medium through its own specialized pipeline before combining the signals.
A major engineering challenge in multimodality is handling cross-modal context. A benign text overlay like "Look at what I found today" becomes a policy violation when paired with a graphic image. To handle this, the outputs from individual classifiers need to feed into a late-fusion layer or a multimodal LLM that can evaluate the combined context of the post.
| Content Type | Tier 1 (Fast) | Tier 2 (Deep) |
|---|---|---|
| Text | DistilBERT / DeBERTa | Qwen3.5 / ShieldGemma |
| Images | ResNet / EfficientNet (NSFW detection) | Vision-Language Models (CLIP, Qwen3.5 multimodal) |
| Video | Keyframe sampling + Image Classifier | Multimodal judge / vision-language model on sampled frames |
| Audio | Audio event classification | Whisper transcription → Text Pipeline |
Here is how a keyframe sampler works in practice. Given a video, it extracts frames at a fixed interval (e.g., every 1 second) plus additional frames whenever a significant scene change is detected. The combined set of keyframes is then passed to the image classifier pipeline. Scenes with rapid cuts or flashing content need extra coverage.
python1import cv2 2import numpy as np 3from dataclasses import dataclass 4from typing import List 5 6@dataclass 7class Keyframe: 8 frame_index: int 9 timestamp_sec: float 10 is_scene_cut: bool 11 12def extract_keyframes(video_path: str, fps: int = 30, 13 sample_interval_sec: float = 1.0, 14 scene_cut_threshold: float = 30.0) -> List[Keyframe]: 15 """ 16 Extract keyframes from a video at regular intervals plus scene cuts. 17 18 Args: 19 video_path: Path to the video file. 20 fps: Frames per second of the video (use cv2.VideoCapture to detect). 21 sample_interval_sec: Extract one frame every N seconds. 22 scene_cut_threshold: Hysteresis threshold for scene cut detection. 23 """ 24 cap = cv2.VideoCapture(video_path) 25 frame_idx = 0 26 last_frame = None 27 keyframes: List[Keyframe] = [] 28 29 interval_frames = int(sample_interval_sec * fps) 30 31 while cap.isOpened(): 32 ret, frame = cap.read() 33 if not ret: 34 break 35 36 gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) 37 38 if last_frame is not None: 39 diff = np.mean(np.abs(gray.astype(float) - last_frame.astype(float))) 40 is_scene_cut = diff > scene_cut_threshold 41 42 if is_scene_cut: 43 # Include extra frames around scene cuts for full context 44 keyframes.append(Keyframe(frame_idx, frame_idx / fps, True)) 45 46 # Regular interval samples 47 if frame_idx % interval_frames == 0: 48 keyframes.append(Keyframe(frame_idx, frame_idx / fps, False)) 49 50 last_frame = gray 51 frame_idx += 1 52 53 cap.release() 54 return keyframes
🎯 Production tip: For live streams, use a sliding window of the last N frames and continuously emit keyframes to the moderation pipeline. For on-demand video uploads, batch processing is cheaper because you can wait for the full file before starting inference.
Once the individual pipelines process their respective modalities, the late-fusion layer normalizes the confidence scores across different models. This aggregated representation then feeds into a final classification head or is used as the context for a multimodal LLM to make the ultimate moderation decision. If the combined score falls within the uncertain range, the entire multimodal package is escalated to human review.
⚠️ Common mistake: Treating video as a black box. Moderating every frame is impossible. Keyframe sampling (e.g., 1 frame per second) plus audio transcription is the standard efficient approach. Note that NSFW classifiers need to run on every keyframe to catch flashing content, which is why sampling rate is a critical tuning knob.
A major challenge in moderation is that the rules change constantly. A new slang term becomes offensive, or a geopolitical event requires new strictness on misinformation. The moderation architecture needs to be able to deploy new policy enforcements in minutes, not weeks, to prevent the viral spread of harmful content. The decision tree below shows how the system routes content based on model confidence scores: high-confidence violations are auto-blocked, mid-range cases go to human review, and low-confidence content is auto-approved. The key architectural insight is the two-track policy update workflow: Tier 2 adapts immediately via prompt changes, while Tier 1 lags because it needs retraining on new labeled data.
💡 Key insight: The LLM-based Tier 2 adapts immediately via prompt engineering. The Tier 1 classifier lags behind because it needs to be retrained on new data (often labeled by Tier 2 or humans) to learn the new pattern. This "student-teacher" loop keeps the fast path efficient while the slow path handles novelty.
Well-designed systems also continuously generate "red team" data to test classifiers against adversarial attacks (for example, using datasets like ToxiGen[3] to validate resilience against implicit hate speech).
Building a system that can accurately classify content is only half the challenge. The other half is making sure the system stays responsive when traffic spikes unpredictably during major global events. Achieving 10,000+ requests per second (RPS) requires heavy optimization at the infrastructure layer. We can use a combination of caching, batching, and intelligent routing to keep latency low and compute costs manageable.
Duplicate and near-duplicate content is common (reposts, viral memes, copypasta). Never waste GPU cycles re-moderating known content.
The following caching implementation shows how to bypass the ML pipeline entirely for known content. It takes the raw user content as input and checks for exact string matches using SHA256 (Secure Hash Algorithm 256), returning instantly if found. If no exact match exists, it checks for minor edits using SimHash-based locality-sensitive hashing (LSH), outputting a cached moderation decision if a high-similarity match is found.
Here is why LSH matters with a concrete example. A user posts "You are such an idiot" and it gets reviewed and blocked. Another user reposts the same text with a single character changed: "You are such an id1ot." A SHA256 exact match would miss this, but SimHash would produce fingerprints that differ by only 1 bit, and the LSH bucketing system would catch it immediately. The k=3 parameter in the code below means two strings are considered near-duplicates if their SimHash fingerprints differ in at most 3 of 64 bit positions.
python1import hashlib 2from dataclasses import dataclass 3from typing import Literal, Optional 4 5from redis import Redis 6from simhash import Simhash, SimhashIndex # pip install simhash 7 8@dataclass 9class CachedModerationResult: 10 action: Literal["ALLOW", "BLOCK", "REVIEW"] 11 category: str = None 12 13def sha256(text: str) -> str: 14 return hashlib.sha256(text.encode()).hexdigest() 15 16def normalize(text: str) -> str: 17 return text.lower().strip() 18 19class ModerationCache: 20 def __init__(self): 21 self.exact_cache = Redis(ttl=3600) # 1 hour TTL 22 self.fuzzy_index = SimhashIndex([], k=3) # Near-duplicate detection 23 24 def check(self, content: str) -> Optional[CachedModerationResult]: 25 # 1. Exact match ($O(1)$) - SHA256 26 # Catches exact copy-pastes 27 content_hash = sha256(normalize(content)) 28 if result := self.exact_cache.get(content_hash): 29 return result 30 31 # 2. Near-duplicate ($O(\log N)$) - SimHash / LSH 32 # Catches minor edits, resizing, re-encoding 33 simhash = Simhash(content) 34 if similar := self.fuzzy_index.get_near_dups(simhash): 35 return similar[0].result if similar else None 36 37 return None # Cache miss - run full pipeline
At scale, 30-50% of content can often be served from cache, dramatically reducing inference costs.
Here is how the three fingerprinting strategies compare in practice:
| Strategy | Speed | What it Catches | False Positive Risk |
|---|---|---|---|
| SHA256 exact hash | <1ms | Identical copy-pastes | Very low (collision unlikely) |
| SimHash + LSH | ~2ms | Minor edits, resizing, re-encoding | Moderate (similar-but-different may match) |
| Perceptual hashing (pHash) | ~5ms | Image/video transformations | Moderate (semantically different may match) |
| Embedding cosine similarity | ~10ms | Semantically similar content | Higher (requires threshold tuning) |
In practice, most systems layer these: SHA256 first (fastest), then SimHash for text, then perceptual hashing for images and video. Embedding similarity is the last resort because it requires a full model inference call.
💡 Key insight: The fingerprinting layer pays for itself immediately. At 10K RPS, even a 40% cache hit rate eliminates 4,000 GPU inference calls per second. If each inference costs 0.1ms GPU time and you have 10K RPS, that's 1 GPU-second per second of compute saved. At scale, this is the difference between needing 10 GPUs versus 16.
🔬 Research insight: Micro-batching at inference time amortizes the overhead of GPU kernel launches and memory transfers across multiple requests. At batch size 32, a typical A100 sustains 200+ TFLOPs versus under 20 TFLOPs at batch size 1, making batching the single highest-impact optimization in a high-RPS moderation system.
GPUs are throughput-optimized devices. Processing requests one by one leaves GPU memory bandwidth and compute massively underutilized. Micro-batching accumulates incoming requests into a small batch (typically 8-32 items) before running inference, so the GPU processes all items in parallel and shares the overhead cost.
This is distinct from training-time gradient accumulation. At inference time, each request is independent. Micro-batching is purely an infrastructure optimization. The table below shows measured latency and throughput improvements:
| Component | Single Request | Batch of 32 | Throughput Gain |
|---|---|---|---|
| Tier 1 (BERT) | 3ms | 8ms | 12× |
| Tier 2 (LLM) | 150ms | 400ms | 12× |
| Embedding | 5ms | 10ms | 16× |
There are two common accumulation strategies, each with a different latency-throughput trade-off:
Production systems typically use a hybrid: start a timer when the first request arrives, and fire the batch when either the timer expires or the batch fills, whichever comes first. This keeps p99 latency predictable while still achieving high GPU utilization during traffic spikes.
The actual batching and GPU scheduling is handled by specialized inference servers. Triton Inference Server exposes a BLS (Business Logic Scripting) backend that lets you write custom batching logic in Python, while vLLM uses continuous batching with paged attention to overlap batch assembly with inference execution. Both approaches avoid the GPU starvation that occurs when a single large batch blocks the queue.
⚠️ Common mistake: Batching alone is not enough. Without proper request timeout handling, a slow request at the head of the batch blocks all subsequent requests in that batch. Set per-request deadlines at the gateway layer and immediately return a "REVIEW" verdict if the timeout fires, so the user doesn't experience a timeout error.
Latency is governed by the speed of light, making physical distance a primary bottleneck for real-time applications. To achieve the sub-200ms latency requirement, moderation inference needs to happen geographically close to the user. This means distributing lightweight classifiers to edge regions while maintaining powerful LLMs in centralized hubs to handle the complex, low-confidence edge cases.
Beyond latency, geographic distribution also serves a critical policy-routing purpose. Content that is legal in one country may be illegal in another. Nazi symbolism is banned in Germany but protected speech in the United States. A platform operating globally cannot apply a single rulebook. The architecture must route content to region-specific policy configurations at the inference layer.
Policy routing strategy:
region_code derived from their account settings and IP geolocation.
Deploy both Tier 1 and Tier 2 classifiers in major edge regions (US, EU). Route only the most complex cases from smaller regions to regional hubs with heavy GPU clusters. For APAC, Tier 1 runs locally to minimize latency for the bulk of safe traffic, with Tier 2 calls routed to a centralized hub. This hybrid approach keeps latency low for the vast majority of requests while centralizing expensive LLM inference.
No moderation system is perfect. In consumer products, Precision (avoiding false bans) is often more important than Recall (catching every bad post) because false bans destroy user trust. However, for severe violations like CSAM (Child Sexual Abuse Material), recall becomes the top priority. This tension requires an explicit appeal and review loop built directly into the core user experience.
When a user appeals a moderation decision:
By automating the initial stages of an appeal, platforms can resolve the vast majority of user complaints without human intervention. This auto-review tier acts as a pressure relief valve for moderation teams, ensuring that only genuinely complex disputes need manual processing.
The appeal system's real value is in generating training data:
Training moderation models on harmful content requires strict safety controls to prevent inadvertent exposure. The preferred approach is synthetic data generation: instead of feeding real toxic content to a model, prompt an LLM to generate complex adversarial and edge-case examples that teach the classifier to recognize subtle violations without handling actual harmful material. Human annotators who do review raw content operate under rotation policies and have access to psychological support, limiting cumulative exposure. Federated learning is another avenue where models train on distributed sensitive data without centralizing it. Verified decisions from the human review pipeline become ground-truth labels for offline retraining cycles, ensuring the automated system improves without engineers directly handling toxic datasets.
Content policies are rarely global. What's considered standard political discourse in one country might be illegal hate speech in another. A well-architected moderation system needs to be flexible enough to apply different rulesets based on user geography, while still maintaining a baseline of universal safety.
| Region | Regulation | Implication |
|---|---|---|
| EU | DSA (Digital Services Act) | Mandatory appeal mechanisms, strict transparency rules. |
| US | Section 230 | Broad platform discretion, but strict reporting on CSAM. |
| India | IT Rules 2021 | Requirement to remove flagged content within 3 hours. |
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.
Inan, H., et al. · 2023 · arXiv preprint
ShieldGemma: Generative AI Content Moderation.
Google DeepMind. · 2024 · Google DeepMind Report
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech.
Hartvigsen, T., et al. · 2022 · ACL 2022