LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnTransformer Deep DivesMechanistic Interpretability
🧠HardTransformer Architecture

Mechanistic Interpretability

Learn how sparse autoencoders decompose transformer activations into candidate interpretable features, support circuit tracing, and enable controlled activation-steering experiments.

30 min read
Learning path
Step 94 of 158 in the full curriculum
Layer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to Nucleus

Layer normalization kept the residual stream numerically stable. Mechanistic interpretability asks the next question: what do the directions inside that stream mean?

A code-assistant LLM keeps refusing legitimate password-reset scripts because they mention credentials. When you inspect the attention heads and MLP neurons that fire on those prompts, dozens of neurons light up in confusing combinations. One neuron activates on "rotate the API key," "credential theft attempt," and "urgent tone." Another neuron fires for "secret manager docs" and also for "shell command syntax." You can't point to a clean "credential-safety feature."

This is the core problem of polysemanticity (one neuron responding to multiple unrelated concepts at once) inside large language models. Raw neurons often fail to map to single human concepts. One influential hypothesis is superposition: useful features can be represented in overlapping directions when clean coordinate axes are scarce. Toy models demonstrate this mechanism; for real language models, it remains a motivating hypothesis and an empirical question.

Mechanistic interpretability researchers train sparse autoencoders (SAEs), small models that reconstruct model activations through sparse latent codes, to look for more inspectable feature directions.[1]Reference 1Towards Monosemanticity: Decomposing Language Models With Dictionary Learninghttps://transformer-circuits.pub/2023/monosemantic-features/index.html[2]Reference 2Sparse Autoencoders Find Highly Interpretable Features in Language Modelshttps://arxiv.org/abs/2309.08600 In evaluated models, Bricken et al. and Cunningham et al. found SAE features that were more interpretable than raw-neuron or alternative baselines. A candidate feature might fire mostly on credential-reset language, exploit-request language, or urgent instructions. Its label remains provisional until examples and interventions support it.

You'll build that path in layers: how SAEs work, what they reveal beyond raw neurons, how to train a minimal NumPy version, and how feature-level interventions support modern safety research.

The problem: polysemantic neurons and superposition

In a transformer, residual-stream vectors and MLP activations combine many signals. If a model represents more useful features than available orthogonal axes, non-orthogonal directions offer an efficient representation but introduce interference. This is the superposition hypothesis for why the same neuron can participate in unrelated computations.

The 2022 "Toy Models of Superposition" paper demonstrated this mechanism with tiny ReLU networks trained on synthetic sparse data. When the data contained more ground-truth features than hidden units, the networks learned overlapping representations. The experiment doesn't establish the full explanation for an LLM, but it gives researchers a concrete model to test against LLM activations.[3]Reference 3Toy Models of Superpositionhttps://transformer-circuits.pub/2022/toy_model/index.html

Direct inspection of weights or raw neuron activations therefore gives you an entangled view. A raw neuron might mix "credential reset + exploit request + urgency." You can't tell which part of the mixture the model is using at any moment without a cleaner representation and an intervention test.

Toy superposition geometry with six feature directions in two dimensions and an exact projection trace showing credential vector 1,0 and urgency vector 0.8,0.6 combining into activation 2.3,0.6 on a shared raw neuron axis. Toy superposition geometry with six feature directions in two dimensions and an exact projection trace showing credential vector 1,0 and urgency vector 0.8,0.6 combining into activation 2.3,0.6 on a shared raw neuron axis.
The credential and urgency directions both project onto raw axis A. Their exact mixture raises that coordinate from the credential-only value $1.5$ to $2.3$, so a large neuron reading doesn't isolate one concept.

The calculation below measures that overlap with cosine similarity and then forms the same mixture shown in the figure.

nonorthogonal-directions-interfere.py
1import math 2 3credential = [1.0, 0.0] 4urgency = [0.8, 0.6] 5 6def cosine(left, right): 7 dot = sum(a * b for a, b in zip(left, right)) 8 norm = math.sqrt(sum(a * a for a in left) * sum(b * b for b in right)) 9 return dot / norm 10 11mixed_activation = [1.5 * credential[0] + urgency[0], 1.5 * credential[1] + urgency[1]] 12print(f"credential/urgency cosine: {cosine(credential, urgency):.2f}") 13print(f"axis A reading after both fire: {mixed_activation[0]:.2f}") 14 15assert cosine(credential, urgency) > 0 16assert mixed_activation[0] > 1.5
Overlapping directions
1credential/urgency cosine: 0.80 2axis A reading after both fire: 2.30

How sparse autoencoders produce candidate features

A sparse autoencoder attacks the entanglement problem with classic dictionary learning. You model activation vectors x∈Rdmodelx \in \mathbb{R}^{d_\text{model}}x∈Rdmodel​ as approximately sparse linear combinations of directions from a much larger learned dictionary. Some learned directions may correspond to useful model features. That correspondence is a hypothesis to evaluate, not an assumption built into the math.

An SAE consists of two learned matrices plus a non-linearity:

  • Encoder: z=f(Wencx+benc)z = f(W_\text{enc} x + b_\text{enc})z=f(Wenc​x+benc​) where Wenc∈Rnfeatures×dmodelW_\text{enc} \in \mathbb{R}^{n_\text{features} \times d_\text{model}}Wenc​∈Rnfeatures​×dmodel​ with nfeatures≫dmodeln_\text{features} \gg d_\text{model}nfeatures​≫dmodel​ and fff is ReLU or a TopK operation. Experiments choose dictionary widths larger than the activation dimension and compare fidelity, sparsity, and feature quality.
  • Decoder: x^=Wdecz+bdec\hat{x} = W_\text{dec} z + b_\text{dec}x^=Wdec​z+bdec​

Training minimizes a loss that has two competing terms:

L=∥x−x^∥22⏟reconstruction+λ∥z∥1⏟sparsityL = \underbrace{\|x - \hat{x}\|_2^2}_{\text{reconstruction}} + \lambda \underbrace{\|z\|_1}_{\text{sparsity}}L=reconstruction∥x−x^∥22​​​+λsparsity∥z∥1​​​

(or a TopK variant that keeps the kkk largest latent activations and zeros the rest). The reconstruction term rewards preserving information from the original activation. The sparsity term pushes the latent code zzz toward few active coordinates rather than a dense remapping. Learned columns of WdecW_\text{dec}Wdec​ are candidate feature directions; evaluating them requires top-activating examples and intervention tests in addition to reconstruction metrics.

reconstruction-sparsity-objective.py
1def mse(target, reconstructed): 2 return sum((a - b) ** 2 for a, b in zip(target, reconstructed)) / len(target) 3 4def objective(target, reconstructed, code, sparsity_weight): 5 return mse(target, reconstructed) + sparsity_weight * sum(abs(value) for value in code) 6 7target = [1.0, 0.0] 8sparse_code = [0.75, 0.0] 9sparse_reconstruction = [0.75, 0.0] 10dense_code = [0.75, 0.50] 11dense_reconstruction = [1.0, 0.0] 12 13for weight in [0.01, 0.10]: 14 sparse_loss = objective(target, sparse_reconstruction, sparse_code, weight) 15 dense_loss = objective(target, dense_reconstruction, dense_code, weight) 16 chosen = "sparse" if sparse_loss < dense_loss else "dense" 17 print(f"lambda={weight:.2f}: sparse={sparse_loss:.4f}, dense={dense_loss:.4f} -> {chosen}") 18 19assert objective(target, dense_reconstruction, dense_code, 0.01) < objective(target, sparse_reconstruction, sparse_code, 0.01) 20assert objective(target, sparse_reconstruction, sparse_code, 0.10) < objective(target, dense_reconstruction, dense_code, 0.10)
Objective tradeoff
1lambda=0.01: sparse=0.0387, dense=0.0125 -> dense 2lambda=0.10: sparse=0.1063, dense=0.1250 -> sparse

The original "Towards Monosemanticity" result (2023) made this concrete: an SAE trained on a one-layer transformer decomposed a 512-neuron MLP layer into more than 4,000 features, with inspected examples for patterns such as DNA sequences, legal language, HTTP requests, and Hebrew text. A year later, "Scaling Monosemanticity" (2024) applied the approach to a middle-layer residual stream of Claude 3 Sonnet, extracting up to 34 million features and reporting abstract and multilingual examples.[1]Reference 1Towards Monosemanticity: Decomposing Language Models With Dictionary Learninghttps://transformer-circuits.pub/2023/monosemantic-features/index.html[4]Reference 4Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnethttps://transformer-circuits.pub/2024/scaling-monosemanticity/

topk-keeps-largest-latents.py
1def top_k_sparse(values, k): 2 ranked = sorted(range(len(values)), key=lambda index: values[index], reverse=True) 3 keep = set(ranked[:k]) 4 return [value if index in keep else 0.0 for index, value in enumerate(values)] 5 6latent = [0.1, 2.4, 0.3, 1.7, 0.05] 7sparse = top_k_sparse(latent, k=2) 8 9assert sparse == [0.0, 2.4, 0.0, 1.7, 0.0] 10assert sum(value != 0.0 for value in sparse) == 2
Sparse autoencoder tensor diagram for the article toy trainer, mapping an 8-dimensional activation through a 64 by 8 encoder into 64 latent slots with about 18.9 active, then through an 8 by 64 decoder back to 8 dimensions, alongside reconstruction and direction-alignment diagnostics. Sparse autoencoder tensor diagram for the article toy trainer, mapping an 8-dimensional activation through a 64 by 8 encoder into 64 latent slots with about 18.9 active, then through an 8 by 64 decoder back to 8 dimensions, alongside reconstruction and direction-alignment diagnostics.
The toy SAE maps $x \in \mathbb{R}^8$ into a 64-slot dictionary and reconstructs back to eight dimensions. Reconstruction improves from $1.426$ to $0.128$, but direction alignment stays at $0.798 \rightarrow 0.799$, so this run demonstrates sparse reconstruction rather than feature recovery.

The SAE learns an overcomplete sparse dictionary. The input activation xxx is projected into a higher-dimensional sparse code zzz, and the decoder reconstructs x^\hat{x}x^ from active coordinates. Reconstruction and sparsity are necessary diagnostics; whether a direction is interpretable still has to be measured.

Where to Attach SAEs Inside a Transformer

Researchers attach SAEs at several natural "hook points":

  • Residual stream immediately after an attention block or after the MLP block.
  • The output of the MLP (the "MLP SAE").
  • Attention-layer outputs (more experimental).

Cunningham et al. presented their method for residual-stream, MLP-output, or attention-output activations. They mainly studied residual streams and reported mixed MLP-SAE results, including dead features. Marks et al. evaluated feature circuits with SAEs at several sublayers in Pythia and Gemma models. Hook point changes the question you can ask, but a residual-stream or MLP-output label still needs feature-quality and causal validation.[2]Reference 2Sparse Autoencoders Find Highly Interpretable Features in Language Modelshttps://arxiv.org/abs/2309.08600[5]Reference 5Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Modelshttps://arxiv.org/abs/2403.19647

Feature circuits: from features to algorithms

Once you have candidate interpretable features at multiple layers, you can study feature circuits: sparse subgraphs that connect features across layers. Attribution methods can suggest which earlier features feed later features, but those edges are hypotheses rather than causal proof. A circuit becomes credible only when ablation, patching, or steering changes behavior in the predicted direction.[5]Reference 5Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Modelshttps://arxiv.org/abs/2403.19647

Circuit papers have recovered recognizable behaviors such as:

  • Induction heads that copy patterns across context windows.[6]Reference 6In-context Learning and Induction Heads.https://arxiv.org/abs/2209.11895
  • An indirect-object identification circuit in GPT-2 small.[7]Reference 7Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 smallhttps://arxiv.org/abs/2211.00593
  • A greater-than circuit for simple year-comparison prompts in GPT-2 small.[8]Reference 8How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language modelhttps://arxiv.org/abs/2305.00586
  • Sparse feature circuits that express some behaviors in human-readable feature units rather than raw neuron or head indices.[5]Reference 5Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Modelshttps://arxiv.org/abs/2403.19647

The promise isn't that every model becomes transparent overnight. The promise is narrower: turn one behavior at a time from "the model produced this and we don't know why" into "these internal components appear to implement it, and interventions support that explanation."

Attribution graphs and circuit tracing

Later work built end-to-end graph hypotheses on top of learned feature decompositions. Anthropic's Circuit Tracing methods article replaces MLPs with cross-layer transcoders, which read from one layer and write to later layers. It then builds attribution graphs whose nodes are active features and whose edges approximate linear contributions to logits and intermediate features. The methods article demonstrates graphs on simple behaviors of a small replacement model; its companion study applies the technique to Claude 3.5 Haiku.[9]Reference 9On the Biology of a Large Language Modelhttps://transformer-circuits.pub/2025/attribution-graphs/biology.html

Attribution graphs compress many candidate dependencies into one object, but the replacement model can diverge from the original model and important graph hypotheses still require perturbation tests. Feature-level circuits therefore explain scoped behaviors under measured approximations, rather than exposing a complete model algorithm.[9]Reference 9On the Biology of a Large Language Modelhttps://transformer-circuits.pub/2025/attribution-graphs/biology.html

Feature-circuit intervention test where credential safety, urgency, and formatting features contribute to a refusal score, then target ablation lowers the score while the formatting control leaves it unchanged. Feature-circuit intervention test where credential safety, urgency, and formatting features contribute to a refusal score, then target ablation lowers the score while the formatting control leaves it unchanged.
In the constructed score, credential safety contributes $1.08$, urgency contributes $0.08$, and formatting contributes $0$. Ablating credential safety changes the score by $-1.08$ while the formatting control changes it by $0$, which is the intervention pattern a causal circuit claim needs.

intervention-not-correlation.py
1def refusal_score(features): 2 weights = {"credential_safety": 1.2, "urgency": 0.1, "formatting": 0.0} 3 return sum(features[name] * weights[name] for name in weights) 4 5observed = {"credential_safety": 0.9, "urgency": 0.8, "formatting": 0.7} 6credential_ablated = {**observed, "credential_safety": 0.0} 7formatting_ablated = {**observed, "formatting": 0.0} 8 9baseline = refusal_score(observed) 10target_change = baseline - refusal_score(credential_ablated) 11control_change = baseline - refusal_score(formatting_ablated) 12print(f"target ablation change: {target_change:.2f}") 13print(f"negative-control change: {control_change:.2f}") 14 15assert target_change > 1.0 16assert control_change == 0.0
Intervention check
1target ablation change: 1.08 2negative-control change: 0.00

This constructed scoring function illustrates an intervention pattern: the named candidate and a negative control are ablated separately. Applying the pattern to a language model requires measured outputs on held-out prompts, not a label alone.

Activation steering with SAE features

Feature directions selected for interpretation experiments can also become intervention directions. After you have a trained SAE, you can perform activation steering at inference time without touching any weights:

  1. Run the prompt and cache the residual-stream activations at the layer where your target SAE lives.
  2. Encode those activations with the SAE encoder to obtain the sparse feature vector zzz.
  3. Add a multiple of the decoder vector of a chosen feature: new activation=x+α⋅Wdec[:,f]\text{new activation} = x + \alpha \cdot W_\text{dec}[:, f]new activation=x+α⋅Wdec​[:,f].
  4. Continue the forward pass with the modified activation.

If feature 17 is the "credential-safety refusal" feature, adding +2.5×Wdec[:,17]+2.5 \times W_\text{dec}[:,17]+2.5×Wdec​[:,17] might make the model more likely to refuse a borderline credential request. Subtracting the same direction might make it more lenient. You don't trust that from one demo. You run a sweep, measure side effects, and check unrelated prompts.

Anthropic's Claude 3 Sonnet work showed that amplifying or suppressing learned features can change model behavior, including safety-relevant examples such as scam-email and sycophantic-praise features.[4]Reference 4Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnethttps://transformer-circuits.pub/2024/scaling-monosemanticity/ Related refusal-direction work shows the same broader lesson from a non-SAE direction: residual-stream edits can strongly affect refusal behavior, so steering needs careful safety handling.[10]Reference 10Refusal in Language Models Is Mediated by a Single Directionhttps://arxiv.org/abs/2406.11717

Exact activation-steering sweep for x equals 0.2,0.1 and candidate direction 0,1, showing candidate projection rising from negative 0.9 to 2.1 while an unrelated control projection remains fixed at 0.2. Exact activation-steering sweep for x equals 0.2,0.1 and candidate direction 0,1, showing candidate projection rising from negative 0.9 to 2.1 while an unrelated control projection remains fixed at 0.2.
Adding $\alpha[0,1]$ moves the candidate projection from $-0.9$ to $2.1$ across the sweep. The orthogonal control $[1,0]$ stays exactly $0.2$, making the intended edit and one negative control explicit.

activation-steering-with-sae-features.py
1def add_scaled_direction(vector, direction, scale): 2 return [value + scale * delta for value, delta in zip(vector, direction)] 3 4activation = [0.2, -0.1, 0.0, 0.4] 5credential_feature_direction = [0.0, 0.5, 0.5, 0.0] 6steered = add_scaled_direction(activation, credential_feature_direction, scale=2.0) 7 8assert steered == [0.2, 0.9, 1.0, 0.4] 9assert steered[1] > activation[1] 10assert steered[2] > activation[2]
steering-sweep.py
1def add_scaled(vector, direction, scale): 2 return [value + scale * delta for value, delta in zip(vector, direction)] 3 4def project(vector, direction): 5 return sum(value * delta for value, delta in zip(vector, direction)) 6 7activation = [0.2, 0.1] 8candidate_direction = [0.0, 1.0] 9unrelated_direction = [1.0, 0.0] 10 11for alpha in [-1.0, 0.0, 1.0, 2.0]: 12 steered = add_scaled(activation, candidate_direction, alpha) 13 score = project(steered, candidate_direction) 14 control = project(steered, unrelated_direction) 15 print(f"alpha={alpha:+.1f}: candidate={score:+.1f}, control={control:+.1f}") 16 17assert project(add_scaled(activation, candidate_direction, 2.0), unrelated_direction) == 0.2
Steering sweep
1alpha=-1.0: candidate=-0.9, control=+0.2 2alpha=+0.0: candidate=+0.1, control=+0.2 3alpha=+1.0: candidate=+1.1, control=+0.2 4alpha=+2.0: candidate=+2.1, control=+0.2
Evidence ledger for a candidate SAE feature label. Evidence ledger for a candidate SAE feature label.
Top examples suggest the credential-safety label, while target and control ablations support a causal role in the constructed score. The held-out-prompt row remains open, so the broader semantic claim isn't yet established.

The visual shows the review flow for one candidate credential-safety feature. Repeated top-activating examples suggest a label. Held-out prompts and causal interventions then test whether that label keeps predicting behavior. A plausible name isn't ground truth.

A toy NumPy SAE you can run today

A direct way to internalize the mechanics is to implement a tiny SAE yourself. The lab trains an SAE on synthetic activations that deliberately contain superposition (24 ground-truth features packed into eight-dimensional activations). After a few hundred gradient steps the reconstruction loss drops while the latent activations remain sparse.

a-toy-numpy-sae-you-can-run-today.py
1import numpy as np 2 3# --- Synthetic data with known superposition --- 4rng = np.random.default_rng(42) 5d_model = 8 6n_true_features = 24 # more features than dimensions 7n_samples = 768 8 9# Ground-truth feature directions (unit vectors) 10true_features = rng.normal(size=(n_true_features, d_model)) 11true_features /= np.linalg.norm(true_features, axis=1, keepdims=True) 12 13# Each sample activates only 2-4 features (sparse) 14activations = np.zeros((n_samples, d_model)) 15for i in range(n_samples): 16 active_idx = rng.choice(n_true_features, size=rng.integers(2, 5), replace=False) 17 coeffs = rng.normal(loc=1.5, scale=0.4, size=len(active_idx)) 18 activations[i] = (coeffs[:, None] * true_features[active_idx]).sum(axis=0) 19 20# Add a little noise 21activations += 0.03 * rng.normal(size=activations.shape) 22 23# --- Tiny SAE implementation --- 24n_features = 64 # 8x expansion 25learning_rate = 0.03 26lambda_sparsity = 5.0 # L1 coefficient 27epochs = 300 28 29W_enc = rng.normal(scale=0.1, size=(n_features, d_model)) 30b_enc = np.zeros(n_features) 31W_dec = rng.normal(scale=0.1, size=(d_model, n_features)) 32W_dec /= np.linalg.norm(W_dec, axis=0, keepdims=True) + 1e-12 33initial_W_dec = W_dec.copy() 34b_dec = np.zeros(d_model) 35 36def encode(x: np.ndarray): 37 pre_activation = x @ W_enc.T + b_enc 38 z = np.maximum(0.0, pre_activation) 39 return pre_activation, z 40 41def reconstruct(x: np.ndarray): 42 _, z = encode(x) 43 return z, z @ W_dec.T + b_dec 44 45def reconstruction_loss(x: np.ndarray) -> float: 46 _, x_hat = reconstruct(x) 47 return float(np.mean((x - x_hat) ** 2)) 48 49def train_sae(x: np.ndarray) -> None: 50 global W_enc, b_enc, W_dec, b_dec 51 for epoch in range(epochs): 52 # Forward 53 pre_activation, z = encode(x) 54 x_hat = z @ W_dec.T + b_dec 55 56 # Loss 57 recon_loss = np.mean((x - x_hat) ** 2) 58 sparsity_loss = np.mean(np.abs(z)) 59 total_loss = recon_loss + lambda_sparsity * sparsity_loss 60 61 # Gradients (manual backprop for clarity) 62 d_recon = 2 * (x_hat - x) / x.size 63 dW_dec = d_recon.T @ z 64 db_dec = d_recon.sum(axis=0) 65 66 dz = d_recon @ W_dec 67 dz += (lambda_sparsity / z.size) * np.sign(z) 68 dz[pre_activation <= 0] = 0.0 # ReLU backward 69 dW_enc = dz.T @ x 70 db_enc = dz.sum(axis=0) 71 72 # SGD step 73 W_dec -= learning_rate * dW_dec 74 b_dec -= learning_rate * db_dec 75 W_enc -= learning_rate * dW_enc 76 b_enc -= learning_rate * db_enc 77 78 # Keep decoder columns normalized so scale cannot move from z into W_dec. 79 W_dec /= np.linalg.norm(W_dec, axis=0, keepdims=True) + 1e-12 80 81 if epoch % 100 == 0: 82 print(f"Epoch {epoch:4d} | total={total_loss:.4f} | recon={recon_loss:.4f} | sparsity={sparsity_loss:.4f}") 83 84initial_recon = reconstruction_loss(activations) 85train_sae(activations)
Training log
1Epoch 0 | total=1.9426 | recon=1.4259 | sparsity=0.1033 2Epoch 100 | total=0.5301 | recon=0.2370 | sparsity=0.0586 3Epoch 200 | total=0.4394 | recon=0.1564 | sparsity=0.0566

Now inspect the learned sparse code after training. This follow-up cell keeps the heavy optimization loop separate from the summary checks you read.

a-toy-numpy-sae-summary.py
1final_recon = reconstruction_loss(activations) 2z, _ = reconstruct(activations) 3active_per_sample = (z > 0).sum(axis=1).mean() 4 5print(f"Training complete: recon {initial_recon:.3f} -> {final_recon:.3f}, active features {active_per_sample:.1f}")
Learned feature summary
1Training complete: recon 1.426 -> 0.128, active features 18.9

Run the first cell and watch reconstruction loss drop, then use the follow-up cell to confirm that average active features per sample stays far below the full 64-feature dictionary. Reconstruction and sparse use alone still say nothing about whether columns recover known generating directions. Because this dataset has ground truth, the next cell measures that question directly.

dictionary-alignment-diagnostic.py
1def direction_alignment(decoder: np.ndarray): 2 columns = decoder.T.copy() 3 columns /= np.linalg.norm(columns, axis=1, keepdims=True) + 1e-12 4 similarities = np.abs(true_features @ columns.T) 5 strongest_per_truth = similarities.max(axis=1) 6 return float(np.median(strongest_per_truth)), int((strongest_per_truth >= 0.90).sum()) 7 8initial_median, initial_high_matches = direction_alignment(initial_W_dec) 9trained_median, trained_high_matches = direction_alignment(W_dec) 10print(f"median max |cosine|: {initial_median:.3f} -> {trained_median:.3f}") 11print(f"true directions at >= 0.90: {initial_high_matches} -> {trained_high_matches}")
Direction alignment
1median max |cosine|: 0.798 -> 0.799 2true directions at >= 0.90: 1 -> 2

Here, reconstruction improves while median direction alignment barely moves (0.798 to 0.799; high matches move from one to two). This tiny trainer therefore demonstrates sparse reconstruction, not recovery of ground-truth features. On synthetic data, a stronger trainer could use rising alignment to support a recovery claim. Real LLM activation datasets lack known generating directions, so feature interpretation instead relies on held-out activation examples and causal tests.

Reading the toy SAE code line by line

  • The synthetic data generator deliberately activates only 2 to 4 of the 24 true features per sample and then sums their scaled direction vectors. Because dmodel=8d_\text{model}=8dmodel​=8, this is textbook superposition.
  • The encoder W_enc is deliberately initialized small so that early activation magnitudes stay controlled.
  • The manual gradient for the sparsity term adds a sub-gradient of the L1 norm (sign(z)) to the latent gradient before the ReLU backward step. In a real PyTorch implementation you would let torch.autograd handle it.
  • Decoder columns are renormalized after every update. Without that constraint, a model could shrink zzz and inflate decoder vectors, lowering the L1 penalty without changing the reconstruction.
  • The learning-rate schedule and λ\lambdaλ value were hand-tuned for this toy problem; on real LLM activations, optimizer settings are found by sweeping on a held-out set of prompts while monitoring both reconstruction MSE and the fraction of "dead" (never-activated) features.
  • After convergence the average ℓ0\ell_0ℓ0​ norm of zzz (number of non-zero entries) should stay far below the full 64-feature dictionary. It won't exactly match the 2 to 4 ground-truth features because this tiny L1 trainer is deliberately simple; the important result is that reconstruction improves without activating the entire dictionary.

The small SAE lab contains the core ingredients used in large SAE work: an activation dataset, an overcomplete dictionary, a reconstruction objective, a sparsity constraint, and diagnostics for feature quality.[4]Reference 4Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnethttps://transformer-circuits.pub/2024/scaling-monosemanticity/[11]Reference 11Scaling and Evaluating Sparse Autoencodershttps://cdn.openai.com/papers/sparse-autoencoders.pdf

Practical training details and pitfalls

Published SAE experiments use several refinements:

RefinementWhy it matters
TopK SAEsReplace the soft L1 penalty with an explicit keep-k operation. Gao et al. report an improved reconstruction-sparsity frontier over their evaluated ReLU baselines and train a 16-million-latent SAE on GPT-4 activations.[11]Reference 11Scaling and Evaluating Sparse Autoencodershttps://cdn.openai.com/papers/sparse-autoencoders.pdf
Gated SAEsSeparate feature selection from magnitude estimation. Rajamanoharan et al. report reduced shrinkage and comparable fidelity with roughly half as many firing features in their tested settings.[12]Reference 12Improving Dictionary Learning with Gated Sparse Autoencodershttps://arxiv.org/abs/2404.16014
JumpReLU SAEsReplace ReLU with a learned threshold and train an L0 penalty with straight-through estimators. On Gemma 2 9B activations, Rajamanoharan et al. report equal or stronger reconstruction-sparsity performance than their Gated and TopK comparisons.[13]Reference 13Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencodershttps://arxiv.org/abs/2407.14435
Dead feature mitigationGao et al. use tied initialization and an auxiliary AuxK loss to reduce dead latents in their TopK recipe.[11]Reference 11Scaling and Evaluating Sparse Autoencodershttps://cdn.openai.com/papers/sparse-autoencoders.pdf
Expansion factor and λ\lambdaλ tuningToo small an expansion leaves features entangled; too large wastes capacity and can cause "feature splitting" (one concept split across several nearly duplicate features).
Layer and token selectionSAEs are trained on activations from many prompts and token positions. Measure whether the sampled data covers rare or safety-relevant contexts instead of assuming it does.

Common beginner mistakes are easier to debug if you name the symptom:

SymptomLikely causeFix
Every latent seems to mean several thingsExpansion factor too small, sparsity too weak, or the training data mixes unrelated domains without enough coverageIncrease dictionary width, sweep sparsity, and inspect top-activating examples per domain.
Many features never activateSparsity pressure is too strong or initialization left features behind earlyLower λ\lambdaλ for L1 training or raise kkk for TopK, then consider dead-feature resampling or an auxiliary revival loss.
Reconstruction looks good but labels look messyDense latent use can reconstruct activations without creating human-readable featuresTrack ℓ0\ell_0ℓ0​, feature purity, top examples, and causal effects alongside mean squared error.
A steering demo works once and then fails elsewhereFeature label was correlational or the intervention changed unrelated behaviorRun held-out sweeps, ablations, and negative-control prompts before treating the feature as causal.
One concept splits into many near-duplicatesDictionary is too wide or sparsity target makes narrow variants cheaperCompare nearest-neighbor decoder directions and merge labels only after checking causal behavior.
dead-latent-monitor.py
1latent_batches = [ 2 [1.2, 0.0, 0.0, 0.0], 3 [0.7, 0.3, 0.0, 0.0], 4 [0.0, 0.0, 0.0, 0.0], 5 [0.0, 0.0, 0.0, 0.5], 6] 7 8rates = [ 9 sum(row[index] > 0 for row in latent_batches) / len(latent_batches) 10 for index in range(len(latent_batches[0])) 11] 12rare_or_dead = [index for index, rate in enumerate(rates) if rate < 0.30] 13print("firing rates:", [round(rate, 2) for rate in rates]) 14print("inspect latents:", rare_or_dead) 15 16assert rare_or_dead == [1, 2, 3]
Dead-latent monitor
1firing rates: [0.5, 0.25, 0.0, 0.25] 2inspect latents: [1, 2, 3]

L1 Penalty vs TopK SAEs - a practical comparison

AspectL1 Penalty SAETopK SAE
Sparsity controlIndirect via λ\lambdaλ; soft thresholdingSelects the kkk largest activations; positive nonzero count can be lower
Training signalL1 pressure applies shrinkage to active magnitudesSelected latents receive reconstruction gradients; Gao et al. add AuxK for dead latents
Reconstruction comparisonA useful simple baselineGao et al. report a stronger frontier than evaluated ReLU baselines
Dead latentsMust be monitoredGao et al. report few dead latents with their mitigation recipe
Research usageStraightforward for small experimentsUsed in the GPT-4-activation scaling experiment
Hyperparameter sensitivityλ\lambdaλ needs careful sweepskkk is intuitive (e.g., keep 32 out of 4096)

These papers evaluate different ways to control sparsity and shrinkage: TopK fixes active count, Gated separates selection from magnitude, and JumpReLU trains a learned threshold with an L0 penalty.[11]Reference 11Scaling and Evaluating Sparse Autoencodershttps://cdn.openai.com/papers/sparse-autoencoders.pdf[12]Reference 12Improving Dictionary Learning with Gated Sparse Autoencodershttps://arxiv.org/abs/2404.16014[13]Reference 13Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencodershttps://arxiv.org/abs/2407.14435 Results are architecture- and dataset-specific, so variant choice should be followed by reconstruction, sparsity, feature-quality, and intervention checks.

Sparse-autoencoder training tradeoff showing reconstruction MSE falling, mean latent magnitude staying small, and the same candidate code switching winner as the sparsity weight lambda increases. Sparse-autoencoder training tradeoff showing reconstruction MSE falling, mean latent magnitude staying small, and the same candidate code switching winner as the sparsity weight lambda increases.
The toy logs show reconstruction MSE falling while mean latent magnitude stays small. Changing $\lambda$ can reverse which candidate code wins, but feature meaning still needs examples, ablations, and held-out tests.

Applications to Safety and Governance

Templeton et al. inspected Claude 3 Sonnet features related to safety-relevant behaviors, including sycophantic praise and scam-related text, and experimented with feature steering.[4]Reference 4Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnethttps://transformer-circuits.pub/2024/scaling-monosemanticity/ These results establish useful research probes, not a general safety monitor. A feature that activates on suspicious examples gives researchers an internal signal to test with interventions and behavioral evaluations.

In a code-assistant deployment, the research question becomes more precise. Instead of asking "why did the model refuse this password-reset script?" in the abstract, you can ask which candidate features activated, whether a credential-safety direction was causally involved, and whether changing that direction shifts decisions on a held-out security-prompt set. That remains research evidence rather than a complete compliance story.

Limits and Caution

SAEs are a genuine advance, not a solved problem. Keep four limits in mind so you don't oversell them in an interview or a design review.

  • Reconstruction error matters. When you splice an SAE reconstruction into the model to evaluate fidelity, x^\hat{x}x^ replaces the original activation. Cunningham et al. report a notable perplexity increase when substituting an SAE reconstruction in their Pythia-70M experiment, and feature-circuit evaluations include error terms for unexplained computation.[2]Reference 2Sparse Autoencoders Find Highly Interpretable Features in Language Modelshttps://arxiv.org/abs/2309.08600[5]Reference 5Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Modelshttps://arxiv.org/abs/2403.19647
  • Coverage is incomplete. The 34-million-feature SAE for Claude 3 Sonnet didn't capture everything the model knows; for example, only about 60% of London boroughs had a dedicated feature, and rare concepts were far less likely to get one. Exhaustive coverage would likely need orders of magnitude more features.[4]Reference 4Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnethttps://transformer-circuits.pub/2024/scaling-monosemanticity/
  • Feature splitting. One human concept can fracture into several near-duplicate features as dictionary width changes, so a feature index shouldn't be treated as a canonical human concept.
  • Labels are hypotheses, not ground truth. A name like "credential safety" comes from top-activating examples and survives only if ablation and steering move behavior in the predicted direction. Reconstruction quality alone never proves interpretability.

What to check before moving on

  • Explain polysemanticity, superposition, and why raw neurons are often weak interpretability units.
  • Write the SAE reconstruction plus sparsity objective and identify encoder, latent code, and decoder shapes.
  • Choose a transformer hook point and explain what a residual-stream SAE or MLP-output SAE can reveal.
  • Implement a minimal SAE forward pass, loss, and gradient update on toy data.
  • Explain how feature circuits connect interpretable features into candidate algorithms, and how attribution graphs extend this with cross-layer transcoders.
  • Describe activation steering with a decoder direction and the controls needed before trusting it.
  • Diagnose dead features, feature splitting, weak sparsity, and reconstruction-only evaluation.
  • Connect SAE research to safety evaluations without claiming the whole model is solved.

Practice checkpoints

Common failures and fixes

  • Symptom: A raw neuron seems to mean three unrelated things at once. Cause: You're reading a polysemantic axis rather than a reliable unit of explanation. Fix: Inspect SAE candidate features and verify them with interventions, not neuron anecdotes alone.
  • Symptom: Reconstruction loss is low, but feature labels still look messy. Cause: The SAE may be reconstructing activations with dense mixtures instead of sparse, stable features. Fix: Track active-latent count, dead-feature rate, top examples, and causal effects together.
  • Symptom: A feature label sounds convincing, but ablation changes nothing. Cause: The label came from correlational examples instead of causal evidence. Fix: Treat names as hypotheses until patching, ablation, or steering moves behavior in the predicted direction.
  • Symptom: Many latent slots never fire. Cause: Sparsity pressure is too strong, initialization is unlucky, or training data coverage is weak. Fix: Sweep sparsity, use dead-feature revival, and inspect domain coverage.
  • Symptom: Steering one feature helps on a demo prompt but breaks unrelated prompts. Cause: The direction is entangled or overscaled. Fix: Run held-out sweeps and negative controls before trusting the intervention.

What to remember

  • Under the superposition hypothesis, polysemantic neurons arise because features share overlapping directions; you can't interpret a model by looking only at raw neuron activations or raw weight matrices.
  • Sparse autoencoders learn an overcomplete sparse basis; published evaluations report more interpretable features than selected native-representation baselines in evaluated models.
  • The training objective (reconstruction plus sparsity) is simple, but the results depend heavily on expansion factor, sparsity target, data coverage, and dead-feature handling.
  • Feature circuits let us test how candidate sparse features compose across layers into scoped algorithm hypotheses; attribution-graph work extends this to per-prompt graph hypotheses built on cross-layer transcoders.
  • SAE features can be read and written at inference time, enabling activation steering experiments that need careful controls before deployment.
  • A minimal NumPy implementation on toy data with known ground-truth features lets you measure reconstruction, sparsity, and dictionary alignment before attempting larger-model experiments.
  • The same toolkit can support auditing, red-teaming, and feature-level evidence, but it doesn't replace behavioral evals or policy review.
  • Progress in this area depends on matching feature claims to evidence: toy ground truth where available, open-model interventions, and carefully scoped published results.

Practice projects

If you want to move from the toy example to larger-model work, the natural next projects are:

  1. Train a small SAE on the residual stream of a 1B to 8B open model using SAELens, TransformerLens, or nnsight to capture activations. Use a few hundred thousand tokens from code, security, documentation, or instruction-following data.
  2. Implement a TopK SAE trainer in PyTorch and compare reconstruction and interpretability metrics against your L1 version.
  3. Build a simple feature browser: for each learned decoder direction, collect the top-20 activating tokens across a large corpus and let a human (or an LLM judge) label the emerging concept.
  4. Reproduce a known circuit (e.g., the greater-than circuit) using SAE features instead of raw attention heads, then attempt a causal intervention by zero-ablating the responsible feature.
  5. Build an attribution graph for a two-hop factual prompt on a small open model using an open-source circuit-tracing library, then validate one intermediate feature by suppressing it.[9]Reference 9On the Biology of a Large Language Modelhttps://transformer-circuits.pub/2025/attribution-graphs/biology.html

Each project exercises these concepts and produces inspectable experimental evidence.

The stable core is superposition, dictionary learning, reconstruction plus sparsity, and causal interventions on feature directions. Once those four ideas are clear, newer SAE variants (TopK, Gated, JumpReLU) and the transcoder-plus-attribution-graph stack become engineering choices rather than a new conceptual foundation.

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.In a toy residual space, a credential feature is [1.0, 0.0] and an urgency feature is [0.8, 0.6]. A raw neuron reads the first coordinate. If both features are active as 1.5 * credential + urgency, its reading is 2.3 rather than 1.5. What does this illustrate?
2.An activation vector has dimension d_model = 512. An SAE uses 4096 latent features, an encoder W_enc, a sparse latent z, and a decoder W_dec. Which description matches the role of the overcomplete dictionary?
3.During L1 SAE training, suppose each latent code is replaced by z' = 0.1z and each decoder column by W_dec' = 10W_dec, with the reconstruction otherwise unchanged. What does normalizing decoder columns after each update prevent?
4.You train one SAE on the residual stream immediately after an MLP block and another on the MLP output before it is added to the residual stream. Which interpretation is valid?
5.A circuit-tracing tool replaces MLPs with cross-layer transcoders and draws a graph for one prompt. The graph has active feature nodes and edges estimating contributions to later features and logits. What should you conclude before using the graph as an explanation?
6.A candidate credential_safety feature co-activates with urgency and formatting features on refusal examples. On held-out prompts, ablating credential_safety drops a refusal score by 1.08, ablating formatting changes it by 0.00, and unrelated prompts are mostly unchanged. Which conclusion is supported by this intervention pattern?
7.At a layer where an SAE was trained, the cached activation is [0.2, -0.1, 0.0, 0.4]. Feature 17's decoder direction is [0.0, 0.5, 0.5, 0.0]. If you steer with alpha = 2.0 using x + alpha W_dec[:,17], what activation is passed onward?
8.A toy SAE trained on 8-dimensional activations with a 64-feature dictionary improves reconstruction from 1.426 to 0.128 and uses about 18.9 active features per sample. A ground-truth alignment check changes median max |cosine| only from 0.798 to 0.799. What claim is supported?
9.During SAE training, many latent slots have firing rates below 0.30, including some that never activate. The model also uses an L1 penalty with a high lambda. Which response matches the training guidance?
10.A team finds an SAE feature that fires on many credential-safety examples and wants to use that feature as an automatic policy decision rule. Which claim is justified?

10 questions remaining.

Next Step
Continue to Decoding Strategies: Greedy to Nucleus

Layer normalization and SAE feature experiments both concern the residual stream that eventually produces logits. Decoding then determines how those logits become generated tokens.

PreviousLayer Normalization: Pre-LN vs Post-LN
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Bricken, T., Templeton, A., et al. (Anthropic) · 2023 · Transformer Circuits Thread

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, H., Ewart, A., Riggs, L., Huben, R., Sharkey, L. · 2023

Toy Models of Superposition

Elhage, N., Hume, T., Olsson, C., et al. (Anthropic) · 2022 · Transformer Circuits Thread

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Templeton, A., Conerly, T., et al. (Anthropic) · 2024 · Transformer Circuits Thread

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., Mueller, A. · 2024 · ICLR 2025

In-context Learning and Induction Heads.

Olsson, C., et al. · 2022

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Wang, K., Variengien, A., Conmy, A., Shlegeris, B., Steinhardt, J. · 2022

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

Hanna, M., Liu, O., Variengien, A. · 2023 · NeurIPS 2023

On the Biology of a Large Language Model

Lindsey, J., Gurnee, W., Ameisen, E., et al. · 2025

Refusal in Language Models Is Mediated by a Single Direction

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., Nanda, N. · 2024

Scaling and Evaluating Sparse Autoencoders

Gao, L., et al. · 2024

Improving Dictionary Learning with Gated Sparse Autoencoders

Rajamanoharan, S., Conmy, A., Smith, L., et al. · 2024

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Rajamanoharan, S., Lieberum, T., Sonnerat, N., et al. · 2024