Learn how sparse autoencoders decompose transformer activations into candidate interpretable features, support circuit tracing, and enable controlled activation-steering experiments.
Layer normalization kept the residual stream numerically stable. Mechanistic interpretability asks the next question: what do the directions inside that stream mean?
A code-assistant LLM keeps refusing legitimate password-reset scripts because they mention credentials. When you inspect the attention heads and MLP neurons that fire on those prompts, dozens of neurons light up in confusing combinations. One neuron activates on "rotate the API key," "credential theft attempt," and "urgent tone." Another neuron fires for "secret manager docs" and also for "shell command syntax." You can't point to a clean "credential-safety feature."
This is the core problem of polysemanticity (one neuron responding to multiple unrelated concepts at once) inside large language models. Raw neurons often fail to map to single human concepts. One influential hypothesis is superposition: useful features can be represented in overlapping directions when clean coordinate axes are scarce. Toy models demonstrate this mechanism; for real language models, it remains a motivating hypothesis and an empirical question.
Mechanistic interpretability researchers train sparse autoencoders (SAEs), small models that reconstruct model activations through sparse latent codes, to look for more inspectable feature directions.[1][2] In evaluated models, Bricken et al. and Cunningham et al. found SAE features that were more interpretable than raw-neuron or alternative baselines. A candidate feature might fire mostly on credential-reset language, exploit-request language, or urgent instructions. Its label remains provisional until examples and interventions support it.
You'll build that path in layers: how SAEs work, what they reveal beyond raw neurons, how to train a minimal NumPy version, and how feature-level interventions support modern safety research.
In a transformer, residual-stream vectors and MLP activations combine many signals. If a model represents more useful features than available orthogonal axes, non-orthogonal directions offer an efficient representation but introduce interference. This is the superposition hypothesis for why the same neuron can participate in unrelated computations.
The 2022 "Toy Models of Superposition" paper demonstrated this mechanism with tiny ReLU networks trained on synthetic sparse data. When the data contained more ground-truth features than hidden units, the networks learned overlapping representations. The experiment doesn't establish the full explanation for an LLM, but it gives researchers a concrete model to test against LLM activations.[3]
Direct inspection of weights or raw neuron activations therefore gives you an entangled view. A raw neuron might mix "credential reset + exploit request + urgency." You can't tell which part of the mixture the model is using at any moment without a cleaner representation and an intervention test.
The calculation below measures that overlap with cosine similarity and then forms the same mixture shown in the figure.
1import math
2
3credential = [1.0, 0.0]
4urgency = [0.8, 0.6]
5
6def cosine(left, right):
7 dot = sum(a * b for a, b in zip(left, right))
8 norm = math.sqrt(sum(a * a for a in left) * sum(b * b for b in right))
9 return dot / norm
10
11mixed_activation = [1.5 * credential[0] + urgency[0], 1.5 * credential[1] + urgency[1]]
12print(f"credential/urgency cosine: {cosine(credential, urgency):.2f}")
13print(f"axis A reading after both fire: {mixed_activation[0]:.2f}")
14
15assert cosine(credential, urgency) > 0
16assert mixed_activation[0] > 1.51credential/urgency cosine: 0.80
2axis A reading after both fire: 2.30A sparse autoencoder attacks the entanglement problem with classic dictionary learning. You model activation vectors as approximately sparse linear combinations of directions from a much larger learned dictionary. Some learned directions may correspond to useful model features. That correspondence is a hypothesis to evaluate, not an assumption built into the math.
An SAE consists of two learned matrices plus a non-linearity:
Training minimizes a loss that has two competing terms:
(or a TopK variant that keeps the largest latent activations and zeros the rest). The reconstruction term rewards preserving information from the original activation. The sparsity term pushes the latent code toward few active coordinates rather than a dense remapping. Learned columns of are candidate feature directions; evaluating them requires top-activating examples and intervention tests in addition to reconstruction metrics.
1def mse(target, reconstructed):
2 return sum((a - b) ** 2 for a, b in zip(target, reconstructed)) / len(target)
3
4def objective(target, reconstructed, code, sparsity_weight):
5 return mse(target, reconstructed) + sparsity_weight * sum(abs(value) for value in code)
6
7target = [1.0, 0.0]
8sparse_code = [0.75, 0.0]
9sparse_reconstruction = [0.75, 0.0]
10dense_code = [0.75, 0.50]
11dense_reconstruction = [1.0, 0.0]
12
13for weight in [0.01, 0.10]:
14 sparse_loss = objective(target, sparse_reconstruction, sparse_code, weight)
15 dense_loss = objective(target, dense_reconstruction, dense_code, weight)
16 chosen = "sparse" if sparse_loss < dense_loss else "dense"
17 print(f"lambda={weight:.2f}: sparse={sparse_loss:.4f}, dense={dense_loss:.4f} -> {chosen}")
18
19assert objective(target, dense_reconstruction, dense_code, 0.01) < objective(target, sparse_reconstruction, sparse_code, 0.01)
20assert objective(target, sparse_reconstruction, sparse_code, 0.10) < objective(target, dense_reconstruction, dense_code, 0.10)1lambda=0.01: sparse=0.0387, dense=0.0125 -> dense
2lambda=0.10: sparse=0.1063, dense=0.1250 -> sparseThe original "Towards Monosemanticity" result (2023) made this concrete: an SAE trained on a one-layer transformer decomposed a 512-neuron MLP layer into more than 4,000 features, with inspected examples for patterns such as DNA sequences, legal language, HTTP requests, and Hebrew text. A year later, "Scaling Monosemanticity" (2024) applied the approach to a middle-layer residual stream of Claude 3 Sonnet, extracting up to 34 million features and reporting abstract and multilingual examples.[1][4]
1def top_k_sparse(values, k):
2 ranked = sorted(range(len(values)), key=lambda index: values[index], reverse=True)
3 keep = set(ranked[:k])
4 return [value if index in keep else 0.0 for index, value in enumerate(values)]
5
6latent = [0.1, 2.4, 0.3, 1.7, 0.05]
7sparse = top_k_sparse(latent, k=2)
8
9assert sparse == [0.0, 2.4, 0.0, 1.7, 0.0]
10assert sum(value != 0.0 for value in sparse) == 2
The SAE learns an overcomplete sparse dictionary. The input activation is projected into a higher-dimensional sparse code , and the decoder reconstructs from active coordinates. Reconstruction and sparsity are necessary diagnostics; whether a direction is interpretable still has to be measured.
Researchers attach SAEs at several natural "hook points":
Cunningham et al. presented their method for residual-stream, MLP-output, or attention-output activations. They mainly studied residual streams and reported mixed MLP-SAE results, including dead features. Marks et al. evaluated feature circuits with SAEs at several sublayers in Pythia and Gemma models. Hook point changes the question you can ask, but a residual-stream or MLP-output label still needs feature-quality and causal validation.[2][5]
Once you have candidate interpretable features at multiple layers, you can study feature circuits: sparse subgraphs that connect features across layers. Attribution methods can suggest which earlier features feed later features, but those edges are hypotheses rather than causal proof. A circuit becomes credible only when ablation, patching, or steering changes behavior in the predicted direction.[5]
Circuit papers have recovered recognizable behaviors such as:
The promise isn't that every model becomes transparent overnight. The promise is narrower: turn one behavior at a time from "the model produced this and we don't know why" into "these internal components appear to implement it, and interventions support that explanation."
Later work built end-to-end graph hypotheses on top of learned feature decompositions. Anthropic's Circuit Tracing methods article replaces MLPs with cross-layer transcoders, which read from one layer and write to later layers. It then builds attribution graphs whose nodes are active features and whose edges approximate linear contributions to logits and intermediate features. The methods article demonstrates graphs on simple behaviors of a small replacement model; its companion study applies the technique to Claude 3.5 Haiku.[9]
Attribution graphs compress many candidate dependencies into one object, but the replacement model can diverge from the original model and important graph hypotheses still require perturbation tests. Feature-level circuits therefore explain scoped behaviors under measured approximations, rather than exposing a complete model algorithm.[9]
1def refusal_score(features):
2 weights = {"credential_safety": 1.2, "urgency": 0.1, "formatting": 0.0}
3 return sum(features[name] * weights[name] for name in weights)
4
5observed = {"credential_safety": 0.9, "urgency": 0.8, "formatting": 0.7}
6credential_ablated = {**observed, "credential_safety": 0.0}
7formatting_ablated = {**observed, "formatting": 0.0}
8
9baseline = refusal_score(observed)
10target_change = baseline - refusal_score(credential_ablated)
11control_change = baseline - refusal_score(formatting_ablated)
12print(f"target ablation change: {target_change:.2f}")
13print(f"negative-control change: {control_change:.2f}")
14
15assert target_change > 1.0
16assert control_change == 0.01target ablation change: 1.08
2negative-control change: 0.00This constructed scoring function illustrates an intervention pattern: the named candidate and a negative control are ablated separately. Applying the pattern to a language model requires measured outputs on held-out prompts, not a label alone.
Feature directions selected for interpretation experiments can also become intervention directions. After you have a trained SAE, you can perform activation steering at inference time without touching any weights:
If feature 17 is the "credential-safety refusal" feature, adding might make the model more likely to refuse a borderline credential request. Subtracting the same direction might make it more lenient. You don't trust that from one demo. You run a sweep, measure side effects, and check unrelated prompts.
Anthropic's Claude 3 Sonnet work showed that amplifying or suppressing learned features can change model behavior, including safety-relevant examples such as scam-email and sycophantic-praise features.[4] Related refusal-direction work shows the same broader lesson from a non-SAE direction: residual-stream edits can strongly affect refusal behavior, so steering needs careful safety handling.[10]
1def add_scaled_direction(vector, direction, scale):
2 return [value + scale * delta for value, delta in zip(vector, direction)]
3
4activation = [0.2, -0.1, 0.0, 0.4]
5credential_feature_direction = [0.0, 0.5, 0.5, 0.0]
6steered = add_scaled_direction(activation, credential_feature_direction, scale=2.0)
7
8assert steered == [0.2, 0.9, 1.0, 0.4]
9assert steered[1] > activation[1]
10assert steered[2] > activation[2]1def add_scaled(vector, direction, scale):
2 return [value + scale * delta for value, delta in zip(vector, direction)]
3
4def project(vector, direction):
5 return sum(value * delta for value, delta in zip(vector, direction))
6
7activation = [0.2, 0.1]
8candidate_direction = [0.0, 1.0]
9unrelated_direction = [1.0, 0.0]
10
11for alpha in [-1.0, 0.0, 1.0, 2.0]:
12 steered = add_scaled(activation, candidate_direction, alpha)
13 score = project(steered, candidate_direction)
14 control = project(steered, unrelated_direction)
15 print(f"alpha={alpha:+.1f}: candidate={score:+.1f}, control={control:+.1f}")
16
17assert project(add_scaled(activation, candidate_direction, 2.0), unrelated_direction) == 0.21alpha=-1.0: candidate=-0.9, control=+0.2
2alpha=+0.0: candidate=+0.1, control=+0.2
3alpha=+1.0: candidate=+1.1, control=+0.2
4alpha=+2.0: candidate=+2.1, control=+0.2
The visual shows the review flow for one candidate credential-safety feature. Repeated top-activating examples suggest a label. Held-out prompts and causal interventions then test whether that label keeps predicting behavior. A plausible name isn't ground truth.
A direct way to internalize the mechanics is to implement a tiny SAE yourself. The lab trains an SAE on synthetic activations that deliberately contain superposition (24 ground-truth features packed into eight-dimensional activations). After a few hundred gradient steps the reconstruction loss drops while the latent activations remain sparse.
1import numpy as np
2
3# --- Synthetic data with known superposition ---
4rng = np.random.default_rng(42)
5d_model = 8
6n_true_features = 24 # more features than dimensions
7n_samples = 768
8
9# Ground-truth feature directions (unit vectors)
10true_features = rng.normal(size=(n_true_features, d_model))
11true_features /= np.linalg.norm(true_features, axis=1, keepdims=True)
12
13# Each sample activates only 2-4 features (sparse)
14activations = np.zeros((n_samples, d_model))
15for i in range(n_samples):
16 active_idx = rng.choice(n_true_features, size=rng.integers(2, 5), replace=False)
17 coeffs = rng.normal(loc=1.5, scale=0.4, size=len(active_idx))
18 activations[i] = (coeffs[:, None] * true_features[active_idx]).sum(axis=0)
19
20# Add a little noise
21activations += 0.03 * rng.normal(size=activations.shape)
22
23# --- Tiny SAE implementation ---
24n_features = 64 # 8x expansion
25learning_rate = 0.03
26lambda_sparsity = 5.0 # L1 coefficient
27epochs = 300
28
29W_enc = rng.normal(scale=0.1, size=(n_features, d_model))
30b_enc = np.zeros(n_features)
31W_dec = rng.normal(scale=0.1, size=(d_model, n_features))
32W_dec /= np.linalg.norm(W_dec, axis=0, keepdims=True) + 1e-12
33initial_W_dec = W_dec.copy()
34b_dec = np.zeros(d_model)
35
36def encode(x: np.ndarray):
37 pre_activation = x @ W_enc.T + b_enc
38 z = np.maximum(0.0, pre_activation)
39 return pre_activation, z
40
41def reconstruct(x: np.ndarray):
42 _, z = encode(x)
43 return z, z @ W_dec.T + b_dec
44
45def reconstruction_loss(x: np.ndarray) -> float:
46 _, x_hat = reconstruct(x)
47 return float(np.mean((x - x_hat) ** 2))
48
49def train_sae(x: np.ndarray) -> None:
50 global W_enc, b_enc, W_dec, b_dec
51 for epoch in range(epochs):
52 # Forward
53 pre_activation, z = encode(x)
54 x_hat = z @ W_dec.T + b_dec
55
56 # Loss
57 recon_loss = np.mean((x - x_hat) ** 2)
58 sparsity_loss = np.mean(np.abs(z))
59 total_loss = recon_loss + lambda_sparsity * sparsity_loss
60
61 # Gradients (manual backprop for clarity)
62 d_recon = 2 * (x_hat - x) / x.size
63 dW_dec = d_recon.T @ z
64 db_dec = d_recon.sum(axis=0)
65
66 dz = d_recon @ W_dec
67 dz += (lambda_sparsity / z.size) * np.sign(z)
68 dz[pre_activation <= 0] = 0.0 # ReLU backward
69 dW_enc = dz.T @ x
70 db_enc = dz.sum(axis=0)
71
72 # SGD step
73 W_dec -= learning_rate * dW_dec
74 b_dec -= learning_rate * db_dec
75 W_enc -= learning_rate * dW_enc
76 b_enc -= learning_rate * db_enc
77
78 # Keep decoder columns normalized so scale cannot move from z into W_dec.
79 W_dec /= np.linalg.norm(W_dec, axis=0, keepdims=True) + 1e-12
80
81 if epoch % 100 == 0:
82 print(f"Epoch {epoch:4d} | total={total_loss:.4f} | recon={recon_loss:.4f} | sparsity={sparsity_loss:.4f}")
83
84initial_recon = reconstruction_loss(activations)
85train_sae(activations)1Epoch 0 | total=1.9426 | recon=1.4259 | sparsity=0.1033
2Epoch 100 | total=0.5301 | recon=0.2370 | sparsity=0.0586
3Epoch 200 | total=0.4394 | recon=0.1564 | sparsity=0.0566Now inspect the learned sparse code after training. This follow-up cell keeps the heavy optimization loop separate from the summary checks you read.
1final_recon = reconstruction_loss(activations)
2z, _ = reconstruct(activations)
3active_per_sample = (z > 0).sum(axis=1).mean()
4
5print(f"Training complete: recon {initial_recon:.3f} -> {final_recon:.3f}, active features {active_per_sample:.1f}")1Training complete: recon 1.426 -> 0.128, active features 18.9Run the first cell and watch reconstruction loss drop, then use the follow-up cell to confirm that average active features per sample stays far below the full 64-feature dictionary. Reconstruction and sparse use alone still say nothing about whether columns recover known generating directions. Because this dataset has ground truth, the next cell measures that question directly.
1def direction_alignment(decoder: np.ndarray):
2 columns = decoder.T.copy()
3 columns /= np.linalg.norm(columns, axis=1, keepdims=True) + 1e-12
4 similarities = np.abs(true_features @ columns.T)
5 strongest_per_truth = similarities.max(axis=1)
6 return float(np.median(strongest_per_truth)), int((strongest_per_truth >= 0.90).sum())
7
8initial_median, initial_high_matches = direction_alignment(initial_W_dec)
9trained_median, trained_high_matches = direction_alignment(W_dec)
10print(f"median max |cosine|: {initial_median:.3f} -> {trained_median:.3f}")
11print(f"true directions at >= 0.90: {initial_high_matches} -> {trained_high_matches}")1median max |cosine|: 0.798 -> 0.799
2true directions at >= 0.90: 1 -> 2Here, reconstruction improves while median direction alignment barely moves (0.798 to 0.799; high matches move from one to two). This tiny trainer therefore demonstrates sparse reconstruction, not recovery of ground-truth features. On synthetic data, a stronger trainer could use rising alignment to support a recovery claim. Real LLM activation datasets lack known generating directions, so feature interpretation instead relies on held-out activation examples and causal tests.
W_enc is deliberately initialized small so that early activation magnitudes stay controlled.sign(z)) to the latent gradient before the ReLU backward step. In a real PyTorch implementation you would let torch.autograd handle it.The small SAE lab contains the core ingredients used in large SAE work: an activation dataset, an overcomplete dictionary, a reconstruction objective, a sparsity constraint, and diagnostics for feature quality.[4][11]
Published SAE experiments use several refinements:
| Refinement | Why it matters |
|---|---|
| TopK SAEs | Replace the soft L1 penalty with an explicit keep-k operation. Gao et al. report an improved reconstruction-sparsity frontier over their evaluated ReLU baselines and train a 16-million-latent SAE on GPT-4 activations.[11] |
| Gated SAEs | Separate feature selection from magnitude estimation. Rajamanoharan et al. report reduced shrinkage and comparable fidelity with roughly half as many firing features in their tested settings.[12] |
| JumpReLU SAEs | Replace ReLU with a learned threshold and train an L0 penalty with straight-through estimators. On Gemma 2 9B activations, Rajamanoharan et al. report equal or stronger reconstruction-sparsity performance than their Gated and TopK comparisons.[13] |
| Dead feature mitigation | Gao et al. use tied initialization and an auxiliary AuxK loss to reduce dead latents in their TopK recipe.[11] |
| Expansion factor and tuning | Too small an expansion leaves features entangled; too large wastes capacity and can cause "feature splitting" (one concept split across several nearly duplicate features). |
| Layer and token selection | SAEs are trained on activations from many prompts and token positions. Measure whether the sampled data covers rare or safety-relevant contexts instead of assuming it does. |
Common beginner mistakes are easier to debug if you name the symptom:
| Symptom | Likely cause | Fix |
|---|---|---|
| Every latent seems to mean several things | Expansion factor too small, sparsity too weak, or the training data mixes unrelated domains without enough coverage | Increase dictionary width, sweep sparsity, and inspect top-activating examples per domain. |
| Many features never activate | Sparsity pressure is too strong or initialization left features behind early | Lower for L1 training or raise for TopK, then consider dead-feature resampling or an auxiliary revival loss. |
| Reconstruction looks good but labels look messy | Dense latent use can reconstruct activations without creating human-readable features | Track , feature purity, top examples, and causal effects alongside mean squared error. |
| A steering demo works once and then fails elsewhere | Feature label was correlational or the intervention changed unrelated behavior | Run held-out sweeps, ablations, and negative-control prompts before treating the feature as causal. |
| One concept splits into many near-duplicates | Dictionary is too wide or sparsity target makes narrow variants cheaper | Compare nearest-neighbor decoder directions and merge labels only after checking causal behavior. |
1latent_batches = [
2 [1.2, 0.0, 0.0, 0.0],
3 [0.7, 0.3, 0.0, 0.0],
4 [0.0, 0.0, 0.0, 0.0],
5 [0.0, 0.0, 0.0, 0.5],
6]
7
8rates = [
9 sum(row[index] > 0 for row in latent_batches) / len(latent_batches)
10 for index in range(len(latent_batches[0]))
11]
12rare_or_dead = [index for index, rate in enumerate(rates) if rate < 0.30]
13print("firing rates:", [round(rate, 2) for rate in rates])
14print("inspect latents:", rare_or_dead)
15
16assert rare_or_dead == [1, 2, 3]1firing rates: [0.5, 0.25, 0.0, 0.25]
2inspect latents: [1, 2, 3]| Aspect | L1 Penalty SAE | TopK SAE |
|---|---|---|
| Sparsity control | Indirect via ; soft thresholding | Selects the largest activations; positive nonzero count can be lower |
| Training signal | L1 pressure applies shrinkage to active magnitudes | Selected latents receive reconstruction gradients; Gao et al. add AuxK for dead latents |
| Reconstruction comparison | A useful simple baseline | Gao et al. report a stronger frontier than evaluated ReLU baselines |
| Dead latents | Must be monitored | Gao et al. report few dead latents with their mitigation recipe |
| Research usage | Straightforward for small experiments | Used in the GPT-4-activation scaling experiment |
| Hyperparameter sensitivity | needs careful sweeps | is intuitive (e.g., keep 32 out of 4096) |
These papers evaluate different ways to control sparsity and shrinkage: TopK fixes active count, Gated separates selection from magnitude, and JumpReLU trains a learned threshold with an L0 penalty.[11][12][13] Results are architecture- and dataset-specific, so variant choice should be followed by reconstruction, sparsity, feature-quality, and intervention checks.
Templeton et al. inspected Claude 3 Sonnet features related to safety-relevant behaviors, including sycophantic praise and scam-related text, and experimented with feature steering.[4] These results establish useful research probes, not a general safety monitor. A feature that activates on suspicious examples gives researchers an internal signal to test with interventions and behavioral evaluations.
In a code-assistant deployment, the research question becomes more precise. Instead of asking "why did the model refuse this password-reset script?" in the abstract, you can ask which candidate features activated, whether a credential-safety direction was causally involved, and whether changing that direction shifts decisions on a held-out security-prompt set. That remains research evidence rather than a complete compliance story.
SAEs are a genuine advance, not a solved problem. Keep four limits in mind so you don't oversell them in an interview or a design review.
If you want to move from the toy example to larger-model work, the natural next projects are:
Each project exercises these concepts and produces inspectable experimental evidence.
The stable core is superposition, dictionary learning, reconstruction plus sparsity, and causal interventions on feature directions. Once those four ideas are clear, newer SAE variants (TopK, Gated, JumpReLU) and the transcoder-plus-attribution-graph stack become engineering choices rather than a new conceptual foundation.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Bricken, T., Templeton, A., et al. (Anthropic) · 2023 · Transformer Circuits Thread
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Cunningham, H., Ewart, A., Riggs, L., Huben, R., Sharkey, L. · 2023
Toy Models of Superposition
Elhage, N., Hume, T., Olsson, C., et al. (Anthropic) · 2022 · Transformer Circuits Thread
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton, A., Conerly, T., et al. (Anthropic) · 2024 · Transformer Circuits Thread
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., Mueller, A. · 2024 · ICLR 2025
In-context Learning and Induction Heads.
Olsson, C., et al. · 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., Steinhardt, J. · 2022
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Hanna, M., Liu, O., Variengien, A. · 2023 · NeurIPS 2023
On the Biology of a Large Language Model
Lindsey, J., Gurnee, W., Ameisen, E., et al. · 2025
Refusal in Language Models Is Mediated by a Single Direction
Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., Nanda, N. · 2024
Scaling and Evaluating Sparse Autoencoders
Gao, L., et al. · 2024
Improving Dictionary Learning with Gated Sparse Autoencoders
Rajamanoharan, S., Conmy, A., Smith, L., et al. · 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Rajamanoharan, S., Lieberum, T., Sonnerat, N., et al. · 2024