LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesAutoencoders and VAEs
🏛️EasyModel Architecture

Autoencoders and VAEs

Compress a equipment-panel patch, turn its latent code into a sampleable distribution, and implement VAE loss and training.

13 min read
Learning path
Step 23 of 158 in the full curriculum
RNNs, LSTMs, GRUs, and Sequence ModelingThe Transformer Architecture End-to-End

An RNN can compress three ordered event records into one final state. An equipment-monitoring workflow also receives images: a close crop of a cracked panel, a loose connector, or a clean surface. These pixel arrays are much larger than the decision they support.

An autoencoder learns to reconstruct an input through a latent code. The undercomplete autoencoder used here makes that code smaller than the input, creating a visible compression bottleneck. A variational autoencoder (VAE) makes its encoder predict a distribution over latent codes and trains that distribution against a prior, so sampling becomes part of the model rather than an unsupported guess.[1]Reference 1Auto-Encoding Variational Bayes.https://arxiv.org/abs/1312.6114

Compress one photo patch

Start with a tiny crop from an inspection image. Each grayscale number is between 0 (dark) and 1 (bright). A real photo has many more pixels, but a 5 by 5 crop lets us read every number.

A 5 by 5 grayscale patch is multiplied by two 5 by 5 encoder weight maps: four center weights of 0.25 produce z1 0.888 and two lower weights of 0.50 produce z2 0.170, reducing 25 values to 2. A 5 by 5 grayscale patch is multiplied by two 5 by 5 encoder weight maps: four center weights of 0.25 produce z1 0.888 and two lower weights of 0.50 produce z2 0.170, reducing 25 values to 2.
The first weight map averages four bright center pixels; the second averages two lower pixels. Their weighted sums form `[0.888, 0.170]`, making the compression step visible before weights are learned.

An autoencoder has two learned functions:

z=fϕ(x)x^=gθ(z)z = f_\phi(x) \qquad \hat{x} = g_\theta(z)z=fϕ​(x)x^=gθ​(z)

The encoder fϕf_\phifϕ​ maps the input xxx to the latent code zzz. The decoder gθg_\thetagθ​ maps that code to a reconstruction x^\hat{x}x^. A narrow zzz creates the bottleneck: the decoder can't copy all 25 pixels directly, so training must preserve information that helps reconstruction.

For normalized pixels, mean-squared error (MSE) is a simple reconstruction measure:

Lrecon=1N∑i=1N(xi−x^i)2L_{\text{recon}} = \frac{1}{N}\sum_{i=1}^{N}(x_i-\hat{x}_i)^2Lrecon​=N1​i=1∑N​(xi​−x^i​)2

First run a hand-designed encoder and decoder. The weights aren't trained; they make the bottleneck calculation visible before we ask optimization to discover one.

crafted-autoencoder-forward.py
1import numpy as np 2 3x = np.array([ 4 0.08, 0.12, 0.09, 0.11, 0.07, 5 0.15, 0.85, 0.88, 0.14, 0.10, 6 0.11, 0.90, 0.92, 0.13, 0.09, 7 0.07, 0.18, 0.16, 0.12, 0.08, 8 0.06, 0.09, 0.10, 0.07, 0.05, 9], dtype=np.float32) 10 11center = [6, 7, 11, 12] 12lower = [16, 17] 13W_encoder = np.zeros((25, 2), dtype=np.float32) 14W_encoder[center, 0] = 0.25 15W_encoder[lower, 1] = 0.50 16 17W_decoder = np.zeros((2, 25), dtype=np.float32) 18W_decoder[0, center] = 1.0 19W_decoder[0, [5, 8, 10, 13]] = 0.12 20W_decoder[1, lower] = 1.0 21W_decoder[1, [15, 18, 21, 22]] = 0.35 22 23z = x @ W_encoder 24x_hat = np.clip(z @ W_decoder + 0.08, 0.0, 1.0) 25mse = np.mean((x - x_hat) ** 2) 26 27print("input numbers:", x.size) 28print("latent numbers:", z.size, np.round(z, 3)) 29print("reconstruction MSE:", round(float(mse), 4))
Output
1input numbers: 25 2latent numbers: 2 [0.888 0.17 ] 3reconstruction MSE: 0.0027

The input shrinks from 25 numbers to 2. This crafted decoder does well on this one pattern because we built its weights around that pattern. A model becomes useful only when its weights are learned across many examples.

Learn the bottleneck

The next program makes noisy variations of the same inspection-image crop and trains an autoencoder in PyTorch. PyTorch is the right tool here because the weights must receive gradients and optimizer updates.

train-tiny-autoencoder.py
1import torch 2from torch import nn 3import torch.nn.functional as F 4 5torch.manual_seed(3) 6base = torch.tensor([ 7 0.08, 0.12, 0.09, 0.11, 0.07, 8 0.15, 0.85, 0.88, 0.14, 0.10, 9 0.11, 0.90, 0.92, 0.13, 0.09, 10 0.07, 0.18, 0.16, 0.12, 0.08, 11 0.06, 0.09, 0.10, 0.07, 0.05, 12]) 13patches = torch.clamp(base + 0.025 * torch.randn(48, 25), 0.0, 1.0) 14 15encoder = nn.Sequential(nn.Linear(25, 8), nn.Tanh(), nn.Linear(8, 2)) 16decoder = nn.Sequential(nn.Linear(2, 8), nn.Tanh(), nn.Linear(8, 25), nn.Sigmoid()) 17optimizer = torch.optim.Adam( 18 list(encoder.parameters()) + list(decoder.parameters()), lr=0.03 19) 20 21with torch.no_grad(): 22 before = F.mse_loss(decoder(encoder(patches)), patches).item() 23 24for _ in range(250): 25 reconstruction = decoder(encoder(patches)) 26 loss = F.mse_loss(reconstruction, patches) 27 optimizer.zero_grad() 28 loss.backward() 29 optimizer.step() 30 31with torch.no_grad(): 32 after = F.mse_loss(decoder(encoder(patches)), patches).item() 33 latent = encoder(patches[:1]) 34 35print(f"MSE before training: {before:.4f}") 36print(f"MSE after training: {after:.4f}") 37print("latent shape:", tuple(latent.shape))
Output
1MSE before training: 0.1537 2MSE after training: 0.0006 3latent shape: (1, 2)

The decoder learned to reconstruct this small family of patches through two latent numbers. It wasn't trained to decode arbitrary random pairs of numbers. That's the generation problem: a deterministic autoencoder defines codes for observed inputs, but its loss doesn't make a simple random distribution over codes useful.

Model questionDeterministic autoencoder answer
What does the encoder output?One code zzz for each input
What is optimized?Reconstruction of encoded training examples
Can you decode a known encoded patch?Yes, if reconstruction training worked
Is z∼N(0,I)z \sim \mathcal{N}(0,I)z∼N(0,I) trained as an input source?No

A VAE learns distributions over codes

A VAE makes prior sampling useful through its training objective. Instead of producing one code, the encoder predicts the parameters of an approximate posterior, written qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x). The word approximate matters: computing the exact posterior pθ(z∣x)p_\theta(z \mid x)pθ​(z∣x) is generally impractical, so the encoder learns a tractable stand-in. For the common diagonal-Gaussian case, it predicts a mean vector μ\muμ and a log-variance vector log⁡σ2\log \sigma^2logσ2:

qϕ(z∣x)=N(μ(x),diag⁡(σ2(x)))q_\phi(z \mid x) = \mathcal{N}\left(\mu(x), \operatorname{diag}(\sigma^2(x))\right)qϕ​(z∣x)=N(μ(x),diag(σ2(x)))

The model also declares a simple prior, usually:

p(z)=N(0,I)p(z) = \mathcal{N}(0, I)p(z)=N(0,I)

The posterior describes codes for one observed patch. The prior is the distribution we want to sample from when no input patch is provided.

Why predict logvar rather than a raw standard deviation? A neural layer may output any real number, while a standard deviation must be positive. Exponentiating half the log variance converts any real output into a valid standard deviation.

log-variance-to-standard-deviation.py
1import numpy as np 2 3logvar = np.array([-0.70, 0.10]) 4sigma = np.exp(0.5 * logvar) 5variance = np.exp(logvar) 6 7print("log variance:", logvar) 8print("variance:", np.round(variance, 3)) 9print("standard deviation:", np.round(sigma, 3)) 10print("all sigma positive:", bool(np.all(sigma > 0)))
Output
1log variance: [-0.7 0.1] 2variance: [0.497 1.105] 3standard deviation: [0.705 1.051] 4all sigma positive: True

Keep posteriors near a sampleable prior

If the encoder may put each patch distribution anywhere it likes, prior samples still won't reliably land where the decoder has trained. VAEs penalize mismatch between the posterior and prior with Kullback-Leibler (KL) divergence. KL isn't a symmetric distance metric: direction matters. For a diagonal Gaussian posterior and standard-normal prior, the VAE uses:

DKL(qϕ(z∣x) ∥ p(z))=−12∑j(1+log⁡σj2−μj2−σj2)D_{\mathrm{KL}}\left(q_\phi(z \mid x) \,\|\, p(z)\right) = -\frac{1}{2} \sum_j \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)DKL​(qϕ​(z∣x)∥p(z))=−21​j∑​(1+logσj2​−μj2​−σj2​)

The penalty is zero when μ=0\mu=0μ=0 and σ=1\sigma=1σ=1. Moving means away from zero or making variances very different from one increases it.

kl-divergence-from-prior.py
1import numpy as np 2 3def kl_to_standard_normal(mu, logvar): 4 value = -0.5 * np.sum(1 + logvar - mu**2 - np.exp(logvar)) 5 return max(0.0, float(value)) # Remove negative zero from roundoff. 6 7posteriors = { 8 "matches prior": (np.array([0.0, 0.0]), np.array([0.0, 0.0])), 9 "small offset": (np.array([0.4, -0.2]), np.array([-0.7, 0.1])), 10 "far offset": (np.array([2.0, -1.5]), np.array([-1.5, 1.2])), 11} 12 13for name, (mu, logvar) in posteriors.items(): 14 print(f"{name:13} KL = {kl_to_standard_normal(mu, logvar):.4f}")
Output
1matches prior KL = 0.0000 2small offset KL = 0.2009 3far offset KL = 4.0466

KL pressure isn't free. If it dominates reconstruction, all inputs can be encoded nearly like the prior and the decoder may ignore zzz. The next diagnostic catches that failure.

Sample while gradients still flow

Kingma and Welling's VAE estimator rewrites random sampling so gradients can reach encoder parameters.[1]Reference 1Auto-Encoding Variational Bayes.https://arxiv.org/abs/1312.6114 Draw independent noise ϵ\epsilonϵ from a standard normal and then transform it:

ϵ∼N(0,I),z=μ+σ⊙ϵ\epsilon \sim \mathcal{N}(0,I), \qquad z = \mu + \sigma \odot \epsilonϵ∼N(0,I),z=μ+σ⊙ϵ

During one backward pass, ϵ\epsilonϵ is a sampled constant. The multiplication and addition still expose how the loss changes when the encoder changes μ\muμ or σ\sigmaσ.

Vector arithmetic and a latent-plane plot show log variance converting to sigma 0.705 and 1.051, scaled noise 0.352 and minus 1.051 added to mu 0.400 and minus 0.200, moving to sampled z 0.752 and minus 1.251. Vector arithmetic and a latent-plane plot show log variance converting to sigma 0.705 and 1.051, scaled noise 0.352 and minus 1.051 added to mu 0.400 and minus 0.200, moving to sampled z 0.752 and minus 1.251.
The sampled noise becomes a displacement `sigma * epsilon`; adding it to `mu` moves from the posterior mean to `z`. During backpropagation, `epsilon` is fixed, so reconstruction gradients still reach `mu` and `sigma`.
reparameterized-sample.py
1import numpy as np 2 3mu = np.array([0.40, -0.20]) 4logvar = np.array([-0.70, 0.10]) 5epsilon = np.array([0.50, -1.00]) 6 7sigma = np.exp(0.5 * logvar) 8z = mu + sigma * epsilon 9 10print("mu:", np.round(mu, 3)) 11print("sigma:", np.round(sigma, 3)) 12print("epsilon:", np.round(epsilon, 3)) 13print("sampled z:", np.round(z, 3))
Output
1mu: [ 0.4 -0.2] 2sigma: [0.705 1.051] 3epsilon: [ 0.5 -1. ] 4sampled z: [ 0.752 -1.251]
Two latent scatter plots compare deterministic autoencoder codes forming separated islands around an unsupported random sample with VAE posterior samples filling a prior-compatible region around a random sample. Two latent scatter plots compare deterministic autoencoder codes forming separated islands around an unsupported random sample with VAE posterior samples filling a prior-compatible region around a random sample.
Reconstruction alone trains the decoder on encoded points, so a simple prior sample can land in a gap. The VAE's KL term keeps posterior samples near the declared prior, making prior sampling a trained interface rather than an unsupported guess.

The VAE objective

The VAE optimizes an evidence lower bound (ELBO) on the data log-likelihood.[1]Reference 1Auto-Encoding Variational Bayes.https://arxiv.org/abs/1312.6114 Written as a quantity to maximize:

LELBO=Eqϕ(z∣x)[log⁡pθ(x∣z)]−DKL(qϕ(z∣x) ∥ p(z))\mathcal{L}_{\mathrm{ELBO}} = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_{\mathrm{KL}}(q_\phi(z \mid x)\,\|\,p(z))LELBO​=Eqϕ​(z∣x)​[logpθ​(x∣z)]−DKL​(qϕ​(z∣x)∥p(z))

The first term rewards a decoder that makes the observed patch likely after sampling its latent. The second discourages a posterior that moves too far from the prior.

In code we usually minimize the negative ELBO:

LVAE=Lrecon+βDKLL_{\mathrm{VAE}} = L_{\mathrm{recon}} + \beta D_{\mathrm{KL}}LVAE​=Lrecon​+βDKL​

For β=1\beta=1β=1, this corresponds to the standard objective when L_recon is the selected negative log-likelihood. An MSE reconstruction term is proportional to a Gaussian decoder negative log-likelihood only after choosing a fixed observation variance, so treat the scale of MSE and KL as a modeling choice, not an identity.

This ledger computes both terms for one patch using a fixed hypothetical reconstruction:

vae-loss-ledger.py
1import numpy as np 2 3x = np.array([0.10, 0.90, 0.85, 0.12]) 4x_hat = np.array([0.12, 0.84, 0.80, 0.14]) 5mu = np.array([0.40, -0.20]) 6logvar = np.array([-0.70, 0.10]) 7 8reconstruction_mse = np.mean((x - x_hat) ** 2) 9kl = -0.5 * np.sum(1 + logvar - mu**2 - np.exp(logvar)) 10 11for beta in [0.0, 0.1, 1.0]: 12 total = reconstruction_mse + beta * kl 13 print(f"beta={beta:.1f}: recon={reconstruction_mse:.4f} KL={kl:.4f} total={total:.4f}")
Output
1beta=0.0: recon=0.0017 KL=0.2009 total=0.0017 2beta=0.1: recon=0.0017 KL=0.2009 total=0.0218 3beta=1.0: recon=0.0017 KL=0.2009 total=0.2026

Higgins et al. studied the β\betaβ-VAE objective, where increasing β\betaβ applies stronger latent-capacity and independence pressure while trading against reconstruction accuracy.[2]Reference 2beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Frameworkhttps://openreview.net/forum?id=Sy2fzU9gl That trade-off must be measured on your task; a higher value isn't automatically better.

beta-tradeoff-accounting.py
1reconstruction_losses = [0.004, 0.010, 0.028] 2kl_losses = [1.20, 0.42, 0.08] 3betas = [0.01, 0.10, 1.00] 4 5for beta, recon, kl in zip(betas, reconstruction_losses, kl_losses): 6 total = recon + beta * kl 7 print(f"beta={beta:>4.2f}: reconstruction={recon:.3f}, KL={kl:.2f}, objective={total:.3f}")
Output
1beta=0.01: reconstruction=0.004, KL=1.20, objective=0.016 2beta=0.10: reconstruction=0.010, KL=0.42, objective=0.052 3beta=1.00: reconstruction=0.028, KL=0.08, objective=0.108

These rows are an accounting exercise, not a reported experiment. They show why you should log reconstruction and KL separately instead of reading only the total.

Train a tiny VAE

Now train a real two-dimensional VAE on two small families of noisy patches: one with bright center cells and one with bright lower cells. The sample batch at the end demonstrates the generative interface. It doesn't claim photographic quality from this miniature dataset.

train-tiny-vae.py
1import torch 2from torch import nn 3import torch.nn.functional as F 4 5torch.manual_seed(8) 6center_mark = torch.tensor([ 7 0.05, 0.10, 0.08, 0.10, 0.06, 8 0.12, 0.88, 0.86, 0.13, 0.08, 9 0.10, 0.91, 0.90, 0.12, 0.07, 10 0.06, 0.15, 0.14, 0.10, 0.06, 11 0.05, 0.08, 0.08, 0.05, 0.04, 12]) 13lower_mark = torch.tensor([ 14 0.04, 0.07, 0.06, 0.08, 0.05, 15 0.10, 0.15, 0.12, 0.10, 0.06, 16 0.08, 0.18, 0.16, 0.09, 0.06, 17 0.06, 0.84, 0.88, 0.11, 0.07, 18 0.04, 0.07, 0.08, 0.05, 0.04, 19]) 20patches = torch.cat([ 21 torch.clamp(center_mark + 0.02 * torch.randn(32, 25), 0.0, 1.0), 22 torch.clamp(lower_mark + 0.02 * torch.randn(32, 25), 0.0, 1.0), 23]) 24 25class TinyVAE(nn.Module): 26 def __init__(self): 27 super().__init__() 28 self.encoder = nn.Sequential(nn.Linear(25, 12), nn.Tanh()) 29 self.mu = nn.Linear(12, 2) 30 self.logvar = nn.Linear(12, 2) 31 self.decoder = nn.Sequential( 32 nn.Linear(2, 12), nn.Tanh(), nn.Linear(12, 25), nn.Sigmoid() 33 ) 34 35 def forward(self, x): 36 hidden = self.encoder(x) 37 mu = self.mu(hidden) 38 logvar = self.logvar(hidden) 39 z = mu + torch.exp(0.5 * logvar) * torch.randn_like(mu) 40 return self.decoder(z), mu, logvar 41 42model = TinyVAE() 43optimizer = torch.optim.Adam(model.parameters(), lr=0.03) 44beta = 0.02 45 46def loss_terms(x_hat, x, mu, logvar): 47 recon = F.mse_loss(x_hat, x) 48 kl = -0.5 * (1 + logvar - mu.pow(2) - logvar.exp()).sum(1).mean() 49 return recon + beta * kl, recon, kl 50 51for _ in range(400): 52 x_hat, mu, logvar = model(patches) 53 loss, recon, kl = loss_terms(x_hat, patches, mu, logvar) 54 optimizer.zero_grad() 55 loss.backward() 56 optimizer.step() 57 58torch.manual_seed(99) 59with torch.no_grad(): 60 x_hat, mu, logvar = model(patches) 61 loss, recon, kl = loss_terms(x_hat, patches, mu, logvar) 62 generated = model.decoder(torch.randn(3, 2)) 63 64print(f"reconstruction={recon.item():.4f} KL={kl.item():.4f} total={loss.item():.4f}") 65print("generated batch shape:", tuple(generated.shape))
Output
1reconstruction=0.0075 KL=0.7382 total=0.0222 2generated batch shape: (3, 25)

Each object in the program now has a job: mu and logvar define each approximate posterior, torch.randn_like supplies ϵ\epsilonϵ, decoder consumes sampled zzz, and the logged losses reveal the compromise.

Watch for posterior collapse

If a decoder can reconstruct without using its latent input, KL pressure can push every posterior close to the prior. Then the latent code stops carrying information. This failure is called posterior collapse. Van den Oord et al. describe it as a common issue for VAEs paired with high-capacity autoregressive decoders.[3]Reference 3Neural Discrete Representation Learning.https://arxiv.org/abs/1711.00937

A first diagnostic is to track KL per latent dimension. The illustrative monitor below flags a run whose KL has nearly vanished in every dimension:

posterior-collapse-monitor.py
1import numpy as np 2 3logged_kl_by_dimension = { 4 "using latent": np.array([0.31, 0.18]), 5 "suspected collapse": np.array([0.0003, 0.0001]), 6} 7threshold = 0.001 8 9for run_name, kl_dims in logged_kl_by_dimension.items(): 10 collapsed = bool(np.all(kl_dims < threshold)) 11 print(f"{run_name:18} total_KL={kl_dims.sum():.4f} collapsed={collapsed}")
Output
1using latent total_KL=0.4900 collapsed=False 2suspected collapse total_KL=0.0004 collapsed=True

Low KL alone isn't a proof of failure. Pair it with reconstruction quality, generated samples, and checks that predictions change when zzz changes. Still, a rapid collapse toward zero is worth investigating.

VQ-VAE: replace continuous samples with codes

A Vector Quantised Variational Autoencoder (VQ-VAE) takes a different route. Its encoder output is mapped to the nearest vector in a learned codebook, so its latent representation is discrete rather than Gaussian. The original VQ-VAE paper also learns a prior over those discrete codes.[3]Reference 3Neural Discrete Representation Learning.https://arxiv.org/abs/1711.00937

For one encoder vector, quantization is a nearest-neighbor lookup:

k=arg⁡min⁡j∥ze(x)−ej∥22,zq(x)=ekk = \arg\min_j \lVert z_e(x)-e_j\rVert_2^2, \qquad z_q(x)=e_kk=argjmin​∥ze​(x)−ej​∥22​,zq​(x)=ek​
vq-vae-nearest-code.py
1import numpy as np 2 3encoder_output = np.array([0.72, -0.10]) 4codebook = np.array([ 5 [-0.80, 0.20], 6 [0.65, -0.05], 7 [0.10, 0.90], 8]) 9 10squared_distances = ((codebook - encoder_output) ** 2).sum(axis=1) 11index = int(squared_distances.argmin()) 12quantized = codebook[index] 13 14print("squared distances:", np.round(squared_distances, 3)) 15print("selected code ID:", index) 16print("decoder receives:", quantized)
Output
1squared distances: [2.4 0.007 1.384] 2selected code ID: 1 3decoder receives: [ 0.65 -0.05]

Nearest-code selection isn't differentiable. VQ-VAE uses a straight-through gradient estimator for the encoder path plus codebook and commitment losses so encoder outputs stay close to useful entries.[3]Reference 3Neural Discrete Representation Learning.https://arxiv.org/abs/1711.00937 The important distinction is interface:

BottleneckDecoder receivesHow a new latent is obtained
AutoencoderDeterministic vector zzzEncode an input
VAEContinuous sampled vector zzzSample from a prior after training
VQ-VAECodebook vector selected by an IDModel or choose discrete code IDs

Latent models reduce the expensive space

Rombach et al. train diffusion models in the latent space of pretrained autoencoders rather than applying all denoising operations directly to pixels.[4]Reference 4High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752 This makes the autoencoder's role precise: compress image data before expensive generation steps, then decode a final latent back to pixels.

The size reduction itself is simple arithmetic. A downsampling factor of 8 reduces each spatial axis by 8, so it reduces spatial positions by 64:

latent-spatial-budget.py
1height, width = 512, 512 2downsample = 8 3latent_height = height // downsample 4latent_width = width // downsample 5 6pixel_positions = height * width 7latent_positions = latent_height * latent_width 8 9print("pixel grid:", (height, width), "positions:", pixel_positions) 10print("latent grid:", (latent_height, latent_width), "positions:", latent_positions) 11print("spatial reduction:", pixel_positions // latent_positions, "x")
Output
1pixel grid: (512, 512) positions: 262144 2latent grid: (64, 64) positions: 4096 3spatial reduction: 64 x

Channels and decoder quality still matter, and individual systems choose their own latent representation. The durable idea is that a learned bottleneck can make later modeling cheaper while retaining useful structure.

Debugging checklist

  • Plot or print reconstruction loss and KL loss separately. Total loss hides which side dominates.
  • Check that sigma = exp(0.5 * logvar), not exp(logvar), before reparameterized sampling. If samples or KL become NaN or Inf, inspect logvar for unstable extremes before adding an explicit clamp.
  • Keep posterior and prior names straight: the encoder supplies qϕ(z∣x)q_\phi(z \mid x)qϕ​(z∣x); generation samples from p(z)p(z)p(z).
  • Inspect whether reconstructions or predictions change when zzz changes. If not, the decoder may ignore its latent input.
  • For VQ-VAE, log code usage. A codebook where only a few IDs are selected wastes capacity.

Mastery check

Key concepts

  • An undercomplete autoencoder reconstructs inputs through a narrow latent bottleneck.
  • Reconstruction loss trains codes for observed inputs but doesn't define a sampleable prior.
  • A VAE encoder predicts a Gaussian approximate posterior using mu and logvar.
  • KL divergence keeps each posterior near the chosen prior.
  • Reparameterization isolates random noise while preserving gradients to encoder outputs.
  • The negative ELBO balances reconstruction likelihood and KL regularization.
  • Posterior collapse means the decoder stops using its latent code.
  • VQ-VAE exposes discrete codebook selections rather than continuous Gaussian samples.
  • Latent diffusion runs expensive generation steps in a compressed autoencoder space.

Evaluation rubric

  • Foundational: Explains why a bottleneck forces compression and computes an MSE reconstruction loss.
  • Foundational: Distinguishes reconstructing encoded data from generating by sampling a prior.
  • Intermediate: Converts logvar to sigma, calculates Gaussian KL, and produces a reparameterized sample.
  • Intermediate: Identifies reconstruction and KL terms in a PyTorch VAE training loop.
  • Advanced: Diagnoses posterior collapse with multiple signals rather than KL alone.
  • Advanced: Compares continuous VAE latents with VQ-VAE code IDs and latent diffusion's compressed space.

Common pitfalls

  • Symptom: Random deterministic-autoencoder codes are treated as valid samples. Cause: Reconstruction training was mistaken for prior training. Fix: State how new codes are generated and whether that source appeared in the objective.
  • Symptom: Sampling or KL loss returns NaN or Inf. Cause: Scale parameterization is wrong or logvar became extreme before exponentiation. Fix: Compute sigma = exp(0.5 * logvar), inspect the logvar range, and stabilize optimization before adding an explicit clamp.
  • Symptom: Reconstruction looks adequate while latent dimensions carry no information. Cause: Posterior collapse. Fix: Inspect KL per dimension and test sensitivity to changing z.
  • Symptom: MSE and ELBO are described as identical in every VAE. Cause: Decoder likelihood choice was skipped. Fix: Tie reconstruction loss to the observation model and its scaling.
  • Symptom: VQ lookup is presented as ordinary differentiable code. Cause: The discrete selection step was omitted. Fix: Explain the straight-through estimator and commitment/codebook losses.

Follow-up questions

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An autoencoder maps a 25-value patch to a 2-value latent code before reconstructing it. What does this undercomplete bottleneck do during training?
2.A grayscale patch has values x = [0.10, 0.90, 0.85, 0.12] and an autoencoder reconstructs x_hat = [0.12, 0.84, 0.80, 0.14]. Using mean-squared error, what is the reconstruction loss?
3.A deterministic autoencoder reconstructs its training-like patches with very low MSE. At inference time, you sample z from N(0, I) and feed those random codes to the decoder, but the outputs are unreliable. What was missing from training?
4.A VAE encoder emits mu and logvar for q(z|x). Which operation gives a valid standard deviation, and where should z come from when generating without an input x?
5.A VAE encoder outputs mu = [0.40, -0.20] and logvar = [-0.70, 0.10]. If epsilon = [0.50, -1.00], which sampled z and gradient explanation are correct?
6.For mu = [0.40, -0.20] and logvar = [-0.70, 0.10], use KL = -0.5 * sum(1 + logvar - mu^2 - exp(logvar)). What is the KL divergence to N(0, I)?
7.A VAE run reports reconstruction MSE = 0.0017, KL = 0.2009, and beta = 0.1. Which report correctly applies L = recon + beta * KL?
8.KL per latent dimension is [0.0003, 0.0001]. Reconstructions look acceptable, but decoded outputs barely change when z is varied. Which diagnosis is supported?
9.A VQ-VAE encoder emits [0.72, -0.10]. The nearest codebook vector is code 1 at [0.65, -0.05]. Which statement describes the forward and training paths?
10.An image model downsamples a 512 by 512 grid by a factor of 8 along each spatial axis before running expensive generation steps. What latent grid and spatial reduction result?

10 questions remaining.

Next Step
Continue to The Transformer Architecture End-to-End

You now know how a learned encoder creates a useful compressed representation. Next you'll see how attention builds contextual <span data-glossary="token">token</span> representations and turns them into next-token predictions.

PreviousRNNs, LSTMs, GRUs, and Sequence Modeling
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Auto-Encoding Variational Bayes.

Kingma, D. P., Welling, M. · 2014 · ICLR 2014

beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., & Lerchner, A. · 2017 · ICLR 2017

Neural Discrete Representation Learning.

Oord, A. van den, Vinyals, O., & Kavukcuoglu, K. · 2017

High-Resolution Image Synthesis with Latent Diffusion Models.

Rombach, R., et al. · 2022 · CVPR 2022