Compress a equipment-panel patch, turn its latent code into a sampleable distribution, and implement VAE loss and training.
An RNN can compress three ordered event records into one final state. An equipment-monitoring workflow also receives images: a close crop of a cracked panel, a loose connector, or a clean surface. These pixel arrays are much larger than the decision they support.
An autoencoder learns to reconstruct an input through a latent code. The undercomplete autoencoder used here makes that code smaller than the input, creating a visible compression bottleneck. A variational autoencoder (VAE) makes its encoder predict a distribution over latent codes and trains that distribution against a prior, so sampling becomes part of the model rather than an unsupported guess.[1]
Start with a tiny crop from an inspection image. Each grayscale number is between 0 (dark) and 1 (bright). A real photo has many more pixels, but a 5 by 5 crop lets us read every number.
An autoencoder has two learned functions:
The encoder maps the input to the latent code . The decoder maps that code to a reconstruction . A narrow creates the bottleneck: the decoder can't copy all 25 pixels directly, so training must preserve information that helps reconstruction.
For normalized pixels, mean-squared error (MSE) is a simple reconstruction measure:
First run a hand-designed encoder and decoder. The weights aren't trained; they make the bottleneck calculation visible before we ask optimization to discover one.
1import numpy as np
2
3x = np.array([
4 0.08, 0.12, 0.09, 0.11, 0.07,
5 0.15, 0.85, 0.88, 0.14, 0.10,
6 0.11, 0.90, 0.92, 0.13, 0.09,
7 0.07, 0.18, 0.16, 0.12, 0.08,
8 0.06, 0.09, 0.10, 0.07, 0.05,
9], dtype=np.float32)
10
11center = [6, 7, 11, 12]
12lower = [16, 17]
13W_encoder = np.zeros((25, 2), dtype=np.float32)
14W_encoder[center, 0] = 0.25
15W_encoder[lower, 1] = 0.50
16
17W_decoder = np.zeros((2, 25), dtype=np.float32)
18W_decoder[0, center] = 1.0
19W_decoder[0, [5, 8, 10, 13]] = 0.12
20W_decoder[1, lower] = 1.0
21W_decoder[1, [15, 18, 21, 22]] = 0.35
22
23z = x @ W_encoder
24x_hat = np.clip(z @ W_decoder + 0.08, 0.0, 1.0)
25mse = np.mean((x - x_hat) ** 2)
26
27print("input numbers:", x.size)
28print("latent numbers:", z.size, np.round(z, 3))
29print("reconstruction MSE:", round(float(mse), 4))1input numbers: 25
2latent numbers: 2 [0.888 0.17 ]
3reconstruction MSE: 0.0027The input shrinks from 25 numbers to 2. This crafted decoder does well on this one pattern because we built its weights around that pattern. A model becomes useful only when its weights are learned across many examples.
The next program makes noisy variations of the same inspection-image crop and trains an autoencoder in PyTorch. PyTorch is the right tool here because the weights must receive gradients and optimizer updates.
1import torch
2from torch import nn
3import torch.nn.functional as F
4
5torch.manual_seed(3)
6base = torch.tensor([
7 0.08, 0.12, 0.09, 0.11, 0.07,
8 0.15, 0.85, 0.88, 0.14, 0.10,
9 0.11, 0.90, 0.92, 0.13, 0.09,
10 0.07, 0.18, 0.16, 0.12, 0.08,
11 0.06, 0.09, 0.10, 0.07, 0.05,
12])
13patches = torch.clamp(base + 0.025 * torch.randn(48, 25), 0.0, 1.0)
14
15encoder = nn.Sequential(nn.Linear(25, 8), nn.Tanh(), nn.Linear(8, 2))
16decoder = nn.Sequential(nn.Linear(2, 8), nn.Tanh(), nn.Linear(8, 25), nn.Sigmoid())
17optimizer = torch.optim.Adam(
18 list(encoder.parameters()) + list(decoder.parameters()), lr=0.03
19)
20
21with torch.no_grad():
22 before = F.mse_loss(decoder(encoder(patches)), patches).item()
23
24for _ in range(250):
25 reconstruction = decoder(encoder(patches))
26 loss = F.mse_loss(reconstruction, patches)
27 optimizer.zero_grad()
28 loss.backward()
29 optimizer.step()
30
31with torch.no_grad():
32 after = F.mse_loss(decoder(encoder(patches)), patches).item()
33 latent = encoder(patches[:1])
34
35print(f"MSE before training: {before:.4f}")
36print(f"MSE after training: {after:.4f}")
37print("latent shape:", tuple(latent.shape))1MSE before training: 0.1537
2MSE after training: 0.0006
3latent shape: (1, 2)The decoder learned to reconstruct this small family of patches through two latent numbers. It wasn't trained to decode arbitrary random pairs of numbers. That's the generation problem: a deterministic autoencoder defines codes for observed inputs, but its loss doesn't make a simple random distribution over codes useful.
| Model question | Deterministic autoencoder answer |
|---|---|
| What does the encoder output? | One code for each input |
| What is optimized? | Reconstruction of encoded training examples |
| Can you decode a known encoded patch? | Yes, if reconstruction training worked |
| Is trained as an input source? | No |
A VAE makes prior sampling useful through its training objective. Instead of producing one code, the encoder predicts the parameters of an approximate posterior, written . The word approximate matters: computing the exact posterior is generally impractical, so the encoder learns a tractable stand-in. For the common diagonal-Gaussian case, it predicts a mean vector and a log-variance vector :
The model also declares a simple prior, usually:
The posterior describes codes for one observed patch. The prior is the distribution we want to sample from when no input patch is provided.
Why predict logvar rather than a raw standard deviation? A neural layer may output any real number, while a standard deviation must be positive. Exponentiating half the log variance converts any real output into a valid standard deviation.
1import numpy as np
2
3logvar = np.array([-0.70, 0.10])
4sigma = np.exp(0.5 * logvar)
5variance = np.exp(logvar)
6
7print("log variance:", logvar)
8print("variance:", np.round(variance, 3))
9print("standard deviation:", np.round(sigma, 3))
10print("all sigma positive:", bool(np.all(sigma > 0)))1log variance: [-0.7 0.1]
2variance: [0.497 1.105]
3standard deviation: [0.705 1.051]
4all sigma positive: TrueIf the encoder may put each patch distribution anywhere it likes, prior samples still won't reliably land where the decoder has trained. VAEs penalize mismatch between the posterior and prior with Kullback-Leibler (KL) divergence. KL isn't a symmetric distance metric: direction matters. For a diagonal Gaussian posterior and standard-normal prior, the VAE uses:
The penalty is zero when and . Moving means away from zero or making variances very different from one increases it.
1import numpy as np
2
3def kl_to_standard_normal(mu, logvar):
4 value = -0.5 * np.sum(1 + logvar - mu**2 - np.exp(logvar))
5 return max(0.0, float(value)) # Remove negative zero from roundoff.
6
7posteriors = {
8 "matches prior": (np.array([0.0, 0.0]), np.array([0.0, 0.0])),
9 "small offset": (np.array([0.4, -0.2]), np.array([-0.7, 0.1])),
10 "far offset": (np.array([2.0, -1.5]), np.array([-1.5, 1.2])),
11}
12
13for name, (mu, logvar) in posteriors.items():
14 print(f"{name:13} KL = {kl_to_standard_normal(mu, logvar):.4f}")1matches prior KL = 0.0000
2small offset KL = 0.2009
3far offset KL = 4.0466KL pressure isn't free. If it dominates reconstruction, all inputs can be encoded nearly like the prior and the decoder may ignore . The next diagnostic catches that failure.
Kingma and Welling's VAE estimator rewrites random sampling so gradients can reach encoder parameters.[1] Draw independent noise from a standard normal and then transform it:
During one backward pass, is a sampled constant. The multiplication and addition still expose how the loss changes when the encoder changes or .
1import numpy as np
2
3mu = np.array([0.40, -0.20])
4logvar = np.array([-0.70, 0.10])
5epsilon = np.array([0.50, -1.00])
6
7sigma = np.exp(0.5 * logvar)
8z = mu + sigma * epsilon
9
10print("mu:", np.round(mu, 3))
11print("sigma:", np.round(sigma, 3))
12print("epsilon:", np.round(epsilon, 3))
13print("sampled z:", np.round(z, 3))1mu: [ 0.4 -0.2]
2sigma: [0.705 1.051]
3epsilon: [ 0.5 -1. ]
4sampled z: [ 0.752 -1.251]
The VAE optimizes an evidence lower bound (ELBO) on the data log-likelihood.[1] Written as a quantity to maximize:
The first term rewards a decoder that makes the observed patch likely after sampling its latent. The second discourages a posterior that moves too far from the prior.
In code we usually minimize the negative ELBO:
For , this corresponds to the standard objective when L_recon is the selected negative log-likelihood. An MSE reconstruction term is proportional to a Gaussian decoder negative log-likelihood only after choosing a fixed observation variance, so treat the scale of MSE and KL as a modeling choice, not an identity.
This ledger computes both terms for one patch using a fixed hypothetical reconstruction:
1import numpy as np
2
3x = np.array([0.10, 0.90, 0.85, 0.12])
4x_hat = np.array([0.12, 0.84, 0.80, 0.14])
5mu = np.array([0.40, -0.20])
6logvar = np.array([-0.70, 0.10])
7
8reconstruction_mse = np.mean((x - x_hat) ** 2)
9kl = -0.5 * np.sum(1 + logvar - mu**2 - np.exp(logvar))
10
11for beta in [0.0, 0.1, 1.0]:
12 total = reconstruction_mse + beta * kl
13 print(f"beta={beta:.1f}: recon={reconstruction_mse:.4f} KL={kl:.4f} total={total:.4f}")1beta=0.0: recon=0.0017 KL=0.2009 total=0.0017
2beta=0.1: recon=0.0017 KL=0.2009 total=0.0218
3beta=1.0: recon=0.0017 KL=0.2009 total=0.2026Higgins et al. studied the -VAE objective, where increasing applies stronger latent-capacity and independence pressure while trading against reconstruction accuracy.[2] That trade-off must be measured on your task; a higher value isn't automatically better.
1reconstruction_losses = [0.004, 0.010, 0.028]
2kl_losses = [1.20, 0.42, 0.08]
3betas = [0.01, 0.10, 1.00]
4
5for beta, recon, kl in zip(betas, reconstruction_losses, kl_losses):
6 total = recon + beta * kl
7 print(f"beta={beta:>4.2f}: reconstruction={recon:.3f}, KL={kl:.2f}, objective={total:.3f}")1beta=0.01: reconstruction=0.004, KL=1.20, objective=0.016
2beta=0.10: reconstruction=0.010, KL=0.42, objective=0.052
3beta=1.00: reconstruction=0.028, KL=0.08, objective=0.108These rows are an accounting exercise, not a reported experiment. They show why you should log reconstruction and KL separately instead of reading only the total.
Now train a real two-dimensional VAE on two small families of noisy patches: one with bright center cells and one with bright lower cells. The sample batch at the end demonstrates the generative interface. It doesn't claim photographic quality from this miniature dataset.
1import torch
2from torch import nn
3import torch.nn.functional as F
4
5torch.manual_seed(8)
6center_mark = torch.tensor([
7 0.05, 0.10, 0.08, 0.10, 0.06,
8 0.12, 0.88, 0.86, 0.13, 0.08,
9 0.10, 0.91, 0.90, 0.12, 0.07,
10 0.06, 0.15, 0.14, 0.10, 0.06,
11 0.05, 0.08, 0.08, 0.05, 0.04,
12])
13lower_mark = torch.tensor([
14 0.04, 0.07, 0.06, 0.08, 0.05,
15 0.10, 0.15, 0.12, 0.10, 0.06,
16 0.08, 0.18, 0.16, 0.09, 0.06,
17 0.06, 0.84, 0.88, 0.11, 0.07,
18 0.04, 0.07, 0.08, 0.05, 0.04,
19])
20patches = torch.cat([
21 torch.clamp(center_mark + 0.02 * torch.randn(32, 25), 0.0, 1.0),
22 torch.clamp(lower_mark + 0.02 * torch.randn(32, 25), 0.0, 1.0),
23])
24
25class TinyVAE(nn.Module):
26 def __init__(self):
27 super().__init__()
28 self.encoder = nn.Sequential(nn.Linear(25, 12), nn.Tanh())
29 self.mu = nn.Linear(12, 2)
30 self.logvar = nn.Linear(12, 2)
31 self.decoder = nn.Sequential(
32 nn.Linear(2, 12), nn.Tanh(), nn.Linear(12, 25), nn.Sigmoid()
33 )
34
35 def forward(self, x):
36 hidden = self.encoder(x)
37 mu = self.mu(hidden)
38 logvar = self.logvar(hidden)
39 z = mu + torch.exp(0.5 * logvar) * torch.randn_like(mu)
40 return self.decoder(z), mu, logvar
41
42model = TinyVAE()
43optimizer = torch.optim.Adam(model.parameters(), lr=0.03)
44beta = 0.02
45
46def loss_terms(x_hat, x, mu, logvar):
47 recon = F.mse_loss(x_hat, x)
48 kl = -0.5 * (1 + logvar - mu.pow(2) - logvar.exp()).sum(1).mean()
49 return recon + beta * kl, recon, kl
50
51for _ in range(400):
52 x_hat, mu, logvar = model(patches)
53 loss, recon, kl = loss_terms(x_hat, patches, mu, logvar)
54 optimizer.zero_grad()
55 loss.backward()
56 optimizer.step()
57
58torch.manual_seed(99)
59with torch.no_grad():
60 x_hat, mu, logvar = model(patches)
61 loss, recon, kl = loss_terms(x_hat, patches, mu, logvar)
62 generated = model.decoder(torch.randn(3, 2))
63
64print(f"reconstruction={recon.item():.4f} KL={kl.item():.4f} total={loss.item():.4f}")
65print("generated batch shape:", tuple(generated.shape))1reconstruction=0.0075 KL=0.7382 total=0.0222
2generated batch shape: (3, 25)Each object in the program now has a job: mu and logvar define each approximate posterior, torch.randn_like supplies , decoder consumes sampled , and the logged losses reveal the compromise.
If a decoder can reconstruct without using its latent input, KL pressure can push every posterior close to the prior. Then the latent code stops carrying information. This failure is called posterior collapse. Van den Oord et al. describe it as a common issue for VAEs paired with high-capacity autoregressive decoders.[3]
A first diagnostic is to track KL per latent dimension. The illustrative monitor below flags a run whose KL has nearly vanished in every dimension:
1import numpy as np
2
3logged_kl_by_dimension = {
4 "using latent": np.array([0.31, 0.18]),
5 "suspected collapse": np.array([0.0003, 0.0001]),
6}
7threshold = 0.001
8
9for run_name, kl_dims in logged_kl_by_dimension.items():
10 collapsed = bool(np.all(kl_dims < threshold))
11 print(f"{run_name:18} total_KL={kl_dims.sum():.4f} collapsed={collapsed}")1using latent total_KL=0.4900 collapsed=False
2suspected collapse total_KL=0.0004 collapsed=TrueLow KL alone isn't a proof of failure. Pair it with reconstruction quality, generated samples, and checks that predictions change when changes. Still, a rapid collapse toward zero is worth investigating.
A Vector Quantised Variational Autoencoder (VQ-VAE) takes a different route. Its encoder output is mapped to the nearest vector in a learned codebook, so its latent representation is discrete rather than Gaussian. The original VQ-VAE paper also learns a prior over those discrete codes.[3]
For one encoder vector, quantization is a nearest-neighbor lookup:
1import numpy as np
2
3encoder_output = np.array([0.72, -0.10])
4codebook = np.array([
5 [-0.80, 0.20],
6 [0.65, -0.05],
7 [0.10, 0.90],
8])
9
10squared_distances = ((codebook - encoder_output) ** 2).sum(axis=1)
11index = int(squared_distances.argmin())
12quantized = codebook[index]
13
14print("squared distances:", np.round(squared_distances, 3))
15print("selected code ID:", index)
16print("decoder receives:", quantized)1squared distances: [2.4 0.007 1.384]
2selected code ID: 1
3decoder receives: [ 0.65 -0.05]Nearest-code selection isn't differentiable. VQ-VAE uses a straight-through gradient estimator for the encoder path plus codebook and commitment losses so encoder outputs stay close to useful entries.[3] The important distinction is interface:
| Bottleneck | Decoder receives | How a new latent is obtained |
|---|---|---|
| Autoencoder | Deterministic vector | Encode an input |
| VAE | Continuous sampled vector | Sample from a prior after training |
| VQ-VAE | Codebook vector selected by an ID | Model or choose discrete code IDs |
Rombach et al. train diffusion models in the latent space of pretrained autoencoders rather than applying all denoising operations directly to pixels.[4] This makes the autoencoder's role precise: compress image data before expensive generation steps, then decode a final latent back to pixels.
The size reduction itself is simple arithmetic. A downsampling factor of 8 reduces each spatial axis by 8, so it reduces spatial positions by 64:
1height, width = 512, 512
2downsample = 8
3latent_height = height // downsample
4latent_width = width // downsample
5
6pixel_positions = height * width
7latent_positions = latent_height * latent_width
8
9print("pixel grid:", (height, width), "positions:", pixel_positions)
10print("latent grid:", (latent_height, latent_width), "positions:", latent_positions)
11print("spatial reduction:", pixel_positions // latent_positions, "x")1pixel grid: (512, 512) positions: 262144
2latent grid: (64, 64) positions: 4096
3spatial reduction: 64 xChannels and decoder quality still matter, and individual systems choose their own latent representation. The durable idea is that a learned bottleneck can make later modeling cheaper while retaining useful structure.
sigma = exp(0.5 * logvar), not exp(logvar), before reparameterized sampling. If samples or KL become NaN or Inf, inspect logvar for unstable extremes before adding an explicit clamp.mu and logvar.logvar to sigma, calculates Gaussian KL, and produces a reparameterized sample.NaN or Inf. Cause: Scale parameterization is wrong or logvar became extreme before exponentiation. Fix: Compute sigma = exp(0.5 * logvar), inspect the logvar range, and stabilize optimization before adding an explicit clamp.z.Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Auto-Encoding Variational Bayes.
Kingma, D. P., Welling, M. · 2014 · ICLR 2014
beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., & Lerchner, A. · 2017 · ICLR 2017
Neural Discrete Representation Learning.
Oord, A. van den, Vinyals, O., & Kavukcuoglu, K. · 2017
High-Resolution Image Synthesis with Latent Diffusion Models.
Rombach, R., et al. · 2022 · CVPR 2022