Master diffusion models from the forward noising process and DDPM training to Classifier-Free Guidance, latent diffusion, DiT backbones, and fast sampling trade-offs.
Multimodal LLM Architecture showed how systems consume visual and audio evidence. Diffusion flips direction: it builds images from conditioning signals by iteratively denoising a latent state.
Diffusion models generate images by learning how to remove noise step by step. This chapter explains the architecture and product design tradeoffs behind text-to-image systems, including conditioning, sampling, safety, and evaluation.
Imagine a furniture catalog content team preparing campaign concepts for ten thousand products. A diffusion system can propose lifestyle scenes from prompts such as "mid-century oak dining table in natural light," then a reviewer can select variants for marketing work. It cannot replace factual product photography when customers need to see the actual SKU, finish, damage, or fit.
The core trick is iterative refinement. Instead of guessing the final image in one shot, the model starts from random noise and repeatedly updates it toward a structured sample. Think of it like a sculptor who roughs out the general shape before adding fine details.
An online fashion retailer might generate approved background variants or campaign mockups from prompts like "white linen shirt on a neutral mannequin, soft studio lighting." Product-detail views, virtual try-on claims, and package-condition evidence need separate fidelity validation and review because a plausible-looking output can invent material, fit, branding, or damage.
For an e-commerce or logistics platform, the same mechanics can generate ad concepts, synthetic training examples, warehouse-layout visualizations, or explicitly labeled package-damage mockups from text and reference images. Keep that governed product-media lens in mind while the math stays general.
If you compare diffusion to one-pass generators, the defining difference is iterative refinement. The model learns a reverse-time denoising process that turns Gaussian noise into a structured sample through repeated model evaluations.[1] An unaccelerated sampler therefore performs more denoiser work than a single generator pass, while giving the system repeated opportunities to inject conditioning such as text, masks, depth maps, or reference images.
Classical DDPM training starts from a variational lower bound, but its common simple objective is a noise-prediction regression loss derived from it. Unlike Generative Adversarial Networks (GANs), this removes the adversarial minimax game; it does not by itself prove coverage, memorization safety, or output suitability.
A natural first guess is to train a single neural network that maps a random vector straight to a 512x512 image. Generative Adversarial Networks (GANs) pair that generator with a discriminator. Their adversarial objective can be difficult to tune and can suffer mode collapse, while strong GAN implementations can still produce excellent samples. Diffusion changes the training objective and serving cost rather than guaranteeing better output for every domain.
Diffusion models take a different path. They don't ask the network to produce a clean image in one step. They ask it to remove a tiny bit of noise. Repeating that cleanup hundreds of times turns a random field into a structured image. The task at each step is much simpler, the training objective is a plain supervised loss, and the model naturally supports conditioning because the prompt can steer every step.
GANs, Variational Autoencoders (VAEs), and diffusion models make different trade-offs between sample quality, coverage, likelihood modeling, and inference cost. Treat the table as an architecture comparison, not a universal ranking:
| Feature | GANs | VAEs | Diffusion Models |
|---|---|---|---|
| Training Objective | Minimax Game (Adversarial) | ELBO (Evidence Lower Bound) | Variational ELBO / MSE |
| Sample Quality | Can be sharp; data and tuning dependent | Often reconstruction-smoothed | Strong results; model and sampler dependent |
| Coverage Risk | Mode collapse is an explicit risk | Compression can remove detail | Must evaluate diversity and memorization |
| Training Behavior | Adversarial instability risk | Stable reconstruction objective | Supervised denoising objective |
| Inference Work | One generator pass | One decoder pass | Repeated denoiser evaluations unless accelerated |
| Likelihood | Implicit | Approximate | Tractable lower bound |
Think of this like a drop of ink dispersing in a glass of water. At first, the ink is a structured blob (the data). Over time, it diffuses until the water is uniformly colored (pure noise). The forward process mathematically describes this diffusion, drawing its principles from non-equilibrium thermodynamics.[2]
Starting from a clean image , we iteratively add Gaussian noise over steps. This process is a fixed Markov chain, which is just a sequence where each new state depends only on the previous one, not on the full history. It gradually destroys structure until the data becomes indistinguishable from isotropic Gaussian noise, meaning random static that has the same variance on every pixel and every color channel.[2]
The transition kernel is defined as:
Where is the noise schedule (typically linear from to ). Because the sum of Gaussians is also a Gaussian, we can marginalize over intermediate steps. Let and :
At (the final step), , so (pure isotropic Gaussian noise). This reparameterization trick lets us sample at any arbitrary timestep directly from without simulating the chain.
To see how the signal fades, imagine a single normalized pixel with initial brightness and a schedule where . At step :
About half the original signal remains, mixed with noise. By step , , so the term is nearly zero and the image is almost pure noise. This is why the model sees very different tasks at early and late timesteps. At it removes corruption from an input that still retains recognizable structure. At it must infer plausible structure from very little remaining signal.
The function below demonstrates the closed-form sampling equation on a single normalized pixel. Real code applies the same formula elementwise to an image or latent tensor.
1import json
2import math
3
4def forward_diffusion_value(x0: float, alpha_bar: float, epsilon: float) -> dict[str, float]:
5 signal = math.sqrt(alpha_bar) * x0
6 noise = math.sqrt(1 - alpha_bar) * epsilon
7 return {
8 "alpha_bar": alpha_bar,
9 "signal": round(signal, 3),
10 "noise": round(noise, 3),
11 "xt": round(signal + noise, 3),
12 "snr": round(alpha_bar / (1 - alpha_bar), 3),
13 }
14
15x0 = 1.0
16epsilon = -0.4
17steps = [
18 forward_diffusion_value(x0, alpha_bar, epsilon)
19 for alpha_bar in [0.9, 0.5, 0.001]
20]
21
22print(json.dumps(steps, indent=2))1[
2 {
3 "alpha_bar": 0.9,
4 "signal": 0.949,
5 "noise": -0.126,
6 "xt": 0.822,
7 "snr": 9.0
8 },
9 {
10 "alpha_bar": 0.5,
11 "signal": 0.707,
12 "noise": -0.283,
13 "xt": 0.424,
14 "snr": 1.0
15 },
16 {
17 "alpha_bar": 0.001,
18 "signal": 0.032,
19 "noise": -0.4,
20 "xt": -0.368,
21 "snr": 0.001
22 }
23]The goal is to learn the reverse process , which corresponds to denoising the image step-by-step.[1]
The subtle point is that the exact reverse posterior is intractable because it marginalizes over every possible clean image that could have produced . What is tractable is , which stays Gaussian. DDPM uses that closed form in the derivation, then trains a model that predicts enough information from and to approximate the reverse step without ever observing at inference time.[1]
The model does not have to predict the clean image directly. In standard DDPM (Denoising Diffusion Probabilistic Models), the model predicts the noise component that was added at step .
We train a neural network to predict the noise from a noisy input :
In this formula, we sample a timestep , a clean image , and noise , then train the network to predict that noise from the noisy input . The squared error is averaged over many such samples.
In practice, people usually optimize the "simple" objective: an MSE loss on the predicted noise. It comes from simplifying and reweighting the original ELBO, but the implementation is just a supervised regression problem from to .[1] The training step samples a timestep, creates xt, asks the model to predict the known noise, and computes mean squared error:
1import json
2
3def mse(true_values: list[float], predicted_values: list[float]) -> float:
4 squared_errors = [
5 (true - predicted) ** 2
6 for true, predicted in zip(true_values, predicted_values, strict=True)
7 ]
8 return sum(squared_errors) / len(squared_errors)
9
10training_rows = [
11 {
12 "timestep": 100,
13 "true_noise": [0.20, -0.10, 0.05],
14 "predicted_noise": [0.18, -0.08, 0.02],
15 },
16 {
17 "timestep": 700,
18 "true_noise": [0.90, -0.40, 0.30],
19 "predicted_noise": [0.72, -0.35, 0.10],
20 },
21]
22
23losses = [
24 {
25 "timestep": row["timestep"],
26 "mse_loss": round(mse(row["true_noise"], row["predicted_noise"]), 4),
27 }
28 for row in training_rows
29]
30
31print(json.dumps(losses, indent=2))1[
2 {
3 "timestep": 100,
4 "mse_loss": 0.0006
5 },
6 {
7 "timestep": 700,
8 "mse_loss": 0.025
9 }
10]The entire training step is a supervised regression problem. You know the exact noise you added, so the model has a ground-truth target. There is no adversarial game and no discriminator. The difficulty is that the same network must learn to denoise at every timestep , from nearly-clean images to nearly-pure static.
The schedule isn't just bookkeeping. In the closed-form forward equation, the signal coefficient is and the noise coefficient is , so the effective signal-to-noise ratio at timestep is:
High-SNR timesteps still contain recognizable structure. Low-SNR timesteps are close to pure noise. If the schedule destroys signal too quickly, many late timesteps become almost uninformative and the model spends capacity denoising inputs with very little semantic content left. That's why schedule design and loss weighting matter in practice: they decide where along the SNR curve the model spends most of its learning budget.
Original DDPM predicts directly.[1] That's not the only valid parameterization. Some systems predict the clean sample , and others predict a velocity-style target , which is a different linear parameterization of the same reverse step.[3] These targets are mathematically convertible, but they change optimization behavior and how errors are distributed across timesteps. In interviews, it's often enough to explain that the scheduler still defines the reverse update, while the prediction target controls what the network is trained to output.
To generate a new image, we start from pure noise and iteratively denoise it using the learned model. Each step depends on the output of the previous step, so the reverse process is sequential even when the model itself is batched efficiently. The simplified scalar example below preserves that dependency while leaving out tensor mechanics:
1import json
2import math
3
4betas = [0.02, 0.03, 0.04]
5alphas = [1 - beta for beta in betas]
6alpha_bars: list[float] = []
7running = 1.0
8for alpha in alphas:
9 running *= alpha
10 alpha_bars.append(running)
11
12def predicted_noise(x: float, step: int) -> float:
13 return 0.45 * x + 0.02 * step
14
15x = 1.25
16trace = []
17for step in reversed(range(len(betas))):
18 beta_t = betas[step]
19 alpha_t = alphas[step]
20 alpha_bar_t = alpha_bars[step]
21 eps = predicted_noise(x, step)
22 x = (x - (beta_t / math.sqrt(1 - alpha_bar_t)) * eps) / math.sqrt(alpha_t)
23 trace.append({
24 "step": step,
25 "predicted_noise": round(eps, 3),
26 "next_x": round(x, 3),
27 })
28
29print(json.dumps(trace, indent=2))1[
2 {
3 "step": 2,
4 "predicted_noise": 0.603,
5 "next_x": 1.193
6 },
7 {
8 "step": 1,
9 "predicted_noise": 0.557,
10 "next_x": 1.135
11 },
12 {
13 "step": 0,
14 "predicted_noise": 0.511,
15 "next_x": 1.073
16 }
17]The original DDPM sampler evaluates the denoiser across its configured reverse chain, commonly presented with steps. That baseline is expensive compared with one-pass generation and motivates sparse samplers and distilled models.
Imagine a detective trying to reconstruct a blurry photo. They first zoom out to understand the overall scene (the contracting path), then zoom in on details, using clues from the wider scene to refine their understanding of specific features (the expanding path). The "skip connections" are like the detective constantly cross-referencing their zoomed-out understanding with the fine details, ensuring consistency.
For years, U-Net (U-shaped convolutional network with skip connections) has been the standard backbone for diffusion models. It combines a contracting path that builds context with an expanding path that restores spatial detail through skip connections, which makes it a strong fit for multiscale denoising. In text-conditioned image models, timestep embeddings modulate every residual block, and cross-attention is usually inserted at several resolutions so prompt tokens can steer both coarse layout and fine detail.[4]
The diagram shows representative conditioning sites, not every single block. Stable Diffusion-style U-Nets use cross-attention in multiple down, middle, and up blocks rather than only at the bottleneck.[4]
DiT (Diffusion Transformer) doesn't mean "transformers are always better." It means the denoiser patchifies its input into tokens and processes them with transformer blocks instead of a convolutional U-Net. In the DiT experiments, increasing model compute improved FID for latent diffusion settings.[5] That is evidence for a scaling path, not a promise that a DiT is cheaper or better for every resolution, latency budget, or dataset.
A concrete published example is Stable Diffusion 3's MM-DiT (multimodal diffusion transformer). Instead of feeding text only through one-directional cross-attention, its text and image representations participate in a joint attention operation while retaining separate modality-specific weights.[6] The SD3 authors report prompt-following, typography, and scaling results for that recipe; those measured results should not be generalized to every transformer-based image generator.
DDPM frames generation as reversing a stochastic noising chain. Flow matching reframes it as learning a velocity field that transports samples along a path between a noise distribution and the data distribution.[7] Rectified flow picks the simplest possible path: a straight line that interpolates noise and data as for .[8] The network is trained to predict the constant velocity along that line, which is just another regression target much like or :
Straight-line paths are designed to make numerical integration effective with fewer solver steps, but quality at any step count remains a measurement, not a guarantee. Stable Diffusion 3 reports rectified flow together with MM-DiT and a shifted timestep-sampling strategy.[6] For interview purposes, connect flow matching to diffusion-style deployment concerns such as conditioning, latent representation, and sampler cost, while keeping their training objectives and scheduler equations distinct.
U-Net-style models usually encode the timestep with a sinusoidal embedding, project it with an MLP, and use the result to modulate residual blocks through FiLM (Feature-wise Linear Modulation) or AdaGN (Adaptive Group Normalization). DiT carries the same timestep signal, but typically injects it through adaptive LayerNorm inside transformer blocks instead of GroupNorm-based residual blocks.[5] The example below shows the same scale-and-shift idea without framework dependencies:
1import json
2
3def modulate(features: list[float], scale: float, shift: float) -> list[float]:
4 return [round(value * (1 + scale) + shift, 3) for value in features]
5
6feature_map = [0.20, -0.10, 0.50]
7timestep_controls = {
8 "early_high_noise": {"scale": 0.50, "shift": -0.05},
9 "late_low_noise": {"scale": 0.05, "shift": 0.02},
10}
11
12outputs = {
13 name: modulate(feature_map, **control)
14 for name, control in timestep_controls.items()
15}
16
17print(json.dumps(outputs, indent=2))1{
2 "early_high_noise": [
3 0.25,
4 -0.2,
5 0.7
6 ],
7 "late_low_noise": [
8 0.23,
9 -0.085,
10 0.545
11 ]
12}This modulation acts like a global controller. At early timesteps (high noise), the scaling parameters adjust the network's activations to focus on large structural changes. At later timesteps (low noise), they tune the network to refine fine textures and details.
How does the model know to draw a "golden retriever" and not a "mountain"? Early conditional diffusion systems used classifier guidance: they trained a separate image classifier on noisy inputs, then added the classifier's gradient into the sampling step to push the image toward a target label. That worked, but it required training and maintaining an extra classifier.
CFG (Classifier-Free Guidance) removes that extra classifier entirely. It gets a similar steering effect by training one denoiser to handle both conditional and unconditional prediction.[9]
CFG works by training a single model to be both a conditional and an unconditional predictor. At inference time, we "extrapolate" away from the unconditional prediction toward the conditional one.
During training, we randomly drop the conditioning (e.g., a text prompt) with some probability (typically 10%). In real text-to-image systems, the unconditional branch is usually represented by an empty-prompt embedding or a learned null token, not a literal all-zero vector. The example below shows how a batch mixes conditional and unconditional rows:
1import json
2import random
3
4def cfg_conditioning_batch(prompts: list[str], drop_prob: float, seed: int = 7) -> list[dict[str, str]]:
5 rng = random.Random(seed)
6 rows = []
7 for prompt in prompts:
8 dropped = rng.random() < drop_prob
9 rows.append({
10 "original_prompt": prompt,
11 "conditioning_used": "<null>" if dropped else prompt,
12 "branch": "unconditional" if dropped else "conditional",
13 })
14 return rows
15
16batch = cfg_conditioning_batch(
17 [
18 "golden retriever on a sofa",
19 "oak dining table in natural light",
20 "red running shoe product photo",
21 "warehouse package damage closeup",
22 ],
23 drop_prob=0.35,
24)
25
26print(json.dumps(batch, indent=2))1[
2 {
3 "original_prompt": "golden retriever on a sofa",
4 "conditioning_used": "<null>",
5 "branch": "unconditional"
6 },
7 {
8 "original_prompt": "oak dining table in natural light",
9 "conditioning_used": "<null>",
10 "branch": "unconditional"
11 },
12 {
13 "original_prompt": "red running shoe product photo",
14 "conditioning_used": "red running shoe product photo",
15 "branch": "conditional"
16 },
17 {
18 "original_prompt": "warehouse package damage closeup",
19 "conditioning_used": "<null>",
20 "branch": "unconditional"
21 }
22]During sampling, we interpolate between the conditional and unconditional predictions:
Here, is the guidance scale. When , the equation reduces to just the conditional prediction . When , the model amplifies the difference between the conditional and unconditional predictions, actively pushing the generated image away from generic features and toward the specific meaning of the text prompt.
Watch the convention. This is the guidance-scale form used in Stable Diffusion and the diffusers library, where means no extra steering. The original paper writes the same thing as , where means no steering.[9] The two map exactly via , so a reported guidance scale of here corresponds to in the paper's notation.
A concrete way to read the formula: if the unconditional prediction says "this patch looks like blurry fur" and the conditional prediction says "this patch looks like golden-retriever fur," a guidance scale of pushes the result seven and a half times farther along that direction. The patch becomes distinctly golden-retriever fur rather than generic fur.
The practical implementation below shows the CFG extrapolation formula on short vectors. Real systems batch the unconditional and conditional inputs together for one forward pass, then split the two predictions before applying the same math.
1import json
2
3def guided_noise(
4 unconditional: list[float],
5 conditional: list[float],
6 guidance_scale: float,
7) -> list[float]:
8 return [
9 round(base + guidance_scale * (target - base), 3)
10 for base, target in zip(unconditional, conditional, strict=True)
11 ]
12
13noise_uncond = [0.20, -0.10, 0.05]
14noise_cond = [0.05, -0.35, 0.20]
15
16rows = [
17 {
18 "guidance_scale": scale,
19 "guided_noise": guided_noise(noise_uncond, noise_cond, scale),
20 }
21 for scale in [0.0, 1.0, 7.5, 15.0]
22]
23
24print(json.dumps(rows, indent=2))1[
2 {
3 "guidance_scale": 0.0,
4 "guided_noise": [
5 0.2,
6 -0.1,
7 0.05
8 ]
9 },
10 {
11 "guidance_scale": 1.0,
12 "guided_noise": [
13 0.05,
14 -0.35,
15 0.2
16 ]
17 },
18 {
19 "guidance_scale": 7.5,
20 "guided_noise": [
21 -0.925,
22 -1.975,
23 1.175
24 ]
25 },
26 {
27 "guidance_scale": 15.0,
28 "guided_noise": [
29 -2.05,
30 -3.85,
31 2.3
32 ]
33 }
34]| Guidance Scale | Effect |
|---|---|
| 0.0 | Unconditional sample. Ignores the prompt entirely. |
| 1.0 | Pure conditional prediction. Good baseline, no extra CFG boost. |
| 3.0 to 8.0 | Common operating range. Better adherence with moderate diversity loss. |
| 15.0+ | Over-guided. Higher artifact risk, oversaturation, and repeated textures. |
Compare denoiser-evaluation budgets. The original DDPM setup is commonly demonstrated with about 1000 reverse steps; a DDIM deployment may try a sparse schedule such as 20 or 50 steps, while distilled or consistency models target still fewer. Whether a reduced schedule preserves acceptable quality depends on model, prompts, guidance, and evaluation slice.
DDIM (Denoising Diffusion Implicit Models) reformulates the reverse process so it's no longer strictly Markovian. In other words, each step is allowed to look back further than just the immediately previous state. The resulting process keeps the same marginal distributions as DDPM but allows for deterministic sampling.[10]
DDIM lets us skip steps. Instead of sampling , we can step . This can reduce denoiser evaluations substantially; it does not eliminate the need to compare prompt adherence, detail preservation, and edit behavior against the slower baseline.
Where is the model's clean-image estimate, is the predicted noise direction, and injects randomness (or vanishes for deterministic DDIM when ).
Setting makes sampling deterministic. By bypassing the strict Markovian requirement, DDIM decoupled the generative process from the exact number of forward steps used during training. That means you can train on 1000 noise levels but sample on a sparse subset during deployment. DDIM is the canonical interview answer because it exposes the key systems idea: training and inference schedules don't need to be identical.
Each step can require more than one denoiser evaluation. For ordinary CFG, a system needs conditional and unconditional predictions unless it batches or distills that work. Count evaluations before claiming an interactive route is affordable:
1import json
2
3routes = [
4 {"name": "offline_ddim_cfg", "steps": 50, "predictions_per_step": 2},
5 {"name": "interactive_distilled", "steps": 6, "predictions_per_step": 1},
6]
7
8budget = [
9 {
10 "route": route["name"],
11 "denoiser_evaluations": route["steps"] * route["predictions_per_step"],
12 "quality_gate_required": True,
13 }
14 for route in routes
15]
16
17print(json.dumps(budget, indent=2))1[
2 {
3 "route": "offline_ddim_cfg",
4 "denoiser_evaluations": 100,
5 "quality_gate_required": true
6 },
7 {
8 "route": "interactive_distilled",
9 "denoiser_evaluations": 6,
10 "quality_gate_required": true
11 }
12]Pixel-space diffusion is computationally expensive because the model must process high-resolution feature maps (e.g., ) at every denoising step. Latent Diffusion Models (LDMs), like Stable Diffusion, solve this by training the diffuser in the latent space of a powerful autoencoder.
If you haven't worked with VAEs recently, recall that a variational autoencoder has two parts: an encoder that compresses an image into a compact latent representation, and a decoder that reconstructs the image from it. The latent space is learned lossy compression: it preserves enough perceptual structure for the training objective, while its reconstruction errors remain part of final output quality.[4]
This is the continuous-latent branch of the autoencoder family. A VQ-VAE uses a discrete codebook instead: each patch is snapped to a learned code ID, which makes the compressed image look more like a sequence of tokens.[11] That design is useful when a Transformer is trained to predict image tokens autoregressively or with masking. Latent diffusion usually takes the other route: keep the compressed representation continuous, add noise to it, then train a denoiser over that continuous latent tensor.
During training, the VAE encoder maps an image to a clean latent , and the model learns to denoise noisy versions of that latent. During inference, there's no input image: we start from random latent noise and run the reverse process entirely in latent space before decoding once at the end.[4]
Only the final denoised latent gets decoded. If the denoiser predicts noise or velocity, the scheduler uses that prediction to update ; the decoder never consumes raw predicted noise.
Operating directly in pixel space demands immense computational resources. Moving the diffusion process to a compressed latent space splits the workload into two phases. The VAE handles high-frequency perceptual details, while the diffusion model focuses on semantic structure and layout.
A latent tensor has fewer scalar values than a image. That reduction explains why the repeated denoising loop can be substantially cheaper than pixel-space denoising; actual throughput still depends on architecture, step count, batch policy, hardware, and safety checks.
1import json
2import math
3
4pixel_shape = (512, 512, 3)
5latent_shape = (64, 64, 4)
6pixel_values = math.prod(pixel_shape)
7latent_values = math.prod(latent_shape)
8
9print(json.dumps({
10 "pixel_values": pixel_values,
11 "latent_values": latent_values,
12 "scalar_reduction": pixel_values // latent_values,
13 "spatial_reduction": (pixel_shape[0] * pixel_shape[1]) // (latent_shape[0] * latent_shape[1]),
14 "throughput_requires_benchmark": True,
15}, indent=2))1{
2 "pixel_values": 786432,
3 "latent_values": 16384,
4 "scalar_reduction": 48,
5 "spatial_reduction": 64,
6 "throughput_requires_benchmark": true
7}Text-conditioning recipes are model-specific. The latent diffusion paper demonstrates cross-attention conditioning and uses pretrained encoders such as CLIP for text-to-image experiments.[4][12] Other image generators may use T5-family encoders, multiple text encoders, or different freeze/train choices. The stable concept is that prompt representations condition repeated denoising updates, not that every text encoder is frozen.
The CrossAttentionBlock takes flattened image or latent features as queries, and text embeddings as keys and values. The runnable version below shows one latent patch attending to prompt tokens:
1import json
2import math
3
4def dot(left: list[float], right: list[float]) -> float:
5 return sum(a * b for a, b in zip(left, right, strict=True))
6
7def softmax(values: list[float]) -> list[float]:
8 largest = max(values)
9 exp_values = [math.exp(value - largest) for value in values]
10 total = sum(exp_values)
11 return [value / total for value in exp_values]
12
13latent_query = [0.25, 0.10, 0.70]
14text_tokens = [
15 {"token": "oak", "key": [0.30, 0.05, 0.65], "value": [0.70, 0.20]},
16 {"token": "dining", "key": [0.10, 0.80, 0.10], "value": [0.20, 0.70]},
17 {"token": "table", "key": [0.35, 0.10, 0.55], "value": [0.80, 0.10]},
18]
19
20logits = [
21 dot(latent_query, token["key"]) / math.sqrt(len(latent_query))
22 for token in text_tokens
23]
24weights = softmax(logits)
25conditioned_patch = [
26 sum(weight * token["value"][i] for weight, token in zip(weights, text_tokens, strict=True))
27 for i in range(2)
28]
29
30print(json.dumps({
31 "attention_weights": {
32 token["token"]: round(weight, 3)
33 for token, weight in zip(text_tokens, weights, strict=True)
34 },
35 "conditioned_patch": [round(value, 3) for value in conditioned_patch],
36}, indent=2))1{
2 "attention_weights": {
3 "oak": 0.359,
4 "dining": 0.292,
5 "table": 0.349
6 },
7 "conditioned_patch": [
8 0.589,
9 0.311
10 ]
11}Production image systems rarely stop at plain text-to-image. Teams may add targeted modules around a base diffuser instead of retraining the whole backbone.
DDIM shows the first big shortcut; a production system can combine optimizations after measuring each route:
In an e-commerce setting, these optimizations determine which governed routes are affordable: offline campaign concepts tolerate slower samplers, while an interactive preview route may require fewer evaluations and stricter latency budgets. No step-count choice makes an output faithful to a real product; quality and representation checks remain separate gates.
By the end of this lesson, you should be able to:
CFG extrapolates away from the unconditional prediction toward the conditional one. At , you get the unconditional model. At , you recover the plain conditional prediction. On a chosen evaluation slice, increasing may improve prompt fidelity while reducing diversity or increasing artifacts; tune it with measured outputs rather than as a fixed rule.
Operating in pixel space, for example , is expensive because the denoiser must process a large spatial grid at every step. A VAE compresses the image into a smaller latent tensor such as , which is smaller in raw values and smaller in spatial locations. That is what makes the iterative denoising loop practical at high resolution, while the decoder reconstructs pixel-space detail at the end.
DDPM defines a reverse Markov chain; the original baseline is commonly shown with a long schedule such as 1000 denoiser evaluations. DDIM defines a non-Markovian process that allows deterministic sampling on a sparse timestep subset. A route can evaluate schedules such as 20 or 50 steps for latency, quality, and inversion/edit behavior rather than assuming they match the baseline.
The text prompt is tokenized and encoded into a sequence of embeddings, often by a CLIP-family or T5-family encoder depending on the model family. Those embeddings are injected through cross-attention layers, where image or latent features provide the queries and text embeddings provide keys and values. Each spatial location can then attend to prompt tokens that matter for layout, objects, and attributes.
The schedule determines the signal-to-noise ratio at every timestep. In the closed-form forward process, remaining signal power is proportional to and noise power is proportional to , so . If signal disappears too quickly, many late steps become nearly pure noise and contribute weak learning signal. If too much signal remains, the model never learns hard denoising steps.
The reparameterization trick lets us sample at any timestep directly from without iterating through all previous noising steps. Because sums of Gaussian variables remain Gaussian, we can marginalize over the full forward Markov chain and express as a linear combination of and one standard normal noise variable . That enables parallel, random timestep sampling across batches during training.
w=0 is unconditional, w=1 is conditional without extra boost, and larger values trade diversity for prompt fidelity.[9]Teams building product-media pipelines cannot treat synthetic imagery as verified product evidence. Generated outputs can show incorrect labels, brand marks, colors, materials, package damage, or unsafe content. High-risk routes need output filtering, provenance labeling, and human review before publication.
1import json
2
3outputs = [
4 {"asset": "campaign_background", "claims_real_sku": False, "contains_logo": False},
5 {"asset": "product_detail_view", "claims_real_sku": True, "contains_logo": False},
6 {"asset": "package_damage_evidence", "claims_real_sku": True, "contains_logo": True},
7]
8
9decisions = []
10for output in outputs:
11 needs_review = output["claims_real_sku"] or output["contains_logo"]
12 decisions.append({
13 "asset": output["asset"],
14 "decision": "human_review" if needs_review else "synthetic_labeled_preview",
15 })
16
17print(json.dumps(decisions, indent=2))1[
2 {
3 "asset": "campaign_background",
4 "decision": "synthetic_labeled_preview"
5 },
6 {
7 "asset": "product_detail_view",
8 "decision": "human_review"
9 },
10 {
11 "asset": "package_damage_evidence",
12 "decision": "human_review"
13 }
14]You should now be able to trace a full diffusion pipeline from prompt to pixel, explain why the forward process uses a closed-form noise sample, and read a training loop to identify exactly where the noise is added, where the MSE loss is computed, and why the model is learning anything at all. You should also be able to explain why Stable Diffusion runs in latent space, what CFG does to the noise prediction, and why fewer sampling steps are a speed-quality trade-off rather than a free lunch.
Denoising Diffusion Probabilistic Models.
Ho, J., Jain, A., & Abbeel, P. · 2020 · NeurIPS 2020
Deep Unsupervised Learning using Nonequilibrium Thermodynamics.
Sohl-Dickstein et al. · 2015
Progressive Distillation for Fast Sampling of Diffusion Models.
Salimans, T., & Ho, J. · 2022 · ICLR 2022
High-Resolution Image Synthesis with Latent Diffusion Models.
Rombach, R., et al. · 2022 · CVPR 2022
Scalable Diffusion Models with Transformers.
Peebles, W., & Chen, S. · 2023 · ICCV 2023
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Esser, P., Kulal, S., Blattmann, A., et al. · 2024
Flow Matching for Generative Modeling
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. · 2022
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., & Liu, Q. · 2022
Classifier-Free Diffusion Guidance.
Ho, J., & Salimans, T. · 2022 · NeurIPS 2021 Workshop
Denoising Diffusion Implicit Models.
Song, J., Meng, C., & Ermon, S. · 2020 · ICLR 2021
Neural Discrete Representation Learning.
Oord, A. van den, Vinyals, O., & Kavukcuoglu, K. · 2017
Learning Transferable Visual Models From Natural Language Supervision.
Radford, A., et al. · 2021 · ICML 2021
LoRA: Low-Rank Adaptation of Large Language Models.
Hu, E. J., et al. · 2021 · ICLR
Adding Conditional Control to Text-to-Image Diffusion Models.
Zhang, L., et al. · 2023 · ICCV 2023
Consistency Models.
Song, Y., et al. · 2023 · ICML 2023
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022