LearnSystem Design CapstonesDiffusion Models: Images & Text

👁️HardMultimodal Models

Diffusion Models: Images & Text

Design a governed image-generation service while learning DDPM noising, stochastic sampling, latent diffusion, classifier-free guidance, DiT backbones, and text diffusion.

47 min read

Learning path

Step 152 of 158 in the full curriculum

Multimodal LLM Architecture Real-Time Voice AI Agent

Multimodal LLM Architecture showed how systems consume visual and audio evidence. Diffusion flips direction: it builds samples from conditioning signals by iteratively denoising a state.

Most learners meet diffusion through text-to-image systems. The same cleanup idea can also generate text by refining masked token blocks instead of predicting one token at a time.^{[1]Reference 1DiffusionGemmahttps://ai.google.dev/gemma/docs/diffusiongemma} Start with a single image brightness value so you can compute the signal-to-noise trade by hand, then generalize that toy into full images and language.

Product brief: a governed design-asset generator

Design a service for product teams that creates synthetic campaign backgrounds and illustration previews. It must never present generated UI text, charts, or incident screenshots as verified evidence.

Requirements and API

POST /v1/image-jobs accepts prompt, negative_prompt, width, height, variant_count, route (preview or quality), seed, and an idempotency key. It returns job_id, accepted model and policy versions, and an estimated completion window.
GET /v1/image-jobs/{job_id} returns queued, running, review, failed, or complete, plus asset URLs, seed, sampler, step count, safety disposition, and synthetic-content label.
Preview requests target p95 first preview below 2 seconds. Quality jobs may take longer but must finish within 30 seconds or return a visible failure.
Prompts and outputs pass policy checks. Logo use, realistic UI claims, and safety-sensitive imagery route to review. Tenant data and fine-tune adapters stay isolated.

Data flow and sizing

The request path is API -> prompt policy -> route scheduler -> text encoder -> seeded latent -> repeated denoiser/scheduler steps -> VAE decode -> output policy -> object store -> job record. Store model, adapter, scheduler, policy, and seed versions with every asset so an operator can reproduce or explain it.

Size denoiser evaluations rather than HTTP requests alone. A peak of 20 preview jobs/s with 4 variants and a 6-step distilled route needs 20 x 4 x 6 = 480 denoiser evaluations/s. A smaller quality lane at 2 jobs/s with 4 variants, 50 steps, and ordinary two-pass CFG needs 2 x 4 x 50 x 2 = 800 evaluations/s. Separate queues prevent offline quality work from starving interactive previews.

Failure recovery, rollout, and evaluation

Admission is idempotent, so a client retry returns the same job. A worker may rerun an unpublished job from its stored seed and versioned configuration. A stochastic DDPM retry must restore the RNG state or restart the full sample; resuming with fresh noise changes the asset. Timeout, out-of-memory, or failed safety checks move the job to failed or review rather than publishing a partial image.

Roll out a new model or sampler against a frozen prompt suite, then shadow it, canary 1% of eligible jobs, expand to 10%, and finally ramp by route. Roll back on safety-policy regressions, memorization alerts, p95 latency breaches, or a material drop in human preference. Evaluation covers prompt adherence, human preference, diversity, text and layout fidelity, unsafe-output rate, memorization checks, p95 latency, failure rate, and cost per accepted asset.

A one-pixel forward process first

Before any transition kernels, fix one normalized pixel with brightness $x_0 = 1.0$ and a known noise sample $\epsilon = -0.4$ . At each noise level $\bar{\alpha}$ , the closed-form mix is:

$x_t = \sqrt{\bar{\alpha}} \, x_0 + \sqrt{1 - \bar{\alpha}} \, \epsilon$

$\bar{\alpha}$	Signal $\sqrt{\bar{\alpha}}\,x_0$	Noise $\sqrt{1-\bar{\alpha}}\,\epsilon$	$x_t$	SNR $\bar{\alpha}/(1-\bar{\alpha})$
0.9	0.949	-0.126	0.822	9.0
0.5	0.707	-0.283	0.424	1.0
0.001	0.032	-0.400	-0.368	0.001

At $\bar{\alpha}=0.5$ , about half the original brightness remains and SNR is 1.0. At $\bar{\alpha}=0.001$ , the pixel is almost pure noise. That is the whole forward story in miniature: larger forward-noising timesteps ask the model to recover structure from weaker and weaker signal.

adding-noise-gradually.py

import json
import math

def forward_diffusion_value(x0: float, alpha_bar: float, epsilon: float) -> dict[str, float]:
    signal = math.sqrt(alpha_bar) * x0
    noise = math.sqrt(1 - alpha_bar) * epsilon
    return {
        "alpha_bar": alpha_bar,
        "signal": round(signal, 3),
        "noise": round(noise, 3),
        "xt": round(signal + noise, 3),
        "snr": round(alpha_bar / (1 - alpha_bar), 3),
    }

x0 = 1.0
epsilon = -0.4
steps = [
    forward_diffusion_value(x0, alpha_bar, epsilon)
    for alpha_bar in [0.9, 0.5, 0.001]
]

print(json.dumps(steps, indent=2))

Output

[
  {
    "alpha_bar": 0.9,
    "signal": 0.949,
    "noise": -0.126,
    "xt": 0.822,
    "snr": 9.0
  },
  {
    "alpha_bar": 0.5,
    "signal": 0.707,
    "noise": -0.283,
    "xt": 0.424,
    "snr": 1.0
  },
  {
    "alpha_bar": 0.001,
    "signal": 0.032,
    "noise": -0.4,
    "xt": -0.368,
    "snr": 0.001
  }
]

Real images apply the same formula elementwise to every pixel or latent channel. Generation will later reverse the path: start near pure noise and clean structure back in step by step.^{[2]Reference 2Denoising Diffusion Probabilistic Models.https://arxiv.org/abs/2006.11239}

Latent diffusion path where prompt conditioning is reused across denoising steps, noisy latent grids become structured, and one final VAE decode produces pixels. — Prompt conditioning is reused at every denoising step, and pixels appear only after one final VAE decode.

Why iterative refinement?

A natural first guess is to train a single neural network that maps a random vector straight to a 512x512 image. Generative Adversarial Networks (GANs) pair that generator with a discriminator. Their adversarial objective can be difficult to tune and can suffer mode collapse, while strong GAN implementations can still produce excellent samples. Diffusion changes the training objective and serving cost rather than guaranteeing better output for every domain.

Diffusion models take a different path. They don't ask the network to produce a clean image in one step. They ask it to remove a bit of noise, like undoing one row of the table above. Repeating that cleanup turns a random field into a structured image. The task at each step is much simpler, the training objective is a plain supervised loss, and the model naturally supports conditioning because the prompt can steer every step.

Classical DDPM training starts from an evidence lower bound (ELBO), a trainable lower bound on data log-likelihood, but its common simple objective is a noise-prediction regression loss derived from it. Unlike GANs, this removes the adversarial minimax game; it doesn't by itself prove coverage, memorization safety, or output suitability.

Generative models comparison

GANs, Variational Autoencoders (VAEs), and diffusion models make different trade-offs between sample quality, coverage, likelihood modeling, and inference cost. Treat the table as an architecture comparison, not a universal ranking:

Feature	GANs	VAEs	Diffusion Models
Training Objective	Minimax Game (Adversarial)	ELBO (Evidence Lower Bound)	Variational ELBO / MSE
Sample Quality	Can be sharp; data and tuning dependent	Often reconstruction-smoothed	Strong results; model and sampler dependent
Coverage Risk	Mode collapse is an explicit risk	Compression can remove detail	Must evaluate diversity and memorization
Training Behavior	Adversarial instability risk	Stable reconstruction objective	Supervised denoising objective
Inference Work	One generator pass	One decoder pass	Repeated denoiser evaluations unless accelerated
Likelihood	Implicit	Approximate	Tractable lower bound

Forward diffusion process

Adding noise gradually

Forward diffusion resembles a drop of ink dispersing in water. At first, the ink is a structured blob (the data). Over time, it diffuses until the water is uniformly colored (pure noise). The single-pixel table above is that story for one brightness value. The formal process extends it to full tensors and draws its principles from non-equilibrium thermodynamics.^{[3]Reference 3Deep Unsupervised Learning using Nonequilibrium Thermodynamics.https://arxiv.org/abs/1503.03585}

Starting from a clean image $x_0$ , we iteratively add Gaussian noise over $T$ steps. This process is a fixed Markov chain, which is just a sequence where each new state depends only on the previous one, not on the full history. It gradually destroys structure until the data becomes indistinguishable from isotropic Gaussian noise, meaning random static that has the same variance on every pixel and every color channel.^{[3]Reference 3Deep Unsupervised Learning using Nonequilibrium Thermodynamics.https://arxiv.org/abs/1503.03585}

Forward diffusion visual with five image tiles from clean structure to terminal Gaussian noise, plus signal and noise bars showing signal fading as noise grows. — Forward diffusion gradually erases image structure while the noise term grows, and closed-form noising lets training jump straight to any timestep.

The transition kernel generalizes the one-pixel mix. Each step adds a little more Gaussian noise:

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} \cdot x_{t-1}, \beta_t \mathbf{I})$

Where $\beta_t$ is the noise schedule. The original DDPM experiments used a linear schedule from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ over $T = 1000$ steps; newer systems may choose different schedules.^{[2]Reference 2Denoising Diffusion Probabilistic Models.https://arxiv.org/abs/2006.11239} Because the sum of Gaussians is also a Gaussian, the derivation can marginalize over intermediate steps. Let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ :

$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \cdot x_0, (1 - \bar{\alpha}_t) \mathbf{I})$

That is exactly the formula you already computed by hand: sample $x_t$ at any timestep $t$ directly from $x_0$ and one noise draw, without simulating every intermediate step. At $t = T$ , $\bar{\alpha}_T \approx 0$ , so $x_T \approx \mathcal{N}(0, \mathbf{I})$ (pure isotropic Gaussian noise).

At small forward timestep $t$ , the input still retains recognizable structure. At large forward $t$ near $T$ , it contains very little original signal. The SNR curve $\bar{\alpha}_t / (1 - \bar{\alpha}_t)$ describes that spectrum without confusing forward time with the direction of reverse sampling.

Reverse process: learned denoising

DDPM training

Diffusion training learns the reverse process $p_\theta(x_{t-1} | x_t)$ , which corresponds to denoising the image step-by-step.^{[2]Reference 2Denoising Diffusion Probabilistic Models.https://arxiv.org/abs/2006.11239}

The subtle point is that the exact reverse posterior $q(x_{t-1} \mid x_t)$ is intractable because it marginalizes over every possible clean image $x_0$ that could have produced $x_t$ . What is tractable is $q(x_{t-1} \mid x_t, x_0)$ , which stays Gaussian. DDPM uses that closed form in the derivation, then trains a model that predicts enough information from $x_t$ and $t$ to approximate the reverse step without ever observing $x_0$ at inference time.^{[2]Reference 2Denoising Diffusion Probabilistic Models.https://arxiv.org/abs/2006.11239}

The model doesn't have to predict the clean image $x_0$ directly. In standard DDPM (Denoising Diffusion Probabilistic Models), the model predicts the noise component $\epsilon$ that was added at step $t$ .

We train a neural network $\epsilon_\theta(x_t, t)$ to predict the noise $\epsilon$ from a noisy input $x_t$ :

$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}[\|\epsilon - \epsilon_\theta(x_t, t)\|^2]$

In this formula, we sample a timestep $t$ , a clean image $x_0$ , and noise $\epsilon$ , then train the network to predict that noise from the noisy input $x_t$ . The squared error is averaged over many such samples.

In practice, people usually optimize the "simple" objective: an MSE loss on the predicted noise. It comes from simplifying and reweighting the original ELBO, but the implementation is just a supervised regression problem from $x_t$ to $\epsilon$ .^{[2]Reference 2Denoising Diffusion Probabilistic Models.https://arxiv.org/abs/2006.11239} The training step samples a timestep, creates xt, asks the model to predict the known noise, and computes mean squared error:

ddpm-training.py

import json

def mse(true_values: list[float], predicted_values: list[float]) -> float:
    squared_errors = [
        (true - predicted) ** 2
        for true, predicted in zip(true_values, predicted_values, strict=True)
    ]
    return sum(squared_errors) / len(squared_errors)

training_rows = [
    {
        "timestep": 100,
        "true_noise": [0.20, -0.10, 0.05],
        "predicted_noise": [0.18, -0.08, 0.02],
    },
    {
        "timestep": 700,
        "true_noise": [0.90, -0.40, 0.30],
        "predicted_noise": [0.72, -0.35, 0.10],
    },
]

losses = [
    {
        "timestep": row["timestep"],
        "mse_loss": round(mse(row["true_noise"], row["predicted_noise"]), 4),
    }
    for row in training_rows
]

print(json.dumps(losses, indent=2))

Output

[
  {
    "timestep": 100,
    "mse_loss": 0.0006
  },
  {
    "timestep": 700,
    "mse_loss": 0.025
  }
]

The entire training step is a supervised regression problem. You know the exact noise $\epsilon$ you added, so the model has a ground-truth target. There is no adversarial game and no discriminator. The difficulty is that the same network must learn to denoise at every timestep $t$ , from nearly-clean images to nearly-pure static.

Noise schedule and SNR

The schedule isn't just bookkeeping. In the closed-form forward equation, the signal coefficient is $\sqrt{\bar{\alpha}_t}$ and the noise coefficient is $\sqrt{1 - \bar{\alpha}_t}$ , so the effective signal-to-noise ratio at timestep $t$ is:

$\mathrm{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}$

High-SNR timesteps still contain recognizable structure. Low-SNR timesteps are close to pure noise. If the schedule destroys signal too quickly, many late timesteps become almost uninformative and the model spends capacity denoising inputs with very little semantic content left. That's why schedule design and loss weighting matter in practice: they decide where along the SNR curve the model spends most of its learning budget.

Line chart showing signal power falling and noise power rising across diffusion timesteps, explaining the signal-to-noise ratio schedule. — Noise schedule shape decides where signal survives and where denoising becomes hardest.

Prediction targets

Original DDPM predicts $\epsilon$ directly.^{[2]Reference 2Denoising Diffusion Probabilistic Models.https://arxiv.org/abs/2006.11239} That's not the only valid parameterization. Some systems predict the clean sample $x_0$ , and others predict a velocity-style target $v$ , which is a different linear parameterization of the same reverse step.^{[4]Reference 4Progressive Distillation for Fast Sampling of Diffusion Models.https://arxiv.org/abs/2202.00512} These targets are mathematically convertible, but they change optimization behavior and how errors are distributed across timesteps. In interviews, it's often enough to explain that the scheduler still defines the reverse update, while the prediction target controls what the network is trained to output.

Sampling (reverse process)

To generate a new image, we start from pure noise $x_T \sim \mathcal{N}(0, \mathbf{I})$ and iteratively denoise it using the learned model. A DDPM reverse step includes both a learned mean and fresh stochastic noise on non-final steps:

\begin{aligned} x_{t-1} &= \frac{1}{\sqrt{\alpha_t}} \left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t,t)\right) + \sigma_t z, \\ z &\sim \mathcal{N}(0,\mathbf{I}) \end{aligned}

Set $z=0$ for the final step into $x_0$ . The original formulation allows a chosen reverse variance; the scalar teaching example uses $\sigma_t=\sqrt{\beta_t}$ . Each step depends on the previous state, so the reverse process remains sequential even when denoiser calls are batched. A seeded random generator makes the stochastic trace reproducible:

sampling-reverse-process.py

import json
import math
import random

betas = [0.02, 0.03, 0.04]
alphas = [1 - beta for beta in betas]
alpha_bars: list[float] = []
running = 1.0
for alpha in alphas:
    running *= alpha
    alpha_bars.append(running)

def predicted_noise(x: float, step: int) -> float:
    return 0.45 * x + 0.02 * step

x = 1.25
trace = []
rng = random.Random(7)
for step in reversed(range(len(betas))):
    beta_t = betas[step]
    alpha_t = alphas[step]
    alpha_bar_t = alpha_bars[step]
    eps = predicted_noise(x, step)
    mean = (x - (beta_t / math.sqrt(1 - alpha_bar_t)) * eps) / math.sqrt(alpha_t)
    z = rng.gauss(0, 1) if step > 0 else 0.0
    x = mean + math.sqrt(beta_t) * z
    trace.append({
        "step": step,
        "predicted_noise": round(eps, 3),
        "reverse_mean": round(mean, 3),
        "stochastic_z": round(z, 3),
        "next_x": round(x, 3),
    })

print(json.dumps(trace, indent=2))

Output

[
  {
    "step": 2,
    "predicted_noise": 0.603,
    "reverse_mean": 1.193,
    "stochastic_z": -0.256,
    "next_x": 1.141
  },
  {
    "step": 1,
    "predicted_noise": 0.534,
    "reverse_mean": 1.086,
    "stochastic_z": 0.511,
    "next_x": 1.174
  },
  {
    "step": 0,
    "predicted_noise": 0.528,
    "reverse_mean": 1.111,
    "stochastic_z": 0.0,
    "next_x": 1.111
  }
]

The original DDPM sampler evaluates the denoiser across its configured reverse chain, commonly presented with $T=1000$ steps. That baseline is expensive compared with one-pass generation and motivates sparse samplers and distilled models.

Model architectures: U-Net and DiT

The U-Net backbone

A denoiser must recover both layout and small spatial details. For a UI illustration scene, it needs to place panels and visual anchors coherently while preserving edges, textures, and lighting cues. A U-Net's down path builds broader context, its up path restores spatial resolution, and skip connections carry local cues around the deepest bottleneck.

For years, U-Net (U-shaped convolutional network with skip connections) has been the standard backbone for diffusion models. It combines a contracting path that builds context with an expanding path that restores spatial detail through skip connections, which makes it a strong fit for multiscale denoising. In text-conditioned image models, timestep embeddings modulate every residual block, and cross-attention is usually inserted at several resolutions so prompt tokens can steer both coarse layout and fine detail.^{[5]Reference 5High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752}

U-Net denoiser path showing down blocks building context, a mid block, up blocks restoring detail, and side conditioning from timestep and text at many blocks. — U-Net builds context on the way down, restores detail on the way up, and keeps local cues through skip paths.

These are representative conditioning sites, not every single block. Stable Diffusion-style U-Nets use cross-attention in multiple down, middle, and up blocks rather than only at the bottleneck.^{[5]Reference 5High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752}

The Transformer shift (DiT)

DiT (Diffusion Transformer) doesn't mean "transformers are always better." It means the denoiser patchifies its input into tokens and processes them with transformer blocks instead of a convolutional U-Net. In the DiT experiments, increasing model compute improved Fréchet Inception Distance (FID), an image-quality metric where lower is better, for latent diffusion settings.^{[6]Reference 6Scalable Diffusion Models with Transformers.https://arxiv.org/abs/2212.09748} That's evidence for a scaling path, not a promise that a DiT is cheaper or better for every resolution, latency budget, or dataset.

A concrete published example is Stable Diffusion 3's MM-DiT (multimodal diffusion transformer). Instead of feeding text only through one-directional cross-attention, its text and image representations participate in a joint attention operation while retaining separate modality-specific weights.^{[7]Reference 7Scaling Rectified Flow Transformers for High-Resolution Image Synthesishttps://arxiv.org/abs/2403.03206} The SD3 authors report prompt-following, typography, and scaling results for that recipe; those measured results shouldn't be generalized to every transformer-based image generator.

Flow matching and rectified flow

DDPM frames generation as reversing a stochastic noising chain. Flow matching reframes it as learning a velocity field that transports samples along a path between a noise distribution and the data distribution.^{[8]Reference 8Flow Matching for Generative Modelinghttps://arxiv.org/abs/2210.02747} Rectified flow picks the simplest possible path: a straight line that interpolates noise $x_1 \sim \mathcal{N}(0, \mathbf{I})$ and data $x_0$ as $x_\tau = (1 - \tau) x_0 + \tau x_1$ for $\tau \in [0, 1]$ .^{[9]Reference 9Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flowhttps://arxiv.org/abs/2209.03003} The network is trained to predict the constant velocity $x_1 - x_0$ along that line, which is just another regression target much like $\epsilon$ or $v$ :

$\mathcal{L}_{\text{FM}} = \mathbb{E}_{\tau, x_0, x_1}\big[\,\|\,v_\theta(x_\tau, \tau) - (x_1 - x_0)\,\|^2\,\big]$

Straight-line paths are designed to make numerical integration effective with fewer solver steps, but quality at any step count remains a measurement, not a guarantee. Stable Diffusion 3 reports rectified flow together with MM-DiT and a shifted timestep-sampling strategy.^{[7]Reference 7Scaling Rectified Flow Transformers for High-Resolution Image Synthesishttps://arxiv.org/abs/2403.03206} For interview purposes, connect flow matching to diffusion-style deployment concerns such as conditioning, latent representation, and sampler cost, while keeping their training objectives and scheduler equations distinct.

Text diffusion and DiffusionGemma

Image diffusion usually corrupts continuous pixel or latent values with Gaussian noise. Text is different because a sentence is a sequence of discrete token IDs. A text diffuser therefore uses a discrete corruption process, often replacing tokens with masks or sampling from categorical transitions, then trains a model to reconstruct the original tokens from noisy context.^{[10]Reference 10Structured Denoising Diffusion Models in Discrete State-Spaceshttps://arxiv.org/abs/2107.03006}

Autoregressive language models generate left to right: $p(x_i \mid x_{<i})$ . Once token 12 is emitted, token 4 usually won't be revised. A diffusion language model starts from a noisy or masked sequence and refines many positions over several rounds. Each round can use context from both sides of a missing token, so the model can repair an earlier word after seeing later words.

Google's DiffusionGemma is a current open model using this idea. Google describes it as a 26B-parameter mixture-of-experts model with 4B active parameters per token that accepts text, image, and video inputs and generates text through discrete diffusion rather than left-to-right autoregressive decoding.^{[1]Reference 1DiffusionGemmahttps://ai.google.dev/gemma/docs/diffusiongemma}^{[11]Reference 11DiffusionGemma Model Cardhttps://ai.google.dev/gemma/docs/diffusiongemma/model_card} The public docs describe generation as iterative refinement over a target output canvas, with output lengths handled in 256-token blocks.^{[1]Reference 1DiffusionGemmahttps://ai.google.dev/gemma/docs/diffusiongemma}

Autoregressive decoding commits tokens left to right while discrete text diffusion refines masked token slots over repeated passes. — Autoregressive decoding commits left to right, while text diffusion refines a whole token canvas over several passes.

The important systems difference is dependency shape. Autoregressive decoding has a hard sequential dependency between adjacent generated tokens. Text diffusion still needs multiple model passes, but each pass can update many uncertain positions in the block. That makes it attractive for high-throughput short or medium outputs, infilling, and editing workflows where bidirectional context matters. It also introduces new engineering questions: how long should the canvas be, how many refinement rounds are enough, and how should the UI handle text that can change across the whole block before it's final?

The toy example below isn't a language model. It shows only the state shape: a masked canvas gets refined in rounds, and later rounds can fill earlier positions after seeing surrounding context.

text-diffusion-refinement.py

import json

target = ["write", "unit", "tests", "first"]
canvas = ["<mask>"] * len(target)
update_plan = [
    {"round": 1, "positions": [0, 3]},
    {"round": 2, "positions": [1]},
    {"round": 3, "positions": [2]},
]

trace = []
for step in update_plan:
    for position in step["positions"]:
        canvas[position] = target[position]
    trace.append({
        "round": step["round"],
        "canvas": " ".join(canvas),
        "remaining_masks": canvas.count("<mask>"),
    })

print(json.dumps(trace, indent=2))

Output

[
  {
    "round": 1,
    "canvas": "write <mask> <mask> first",
    "remaining_masks": 2
  },
  {
    "round": 2,
    "canvas": "write unit <mask> first",
    "remaining_masks": 1
  },
  {
    "round": 3,
    "canvas": "write unit tests first",
    "remaining_masks": 0
  }
]

Text diffusion shouldn't be described as "Stable Diffusion but with words." The forward corruption is discrete, the scheduler decides which token positions remain uncertain, and the output length often has to be planned up front. The shared idea is denoising; the data type, conditioning, and serving constraints are different.

Timestep conditioning

U-Net-style models usually encode the timestep $t$ with a sinusoidal embedding, project it with an MLP, and use the result to modulate residual blocks through FiLM (Feature-wise Linear Modulation) or AdaGN (Adaptive Group Normalization). DiT carries the same timestep signal, but typically injects it through adaptive LayerNorm inside transformer blocks instead of GroupNorm-based residual blocks.^{[6]Reference 6Scalable Diffusion Models with Transformers.https://arxiv.org/abs/2212.09748} A framework-free scale-and-shift sketch makes the shared idea visible:

timestep-conditioning.py

import json

def modulate(features: list[float], scale: float, shift: float) -> list[float]:
    return [round(value * (1 + scale) + shift, 3) for value in features]

feature_map = [0.20, -0.10, 0.50]
timestep_controls = {
    "reverse_start_high_noise": {"scale": 0.50, "shift": -0.05},
    "reverse_end_low_noise": {"scale": 0.05, "shift": 0.02},
}

outputs = {
    name: modulate(feature_map, **control)
    for name, control in timestep_controls.items()
}

print(json.dumps(outputs, indent=2))

Output

{
  "reverse_start_high_noise": [
    0.25,
    -0.2,
    0.7
  ],
  "reverse_end_low_noise": [
    0.23,
    -0.085,
    0.545
  ]
}

This modulation acts like a global controller. Reverse sampling starts at large $t$ with high noise, where the network must establish broad structure. It ends near $t=0$ with low noise, where the network refines texture and detail. In the forward noising direction, those same endpoints are called large or late $t$ and small or early $t$ , respectively, so naming the direction avoids an ambiguous word such as "early."

Classifier-free guidance (CFG)

How CFG steers generation without a classifier

How does the model know to draw an "oak dining table" and not a "red running shoe"? Early conditional diffusion systems used classifier guidance: they trained a separate image classifier on noisy inputs, then added the classifier's gradient into the sampling step to push the image toward a target label. That worked, but it required training and maintaining an extra classifier.

CFG (Classifier-Free Guidance) removes that extra classifier entirely. It gets a similar steering effect by training one denoiser to handle both conditional and unconditional prediction.^{[12]Reference 12Classifier-Free Diffusion Guidance.https://arxiv.org/abs/2207.12598}

CFG works by training a single model to be both a conditional and an unconditional predictor. At inference time, we "extrapolate" away from the unconditional prediction toward the conditional one.

During training, we randomly drop the conditioning (e.g., a text prompt) with some probability $p_{uncond}$ (typically 10%). In real text-to-image systems, the unconditional branch is usually represented by an empty-prompt embedding or a learned null token, not a literal all-zero vector. This batch mixes conditional and unconditional rows:

cfg-training-step.py

import json
import random

def cfg_conditioning_batch(prompts: list[str], drop_prob: float, seed: int = 7) -> list[dict[str, str]]:
    rng = random.Random(seed)
    rows = []
    for prompt in prompts:
        dropped = rng.random() < drop_prob
        rows.append({
            "original_prompt": prompt,
            "conditioning_used": "<null>" if dropped else prompt,
            "branch": "unconditional" if dropped else "conditional",
        })
    return rows

batch = cfg_conditioning_batch(
    [
        "oak dining table in soft light",
        "oak dining table in natural light",
        "mobile dashboard empty state",
        "annotated latency chart hero image",
    ],
    drop_prob=0.35,
)

print(json.dumps(batch, indent=2))

Output

[
  {
    "original_prompt": "oak dining table in soft light",
    "conditioning_used": "<null>",
    "branch": "unconditional"
  },
  {
    "original_prompt": "oak dining table in natural light",
    "conditioning_used": "<null>",
    "branch": "unconditional"
  },
  {
    "original_prompt": "mobile dashboard empty state",
    "conditioning_used": "mobile dashboard empty state",
    "branch": "conditional"
  },
  {
    "original_prompt": "annotated latency chart hero image",
    "conditioning_used": "<null>",
    "branch": "unconditional"
  }
]

During sampling, we combine the conditional and unconditional predictions. Common guidance scales above $1$ extrapolate beyond the conditional prediction:

$\hat{\epsilon} = \epsilon_\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))$

Here, $w$ is the guidance scale. When $w=1$ , the equation reduces to just the conditional prediction $\epsilon_\theta(x_t, t, c)$ . When $w>1$ , the model amplifies the difference between the conditional and unconditional predictions, actively pushing the generated image away from generic features and toward the specific meaning of the text prompt.

Watch the convention. This is the guidance-scale form used in Stable Diffusion and the diffusers library, where $w=1$ means no extra steering. The original paper writes the same thing as $\hat{\epsilon} = (1 + w')\epsilon_\theta(x_t, t, c) - w' \epsilon_\theta(x_t, t, \emptyset)$ , where $w'=0$ means no steering.^{[12]Reference 12Classifier-Free Diffusion Guidance.https://arxiv.org/abs/2207.12598} The two map exactly via $w = 1 + w'$ , so a reported guidance scale of $7.5$ here corresponds to $w' = 6.5$ in the paper's notation.

A concrete way to read the formula: if the unconditional prediction says "this patch looks like generic wood grain" and the conditional prediction says "this patch looks like oak grain," a guidance scale of $w = 7.5$ pushes the result seven and a half times farther along that direction. The patch becomes more distinctly oak-like rather than generic wood.

The practical implementation below shows the CFG extrapolation formula on short vectors. Real systems batch the unconditional and conditional inputs together for one forward pass, then split the two predictions before applying the same math.

cfg-guidance-formula.py

import json

def guided_noise(
    unconditional: list[float],
    conditional: list[float],
    guidance_scale: float,
) -> list[float]:
    return [
        round(base + guidance_scale * (target - base), 3)
        for base, target in zip(unconditional, conditional, strict=True)
    ]

noise_uncond = [0.20, -0.10, 0.05]
noise_cond = [0.05, -0.35, 0.20]

rows = [
    {
        "guidance_scale": scale,
        "guided_noise": guided_noise(noise_uncond, noise_cond, scale),
    }
    for scale in [0.0, 1.0, 7.5, 15.0]
]

print(json.dumps(rows, indent=2))

Output

[
  {
    "guidance_scale": 0.0,
    "guided_noise": [
      0.2,
      -0.1,
      0.05
    ]
  },
  {
    "guidance_scale": 1.0,
    "guided_noise": [
      0.05,
      -0.35,
      0.2
    ]
  },
  {
    "guidance_scale": 7.5,
    "guided_noise": [
      -0.925,
      -1.975,
      1.175
    ]
  },
  {
    "guidance_scale": 15.0,
    "guided_noise": [
      -2.05,
      -3.85,
      2.3
    ]
  }
]

Guidance Scale $w$	Effect
0.0	Unconditional sample. Ignores the prompt entirely.
1.0	Pure conditional prediction. Good baseline, no extra CFG boost.
3.0 to 8.0	Illustrative tuning range. Measure adherence, diversity, and artifacts for this model and route.
15.0+	Over-guided. Higher artifact risk, oversaturation, and repeated textures.

Illustrative classifier-free guidance curves compare prompt adherence, diversity, and artifact risk as guidance scale rises, with low, balanced, and over-guided operating zones. — Low guidance drifts, middle guidance aligns, and high guidance oversteers.

DDIM: fast sampling

Diffusion sampler schedule comparison showing DDPM, DDIM, distillation, and consistency model routes as denoising state tiles with different numbers of reverse steps. — Shorter samplers skip more of the reverse chain, so latency improves only if quality, editability, and control still pass evaluation.

Compare denoiser-evaluation budgets. The original DDPM setup is commonly demonstrated with about 1000 reverse steps; a DDIM deployment may try a sparse schedule such as 20 or 50 steps, while distilled or consistency models target still fewer. Whether a reduced schedule preserves acceptable quality depends on model, prompts, guidance, and evaluation slice.

DDIM (Denoising Diffusion Implicit Models) constructs a non-Markovian forward family with the same per-timestep marginals and training objective used by DDPM. Its reverse trajectory still updates one selected state into the next selected state. The practical difference is that generation can follow a shorter timestep subsequence, and one setting makes that trajectory deterministic.^{[13]Reference 13Denoising Diffusion Implicit Models.https://arxiv.org/abs/2010.02502}

DDIM lets us skip steps. Instead of sampling $t, t-1, t-2$ , the sampler can step $t, t-\Delta, t-2\Delta$ . This can reduce denoiser evaluations substantially; it doesn't eliminate the need to compare prompt adherence, detail preservation, and edit behavior against the slower baseline.

\begin{aligned} x_{t-1} &= \sqrt{\bar{\alpha}_{t-1}} \cdot \underbrace{\hat{x}_0}_{\text{predicted } x_0} \\ &\quad + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon_\theta(x_t, t) \\ &\quad + \sigma_t \cdot z \end{aligned}

Where $\hat{x}_0$ is the model's clean-image estimate, $\epsilon_\theta(x_t,t)$ is the predicted noise direction, and $\sigma_t z$ injects randomness (or vanishes for deterministic DDIM when $\sigma_t = 0$ ).

Setting $\sigma_t = 0$ makes the trajectory deterministic. By bypassing the strict Markovian requirement, DDIM decoupled generation from the exact number of forward steps used during training. That means you can train on 1000 noise levels but sample on a sparse subset during deployment. The systems lesson is that training and inference schedules don't need to be identical.

Each step can require more than one denoiser evaluation. For ordinary CFG, a system needs conditional and unconditional predictions unless it batches or distills that work. Count evaluations before claiming an interactive route is affordable:

sampler-evaluation-budget.py

import json

routes = [
    {"name": "offline_ddim_cfg", "steps": 50, "predictions_per_step": 2},
    {"name": "interactive_distilled", "steps": 6, "predictions_per_step": 1},
]

budget = [
    {
        "route": route["name"],
        "denoiser_evaluations": route["steps"] * route["predictions_per_step"],
        "quality_gate_required": True,
    }
    for route in routes
]

print(json.dumps(budget, indent=2))

Output

[
  {
    "route": "offline_ddim_cfg",
    "denoiser_evaluations": 100,
    "quality_gate_required": true
  },
  {
    "route": "interactive_distilled",
    "denoiser_evaluations": 6,
    "quality_gate_required": true
  }
]

Latent diffusion

Architecture

Pixel-space diffusion is computationally expensive because the model must process high-resolution feature maps (e.g., $512 \times 512 \times 3$ ) at every denoising step. Latent Diffusion Models (LDMs), like Stable Diffusion, solve this by training the diffuser in the latent space of a high-capacity autoencoder.

If you haven't worked with VAEs recently, recall that a variational autoencoder has two parts: an encoder that compresses an image into a compact latent representation, and a decoder that reconstructs the image from it. The latent space is learned lossy compression: it preserves enough perceptual structure for the training objective, while its reconstruction errors remain part of final output quality.^{[5]Reference 5High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752}

This is the continuous-latent branch of the autoencoder family. A VQ-VAE uses a discrete codebook instead: each patch is snapped to a learned code ID, which makes the compressed image look more like a sequence of tokens.^{[14]Reference 14Neural Discrete Representation Learning.https://arxiv.org/abs/1711.00937} That design is useful when a Transformer is trained to predict image tokens autoregressively or with masking. Latent diffusion usually takes the other route: keep the compressed representation continuous, add noise to it, then train a denoiser over that continuous latent tensor.

During training, the VAE encoder maps an image to a clean latent $z_0$ , and the model learns to denoise noisy versions of that latent. During inference, there's no input image: we start from random latent noise $z_T \sim \mathcal{N}(0, \mathbf{I})$ and run the reverse process entirely in latent space before decoding once at the end.^{[5]Reference 5High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752}

Latent diffusion keeps repeated denoising work in compressed latent space. — Latent diffusion spends iterative work in compressed space and decodes pixels once.

Only the final denoised latent gets decoded. If the denoiser predicts noise or velocity, the scheduler uses that prediction to update $z_t \rightarrow z_{t-1}$ ; the decoder never consumes raw predicted noise.

Why latent space?

Operating directly in pixel space demands substantial computational resources. Moving the diffusion process to a compressed latent space splits the workload into two phases. The autoencoder compresses and reconstructs pixels with some measured detail loss, while the diffusion model performs repeated denoising on the smaller latent state.

A $64 \times 64 \times 4$ latent tensor has $48\times$ fewer scalar values than a $512 \times 512 \times 3$ image. That reduction explains why the repeated denoising loop can be substantially cheaper than pixel-space denoising; actual throughput still depends on architecture, step count, batch policy, hardware, and safety checks.

Efficiency: A $64 \times 64 \times 4$ latent tensor has $48\times$ fewer scalar values than a $512 \times 512 \times 3$ image, and $64\times$ fewer spatial locations.
Compression: The latent representation removes information; reconstruction quality must be evaluated on details the route cares about.
Practical effect: The expensive denoising loop runs on a much smaller state, reducing computational demand relative to pixel-space diffusion.^{[5]Reference 5High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752}

latent-work-state.py

import json
import math

pixel_shape = (512, 512, 3)
latent_shape = (64, 64, 4)
pixel_values = math.prod(pixel_shape)
latent_values = math.prod(latent_shape)

print(json.dumps({
    "pixel_values": pixel_values,
    "latent_values": latent_values,
    "scalar_reduction": pixel_values // latent_values,
    "spatial_reduction": (pixel_shape[0] * pixel_shape[1]) // (latent_shape[0] * latent_shape[1]),
    "throughput_requires_benchmark": True,
}, indent=2))

Output

{
  "pixel_values": 786432,
  "latent_values": 16384,
  "scalar_reduction": 48,
  "spatial_reduction": 64,
  "throughput_requires_benchmark": true
}

Text conditioning via cross-attention

Text-conditioning recipes are model-specific. The latent diffusion paper demonstrates cross-attention conditioning and uses pretrained encoders such as CLIP for text-to-image experiments.^{[5]Reference 5High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752}^{[15]Reference 15Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020} Other image generators may use T5-family encoders, multiple text encoders, or different freeze/train choices. The stable concept is that prompt embeddings condition repeated denoising updates, not that every text encoder is frozen.

The CrossAttentionBlock takes flattened image or latent features as queries, and text embeddings as keys and values. A stable softmax turns compatibility scores into weights. The runnable version below shows one latent patch attending to prompt tokens:

text-conditioning-via-cross-attention.py

import json
import math

def dot(left: list[float], right: list[float]) -> float:
    return sum(a * b for a, b in zip(left, right, strict=True))

def softmax(values: list[float]) -> list[float]:
    largest = max(values)
    exp_values = [math.exp(value - largest) for value in values]
    total = sum(exp_values)
    return [value / total for value in exp_values]

latent_query = [0.25, 0.10, 0.70]
text_tokens = [
    {"token": "oak", "key": [0.30, 0.05, 0.65], "value": [0.70, 0.20]},
    {"token": "dining", "key": [0.10, 0.80, 0.10], "value": [0.20, 0.70]},
    {"token": "table", "key": [0.35, 0.10, 0.55], "value": [0.80, 0.10]},
]

logits = [
    dot(latent_query, token["key"]) / math.sqrt(len(latent_query))
    for token in text_tokens
]
weights = softmax(logits)
conditioned_patch = [
    sum(weight * token["value"][i] for weight, token in zip(weights, text_tokens, strict=True))
    for i in range(2)
]

print(json.dumps({
    "attention_weights": {
        token["token"]: round(weight, 3)
        for token, weight in zip(text_tokens, weights, strict=True)
    },
    "conditioned_patch": [round(value, 3) for value in conditioned_patch],
}, indent=2))

Output

{
  "attention_weights": {
    "oak": 0.359,
    "dining": 0.292,
    "table": 0.349
  },
  "conditioned_patch": [
    0.589,
    0.311
  ]
}

Control and specialization

Production image systems rarely stop at plain text-to-image. Teams may add targeted modules around a base diffuser instead of retraining the whole backbone.

LoRA (Low-Rank Adaptation): Full fine-tuning is expensive. LoRA^{[16]Reference 16LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685} inserts small trainable matrices into selected layers so you can specialize a base model with a much smaller parameter update.
Inpainting and outpainting: The model conditions on known pixels and a mask, which lets it fill holes, replace objects, or extend the canvas while preserving local context.
ControlNet / adapters: ControlNet (a neural network architecture for adding conditional control to diffusion models)^{[17]Reference 17Adding Conditional Control to Text-to-Image Diffusion Models.https://arxiv.org/abs/2302.05543} keeps the pretrained backbone and adds trainable control branches for signals like depth, edges, or human pose. This is how you turn "draw a person" into "draw a person in exactly this pose."

Fast inference in practice

DDIM shows the first big shortcut; a production system can combine optimizations after measuring each route:

Sampler choice: DDPM is the canonical stochastic sampler. DDIM keeps the same training objective but allows deterministic skipped-step trajectories.^{[13]Reference 13Denoising Diffusion Implicit Models.https://arxiv.org/abs/2010.02502}
Distillation: Progressive distillation halves the number of required sampling steps by teaching a student model to match multiple teacher steps at once.^{[4]Reference 4Progressive Distillation for Fast Sampling of Diffusion Models.https://arxiv.org/abs/2202.00512}
Consistency models: Consistency training learns mappings that stay valid across noise levels, which enables one-step or few-step generation.^{[18]Reference 18Consistency Models.https://arxiv.org/abs/2303.01469}
Straight-line objectives: Rectified flow and flow matching learn transport paths between noise and data intended to support effective numerical integration with fewer steps.^{[9]Reference 9Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flowhttps://arxiv.org/abs/2209.03003}^{[8]Reference 8Flow Matching for Generative Modelinghttps://arxiv.org/abs/2210.02747}
Guidance distillation: A distilled model can absorb guidance behavior so serving avoids separate conditional and unconditional denoiser evaluations per step; quality must be reevaluated after distillation.
Kernel efficiency: Transformer-heavy denoisers benefit from fused kernels such as FlashAttention, which reduce memory traffic and make larger batch sizes practical.^{[19]Reference 19FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.https://arxiv.org/abs/2205.14135}

In a design-tool setting, these optimizations determine which governed routes are affordable: offline concept boards tolerate slower samplers, while an interactive preview route may require fewer evaluations and stricter latency budgets. No step-count choice makes an output faithful to a real screenshot; quality and representation checks remain separate gates.

Skills to defend in review

Foundational: Derive the forward process $q(x_t \mid x_{t-1})$ as Gaussian noise addition.
Intermediate: Explain the DDPM training objective as noise prediction and why the reverse posterior is approximated.
Intermediate: Explain how the noise schedule controls the SNR trajectory across timesteps.
Advanced: Describe U-Net and DiT denoisers with timestep and text conditioning.
Advanced: Show the classifier-free guidance interpolation formula.
Advanced: Distinguish classifier guidance, classifier-free guidance, and prediction targets such as $\epsilon$ , $x_0$ , and $v$ .
Advanced: Explain latent-space diffusion with a VAE encoder, latent denoiser, and VAE decoder.
Advanced: Explain how discrete text diffusion differs from Gaussian image diffusion.
Advanced: Describe why DiffusionGemma-style generation refines a token canvas instead of appending one token at a time.
Advanced: Explain the reparameterization trick for closed-form forward sampling.
Advanced: Contrast pixel-space diffusion with latent-space diffusion.
Advanced: Explain fast inference tradeoffs across DDIM, distillation, and consistency models.
Advanced: Relate rectified flow / flow matching to diffusion and describe the published Stable Diffusion 3 MM-DiT example.
Advanced: Define an asynchronous image-job API, size denoiser evaluations by route, preserve seeded retry semantics, and gate a staged rollout on quality and safety slices.

Production questions to answer

How does classifier-free guidance trade diversity for fidelity?

CFG extrapolates away from the unconditional prediction toward the conditional one. At $w = 0$ , you get the unconditional model. At $w = 1$ , you recover the plain conditional prediction. On a chosen evaluation slice, increasing $w$ may improve prompt fidelity while reducing diversity or increasing artifacts; tune it with measured outputs rather than as a fixed rule.

Why does Stable Diffusion operate in latent space instead of pixel space?

Operating in pixel space, for example $512 \times 512 \times 3$ , is expensive because the denoiser must process a large spatial grid at every step. A VAE compresses the image into a smaller latent tensor such as $64 \times 64 \times 4$ , which is $48\times$ smaller in raw values and $64\times$ smaller in spatial locations. That's what makes the iterative denoising loop practical at high resolution, while the decoder reconstructs pixel-space detail at the end.

What is the difference between DDPM and DDIM sampling?

DDPM defines a reverse Markov chain; the original baseline is commonly shown with a long schedule such as 1000 denoiser evaluations. DDIM defines a non-Markovian process that allows deterministic sampling on a sparse timestep subset. A route can evaluate schedules such as 20 or 50 steps for latency, quality, and inversion/edit behavior rather than assuming they match the baseline.

How do text encoders integrate with a U-Net or DiT?

The text prompt is tokenized and encoded into a sequence of embeddings, often by a CLIP-family or T5-family encoder depending on the model family. Stable Diffusion-style U-Nets inject those embeddings through cross-attention layers, where image or latent features provide the queries and text embeddings provide keys and values. Newer multimodal DiTs may instead use joint text-image attention with modality-specific weights, as Stable Diffusion 3 does.^{[5]Reference 5High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752}^{[7]Reference 7Scaling Rectified Flow Transformers for High-Resolution Image Synthesishttps://arxiv.org/abs/2403.03206}

How does DiffusionGemma-style text generation differ from ordinary LLM decoding?

Ordinary decoder-only LLMs append tokens left to right, so each new token depends on a committed prefix. DiffusionGemma-style text generation starts with a target token canvas and repeatedly refines uncertain positions through discrete diffusion.^{[1]Reference 1DiffusionGemmahttps://ai.google.dev/gemma/docs/diffusiongemma} That can reduce strict next-token dependency inside a block and can use bidirectional context during refinement. Serving now has new knobs: target length blocks, number of refinement passes, confidence thresholds, and UI behavior for text that can change before final commit.

Why does the noise schedule matter so much in diffusion training?

The schedule determines the signal-to-noise ratio at every timestep. In the closed-form forward process, remaining signal power is proportional to $\bar{\alpha}_t$ and noise power is proportional to $1 - \bar{\alpha}_t$ , so $\mathrm{SNR}_t \approx \bar{\alpha}_t / (1 - \bar{\alpha}_t)$ . If signal disappears too quickly, many late steps become nearly pure noise and contribute weak learning signal. If too much signal remains, the model never learns hard denoising steps.

What is the reparameterization trick in the forward process?

The reparameterization trick lets us sample $x_t$ at any timestep directly from $x_0$ without iterating through all previous noising steps. Because sums of Gaussian variables remain Gaussian, the full forward Markov chain can be marginalized and expressed as a linear combination of $x_0$ and one standard normal noise variable $\epsilon$ . That enables parallel, random timestep sampling across batches during training.

Common design failures

Confusing noise prediction, $\epsilon$ , with clean-image prediction, $x_0$ .
Assuming the forward process requires a neural network, even though it's fixed math.
Confusing classifier guidance with classifier-free guidance.
Thinking classifier-free guidance requires a separate classifier model.
Thinking the VAE decoder consumes the denoiser's raw noise prediction instead of the final denoised latent.
Overlooking the perceptual compression loss introduced by the VAE in latent diffusion.
Treating guidance scale as a probability rather than an extrapolation coefficient.
Assuming fewer sampling steps are a free speedup with no quality or editability tradeoff.
Assuming diffusion always means Gaussian noise over pixels, even though text diffusion uses discrete token corruption and refinement.
Treating DiffusionGemma-style generation as ordinary next-token decoding with a different sampler.

Traps that catch beginners

CFG scale isn't a probability. It's an extrapolation coefficient. w=0 is unconditional, w=1 is conditional without extra boost, and larger values trade diversity for prompt fidelity.^{[12]Reference 12Classifier-Free Diffusion Guidance.https://arxiv.org/abs/2207.12598}
Latent diffusion doesn't remove the autoencoder bottleneck. It moves the denoising loop into a smaller space, but decoder quality still caps the final image quality.^{[5]Reference 5High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752}
The denoiser's raw prediction isn't what gets decoded. If the model predicts $\epsilon$ or $v$ , the scheduler converts that into the next latent and only the final clean latent goes through the VAE decoder.^{[2]Reference 2Denoising Diffusion Probabilistic Models.https://arxiv.org/abs/2006.11239}^{[5]Reference 5High-Resolution Image Synthesis with Latent Diffusion Models.https://arxiv.org/abs/2112.10752}
DiT doesn't automatically mean cheaper inference. Self-attention scales well with model size, but token count still matters. Latent-space patching is what makes DiT practical for image generation.^{[6]Reference 6Scalable Diffusion Models with Transformers.https://arxiv.org/abs/2212.09748}
Text diffusion isn't pixel diffusion with token labels. The noisy state is discrete, the model refines masked or corrupted token positions, and target length planning becomes part of serving.

Teams building design-asset pipelines can't treat synthetic imagery as verified UI or incident evidence. Generated outputs can show incorrect labels, brand marks, chart values, legal text, or unsafe content. High-risk routes need output filtering, provenance labeling, and human review before publication.

synthetic-output-release-gate.py

import json

outputs = [
    {"asset": "campaign_background", "claims_real_screenshot": False, "contains_logo": False},
    {"asset": "ui_mockup_with_text", "claims_real_screenshot": True, "contains_logo": False},
    {"asset": "incident_screenshot_evidence", "claims_real_screenshot": True, "contains_logo": True},
]

decisions = []
for output in outputs:
    needs_review = output["claims_real_screenshot"] or output["contains_logo"]
    decisions.append({
        "asset": output["asset"],
        "decision": "human_review" if needs_review else "synthetic_labeled_preview",
    })

print(json.dumps(decisions, indent=2))

Output

[
  {
    "asset": "campaign_background",
    "decision": "synthetic_labeled_preview"
  },
  {
    "asset": "ui_mockup_with_text",
    "decision": "human_review"
  },
  {
    "asset": "incident_screenshot_evidence",
    "decision": "human_review"
  }
]

Diffusion model checklist

What diffusion models do

Forward process: a fixed Markov chain gradually destroys information by adding Gaussian noise.
Reverse process: a learned model (U-Net or DiT) predicts and removes noise step-by-step.
Classifier-Free Guidance (CFG): a common text-steering mechanism that extrapolates between unconditional and conditional predictions.
Latent diffusion: iterative denoising happens in a compressed representation, which reduces work while introducing a measured reconstruction bottleneck.
Text diffusion: the same denoising idea moves onto discrete token sequences, where masked positions are refined over several passes instead of appended left to right.
Modern architecture: Stable Diffusion 3 is a published example of MM-DiT with rectified flow; DiffusionGemma is a current Google example of discrete diffusion for text.
Inference engineering: sampler choice, denoiser evaluations, refinement rounds, output canvas length, distillation, and quality checks determine whether a route is usable in production.

Common misconceptions

"The model predicts the image directly." Most standard diffusion models predict the noise $\epsilon$ , though predicting the original image $x_0$ or its velocity $v$ is also possible.
"More steps always equal better quality." Quality improves only until sampler error, model error, and guidance artifacts dominate. Few-step models exist, but they're an engineering trade-off, not a free lunch.
"Latent diffusion throws away everything important." The VAE is lossy, but it's trained for useful perceptual reconstruction. Measure compression errors separately from denoiser errors on the details the route requires.
"DiffusionGemma is just faster next-token decoding." It changes the generation dependency pattern: a masked token canvas is refined over passes, so serving has different length, streaming, and confidence controls.

Trace a full image diffusion pipeline from prompt to pixel, explain why the forward process uses a closed-form noise sample, and read a training loop to identify exactly where the noise is added, where the MSE loss is computed, and why the model is learning anything at all. Also explain why Stable Diffusion runs in latent space, what CFG does to the noise prediction, why fewer sampling steps are a speed-quality trade-off, and how DiffusionGemma-style text generation changes the dependency pattern from prefix appending to canvas refinement.

Next Step

Continue to Real-Time Voice AI Agent

You can now reason about iterative generation under latency and quality trade-offs. Next you'll apply that systems discipline to live audio, where streaming stages, interruptions, and first-audio latency shape the architecture.

PreviousMultimodal LLM Architecture

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

DiffusionGemma

Google · 2026

Denoising Diffusion Probabilistic Models.

Ho, J., Jain, A., & Abbeel, P. · 2020 · NeurIPS 2020

Deep Unsupervised Learning using Nonequilibrium Thermodynamics.

Sohl-Dickstein et al. · 2015

Progressive Distillation for Fast Sampling of Diffusion Models.

Salimans, T., & Ho, J. · 2022 · ICLR 2022

High-Resolution Image Synthesis with Latent Diffusion Models.

Rombach, R., et al. · 2022 · CVPR 2022

Scalable Diffusion Models with Transformers.

Peebles, W., & Chen, S. · 2023 · ICCV 2023

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Esser, P., Kulal, S., Blattmann, A., et al. · 2024

Flow Matching for Generative Modeling

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. · 2022

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., & Liu, Q. · 2022

Structured Denoising Diffusion Models in Discrete State-Spaces

Austin, J., Johnson, D. D., Ho, J., Tarlow, D., & van den Berg, R. · 2021 · NeurIPS 2021

DiffusionGemma Model Card

Google · 2026

Classifier-Free Diffusion Guidance.

Ho, J., & Salimans, T. · 2022 · NeurIPS 2021 Workshop

Denoising Diffusion Implicit Models.

Song, J., Meng, C., & Ermon, S. · 2020 · ICLR 2021

Neural Discrete Representation Learning.

Oord, A. van den, Vinyals, O., & Kavukcuoglu, K. · 2017

Learning Transferable Visual Models From Natural Language Supervision.

Radford, A., et al. · 2021 · ICML 2021

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

Adding Conditional Control to Text-to-Image Diffusion Models.

Zhang, L., et al. · 2023 · ICCV 2023

Consistency Models.

Song, Y., et al. · 2023 · ICML 2023

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Diffusion Models: Images & Text

Product brief: a governed design-asset generator

Requirements and API

Data flow and sizing

Failure recovery, rollout, and evaluation

A one-pixel forward process first

If x0=1.0x_0=1.0x0​=1.0, ϵ=−0.4\epsilon=-0.4ϵ=−0.4, and αˉ=0.5\bar{\alpha}=0.5αˉ=0.5, what are xtx_txt​ and SNR?

Why iterative refinement?

Why did diffusion models become attractive compared with one-shot GAN-style generators?

Generative models comparison

Forward diffusion process

Adding noise gradually

Why is the closed-form forward sample useful during training?

Reverse process: learned denoising

DDPM training

Why is the reverse process learned instead of computed exactly?

What target does the standard DDPM training loop regress against?

Noise schedule and SNR

What does the noise schedule control besides bookkeeping?

Prediction targets

If a model predicts vvv instead of ϵ\epsilonϵ, did the diffusion pipeline stop being diffusion?

Sampling (reverse process)

Model architectures: U-Net and DiT

The U-Net backbone

Why do U-Nets fit denoising tasks well?

The Transformer shift (DiT)

What changes when a denoiser moves from U-Net to DiT?

Flow matching and rectified flow

How does rectified flow differ from a standard DDPM noise schedule?

Text diffusion and DiffusionGemma

What changes when diffusion moves from images to text?

Timestep conditioning

Why must the denoiser know the timestep?

Classifier-free guidance (CFG)

How CFG steers generation without a classifier

What problem does classifier-free guidance solve?

Why is guidance scale not a probability?

DDIM: fast sampling

What is the practical systems insight behind DDIM?

Latent diffusion

Architecture

Why latent space?

Why does Stable Diffusion-style latent space change production economics?

Text conditioning via cross-attention

In text-to-image cross-attention, what supplies queries, keys, and values?

Control and specialization

Which adaptation tool fits brand lighting, a masked campaign-background repair, and exact pose control?

Fast inference in practice

How should a platform team choose between 50-step DDIM and a 4-8 step distilled model?

Skills to defend in review

Production questions to answer

How does classifier-free guidance trade diversity for fidelity?

Why does Stable Diffusion operate in latent space instead of pixel space?

What is the difference between DDPM and DDIM sampling?

How do text encoders integrate with a U-Net or DiT?

How does DiffusionGemma-style text generation differ from ordinary LLM decoding?

Why does the noise schedule matter so much in diffusion training?

What is the reparameterization trick in the forward process?

Common design failures

Traps that catch beginners

Why does synthetic design imagery still need brand-safety checks?

Diffusion model checklist

What diffusion models do

Common misconceptions

What full path should you be able to trace after this diffusion lesson?

Mastery Check

Discussion