Understand how Vision Transformers split images into patches, build visual tokens, train encoders, and connect to CLIP and multimodal LLMs.
Text transformers start with tokens. Vision transformers start by inventing tokens from pixels.
You just learned how attention routes information across a sequence. ViT changes where the sequence comes from: image patches instead of word tokens.
That one shift explains why multimodal systems feel less mysterious. A document scan, UI screenshot, or field-inspection image can't enter a transformer as raw pixels. It must first become a sequence of vectors. A Vision Transformer (ViT) does that by cutting the image into patches, projecting each patch into an embedding, adding position information, and running transformer encoder blocks.[1]
Take a 224 x 224 RGB image. If you use 16 x 16 patches, the image becomes:
1224 / 16 = 14 patches per side
214 x 14 = 196 image patchesEach patch contains 16 x 16 x 3 = 768 raw pixel values. ViT flattens each patch and applies a learned linear projection to map it to the model's hidden size d_model. ViT-Base uses d_model = 768. That this happens to equal the raw patch width of 768 is a coincidence of this configuration, not a rule: a 16 x 16 patch projected to ViT-Large's d_model = 1024 would go from 768 raw values to a 1024-dimensional vector. The raw patch width and the hidden size are two independent numbers.
More generally, for an H x W x C image and square patch width P, assuming P divides both H and W:
1patch_count = (H / P) x (W / P)
2raw_patch_width = P x P x C
3projected_tokens shape = [patch_count, d_model]Track those three values separately. Changing patch size changes sequence length and raw patch width. Changing d_model changes projected feature width.
Now the transformer sees a sequence:
1[patch_1, patch_2, ..., patch_196]This is the same shape idea you learned for text. The model doesn't process "an image" directly. It processes a sequence of vectors, and attention lets each visual token look at other visual tokens.
Why flatten and project? A linear layer (a matrix multiplication) learns to turn the raw pixel values into a d_model-dimensional input representation that the transformer can refine. The weights are shared across all patches -- the same "patch embedder" is applied everywhere. This is analogous to the token embedding matrix in text transformers.
Patchify can be a strided convolution. The ViT paper describes flattening each patch and applying one shared linear projection. A 2D convolution with kernel size P and stride P can perform exactly that same operation: each output location reads one non-overlapping patch, and each output channel is one learned projection dimension. So nn.Conv2d(3, 768, kernel_size=16, stride=16) can produce the same patch tokens as explicit reshape-then-matmul when its weights and bias are copied into the equivalent linear layer.[1]
This minimal, runnable NumPy implementation shows the patch embedding step with toy numbers:
1import numpy as np
2
3def extract_patches(image: np.ndarray, patch_size: int = 16) -> np.ndarray:
4 """Split HxWxC image into (N_patches, patch_size*patch_size*C) flat patches."""
5 h, w, c = image.shape
6 assert h % patch_size == 0 and w % patch_size == 0
7 patches = image.reshape(
8 h // patch_size, patch_size,
9 w // patch_size, patch_size, c
10 ).swapaxes(1, 2).reshape(-1, patch_size * patch_size * c)
11 return patches
12
13def patch_embed(patches: np.ndarray, proj_weight: np.ndarray) -> np.ndarray:
14 """Project (N, patch_width) patches with shared (d_model, patch_width) weights."""
15 return patches @ proj_weight.T # shared weights for every patch
16
17# Toy 32x32 RGB image, 8x8 patches: 16 patches, each 192 raw values
18rng = np.random.default_rng(0)
19toy_image = rng.normal(size=(32, 32, 3)).astype(np.float32)
20patches = extract_patches(toy_image, patch_size=8)
21print("Patches shape:", patches.shape)
22
23d_model = 256
24proj = rng.normal(size=(d_model, 192)).astype(np.float32) * 0.02
25embedded = patch_embed(patches, proj)
26print("Embedded tokens shape:", embedded.shape)1Patches shape: (16, 192)
2Embedded tokens shape: (16, 256)The next check uses PyTorch's Conv2d layer and an explicit flattened-patch multiplication with identical weights. Matching output confirms that these are two implementations of the same patch projection, not two different vision architectures.
1import torch
2
3torch.manual_seed(3)
4image = torch.arange(1 * 3 * 4 * 4, dtype=torch.float32).reshape(1, 3, 4, 4)
5patch_size = 2
6d_model = 5
7
8conv = torch.nn.Conv2d(3, d_model, kernel_size=patch_size, stride=patch_size)
9conv_tokens = conv(image).flatten(2).transpose(1, 2)
10
11flat_patches = (
12 image.unfold(2, patch_size, patch_size)
13 .unfold(3, patch_size, patch_size)
14 .permute(0, 2, 3, 1, 4, 5)
15 .reshape(1, -1, 3 * patch_size * patch_size)
16)
17linear_tokens = flat_patches @ conv.weight.reshape(d_model, -1).T + conv.bias
18
19print("Token shape:", tuple(conv_tokens.shape))
20print("Conv equals explicit projection:", torch.allclose(conv_tokens, linear_tokens))1Token shape: (1, 4, 5)
2Conv equals explicit projection: TruePatch size is a major design knob:
| Patch size | Tokens for 224 x 224 image | Trade-off |
|---|---|---|
| 32 x 32 | 49 | Cheaper, less detail |
| 16 x 16 | 196 | Common balance |
| 14 x 14 | 256 | More detail, more attention cost |
| 8 x 8 | 784 | Expensive, useful for fine detail |
This matters when reading tiny chart labels, serial numbers, or document text. Smaller patches preserve more visual detail (a small axis label or printed ID), but self-attention cost grows with the square of token count. Systems that need fine detail often add smaller patches, higher-resolution crops, OCR fallback, or variable-resolution processing for text-heavy regions.
1def patch_tokens(image_size: int, patch_size: int) -> int:
2 assert image_size % patch_size == 0, "image must split into complete patches"
3 patches_per_side = image_size // patch_size
4 return patches_per_side**2
5
6def attention_pairs(tokens: int) -> int:
7 return tokens**2
8
9tokens_16 = patch_tokens(224, 16)
10tokens_14 = patch_tokens(224, 14)
11tokens_8 = patch_tokens(224, 8)
12
13print(tokens_16, tokens_14, tokens_8)
14print(round(attention_pairs(tokens_8) / attention_pairs(tokens_16), 1), "x more attention pairs")
15print("16-patch count correct:", tokens_16 == 196)
16print("14-patch count correct:", tokens_14 == 256)
17print("8-patch count correct:", tokens_8 == 784)1196 256 784
216.0 x more attention pairs
316-patch count correct: True
414-patch count correct: True
58-patch count correct: TrueA bag of patches isn't enough. The model needs to know where each patch came from. A patch in the top-left corner of a inspection photo means something different from an identical patch near the barcode.
The original ViT adds a learned one-dimensional position embedding table to the raster-ordered patch sequence. A patch in row 0, column 1 receives the vector for its flattened sequence slot. The paper found no clear gain from its tested 2D-aware alternatives; the learned 1D vectors still developed row-and-column structure after training.[1] Later encoders may use explicitly 2D or relative position schemes.
For patch sequence slot , the input to the encoder is:
1visual_token_i = patch_projection_i + position_embedding_iOriginal ViT also prepends a learnable classification token ([CLS]). After the encoder, that token is the image representation supplied to its classifier. Other image encoders may pool patch tokens or keep the full patch sequence for downstream tasks.
The tiny attention calculation below swaps two patch contents. A summary query can't distinguish the swap when there's no position signal; after position vectors are added, the same query produces a different summary.
1import numpy as np
2
3def attend(query: np.ndarray, tokens: np.ndarray) -> np.ndarray:
4 scores = tokens @ query / np.sqrt(tokens.shape[1])
5 scores = scores - scores.max()
6 weights = np.exp(scores) / np.exp(scores).sum()
7 return weights @ tokens
8
9query = np.array([1.0, 0.0])
10patches = np.array([[2.0, 0.1], [0.2, 1.5]])
11swapped = patches[::-1]
12positions = np.array([[0.0, 0.6], [0.5, 0.0]])
13
14without_position = np.allclose(attend(query, patches), attend(query, swapped))
15with_position = np.allclose(
16 attend(query, patches + positions),
17 attend(query, swapped + positions),
18)
19print("Same summary after swap without positions:", without_position)
20print("Same summary after swap with positions:", with_position)1Same summary after swap without positions: True
2Same summary after swap with positions: FalseThe important habit is to track shape and meaning:
| Tensor | Shape example | Meaning |
|---|---|---|
| image | [3, 224, 224] | RGB pixels |
| patches | [196, 768] | flattened patch values |
| patch embeddings | [196, 768] | projected visual tokens |
encoder input with [CLS] | [197, 768] | summary slot plus positioned patches |
| encoded tokens | [197, 768] | contextual image representation |
Learned position tables also expose a common failure when resolution changes. Moving from 224 x 224 to 256 x 256 with 16 x 16 patches changes the patch sequence from 196 to 256 entries. The original ViT approach handles higher-resolution fine-tuning by interpolating the pretrained position embeddings on their 2D patch grid.[1]
1import numpy as np
2
3def sequence_length(image_size: int, patch_size: int, cls_token: bool = True) -> int:
4 assert image_size % patch_size == 0, "image must split into complete patches"
5 patch_count = (image_size // patch_size) ** 2
6 return patch_count + int(cls_token)
7
8trained_positions = np.zeros((sequence_length(224, 16), 768))
9higher_resolution_tokens = np.zeros((sequence_length(256, 16), 768))
10
11print("trained slots:", trained_positions.shape[0])
12print("new slots:", higher_resolution_tokens.shape[0])
13print("needs resized position table:", trained_positions.shape != higher_resolution_tokens.shape)1trained slots: 197
2new slots: 257
3needs resized position table: TrueViT uses standard transformer encoder blocks (the same scaled dot-product attention and feed-forward layers you learned in the attention chapter). Unlike a decoder-only language model, it doesn't need a causal mask for image understanding. Every patch can attend to every other patch -- the attention is bidirectional.
For a inspection image, one patch might show error panel, another might show a status badge, and another might show a tiny warning icon. Self-attention lets the encoder combine these clues across the image instead of treating each patch independently. A patch containing "ERROR" text can directly influence the representation of a distant "warning" patch.
That's why ViT connects naturally to the attention chapter. Query and key vectors still decide where information flows. Value vectors still carry content. The sequence positions are image patches instead of words. The full stack also includes residual connections, LayerNorm (covered in a later chapter), and position-wise feed-forward networks -- exactly the same building blocks as text transformers.
To see the mask difference directly, the next example uses the same attention scores twice. A vision encoder leaves all patch-to-patch entries available. A left-to-right decoder masks future entries.
1import numpy as np
2
3def softmax(scores: np.ndarray) -> np.ndarray:
4 scores = scores - scores.max(axis=-1, keepdims=True)
5 probs = np.exp(scores)
6 return probs / probs.sum(axis=-1, keepdims=True)
7
8tokens = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
9scores = tokens @ tokens.T / np.sqrt(tokens.shape[1])
10full_weights = softmax(scores)
11causal_scores = np.where(np.triu(np.ones_like(scores), k=1) == 1, -np.inf, scores)
12causal_weights = softmax(causal_scores)
13
14print("Vision patch 0 can read later patch 1:", full_weights[0, 1] > 0)
15print("Causal token 0 can read later token 1:", causal_weights[0, 1] > 0)
16print("Vision row sums to one:", np.isclose(full_weights[0].sum(), 1.0))1Vision patch 0 can read later patch 1: True
2Causal token 0 can read later token 1: False
3Vision row sums to one: TrueA complete Vision Transformer typically looks like this:
[CLS] token for the original classification modelThe output is a set of contextual visual tokens plus an optional global image representation. CLIP-style image encoders commonly read a pooled image vector, while a VLM bridge may retain many patch features.
[3, 224, 224]196 × (16×16×3) = 196 × 768196 × 768 tokens[CLS]: 197 × 768197 × 768768 dimensionsThis shape discipline is the same habit you developed for text token sequences.
Why pretraining scale matters. A convolutional neural network (CNN) bakes locality and translation equivariance into each layer. Original ViT uses patch extraction and learned position embeddings as its limited image-specific structure, then learns spatial relationships through attention. In the ViT paper's controlled experiments, ResNet baselines do better with smaller pretraining datasets, while ViT catches up or does better as pretraining data grows to ImageNet-21k and JFT-300M.[1] For a limited private image dataset, begin evaluation with a pretrained encoder rather than assuming training a large ViT from scratch will win.
CLIP (Contrastive Language-Image Pre-training) uses contrastive learning to train an image encoder and a text encoder together so matching image-text pairs land near each other in embedding space. Its paper trains on 400 million image-text pairs gathered from the internet instead of using only a fixed class-label list.[2]
A simplified training batch looks like this:
| Image | Matching text |
|---|---|
| Photo of a cracked equipment panel | "cracked panel with warning label" |
| Instrument-panel photo | "voltage warning panel on gray rack" |
| Document scan | "barcode and form label" |
The model normalizes image and text embeddings, computes a matrix of scaled cosine similarities, and applies symmetric softmax cross-entropy: each image must pick its paired text, and each text must pick its paired image. For zero-shot classification, it embeds text prompts for the candidate classes and ranks their scaled similarities to the image.[2]
CLIP isn't a captioning model by itself. It returns similarity scores. A multimodal assistant needs extra components (a language model head or a separate captioner) to turn image understanding into grounded language.
The connection to the sentence-embedding lesson is direct: the same contrastive learning machinery that turns sentences into useful vectors is used here to align two different modalities (vision and language) in one shared embedding space.
This compact inference example treats two descriptions as candidate classes. The returned number is a score-derived probability over those supplied candidates, not generated language or verified evidence.
1import numpy as np
2
3def normalize(rows: np.ndarray) -> np.ndarray:
4 return rows / np.linalg.norm(rows, axis=-1, keepdims=True)
5
6image = normalize(np.array([[0.95, 0.20, 0.05]]))
7prompts = ["a cracked equipment panel", "a clean equipment panel"]
8text = normalize(np.array([[0.90, 0.25, 0.05], [0.05, 0.95, 0.10]]))
9logit_scale = 5.0
10
11logits = logit_scale * (image @ text.T)[0]
12probabilities = np.exp(logits - logits.max())
13probabilities = probabilities / probabilities.sum()
14
15print("Ranked prompt:", prompts[int(np.argmax(probabilities))])
16print("Candidate probabilities:", np.round(probabilities, 3).tolist())1Ranked prompt: a cracked equipment panel
2Candidate probabilities: [0.976, 0.024]Modern vision-language models (VLMs) can connect a vision encoder to a large language model (LLM). Original LLaVA uses CLIP ViT-L/14 grid features, a learned linear projection into the language embedding dimension, and a language decoder.[3] The bridge maps each selected vision feature into an input vector the LLM can consume. Which visual features are selected, and whether they are pooled or compressed, is a model design choice rather than a ViT rule.
That handoff creates practical engineering questions about the token budget and evidence quality:
| Concern | Engineering question |
|---|---|
| Context window | Before any compression, a 224 x 224 image at 16 x 16 patches yields 196 patch features. At 14 x 14 patches it yields 256. Does your bridge pass all features into the LLM's context window or reduce them first? |
| Detail preservation | Does the chosen encoder retain the label or tracking-code evidence on your evaluated images? If it doesn't, measure higher-resolution crops, optical character recognition (OCR), or a document-specific path. |
| Interleaving | Are visual tokens injected once at the beginning, or can the model "look again" at the image in later turns? |
| Tool use | Can the inspection VLM read an equipment photo, output a structured report, and then call a policy lookup tool with the detected asset tag? |
For an inspection workflow, you might ask a VLM to identify visible damage, extract an asset code, and prepare a structured request for an inspection tool. This must be evaluated separately for damage detection, text extraction, and action correctness. Visual tokens provide model input; they don't prove that a claimed detail was read correctly.
The projector itself is easy to shape-check. This scaled-down example maps vision width 8 to language-model width 16. A production bridge may use widths such as 1024 and 4096, but the invariant is identical: a linear projector changes feature width, not feature-slot count.
1import numpy as np
2
3rng = np.random.default_rng(5)
4visual_features = rng.normal(size=(1, 4, 8))
5projector = rng.normal(size=(8, 16))
6llm_visual_inputs = visual_features @ projector
7text_token_count = 3
8
9print("Vision features:", visual_features.shape)
10print("Projected features:", llm_visual_inputs.shape)
11print("Slots before generation:", llm_visual_inputs.shape[1] + text_token_count)1Vision features: (1, 4, 8)
2Projected features: (1, 4, 16)
3Slots before generation: 71def patch_feature_slots(image_size: int, patch_size: int, images: int = 1) -> int:
2 assert image_size % patch_size == 0, "image must split into complete patches"
3 per_image = (image_size // patch_size) ** 2
4 return images * per_image
5
6base = patch_feature_slots(224, 16, images=1)
7two_images = patch_feature_slots(224, 16, images=2)
8high_res = patch_feature_slots(448, 16, images=1)
9
10print(base, two_images, high_res)
11print("single-image patch features correct:", base == 196)
12print("two-image patch features correct:", two_images == 392)
13print("high-res patch features correct:", high_res == 784)1196 392 784
2single-image patch features correct: True
3two-image patch features correct: True
4high-res patch features correct: TrueThese counts are a patch-feature baseline, not a promise about every VLM. A classifier might use one [CLS] output. A bridge might pass all patch features, merge adjacent features, or resample them into another length.
The pipeline stays recognizable as researchers change its training objective and image sizing policy.
CLIP's softmax loss needs normalization across rows and columns of the batch similarity matrix. SigLIP replaces this with a sigmoid loss on each image-text pair: diagonal pairs receive positive labels and off-diagonal pairs receive negative labels. The SigLIP paper reports that this loss is simpler to distribute, uses less memory in its implementation, and outperforms its softmax baseline at small batch sizes in the evaluated setups.[4]
This code computes the pairwise logistic losses for one tiny batch. High diagonal logits and low off-diagonal logits reduce the loss.
1import numpy as np
2
3logits = np.array([[2.0, -1.0], [-0.4, 1.7]])
4labels = 2 * np.eye(2) - 1
5pair_losses = np.logaddexp(0.0, -labels * logits)
6
7print("Positive losses:", np.round(np.diag(pair_losses), 3).tolist())
8print("Negative losses:", np.round(pair_losses[labels < 0], 3).tolist())
9print("Mean pair loss:", round(float(pair_losses.mean()), 3))1Positive losses: [0.127, 0.168]
2Negative losses: [0.313, 0.513]
3Mean pair loss: 0.28A fixed-resolution ViT setup resizes images to a chosen patch grid. Original ViT can be fine-tuned at a higher resolution by interpolating its learned position table. SigLIP 2's NaFlex variant instead supports multiple sequence lengths while preserving an image's native aspect ratio as much as its patch constraints allow.[1][5] This is relevant to document images because squeezing a tall label into a square input changes the evidence before the model sees it.
SigLIP 2 also adds captioning-based training, self-distillation, and masked prediction; its paper reports improvements over matching SigLIP encoders on localization and dense-prediction evaluations.[5] DINOv2 takes another route: it trains visual features without text supervision and evaluates patch-level features for segmentation and depth estimation.[6] Which encoder to use depends on measured task behavior, not on a general rule that one backbone is best.
The mental model remains stable: patchify, project, supply position information, run encoder blocks, then choose what representation downstream code consumes. Variants change the training signal and token budget, but you still need to track shapes and preserve relevant evidence.
d_model separately.Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Dosovitskiy, A., et al. · 2020 · ICLR 2021
Learning Transferable Visual Models From Natural Language Supervision.
Radford, A., et al. · 2021 · ICML 2021
Visual Instruction Tuning.
Liu, H., et al. · 2023 · NeurIPS 2023
Sigmoid Loss for Language Image Pre-training.
Zhai, X., et al. · 2023 · ICCV 2023
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Tschannen, M., Gritsenko, A., Wang, X., et al. · 2025
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., et al. · 2024