LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnTransformer Deep DivesVision Transformers and Image Encoders
👁️HardMultimodal Models

Vision Transformers and Image Encoders

Understand how Vision Transformers split images into patches, build visual tokens, train encoders, and connect to CLIP and multimodal LLMs.

24 min read
Learning path
Step 91 of 158 in the full curriculum
Scaled Dot-Product AttentionPositional Encoding: RoPE & ALiBi

Text transformers start with tokens. Vision transformers start by inventing tokens from pixels.

You just learned how attention routes information across a sequence. ViT changes where the sequence comes from: image patches instead of word tokens.

That one shift explains why multimodal systems feel less mysterious. A document scan, UI screenshot, or field-inspection image can't enter a transformer as raw pixels. It must first become a sequence of vectors. A Vision Transformer (ViT) does that by cutting the image into patches, projecting each patch into an embedding, adding position information, and running transformer encoder blocks.[1]

Concrete ViT-Base shape pipeline from a 224 by 224 RGB image through 196 flattened 16 by 16 patches, shared projection to width 768, CLS and position tokens, transformer encoders, and image or patch readout Concrete ViT-Base shape pipeline from a 224 by 224 RGB image through 196 flattened 16 by 16 patches, shared projection to width 768, CLS and position tokens, transformer encoders, and image or patch readout
For ViT-Base, a 224 by 224 RGB image becomes 196 flattened width-768 patches. A shared projection creates 196 patch tokens, a class token expands the sequence to 197, position embeddings preserve layout, and encoder blocks return either a summary representation or contextual patch features.

From pixels to patches

Take a 224 x 224 RGB image. If you use 16 x 16 patches, the image becomes:

text
1224 / 16 = 14 patches per side 214 x 14 = 196 image patches

Each patch contains 16 x 16 x 3 = 768 raw pixel values. ViT flattens each patch and applies a learned linear projection to map it to the model's hidden size d_model. ViT-Base uses d_model = 768. That this happens to equal the raw patch width of 768 is a coincidence of this configuration, not a rule: a 16 x 16 patch projected to ViT-Large's d_model = 1024 would go from 768 raw values to a 1024-dimensional vector. The raw patch width and the hidden size are two independent numbers.

More generally, for an H x W x C image and square patch width P, assuming P divides both H and W:

text
1patch_count = (H / P) x (W / P) 2raw_patch_width = P x P x C 3projected_tokens shape = [patch_count, d_model]

Track those three values separately. Changing patch size changes sequence length and raw patch width. Changing d_model changes projected feature width.

Now the transformer sees a sequence:

text
1[patch_1, patch_2, ..., patch_196]

This is the same shape idea you learned for text. The model doesn't process "an image" directly. It processes a sequence of vectors, and attention lets each visual token look at other visual tokens.

Why flatten and project? A linear layer (a matrix multiplication) learns to turn the raw pixel values into a d_model-dimensional input representation that the transformer can refine. The weights are shared across all patches -- the same "patch embedder" is applied everywhere. This is analogous to the token embedding matrix in text transformers.

Patchify can be a strided convolution. The ViT paper describes flattening each patch and applying one shared linear projection. A 2D convolution with kernel size P and stride P can perform exactly that same operation: each output location reads one non-overlapping patch, and each output channel is one learned projection dimension. So nn.Conv2d(3, 768, kernel_size=16, stride=16) can produce the same patch tokens as explicit reshape-then-matmul when its weights and bias are copied into the equivalent linear layer.[1]

This minimal, runnable NumPy implementation shows the patch embedding step with toy numbers:

from-pixels-to-patches.py
1import numpy as np 2 3def extract_patches(image: np.ndarray, patch_size: int = 16) -> np.ndarray: 4 """Split HxWxC image into (N_patches, patch_size*patch_size*C) flat patches.""" 5 h, w, c = image.shape 6 assert h % patch_size == 0 and w % patch_size == 0 7 patches = image.reshape( 8 h // patch_size, patch_size, 9 w // patch_size, patch_size, c 10 ).swapaxes(1, 2).reshape(-1, patch_size * patch_size * c) 11 return patches 12 13def patch_embed(patches: np.ndarray, proj_weight: np.ndarray) -> np.ndarray: 14 """Project (N, patch_width) patches with shared (d_model, patch_width) weights.""" 15 return patches @ proj_weight.T # shared weights for every patch 16 17# Toy 32x32 RGB image, 8x8 patches: 16 patches, each 192 raw values 18rng = np.random.default_rng(0) 19toy_image = rng.normal(size=(32, 32, 3)).astype(np.float32) 20patches = extract_patches(toy_image, patch_size=8) 21print("Patches shape:", patches.shape) 22 23d_model = 256 24proj = rng.normal(size=(d_model, 192)).astype(np.float32) * 0.02 25embedded = patch_embed(patches, proj) 26print("Embedded tokens shape:", embedded.shape)
Output
1Patches shape: (16, 192) 2Embedded tokens shape: (16, 256)

The next check uses PyTorch's Conv2d layer and an explicit flattened-patch multiplication with identical weights. Matching output confirms that these are two implementations of the same patch projection, not two different vision architectures.

verify-convolutional-patch-embedding.py
1import torch 2 3torch.manual_seed(3) 4image = torch.arange(1 * 3 * 4 * 4, dtype=torch.float32).reshape(1, 3, 4, 4) 5patch_size = 2 6d_model = 5 7 8conv = torch.nn.Conv2d(3, d_model, kernel_size=patch_size, stride=patch_size) 9conv_tokens = conv(image).flatten(2).transpose(1, 2) 10 11flat_patches = ( 12 image.unfold(2, patch_size, patch_size) 13 .unfold(3, patch_size, patch_size) 14 .permute(0, 2, 3, 1, 4, 5) 15 .reshape(1, -1, 3 * patch_size * patch_size) 16) 17linear_tokens = flat_patches @ conv.weight.reshape(d_model, -1).T + conv.bias 18 19print("Token shape:", tuple(conv_tokens.shape)) 20print("Conv equals explicit projection:", torch.allclose(conv_tokens, linear_tokens))
Output
1Token shape: (1, 4, 5) 2Conv equals explicit projection: True

Patch size is a major design knob:

Patch sizeTokens for 224 x 224 imageTrade-off
32 x 3249Cheaper, less detail
16 x 16196Common balance
14 x 14256More detail, more attention cost
8 x 8784Expensive, useful for fine detail
Four exact patch grids for a 224 by 224 image, showing patch widths 32, 16, 14, and 8 with their patches per side, token counts, attention-pair counts, and cost relative to 16 pixel patches Four exact patch grids for a 224 by 224 image, showing patch widths 32, 16, 14, and 8 with their patches per side, token counts, attention-pair counts, and cost relative to 16 pixel patches
For a fixed 224 by 224 image, patch width 32 produces 49 tokens and 2,401 attention pairs, while patch width 8 produces 784 tokens and 614,656 pairs. Halving patch width creates four times as many tokens and sixteen times as many query-key pairs.

This matters when reading tiny chart labels, serial numbers, or document text. Smaller patches preserve more visual detail (a small axis label or printed ID), but self-attention cost grows with the square of token count. Systems that need fine detail often add smaller patches, higher-resolution crops, OCR fallback, or variable-resolution processing for text-heavy regions.

from-pixels-to-patches-2.py
1def patch_tokens(image_size: int, patch_size: int) -> int: 2 assert image_size % patch_size == 0, "image must split into complete patches" 3 patches_per_side = image_size // patch_size 4 return patches_per_side**2 5 6def attention_pairs(tokens: int) -> int: 7 return tokens**2 8 9tokens_16 = patch_tokens(224, 16) 10tokens_14 = patch_tokens(224, 14) 11tokens_8 = patch_tokens(224, 8) 12 13print(tokens_16, tokens_14, tokens_8) 14print(round(attention_pairs(tokens_8) / attention_pairs(tokens_16), 1), "x more attention pairs") 15print("16-patch count correct:", tokens_16 == 196) 16print("14-patch count correct:", tokens_14 == 256) 17print("8-patch count correct:", tokens_8 == 784)
Output
1196 256 784 216.0 x more attention pairs 316-patch count correct: True 414-patch count correct: True 58-patch count correct: True

Position information

A bag of patches isn't enough. The model needs to know where each patch came from. A patch in the top-left corner of a inspection photo means something different from an identical patch near the barcode.

The original ViT adds a learned one-dimensional position embedding table to the raster-ordered patch sequence. A patch in row 0, column 1 receives the vector for its flattened sequence slot. The paper found no clear gain from its tested 2D-aware alternatives; the learned 1D vectors still developed row-and-column structure after training.[1] Later encoders may use explicitly 2D or relative position schemes.

For patch sequence slot iii, the input to the encoder is:

text
1visual_token_i = patch_projection_i + position_embedding_i

Original ViT also prepends a learnable classification token ([CLS]). After the encoder, that token is the image representation supplied to its classifier. Other image encoders may pool patch tokens or keep the full patch sequence for downstream tasks.

The tiny attention calculation below swaps two patch contents. A summary query can't distinguish the swap when there's no position signal; after position vectors are added, the same query produces a different summary.

position-makes-layout-visible.py
1import numpy as np 2 3def attend(query: np.ndarray, tokens: np.ndarray) -> np.ndarray: 4 scores = tokens @ query / np.sqrt(tokens.shape[1]) 5 scores = scores - scores.max() 6 weights = np.exp(scores) / np.exp(scores).sum() 7 return weights @ tokens 8 9query = np.array([1.0, 0.0]) 10patches = np.array([[2.0, 0.1], [0.2, 1.5]]) 11swapped = patches[::-1] 12positions = np.array([[0.0, 0.6], [0.5, 0.0]]) 13 14without_position = np.allclose(attend(query, patches), attend(query, swapped)) 15with_position = np.allclose( 16 attend(query, patches + positions), 17 attend(query, swapped + positions), 18) 19print("Same summary after swap without positions:", without_position) 20print("Same summary after swap with positions:", with_position)
Output
1Same summary after swap without positions: True 2Same summary after swap with positions: False

The important habit is to track shape and meaning:

TensorShape exampleMeaning
image[3, 224, 224]RGB pixels
patches[196, 768]flattened patch values
patch embeddings[196, 768]projected visual tokens
encoder input with [CLS][197, 768]summary slot plus positioned patches
encoded tokens[197, 768]contextual image representation
Vision Transformer position embedding diagram mapping a 3 by 3 raster patch grid to sequence slots, adding learned slot vectors to patch content, and showing that swapping patch contents is invisible to CLS without positions but visible with fixed slot embeddings Vision Transformer position embedding diagram mapping a 3 by 3 raster patch grid to sequence slots, adding learned slot vectors to patch content, and showing that swapping patch contents is invisible to CLS without positions but visible with fixed slot embeddings
Raster order maps each patch to a fixed learned slot vector, so encoder input is $z_i=x_i+e_i$. Without $e_i$, permuting patch tokens leaves the class-token summary unchanged; with fixed slots, swapping two patch contents changes their content-position sums.

Learned position tables also expose a common failure when resolution changes. Moving from 224 x 224 to 256 x 256 with 16 x 16 patches changes the patch sequence from 196 to 256 entries. The original ViT approach handles higher-resolution fine-tuning by interpolating the pretrained position embeddings on their 2D patch grid.[1]

position-table-resolution-mismatch.py
1import numpy as np 2 3def sequence_length(image_size: int, patch_size: int, cls_token: bool = True) -> int: 4 assert image_size % patch_size == 0, "image must split into complete patches" 5 patch_count = (image_size // patch_size) ** 2 6 return patch_count + int(cls_token) 7 8trained_positions = np.zeros((sequence_length(224, 16), 768)) 9higher_resolution_tokens = np.zeros((sequence_length(256, 16), 768)) 10 11print("trained slots:", trained_positions.shape[0]) 12print("new slots:", higher_resolution_tokens.shape[0]) 13print("needs resized position table:", trained_positions.shape != higher_resolution_tokens.shape)
Output
1trained slots: 197 2new slots: 257 3needs resized position table: True

Encoder attention

ViT uses standard transformer encoder blocks (the same scaled dot-product attention and feed-forward layers you learned in the attention chapter). Unlike a decoder-only language model, it doesn't need a causal mask for image understanding. Every patch can attend to every other patch -- the attention is bidirectional.

For a inspection image, one patch might show error panel, another might show a status badge, and another might show a tiny warning icon. Self-attention lets the encoder combine these clues across the image instead of treating each patch independently. A patch containing "ERROR" text can directly influence the representation of a distant "warning" patch.

Bidirectional Vision Transformer attention matrix for damage, label, barcode, and background patches, with every row normalized and the damage query mixing all four value vectors into one contextual token Bidirectional Vision Transformer attention matrix for damage, label, barcode, and background patches, with every row normalized and the damage query mixing all four value vectors into one contextual token
Every query row can use every image patch. In this illustrative damage row, weights 0.38, 0.31, 0.22, and 0.09 mix damage, label, barcode, and background value vectors into one contextual representation.

That's why ViT connects naturally to the attention chapter. Query and key vectors still decide where information flows. Value vectors still carry content. The sequence positions are image patches instead of words. The full stack also includes residual connections, LayerNorm (covered in a later chapter), and position-wise feed-forward networks -- exactly the same building blocks as text transformers.

To see the mask difference directly, the next example uses the same attention scores twice. A vision encoder leaves all patch-to-patch entries available. A left-to-right decoder masks future entries.

bidirectional-image-attention.py
1import numpy as np 2 3def softmax(scores: np.ndarray) -> np.ndarray: 4 scores = scores - scores.max(axis=-1, keepdims=True) 5 probs = np.exp(scores) 6 return probs / probs.sum(axis=-1, keepdims=True) 7 8tokens = np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]]) 9scores = tokens @ tokens.T / np.sqrt(tokens.shape[1]) 10full_weights = softmax(scores) 11causal_scores = np.where(np.triu(np.ones_like(scores), k=1) == 1, -np.inf, scores) 12causal_weights = softmax(causal_scores) 13 14print("Vision patch 0 can read later patch 1:", full_weights[0, 1] > 0) 15print("Causal token 0 can read later token 1:", causal_weights[0, 1] > 0) 16print("Vision row sums to one:", np.isclose(full_weights[0].sum(), 1.0))
Output
1Vision patch 0 can read later patch 1: True 2Causal token 0 can read later token 1: False 3Vision row sums to one: True

The ViT architecture in full

A complete Vision Transformer typically looks like this:

  1. Patch embedding (linear projection of flattened patches)
  2. Prepend a learnable [CLS] token for the original classification model
  3. Add learned sequence-position embeddings; some later variants encode 2D structure explicitly
  4. Stack of L identical encoder blocks:
    • LayerNorm -> multi-head self-attention (bidirectional) -> residual add
    • LayerNorm -> position-wise MLP (usually 4× expansion) -> residual add
  5. Final LayerNorm
  6. Readout: [CLS] token, mean-pool of patch tokens, or full feature map depending on the downstream task

The output is a set of contextual visual tokens plus an optional global image representation. CLIP-style image encoders commonly read a pooled image vector, while a VLM bridge may retain many patch features.

Shape tracking example (ViT-B/16 style)

  • Input image: [3, 224, 224]
  • Patches: 196 × (16×16×3) = 196 × 768
  • After projection: 196 × 768 tokens
  • After adding position embeddings and [CLS]: 197 × 768
  • After 12 encoder blocks: still 197 × 768
  • Final image embedding (from CLS): 768 dimensions

This shape discipline is the same habit you developed for text token sequences.

Why pretraining scale matters. A convolutional neural network (CNN) bakes locality and translation equivariance into each layer. Original ViT uses patch extraction and learned position embeddings as its limited image-specific structure, then learns spatial relationships through attention. In the ViT paper's controlled experiments, ResNet baselines do better with smaller pretraining datasets, while ViT catches up or does better as pretraining data grows to ImageNet-21k and JFT-300M.[1] For a limited private image dataset, begin evaluation with a pretrained encoder rather than assuming training a large ViT from scratch will win.

CLIP and image-text training

CLIP (Contrastive Language-Image Pre-training) uses contrastive learning to train an image encoder and a text encoder together so matching image-text pairs land near each other in embedding space. Its paper trains on 400 million image-text pairs gathered from the internet instead of using only a fixed class-label list.[2]

A simplified training batch looks like this:

ImageMatching text
Photo of a cracked equipment panel"cracked panel with warning label"
Instrument-panel photo"voltage warning panel on gray rack"
Document scan"barcode and form label"

The model normalizes image and text embeddings, computes a matrix of scaled cosine similarities, and applies symmetric softmax cross-entropy: each image must pick its paired text, and each text must pick its paired image. For zero-shot classification, it embeds text prompts for the candidate classes and ranks their scaled similarities to the image.[2]

CLIP batch with three normalized image vectors and three normalized text vectors forming a 3 by 3 scaled cosine logit matrix, where diagonal matches are positives, six off-diagonal cells are in-batch negatives, and row plus column cross-entropy train both retrieval directions CLIP batch with three normalized image vectors and three normalized text vectors forming a 3 by 3 scaled cosine logit matrix, where diagonal matches are positives, six off-diagonal cells are in-batch negatives, and row plus column cross-entropy train both retrieval directions
A batch of three matched image-text pairs creates a 3 by 3 scaled-cosine logit matrix: three diagonal positives and six off-diagonal negatives. CLIP averages image-to-text row loss and text-to-image column loss.

CLIP isn't a captioning model by itself. It returns similarity scores. A multimodal assistant needs extra components (a language model head or a separate captioner) to turn image understanding into grounded language.

The connection to the sentence-embedding lesson is direct: the same contrastive learning machinery that turns sentences into useful vectors is used here to align two different modalities (vision and language) in one shared embedding space.

This compact inference example treats two descriptions as candidate classes. The returned number is a score-derived probability over those supplied candidates, not generated language or verified evidence.

clip-style-zero-shot-ranking.py
1import numpy as np 2 3def normalize(rows: np.ndarray) -> np.ndarray: 4 return rows / np.linalg.norm(rows, axis=-1, keepdims=True) 5 6image = normalize(np.array([[0.95, 0.20, 0.05]])) 7prompts = ["a cracked equipment panel", "a clean equipment panel"] 8text = normalize(np.array([[0.90, 0.25, 0.05], [0.05, 0.95, 0.10]])) 9logit_scale = 5.0 10 11logits = logit_scale * (image @ text.T)[0] 12probabilities = np.exp(logits - logits.max()) 13probabilities = probabilities / probabilities.sum() 14 15print("Ranked prompt:", prompts[int(np.argmax(probabilities))]) 16print("Candidate probabilities:", np.round(probabilities, 3).tolist())
Output
1Ranked prompt: a cracked equipment panel 2Candidate probabilities: [0.976, 0.024]

From image encoder to VLM

Modern vision-language models (VLMs) can connect a vision encoder to a large language model (LLM). Original LLaVA uses CLIP ViT-L/14 grid features, a learned linear projection into the language embedding dimension, and a language decoder.[3] The bridge maps each selected vision feature into an input vector the LLM can consume. Which visual features are selected, and whether they are pooled or compressed, is a model design choice rather than a ViT rule.

Vision-language model bridge shape diagram where 196 vision features are projected from vision width to LLM width without changing slot count, concatenated with text prompt tokens, and compared with 392-slot two-image and 784-slot high-resolution budgets Vision-language model bridge shape diagram where 196 vision features are projected from vision width to LLM width without changing slot count, concatenated with text prompt tokens, and compared with 392-slot two-image and 784-slot high-resolution budgets
A linear VLM projector changes each feature from vision width to language-model width but preserves the 196 visual slots in this baseline. Two images require 392 slots and one 448-pixel image requires 784 before any explicit pooling, resampling, or compression.

That handoff creates practical engineering questions about the token budget and evidence quality:

ConcernEngineering question
Context windowBefore any compression, a 224 x 224 image at 16 x 16 patches yields 196 patch features. At 14 x 14 patches it yields 256. Does your bridge pass all features into the LLM's context window or reduce them first?
Detail preservationDoes the chosen encoder retain the label or tracking-code evidence on your evaluated images? If it doesn't, measure higher-resolution crops, optical character recognition (OCR), or a document-specific path.
InterleavingAre visual tokens injected once at the beginning, or can the model "look again" at the image in later turns?
Tool useCan the inspection VLM read an equipment photo, output a structured report, and then call a policy lookup tool with the detected asset tag?

For an inspection workflow, you might ask a VLM to identify visible damage, extract an asset code, and prepare a structured request for an inspection tool. This must be evaluated separately for damage detection, text extraction, and action correctness. Visual tokens provide model input; they don't prove that a claimed detail was read correctly.

The projector itself is easy to shape-check. This scaled-down example maps vision width 8 to language-model width 16. A production bridge may use widths such as 1024 and 4096, but the invariant is identical: a linear projector changes feature width, not feature-slot count.

project-visual-features-into-llm-space.py
1import numpy as np 2 3rng = np.random.default_rng(5) 4visual_features = rng.normal(size=(1, 4, 8)) 5projector = rng.normal(size=(8, 16)) 6llm_visual_inputs = visual_features @ projector 7text_token_count = 3 8 9print("Vision features:", visual_features.shape) 10print("Projected features:", llm_visual_inputs.shape) 11print("Slots before generation:", llm_visual_inputs.shape[1] + text_token_count)
Output
1Vision features: (1, 4, 8) 2Projected features: (1, 4, 16) 3Slots before generation: 7
from-image-encoder-to-vlm.py
1def patch_feature_slots(image_size: int, patch_size: int, images: int = 1) -> int: 2 assert image_size % patch_size == 0, "image must split into complete patches" 3 per_image = (image_size // patch_size) ** 2 4 return images * per_image 5 6base = patch_feature_slots(224, 16, images=1) 7two_images = patch_feature_slots(224, 16, images=2) 8high_res = patch_feature_slots(448, 16, images=1) 9 10print(base, two_images, high_res) 11print("single-image patch features correct:", base == 196) 12print("two-image patch features correct:", two_images == 392) 13print("high-res patch features correct:", high_res == 784)
Output
1196 392 784 2single-image patch features correct: True 3two-image patch features correct: True 4high-res patch features correct: True

These counts are a patch-feature baseline, not a promise about every VLM. A classifier might use one [CLS] output. A bridge might pass all patch features, merge adjacent features, or resample them into another length.

Encoder variants change supervision and token budgets

The pipeline stays recognizable as researchers change its training objective and image sizing policy.

SigLIP: independent pair losses

CLIP's softmax loss needs normalization across rows and columns of the batch similarity matrix. SigLIP replaces this with a sigmoid loss on each image-text pair: diagonal pairs receive positive labels and off-diagonal pairs receive negative labels. The SigLIP paper reports that this loss is simpler to distribute, uses less memory in its implementation, and outperforms its softmax baseline at small batch sizes in the evaluated setups.[4]

This code computes the pairwise logistic losses for one tiny batch. High diagonal logits and low off-diagonal logits reduce the loss.

siglip-pairwise-loss.py
1import numpy as np 2 3logits = np.array([[2.0, -1.0], [-0.4, 1.7]]) 4labels = 2 * np.eye(2) - 1 5pair_losses = np.logaddexp(0.0, -labels * logits) 6 7print("Positive losses:", np.round(np.diag(pair_losses), 3).tolist()) 8print("Negative losses:", np.round(pair_losses[labels < 0], 3).tolist()) 9print("Mean pair loss:", round(float(pair_losses.mean()), 3))
Output
1Positive losses: [0.127, 0.168] 2Negative losses: [0.313, 0.513] 3Mean pair loss: 0.28

Variable resolution and dense features

A fixed-resolution ViT setup resizes images to a chosen patch grid. Original ViT can be fine-tuned at a higher resolution by interpolating its learned position table. SigLIP 2's NaFlex variant instead supports multiple sequence lengths while preserving an image's native aspect ratio as much as its patch constraints allow.[1][5] This is relevant to document images because squeezing a tall label into a square input changes the evidence before the model sees it.

SigLIP 2 also adds captioning-based training, self-distillation, and masked prediction; its paper reports improvements over matching SigLIP encoders on localization and dense-prediction evaluations.[5] DINOv2 takes another route: it trains visual features without text supervision and evaluates patch-level features for segmentation and depth estimation.[6] Which encoder to use depends on measured task behavior, not on a general rule that one backbone is best.

The mental model remains stable: patchify, project, supply position information, run encoder blocks, then choose what representation downstream code consumes. Variants change the training signal and token budget, but you still need to track shapes and preserve relevant evidence.

What to check before moving on

  • Convert image size, patch width, and channels into patch count, raw patch width, and projected token shape.
  • Explain why position vectors are added to raster-ordered patches and why changing resolution can require position-table interpolation.
  • Trace bidirectional patch attention through a ViT encoder without introducing a causal mask.
  • Distinguish CLIP similarity scoring, SigLIP pairwise training, and generative VLM behavior.
  • Track how a VLM projector changes feature width while visual slots consume context-window capacity.
  • Choose patch size, native resolution, cropping, or OCR based on measured detail, latency, and token-budget needs.

Practice prompts

Common failures and fixes

  • Symptom: Cost estimates assume one image token equals one pixel. Cause: Patch width, channel count, and projected token count were collapsed into one idea. Fix: Track patch size, flattened patch width, patch count, and d_model separately.
  • Symptom: Prototype looks fine at one resolution, then production latency jumps hard. Cause: Patch-size and image-resolution effects on visual token count were not measured. Fix: Measure emitted visual tokens on representative images before choosing encoder settings.
  • Symptom: Team expects CLIP embeddings to produce grounded explanations. Cause: Retrieval-style similarity scoring was confused with generative multimodal reasoning. Fix: Use CLIP for retrieval, filtering, or zero-shot classification, and use a VLM when you need language output or tool calls.
  • Symptom: Model catches broad damage but misses tiny text or torn-label details. Cause: One low-resolution encoder was asked to solve both coarse and fine tasks. Fix: Add a hybrid path with smaller patches, crops, OCR, or a detector for detail-heavy regions.
  • Symptom: Large ViT trained on a small private dataset underperforms simpler baselines. Cause: In the original study, ViT benefits more from larger pretraining sets than comparable ResNet baselines. Fix: Start from a pretrained encoder, then compare it against a smaller baseline on your data.[1]
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A 32x32 RGB image is split into 8x8 patches and each flattened patch is projected to d_model = 256 with one shared linear layer. What are the shapes before and after projection?
2.For a 224x224 image, an encoder switches from 16x16 patches to 8x8 patches with the same d_model. What happens to token count and attention-pair count?
3.A ViT-B/16 model was pretrained on 224x224 images with learned position embeddings and a [CLS] token. You fine-tune it on 256x256 images using the same patch size. What position-table issue must be handled?
4.A ViT encoder is processing patch tokens from a inspection image. Which attention pattern matches standard image understanding in this encoder?
5.You compare one image embedding against two text prompts, 'a cracked equipment panel' and 'a clean equipment panel', and CLIP gives the first prompt the higher score. What can you conclude from that output alone?
6.A VLM bridge passes every patch feature from one 224x224 image into the LLM. If patch size changes from 16x16 to 14x14, how many visual slots are added before text tokens?
7.A inspection-photo VLM reliably flags cracked panels but often invents characters in tiny asset-code text. Evaluation shows the wrong characters are already absent from the visual features. Which change targets the failure most directly?
8.In a batch of image-text pairs, a SigLIP-style loss labels matching pairs positive and mismatches negative, then applies a logistic loss to each pair. What is the key difference from CLIP's softmax contrastive loss?
9.A team has only a small private set of inspection images and wants to train a large ViT from random initialization. What should it try first?
10.A ViT gives the same summary when two patch embeddings are swapped between raster positions. Which modification directly lets attention distinguish the two layouts?

10 questions remaining.

Next Step
Continue to Positional Encoding: RoPE & ALiBi

ViT showed why position matters for image patches; positional encoding now shows how transformer systems represent order and distance for text and long-context sequence modeling.

PreviousScaled Dot-Product Attention
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Dosovitskiy, A., et al. · 2020 · ICLR 2021

Learning Transferable Visual Models From Natural Language Supervision.

Radford, A., et al. · 2021 · ICML 2021

Visual Instruction Tuning.

Liu, H., et al. · 2023 · NeurIPS 2023

Sigmoid Loss for Language Image Pre-training.

Zhai, X., et al. · 2023 · ICCV 2023

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., et al. · 2025

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., et al. · 2024