LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnSystem Design CapstonesVision-Language Models & CLIP
👁️HardMultimodal Models

Vision-Language Models & CLIP

Master CLIP's contrastive pre-training, zero-shot classification, visual token budgets, and the architecture of modern published VLMs like LLaVA, BLIP-2, and Qwen-VL.

40 min read
Learning path
Step 147 of 155 in the full curriculum
LLM-Powered Search EngineMultimodal LLM Architecture

Vision-Language Models & CLIP

LLM-powered search turned retrieved evidence into cited answers. Vision-language models add visual evidence to that same loop: images, diagrams, screenshots, document pages, and product photos become searchable and reasoned about alongside text.

Vision-language models connect images and text so systems can search, classify, and reason across both modalities. This chapter starts with CLIP-style contrastive learning, then expands to production uses and failure modes.

Imagine training a warehouse system to recognize damaged packages. You wouldn't just give it a fixed list of labels like damaged_box or intact_box. You'd show it real photos paired with descriptions: "corner crushed during transit," "shipping label torn near barcode," or "fragile sticker visible on side panel." The system learns the concept from images and language together.

Many supervised computer vision systems expose a fixed label inventory: recognizing a new category generally requires labeled examples and another training step. CLIP (Contrastive Language-Image Pre-training) showed that image-text matching at web scale can produce representations that transfer to new text-defined classes without training a new classifier head.[1] That unlocked practical open-vocabulary retrieval and provided useful vision features for later generative systems.

Conceptual foundation

A quick refresher: embeddings and similarity

Before we dive into CLIP, recall what embeddings are: high-dimensional vectors that compress meaning into a list of numbers. A sentence like "corner crushed during transit" becomes a vector, perhaps 768 numbers long. An image of that same crushed box becomes another 768-number vector.

If the model has done its job, those two vectors point in roughly the same direction in space. We measure that alignment with cosine similarity, a score between -1 and 1. Within one trained model, a higher score ranks an image-text pair as more aligned than a lower-scoring candidate. It is not a calibrated probability and does not prove that the pair means exactly the same thing. That's the geometric heart of CLIP: it learns to rank matching image-text pairs above mismatches.

From fixed labels to open vocabulary

Imagine trying to learn a warehouse photo taxonomy from a fixed label list. You might learn that one class is damaged_box, but you wouldn't understand how damage appears across lighting, angles, materials, and carrier labels.

Now imagine learning from many real catalog and logistics images with full captions. This is how Vision-Language Models (VLMs) learn. Instead of memorizing fixed categories, they learn to align the visual concept of an object with its natural language description.

This shared image-text space allows a system to score text prompts that were not fixed training labels. Transfer still depends on whether pre-training learned the relevant visual evidence and whether the new domain resembles the data it saw. A warehouse team must validate prompts on its own cartons, lighting, and damage policy before using scores operationally.

Shared image-text space

Think of CLIP as a shared routing map for vision and language. One side starts with pixels, the other with words. They do not share raw input format, but the original model was trained on 400 million image-text pairs.[1] Over training, matched images and captions receive higher similarity than mismatches. If its training produced useful package-damage features, it may rank a new prompt such as "torn corner on shipping box" well even when that exact label was not a training class. That is a transfer hypothesis to evaluate, not an automatic guarantee.

CLIP: Contrastive Language-Image Pre-training

Architecture[1]

CLIP jointly trains an image encoder and text encoder to align visual and textual representations in a shared embedding space. In the original paper, the vision side was either a ResNet (Residual Network) or a ViT (Vision Transformer), while the text side was a Transformer that produced one representation for each caption.[1] The contrastive objective pulls matching pairs together and pushes mismatches apart, which is what makes zero-shot transfer and cross-modal retrieval work.

CLIP shared-space diagram showing image and text encoded separately, then scored for relative alignment inside one common embedding plot. CLIP shared-space diagram showing image and text encoded separately, then scored for relative alignment inside one common embedding plot.
CLIP learns one shared space. Matching image-text pairs land close; mismatches stay far.

The illustration above shows core CLIP idea: keep encoders separate, normalize both outputs, then judge whether image and text land near each other in one shared space.

Training objective

Think of this process like matching warehouse photos to their correct captions. For every image, the model must pick the correct caption out of thousands of incorrect ones.

Let's walk through a tiny batch first, because the pattern is easier to see with real numbers. Suppose you have 3 images and 3 captions:

ImageCaption
A"poly mailer torn near barcode"
B"return label covers old tracking number"
C"fragile sticker visible on side panel"

There are 3×3=93 \times 3 = 93×3=9 possible pairings. Only the 3 diagonal pairs (A+A, B+B, C+C) are correct. The other 6 are mismatches. CLIP's job is to make the similarity scores for the diagonal pairs as high as possible, while keeping the off-diagonal scores low.

CLIP contrastive loss matrix for a batch of three paired images and captions. Diagonal pairs are positives and off-diagonal pairs are treated as negatives in both directions, even when related examples could collide. CLIP contrastive loss matrix for a batch of three paired images and captions. Diagonal pairs are positives and off-diagonal pairs are treated as negatives in both directions, even when related examples could collide.
The diagonal cells are positives. Every off-diagonal cell becomes a negative example for both encoders.

In a real training run, the batch size is much larger (32,768 in the original paper[1]), so each image sees thousands of negative examples. But the logic is the same: maximize the similarity of matching pairs and minimize it for non-matching pairs using a symmetric contrastive loss that's often described as InfoNCE-style (Information Noise Contrastive Estimation).

Let IiI_iIi​ and TjT_jTj​ be the L2-normalized image and text embeddings. CLIP computes logits

zij=exp⁡(t)⋅Ii⊤Tjz_{ij} = \exp(t) \cdot I_i^\top T_jzij​=exp(t)⋅Ii⊤​Tj​

where ttt is a learned logit-scale parameter. You'll also see this written as division by a temperature τ\tauτ, where τ=1/exp⁡(t)\tau = 1 / \exp(t)τ=1/exp(t). The symmetric objective is:

LI→T=−1N∑i=1Nlog⁡exp⁡(zii)∑j=1Nexp⁡(zij)\mathcal{L}_{I \rightarrow T} = -\frac{1}{N}\sum_{i=1}^{N}\log\frac{\exp(z_{ii})}{\sum_{j=1}^{N}\exp(z_{ij})}LI→T​=−N1​i=1∑N​log∑j=1N​exp(zij​)exp(zii​)​ LT→I=−1N∑i=1Nlog⁡exp⁡(zii)∑j=1Nexp⁡(zji)\mathcal{L}_{T \rightarrow I} = -\frac{1}{N}\sum_{i=1}^{N}\log\frac{\exp(z_{ii})}{\sum_{j=1}^{N}\exp(z_{ji})}LT→I​=−N1​i=1∑N​log∑j=1N​exp(zji​)exp(zii​)​ L=LI→T+LT→I2\mathcal{L} = \frac{\mathcal{L}_{I \rightarrow T} + \mathcal{L}_{T \rightarrow I}}{2}L=2LI→T​+LT→I​​

The first equation asks: "Given image iii, what's the probability that caption iii is the right one?" It compares the matching score exp⁡(zii)\exp(z_{ii})exp(zii​) against the sum of all scores for that image across every caption in the batch. The second equation does the same from the text side. The final loss averages both directions so neither encoder dominates.

The implementation below shows a simplified fixed-temperature version of that loss. It keeps the math readable while matching the same symmetric training pattern.

training-objective.py
1import math 2 3def normalize(vector: list[float]) -> list[float]: 4 length = math.sqrt(sum(value * value for value in vector)) 5 return [value / length for value in vector] 6 7def dot(left: list[float], right: list[float]) -> float: 8 return sum(a * b for a, b in zip(left, right, strict=True)) 9 10def cross_entropy(logits: list[float], target: int) -> float: 11 max_logit = max(logits) 12 log_sum_exp = max_logit + math.log(sum(math.exp(x - max_logit) for x in logits)) 13 return -logits[target] + log_sum_exp 14 15def clip_loss( 16 image_embeds: list[list[float]], 17 text_embeds: list[list[float]], 18 temperature: float = 0.2, 19) -> tuple[float, list[list[float]]]: 20 images = [normalize(vector) for vector in image_embeds] 21 texts = [normalize(vector) for vector in text_embeds] 22 logits = [ 23 [dot(image, text) / temperature for text in texts] 24 for image in images 25 ] 26 27 loss_i2t = sum(cross_entropy(row, i) for i, row in enumerate(logits)) / len(logits) 28 columns = [list(column) for column in zip(*logits, strict=True)] 29 loss_t2i = sum(cross_entropy(column, i) for i, column in enumerate(columns)) / len(columns) 30 return (loss_i2t + loss_t2i) / 2, logits 31 32image_embeds = [ 33 [0.95, 0.05, 0.00], 34 [0.05, 0.92, 0.03], 35 [0.00, 0.07, 0.94], 36] 37text_embeds = [ 38 [0.91, 0.08, 0.01], 39 [0.06, 0.89, 0.04], 40 [0.02, 0.09, 0.90], 41] 42 43loss, logits = clip_loss(image_embeds, text_embeds) 44print("logit matrix:") 45for row in logits: 46 print([round(value, 2) for value in row]) 47print(f"symmetric_loss: {loss:.3f}")
Output
1logit matrix: 2[5.0, 0.6, 0.14] 3[0.71, 5.0, 0.66] 4[0.09, 0.59, 5.0] 5symmetric_loss: 0.022

Low loss means the diagonal image-caption pairs dominate the off-diagonal mismatches. A random or poorly aligned batch would have a much flatter matrix and a higher loss.

The softmax-based cross-entropy requires the full N×NN \times NN×N similarity matrix, which means CLIP benefits from very large batch sizes (32,768 in the original paper[1]) so that each example sees many in-batch negatives. In practice, that pushes training toward large distributed runs with expensive cross-device synchronization.

An off-diagonal pair is treated as negative by this objective, but it is not automatically a clean semantic negative. Two photographs of the same product defect may be valid matches even if they arrived as separate labeled pairs. A batch-construction audit catches obvious collisions before the loss pushes them apart.

audit-false-negatives.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Pair: 5 image_id: str 6 caption: str 7 concept: str 8 9pairs = [ 10 Pair("box-front", "corner crushed during transit", "crushed-corner"), 11 Pair("box-side", "cardboard corner is dented", "crushed-corner"), 12 Pair("label", "barcode obscured by tape", "blocked-barcode"), 13] 14 15potential_false_negatives = [ 16 (left.image_id, right.image_id, left.concept) 17 for index, left in enumerate(pairs) 18 for right in pairs[index + 1:] 19 if left.concept == right.concept 20] 21 22print("pairs_treated_as_negative_but_related:", potential_false_negatives)
Output
1pairs_treated_as_negative_but_related: [('box-front', 'box-side', 'crushed-corner')]

Training data

The success of CLIP is heavily dependent on the sheer scale and diversity of its pre-training data. The original paper doesn't release or assign a canonical public name to the full corpus. It describes training on 400 million (image, text) pairs collected from the internet.[1]

Unlike prior vision datasets (e.g., ImageNet) that relied on manual class labels, CLIP learns from natural language supervision attached to images on the web. That means the model trains against captions and descriptive text rather than a closed vocabulary of hand-assigned categories. The result is a much broader representation of visual concepts and a much stronger zero-shot transfer story.

Zero-shot classification

CLIP enables classification without any task-specific training. Instead of a final classification layer with fixed weights, the model computes the similarity between the image and the text descriptions of possible classes.

Analogy: instead of teaching a child that "Class 42 is a zebra," you give them a dictionary. They look at a picture and find the word in the dictionary that best describes it.

Here's a concrete example. Suppose you have a warehouse photo and you want to classify it into one of three categories:

ClassPromptSimilarity
damaged box"a photo of a damaged box"0.87
intact box"a photo of an intact box"0.12
fragile item"a photo of a fragile item"0.41

The highest score is the model's top candidate. In a real decision path, it should win only if validation has established an acceptable score or margin policy; ambiguous images should abstain or route to review. You did not retrain the model, but you still need to calibrate the decision rule.

The practical breakthrough isn't classification alone; it's a text-defined candidate vocabulary. A warehouse damage-claim system can test a prompt such as "poly mailer with barcode partially obscured by packing tape" without fitting a new label head. Whether the prompt is reliable on real claims still has to be measured.

CLIP downstream tasks: zero-shot classification, image search, frozen-feature probing, and reranking all reuse shared image-text embeddings. A score chart compares candidate prompts for a damaged package image. CLIP downstream tasks: zero-shot classification, image search, frozen-feature probing, and reranking all reuse shared image-text embeddings. A score chart compares candidate prompts for a damaged package image.
The same embeddings support classification, retrieval, probing, and reranking. Only the scoring layer changes.

The function below shows the standard zero-shot pattern: prompt the candidate classes in natural language, encode image and text, normalize both, then score them with a scaled dot product. This standalone version uses small hand-written vectors so the inference mechanics are visible.

zero-shot-classification.py
1import json 2import math 3 4def normalize(vector: list[float]) -> list[float]: 5 length = math.sqrt(sum(value * value for value in vector)) 6 return [value / length for value in vector] 7 8def dot(left: list[float], right: list[float]) -> float: 9 return sum(a * b for a, b in zip(left, right, strict=True)) 10 11def softmax(values: list[float]) -> list[float]: 12 max_value = max(values) 13 exp_values = [math.exp(value - max_value) for value in values] 14 total = sum(exp_values) 15 return [value / total for value in exp_values] 16 17def zero_shot_classify( 18 image_embed: list[float], 19 class_names: list[str], 20 text_embeds: dict[str, list[float]], 21 logit_scale: float = 12.0, 22) -> dict[str, object]: 23 image = normalize(image_embed) 24 prompts = [ 25 f"a photo of {'an' if name[0] in 'aeiou' else 'a'} {name}" 26 for name in class_names 27 ] 28 logits = [ 29 logit_scale * dot(image, normalize(text_embeds[prompt])) 30 for prompt in prompts 31 ] 32 probs = softmax(logits) 33 best_index = max(range(len(probs)), key=probs.__getitem__) 34 return { 35 "predicted_class": class_names[best_index], 36 "scores": { 37 class_name: round(probability, 3) 38 for class_name, probability in zip(class_names, probs, strict=True) 39 }, 40 } 41 42text_embeds = { 43 "a photo of a damaged box": [0.92, 0.20, 0.05], 44 "a photo of an intact box": [0.18, 0.88, 0.08], 45 "a photo of a fragile item": [0.48, 0.31, 0.81], 46} 47 48result = zero_shot_classify( 49 image_embed=[0.89, 0.24, 0.08], 50 class_names=["damaged box", "intact box", "fragile item"], 51 text_embeds=text_embeds, 52) 53print(json.dumps(result, indent=2))
Output
1{ 2 "predicted_class": "damaged box", 3 "scores": { 4 "damaged box": 0.988, 5 "intact box": 0.001, 6 "fragile item": 0.01 7 } 8}

Prompt engineering matters

The format of the text prompt significantly impacts performance. "A photo of a {class}" typically works better than just "{class}".[1]

Ensembling multiple prompts can systematically improve the stability of zero-shot classifications. The snippet below takes a hardcoded list of templates as input to format the target class description in various contexts. By tokenizing and passing each variation through the text encoder and averaging the resulting embeddings, the procedure outputs a single, more stable vector representation that mitigates the variance of any individual prompt template.

prompt-engineering-matters.py
1import json 2import math 3 4def normalize(vector: list[float]) -> list[float]: 5 length = math.sqrt(sum(value * value for value in vector)) 6 return [value / length for value in vector] 7 8def mean_vector(vectors: list[list[float]]) -> list[float]: 9 return [ 10 sum(vector[i] for vector in vectors) / len(vectors) 11 for i in range(len(vectors[0])) 12 ] 13 14prompt_vectors = { 15 "a photo of a damaged box": [0.92, 0.20, 0.05], 16 "a picture of a damaged box": [0.88, 0.24, 0.07], 17 "an image showing damaged box": [0.86, 0.28, 0.08], 18 "a damaged box in the wild": [0.82, 0.31, 0.12], 19} 20 21ensemble = normalize(mean_vector([ 22 normalize(vector) 23 for vector in prompt_vectors.values() 24])) 25 26print(json.dumps({ 27 "class": "damaged box", 28 "templates": len(prompt_vectors), 29 "ensemble_embedding": [round(value, 3) for value in ensemble], 30}, indent=2))
Output
1{ 2 "class": "damaged box", 3 "templates": 4, 4 "ensemble_embedding": [ 5 0.955, 6 0.284, 7 0.088 8 ] 9}

Prompt ensembling reduces wording sensitivity; it does not create a deployment threshold. Add an abstention rule using held-out data, and make uncertain classifications visible to downstream policy.

zero-shot-abstention-policy.py
1def decide(scores: dict[str, float], min_score: float, min_margin: float) -> str: 2 ranked = sorted(scores.items(), key=lambda item: item[1], reverse=True) 3 (best_label, best_score), (_, second_score) = ranked[:2] 4 if best_score < min_score or best_score - second_score < min_margin: 5 return "review" 6 return best_label 7 8cases = { 9 "clear_damage": {"damaged": 0.87, "intact": 0.12, "barcode_blocked": 0.31}, 10 "ambiguous_corner": {"damaged": 0.56, "intact": 0.53, "barcode_blocked": 0.18}, 11} 12 13for name, scores in cases.items(): 14 print(name, decide(scores, min_score=0.60, min_margin=0.10))
Output
1clear_damage damaged 2ambiguous_corner review

Limitations of CLIP

A common deployment mistake is using CLIP's global similarity score to count inventory from warehouse shelf photos or localize damaged regions on a package. Standard CLIP inference exposes a pooled image representation optimized for image-text matching, not calibrated counts or boxes. For counting and grounded localization, validate a detector such as Grounding DINO[2] or a grounding-aware VLM instead of interpreting global similarity as a region prediction.

CLIP was an important advance in vision-language alignment, but its contrastive objective and standard scoring interface impose specific constraints. Because zero-shot scoring compares one pooled image representation with one text representation, the output does not expose region coordinates or counting evidence. Understanding that boundary matters before using it in production systems.

No generation. CLIP can only classify or retrieve; it can't generate text captions or new images on its own.

  • Symptom: You ask CLIP to "describe this image" and it gives you a similarity score instead of a sentence.
  • Cause: CLIP is a dual-encoder alignment model, not a decoder. It has no language-generation head.
  • Fix: Use a generative VLM like LLaVA (Large Language-and-Vision Assistant) for captioning, or use CLIP for retrieval and a separate LLM for generation.

Compositional understanding can be weak. For example, CLIP benchmarks show failures when concepts are rearranged into different relations, such as "a red box beside a blue return bin" versus "a blue box beside a red return bin."

  • Symptom: Zero-shot accuracy drops sharply on tasks that require understanding spatial relationships between multiple objects.
  • Cause: The training objective only asks "does this caption match this image overall?" It doesn't supervise exact object positions or relations.
  • Fix: For spatial reasoning, use grounding-aware models or add a specialized detection head like Grounding DINO.

Fine-grained tasks need separate validation. Standard pooled scoring is a poor contract for counting ("how many boxes") or exact spatial reasoning because it does not output region-level evidence.

  • Symptom: A team maps a top similarity score to "three people," but a labeled audit shows five.
  • Cause: The deployed scoring path returns global alignment rather than count-supervised region predictions.
  • Fix: Use region-based models or add an explicit counting module.

Text in images is brittle. CLIP shows non-trivial OCR transfer, especially on rendered text, but performance drops on dense documents, small fonts, and stylized text.[1]

  • Symptom: CLIP misses label numbers on shipping boxes or misreads stylized logos.
  • Cause: Web-crawled training data contains more natural-scene text than document text.
  • Fix: For document OCR, use specialized document models or preprocessing pipelines with explicit OCR stages.

Bias. It inherits biases from web-crawled data, sometimes associating certain visual concepts with stereotypes.[3]

  • Symptom: Certain demographic groups or contexts get systematically lower similarity scores for neutral prompts.
  • Cause: The training data reflects internet distributions, which include skewed representations.
  • Fix: Audit embeddings on representative data before deployment; use debiasing techniques or domain-specific fine-tuning.

Vision encoder architectures

The vision encoder is the foundation of any VLM. Its job is to compress high-dimensional pixels into a semantic vector. This transformation must preserve essential details like shape, color, and object relationships while discarding irrelevant noise. The choice of architecture fundamentally determines the model's scalability and its ability to capture fine-grained spatial information.

Vision Transformer (ViT)[4]

While earlier models used Convolutional Neural Networks (CNNs), like ResNet, many modern published VLMs use Vision Transformers (ViT). ViT treats images as sequences of patches, processing them with a standard Transformer encoder, much like Large Language Models (LLMs) process text tokens. The figure below illustrates this process: an input image is sliced into fixed-size patches, linearly projected into embeddings, and fed through a Transformer stack to produce a sequence of contextualized visual tokens.

Vision Transformer patch budget: a 224 by 224 image becomes a 14 by 14 patch grid, then 196 visual tokens flow into the encoder. Vision Transformer patch budget: a 224 by 224 image becomes a 14 by 14 patch grid, then 196 visual tokens flow into the encoder.
Patch size is a token-budget knob. A 224 by 224 image with 16 by 16 patches becomes 196 visual tokens before any language reasoning starts.

Patch tokenization

A 224×224 image divided into 16×16 patches results in 14×14=19614 \times 14 = 19614×14=196 visual tokens, plus a special [CLS] token used to aggregate the global image representation.

Patch size is part of the serving configuration. The example above uses 16x16 patches; the high-resolution calculation later in this chapter uses 14x14 patches. Compute the budget from the actual tower rather than copying a memorized number.

patch-token-budget.py
1def patch_tokens(height: int, width: int, patch_size: int, crops: int = 1) -> int: 2 assert height % patch_size == 0 and width % patch_size == 0 3 return (height // patch_size) * (width // patch_size) * crops 4 5for label, size, crops in [("thumbnail", 224, 1), ("page_crop", 336, 1), ("tiled_page", 336, 4)]: 6 print(label, patch_tokens(size, size, patch_size=14, crops=crops))
Output
1thumbnail 256 2page_crop 576 3tiled_page 2304

ViT vs. ResNet

FeatureVision Transformer (ViT)ResNet (CNN)
ScalingScales well with sufficient data; attention cost grows with patch-token count.Strong convolutional baseline with different compute and inductive-bias trade-offs.
Context WindowGlobal context from the first layer via self-attention across all patches.Local context initially, building to global context only in deep layers.
Inductive BiasLow (treats image as sequence of patches), requiring more data to learn structure.High (translation invariance built-in), requires less data to train from scratch.
ArchitectureTransformer-shaped token sequence is convenient for many connectors.Also usable as a vision tower; connector still maps its features to the downstream task.

SigLIP[5]

A scaling cost in standard CLIP training is the softmax-based contrastive loss over image-text candidates in an effective batch. Original CLIP used a batch size of 32,768, which required a distributed setup; smaller implementations can still train, with a different accuracy and negative-diversity trade-off.[1]

Instead of asking "Which of these 32,000 captions is the best match?" (a global multiple-choice question), SigLIP (Sigmoid Loss for Language-Image Pre-training) asks "Is this specific image and caption a good match, yes or no?" (a local true/false question). This improves upon CLIP by replacing the softmax-based contrastive loss with a pairwise sigmoid loss:

L=−1N∑i=1N∑j=1Nlog⁡σ(zij⋅(t⋅vi⊤uj+b))\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{N} \log \sigma\big(z_{ij} \cdot (t \cdot v_i^\top u_j + b)\big)L=−N1​i=1∑N​j=1∑N​logσ(zij​⋅(t⋅vi⊤​uj​+b))

Here NNN is the batch size, zij=+1z_{ij} = +1zij​=+1 for matching pairs and zij=−1z_{ij} = -1zij​=−1 otherwise, while viv_ivi​ and uju_juj​ are the L2-normalized image and text embeddings. ttt and bbb are learned temperature and bias parameters, and σ\sigmaσ is the logistic sigmoid. The original paper initializes the bias to a large negative value so that the loss starts near the all-negative prior, which stabilizes early training when most pairs in a batch are mismatches.[5] Rather than forcing a single matching caption to compete against all others in the batch, this equation independently evaluates each pairing, treating it as a separate binary classification problem.

Advantages over CLIP

  • Decoupled computation: No global normalization is needed. Each pair is scored independently, allowing the similarity matrix to be computed in chunks across devices.[5]
  • Efficiency: The sigmoid loss removes global softmax normalization. The SigLIP paper reports stronger results than its softmax baseline at studied smaller batch sizes and diminishing returns from pushing batch size much further, rather than a blanket hardware guarantee.[5]
  • Memory and partitioning: The loss is easier to chunk across devices because it doesn't require a single global softmax over the full batch.

SigLIP's pairwise sigmoid loss changes the computation contract: positive and negative pair terms can be evaluated without a batch-wide probability distribution. It does not by itself say what accelerator, data volume, or accuracy a warehouse matcher will need.

siglip-pairwise-loss.py
1import math 2 3def softplus(value: float) -> float: 4 return math.log1p(math.exp(value)) 5 6def pair_loss(score: float, is_match: bool) -> float: 7 label = 1.0 if is_match else -1.0 8 return softplus(-label * score) 9 10scores = [ 11 ("same_package", 3.2, True), 12 ("wrong_caption", -2.1, False), 13 ("hard_negative", 0.4, False), 14] 15 16for name, score, match in scores: 17 print(name, round(pair_loss(score, match), 3))
Output
1same_package 0.04 2wrong_caption 0.116 3hard_negative 0.913

Why SigLIP matters in newer stacks

A common published VLM pattern is to swap plain CLIP-style towers for newer SigLIP-family encoders because the sigmoid objective is easier to scale and stronger at moderate batch sizes. SigLIP 2 extends that recipe with caption-based pretraining, self-distillation, masked prediction, and native-aspect-ratio support, and the paper reports better transfer than original SigLIP across model scales when used as a VLM vision backbone.[6] The contrastive idea from CLIP is still the foundation; SigLIP changed the loss, and SigLIP 2 hardened the training recipe.

Modern VLMs: from alignment to generation

CLIP and SigLIP tell a model whether an image and a caption belong together. Modern VLMs go further: they turn visual features into something a language model can reason over token by token.

The safest way to study those systems is to stick to architectures that have actually been published. Open papers already show the main patterns we need:

  • simple projectors that map vision features into the LLM space
  • query-based compressors that reduce many visual patches into a fixed token budget
  • resamplers and adapters that preserve more spatial detail for high-resolution inputs

That gives us a solid mental model without pretending we know the internals of proprietary systems.

Connector patterns for generative vision-language models: LLaVA uses a projector into LLM tokens, BLIP-2 uses Q-Former query compression, and Flamingo uses a resampler plus gated cross-attention. Connector patterns for generative vision-language models: LLaVA uses a projector into LLM tokens, BLIP-2 uses Q-Former query compression, and Flamingo uses a resampler plus gated cross-attention.
Connector choice decides how much visual detail reaches the language model and how many tokens serving must carry.

LLaVA: the straightforward projector path[7]

LLaVA is the clearest example of the "project visual features into the language model" recipe. It takes a pretrained CLIP vision encoder and a pretrained LLM, then learns a projection layer between them.

The original LLaVA recipe uses two stages:[7]

  1. Feature alignment: freeze the vision encoder and LLM, train only the projector.
  2. Visual instruction tuning: fine-tune the projector and LLM on multimodal instruction data.

The architectural lesson is important: LLaVA reported useful multimodal instruction-following behavior from a pretrained vision encoder, an LLM, and a learned projector, without introducing a new vision backbone.[7]

Common multimodal training ladder

Most strong VLMs are not trained in one jump from raw images to polished chat behavior. They are usually built in stages, with each stage solving a different problem.

StageTypical objectiveWhat it teaches
Contrastive pretrainingimage-text matching as in CLIP or SigLIPshared semantic space for retrieval and zero-shot recognition
Connector or projector alignmentfreeze big backbones, train connectormap vision features into language-model space cheaply
Multimodal instruction tuningimage + prompt -> assistant responseteach task behavior, dialogue format, OCR usage, and grounded answers
Preference or safety post-trainingchosen vs rejected responses, critiques, or policy filtersreduce harmful or low-quality multimodal behavior

CLIP and SigLIP mostly cover the first row.[1][5] LLaVA makes the second and third rows explicit with feature alignment followed by visual instruction tuning.[7] A chat-oriented system needs generative and instruction-following training beyond contrastive alignment, because matching alone does not teach long-form answer behavior.

This ladder is useful because it prevents a common confusion:

  • contrastive pretraining teaches matching
  • instruction tuning teaches how to answer
  • post-training teaches which answers are preferred or allowed

If a multimodal assistant sees the right evidence but answers in the wrong format, you need better instruction tuning, not more CLIP data. If it gives polished but unsafe grounded answers, you need better post-training, not a new projector.

BLIP-2 and the Q-Former: compress before the LLM[8]

BLIP-2 (Bootstrapping Language-Image Pre-training, version 2) is useful because it attacks a different problem. Instead of forwarding every visual feature directly into the LLM, it introduces a Q-Former that learns a fixed set of query vectors. Those queries pull the most relevant information out of the frozen vision encoder.

This gives you a bounded number of visual tokens regardless of how large the raw vision feature map is.[8]

That is relevant in production because tokens admitted to the language path affect prefill and cache cost. If a smaller fixed budget preserves task-critical evidence, it can improve the measured latency-cost trade-off over naive concatenation.

Flamingo-style resampling: preserve more context without exploding token count[9]

Flamingo pushed another important idea: first compress variable-length visual features with a Perceiver Resampler, then let language tokens attend to that compact bank through gated cross-attention blocks.[9]

This kind of design is especially useful when:

  • multiple images appear in one prompt
  • video or long visual sequences are involved
  • the language model needs to revisit visual evidence later in the answer

The core trade-off is complexity. Resamplers and cross-attention blocks preserve more structure, but they also add more moving parts than the very simple LLaVA recipe.

Qwen-VL: grounding and text reading beyond caption matching[10]

Qwen-VL (Qwen Vision-Language) is a good published example of a chat-oriented VLM that was explicitly built for grounding and text reading, not just caption matching. The paper highlights a visual receptor, an input-output interface, and a multi-stage training pipeline that let the model handle OCR-heavy and box-grounded tasks in addition to generic image understanding.[10]

That matters because text reading and grounding are resolution-hungry tasks.

High-resolution token math

Suppose your vision tower uses 14×14 patches. A single 336×336 crop becomes 24×24=57624 \times 24 = 57624×24=576 patch tokens before any projector-side compression. Tile one document page into four such crops and you're already above 2,300 visual tokens before the model even reads the user's question.

In an autoregressive VLM, those visual tokens hurt twice: they make prefill slower, and they enlarge the KV cache that must remain resident during decode. PagedAttention helps the serving stack pack and recycle that memory more efficiently, but it doesn't reduce the number of visual tokens you created in the first place.[11]

Visual token budget chart for 14 by 14 patches: one 224 by 224 image creates 256 patch tokens, one 336 by 336 crop creates 576 patch tokens, and four such crops create 2304 visual tokens before text. Visual token budget chart for 14 by 14 patches: one 224 by 224 image creates 256 patch tokens, one 336 by 336 crop creates 576 patch tokens, and four such crops create 2304 visual tokens before text.
High-resolution inputs can become long prefixes. Measure compression against evidence loss before choosing a serving policy.

The connector decides how many encoder features become autoregressive prefix tokens. The example below compares raw tiles with a fixed-size compression path without claiming that compression preserves enough OCR evidence.

visual-prefix-serving-budget.py
1def prefix_tokens(text_tokens: int, crops: int, patch_tokens_per_crop: int, compressed_tokens: int | None) -> int: 2 visual_tokens = crops * patch_tokens_per_crop 3 if compressed_tokens is not None: 4 visual_tokens = compressed_tokens 5 return text_tokens + visual_tokens 6 7raw_prefix = prefix_tokens(text_tokens=180, crops=4, patch_tokens_per_crop=576, compressed_tokens=None) 8compressed_prefix = prefix_tokens(text_tokens=180, crops=4, patch_tokens_per_crop=576, compressed_tokens=64) 9 10print("raw_prefix_tokens:", raw_prefix) 11print("compressed_prefix_tokens:", compressed_prefix) 12print("tokens_avoided_if_quality_holds:", raw_prefix - compressed_prefix)
Output
1raw_prefix_tokens: 2484 2compressed_prefix_tokens: 244 3tokens_avoided_if_quality_holds: 2240

Production systems usually respond in one of three ways:

  • resize aggressively and accept some information loss
  • tile or crop the image, then summarize each region
  • compress the visual stream with a Q-Former, resampler, or other projector-side bottleneck before handing it to the LLM

When a model performs well on documents, inspect the full evidence path: resolution, tiling, OCR, connector compression, and the model's measured failure slices. Published designs such as BLIP-2's Q-Former and Flamingo's resampler show ways to bound or mediate visual features before language reasoning.[8][9]

Connector design controls a serving trade-off

The component between the vision encoder and the LLM controls what evidence reaches generation and how many visual tokens the language path must handle.

Common options include:

  1. Linear projection: cheap and simple, but can bottleneck representational power.
  2. MLP projector: still simple, usually stronger than a single linear layer.
  3. Q-Former / query transformer: compresses many visual features into a fixed learned token set.[8]
  4. Convolutional or locality-aware abstractors: preserve more spatial structure before handing features to the LLM.[12]

An overcompressed connector can remove evidence needed for OCR or grounding. A large or expensive connector can increase latency or language-prefix cost. Both outcomes must be measured by task slice.

That's why there's no single best adapter. The right design depends on whether you care more about:

  • raw simplicity
  • token efficiency
  • spatial fidelity
  • video or multi-image support

Designing around unpublished internals

Closed vendors may publish capability demos and benchmark numbers without enough architecture detail to justify internals claims.

Treat the model as an evaluated dependency: measure supported input resolution, visual-token billing if exposed, latency by image count, OCR and grounding slices, refusal behavior, and output contract. Do not infer a projector type, token budget, or training curriculum unless the provider publishes it.

Evaluation

Evaluating a VLM requires more than one benchmark because different tests stress very different abilities.

BenchmarkFocusMetricWhat it tells you
ImageNetZero-shot classificationTop-1 AccuracyWhether the model recognizes common visual categories
COCO (Common Objects in Context) CaptionsCaptioningCIDEr, BLEUWhether the generated text overlaps with human references
VQAv2Visual question answeringAccuracyWhether the model can answer grounded questions about an image
TextVQAOCR + reasoningAccuracyWhether the model can read and reason over embedded text
MMMUExpert multimodal reasoningAccuracyWhether it can combine specialist knowledge and visual evidence
MMBenchPrompt variationScoreWhether the model is stable under option ordering and prompt variations

No single metric is enough.

  • CLIP-like models can do well on retrieval and still fail badly on counting.
  • Caption metrics can look fine even when spatial reasoning is weak.
  • OCR-heavy tasks expose very different failure modes than general image description.

That's why serious evaluation needs a benchmark mix, not a single scoreboard.

Aggregate scores can also hide the exact slice that blocks launch. Keep capability slices separate and require task-specific pass bars.

vlm-evaluation-slices.py
1results = { 2 "product_retrieval_recall_at_10": (0.94, 0.90), 3 "shipping_label_ocr_accuracy": (0.71, 0.92), 4 "damage_box_grounding_iou": (0.76, 0.70), 5} 6 7failed = [ 8 metric 9 for metric, (measured, required) in results.items() 10 if measured < required 11] 12 13print("launch_ready:", not failed) 14print("failed_slices:", failed)
Output
1launch_ready: False 2failed_slices: ['shipping_label_ocr_accuracy']

Production Applications

Document understanding

Document AI is fundamentally a token-budget problem. Dense documents contain tiny text, tables, figures, and layout cues that don't survive aggressive resizing.

In practice, production systems usually combine some mix of:

  • page tiling or crop-based preprocessing
  • OCR features
  • a VLM for ambiguous or layout-heavy cases

The engineering goal isn't "feed the page in at maximum resolution." It's "preserve the evidence that matters while keeping the token budget survivable."

Image search and retrieval

For image search, a generative VLM is usually the wrong first tool.

A dual-encoder model like CLIP or SigLIP is better because you can:

  • precompute image embeddings offline
  • index them once
  • embed only the query at request time

That supports an indexed first stage whose latency and cost can be measured independently. Generative VLMs are candidates for reranking or explanation after retrieval when those added capabilities justify their measured cost.

route-vision-requests.py
1def select_path(task: str) -> list[str]: 2 if task == "retrieve_similar_products": 3 return ["dual_encoder", "vector_index"] 4 if task == "answer_from_label_text": 5 return ["ocr", "generative_vlm", "citation_check"] 6 if task == "click_refund_button": 7 return ["grounding_model", "action_policy", "human_confirm"] 8 raise ValueError(task) 9 10for task in ["retrieve_similar_products", "answer_from_label_text", "click_refund_button"]: 11 print(task, "->", " + ".join(select_path(task)))
Output
1retrieve_similar_products -> dual_encoder + vector_index 2answer_from_label_text -> ocr + generative_vlm + citation_check 3click_refund_button -> grounding_model + action_policy + human_confirm

Visual agents

Visual agents need more than a captioning model. They need grounding.

It isn't enough to answer "the blue button is near the top-right." The system often has to emit an action like:

  • click (x, y)
  • select the second menu item
  • type into the input field beneath a label

That usually requires a grounding-aware model or a separate grounding component on top of the core VLM, plus strong safety checks around actions.

Mastery check

Use these checkpoints to test core mechanics before moving on.

Evaluation rubric

By the end of this chapter, you should be able to defend these design choices:

DecisionPass bar
Retrieval vs generationYou can explain when CLIP or SigLIP is enough, and when a generative VLM is required because the task needs open-ended answers, OCR-heavy reasoning, or grounded actions.
Backbone choiceYou can compare ViT and ResNet trade-offs around scale, inductive bias, and tokenization, and explain why patch size changes prefill and KV-cache budget.
Spatial failure modesYou can predict when pooled image embeddings will fail on counting, localization, OCR, or grounding before a team ships the wrong model class.
Connector choiceYou can defend when a simple projector is enough and when a Q-Former, resampler, or other compression step is needed before the LLM.
Document strategyYou can choose between higher resolution, tiling, OCR preprocessing, and crop-and-compress pipelines based on what evidence must survive.
Serving budgetYou can estimate how visual token count affects prefill latency, concurrency, and KV-cache pressure, not only encoder FLOPs.

Follow-up questions

Common pitfalls

"One pooled CLIP embedding understands an image like a person does"

Symptom: Teams ask for counts, boxes, or grounded actions from one global embedding. Cause: CLIP was trained for global alignment, not dense spatial supervision. Fix: Use detectors, grounding-aware VLMs, or region features when localization or counting matters.

"CLIP or SigLIP can also serve my chat and captioning path"

Symptom: Similarity scores are wired into a task that needs open-ended explanations or grounded dialogue. Cause: Image-text matching was confused with multimodal generation. Fix: Keep CLIP or SigLIP for retrieval and zero-shot scoring, and use a generative VLM when you need open-ended answers.

"Higher resolution always means better OCR"

Symptom: Document quality rises a little, but latency and memory spike hard. Cause: Resolution was increased without checking token budget, tiling policy, or compression. Fix: Measure visual token count, tile selectively, preserve only the evidence that matters, and compress before the LLM when possible.

"Visual serving cost ends when the encoder finishes"

Symptom: Offline profiling looks cheap, but production decode concurrency collapses. Cause: The team counted encoder FLOPs but ignored prefill latency and KV-cache growth from large visual prefixes. Fix: Track prefill cost and KV pressure together with encoder cost before picking image resolution or crop count.

"Every in-batch non-match is a clean negative"

Symptom: Contrastive training underperforms on visually similar products or captions that describe the same item in different words. Cause: Some in-batch negatives are false negatives, so the loss pushes semantically related pairs apart. Fix: Curate data carefully, use richer captioning, and treat contrastive batch design as a modeling choice rather than a harmless training detail.

"Zero-shot prompts will hold in fine-grained domains"

Symptom: Prompt wording swings predictions or specialist categories collapse together. Cause: Generic templates do not provide enough domain signal. Fix: Prompt-ensemble, calibrate on validation data, and fine-tune or probe when domain precision matters.

What you should carry forward

After this article, you should be able to:

  • Explain how CLIP's dual-encoder architecture aligns images and text in a shared embedding space.
  • Walk through the contrastive loss logic with a concrete batch example.
  • Implement zero-shot classification using prompt engineering and embedding similarity.
  • Diagnose why CLIP fails on spatial reasoning, counting, or fine-grained OCR.
  • Compare LLaVA, BLIP-2, and Flamingo-style designs for connecting vision encoders to language models.
  • Size the visual token budget for a production VLM and choose an appropriate compression strategy.
Next Step
Continue to Multimodal LLM Architecture

You will now take the CLIP-style alignment and visual token compression ideas and apply them to the full architecture of multimodal LLMs that fuse vision encoders with language models through projectors, cross-attention, or early fusion. Visual token budgets, projector design, and the KV-cache implications of long image sequences become first-class constraints in the next chapter.

PreviousLLM-Powered Search Engine
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Learning Transferable Visual Models From Natural Language Supervision.

Radford, A., et al. · 2021 · ICML 2021

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

Liu, S., et al. · 2023 · arXiv preprint

Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications.

Agarwal, S., et al. · 2021 · arXiv preprint

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Dosovitskiy, A., et al. · 2020 · ICLR 2021

Sigmoid Loss for Language Image Pre-training.

Zhai, X., et al. · 2023 · ICCV 2023

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., et al. · 2025

Visual Instruction Tuning.

Liu, H., et al. · 2023 · NeurIPS 2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.

Li, J., et al. · 2023 · ICML 2023

Flamingo: a Visual Language Model for Few-Shot Learning.

Alayrac, J.-B., et al. · 2022 · NeurIPS 2022

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., et al. · 2023 · arXiv preprint

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Honeybee: Locality-enhanced Projector for Multimodal LLM

Cha, J., et al. · 2023 · arXiv preprint