Master CLIP's contrastive pre-training, zero-shot classification, visual token budgets, and the architecture of modern published VLMs like LLaVA, BLIP-2, and Qwen-VL.
LLM-powered search turned retrieved evidence into cited answers. Vision-language models add visual evidence to that same loop: images, diagrams, screenshots, document pages, and product photos become searchable and reasoned about alongside text.
Vision-language models connect images and text so systems can search, classify, and reason across both modalities. This chapter starts with CLIP-style contrastive learning, then expands to production uses and failure modes.
Imagine training a warehouse system to recognize damaged packages. You wouldn't just give it a fixed list of labels like damaged_box or intact_box. You'd show it real photos paired with descriptions: "corner crushed during transit," "shipping label torn near barcode," or "fragile sticker visible on side panel." The system learns the concept from images and language together.
Many supervised computer vision systems expose a fixed label inventory: recognizing a new category generally requires labeled examples and another training step. CLIP (Contrastive Language-Image Pre-training) showed that image-text matching at web scale can produce representations that transfer to new text-defined classes without training a new classifier head.[1] That unlocked practical open-vocabulary retrieval and provided useful vision features for later generative systems.
Before we dive into CLIP, recall what embeddings are: high-dimensional vectors that compress meaning into a list of numbers. A sentence like "corner crushed during transit" becomes a vector, perhaps 768 numbers long. An image of that same crushed box becomes another 768-number vector.
If the model has done its job, those two vectors point in roughly the same direction in space. We measure that alignment with cosine similarity, a score between -1 and 1. Within one trained model, a higher score ranks an image-text pair as more aligned than a lower-scoring candidate. It is not a calibrated probability and does not prove that the pair means exactly the same thing. That's the geometric heart of CLIP: it learns to rank matching image-text pairs above mismatches.
Imagine trying to learn a warehouse photo taxonomy from a fixed label list. You might learn that one class is damaged_box, but you wouldn't understand how damage appears across lighting, angles, materials, and carrier labels.
Now imagine learning from many real catalog and logistics images with full captions. This is how Vision-Language Models (VLMs) learn. Instead of memorizing fixed categories, they learn to align the visual concept of an object with its natural language description.
This shared image-text space allows a system to score text prompts that were not fixed training labels. Transfer still depends on whether pre-training learned the relevant visual evidence and whether the new domain resembles the data it saw. A warehouse team must validate prompts on its own cartons, lighting, and damage policy before using scores operationally.
Think of CLIP as a shared routing map for vision and language. One side starts with pixels, the other with words. They do not share raw input format, but the original model was trained on 400 million image-text pairs.[1] Over training, matched images and captions receive higher similarity than mismatches. If its training produced useful package-damage features, it may rank a new prompt such as "torn corner on shipping box" well even when that exact label was not a training class. That is a transfer hypothesis to evaluate, not an automatic guarantee.
CLIP jointly trains an image encoder and text encoder to align visual and textual representations in a shared embedding space. In the original paper, the vision side was either a ResNet (Residual Network) or a ViT (Vision Transformer), while the text side was a Transformer that produced one representation for each caption.[1] The contrastive objective pulls matching pairs together and pushes mismatches apart, which is what makes zero-shot transfer and cross-modal retrieval work.
The illustration above shows core CLIP idea: keep encoders separate, normalize both outputs, then judge whether image and text land near each other in one shared space.
Think of this process like matching warehouse photos to their correct captions. For every image, the model must pick the correct caption out of thousands of incorrect ones.
Let's walk through a tiny batch first, because the pattern is easier to see with real numbers. Suppose you have 3 images and 3 captions:
| Image | Caption |
|---|---|
| A | "poly mailer torn near barcode" |
| B | "return label covers old tracking number" |
| C | "fragile sticker visible on side panel" |
There are possible pairings. Only the 3 diagonal pairs (A+A, B+B, C+C) are correct. The other 6 are mismatches. CLIP's job is to make the similarity scores for the diagonal pairs as high as possible, while keeping the off-diagonal scores low.
In a real training run, the batch size is much larger (32,768 in the original paper[1]), so each image sees thousands of negative examples. But the logic is the same: maximize the similarity of matching pairs and minimize it for non-matching pairs using a symmetric contrastive loss that's often described as InfoNCE-style (Information Noise Contrastive Estimation).
Let and be the L2-normalized image and text embeddings. CLIP computes logits
where is a learned logit-scale parameter. You'll also see this written as division by a temperature , where . The symmetric objective is:
The first equation asks: "Given image , what's the probability that caption is the right one?" It compares the matching score against the sum of all scores for that image across every caption in the batch. The second equation does the same from the text side. The final loss averages both directions so neither encoder dominates.
The implementation below shows a simplified fixed-temperature version of that loss. It keeps the math readable while matching the same symmetric training pattern.
1import math
2
3def normalize(vector: list[float]) -> list[float]:
4 length = math.sqrt(sum(value * value for value in vector))
5 return [value / length for value in vector]
6
7def dot(left: list[float], right: list[float]) -> float:
8 return sum(a * b for a, b in zip(left, right, strict=True))
9
10def cross_entropy(logits: list[float], target: int) -> float:
11 max_logit = max(logits)
12 log_sum_exp = max_logit + math.log(sum(math.exp(x - max_logit) for x in logits))
13 return -logits[target] + log_sum_exp
14
15def clip_loss(
16 image_embeds: list[list[float]],
17 text_embeds: list[list[float]],
18 temperature: float = 0.2,
19) -> tuple[float, list[list[float]]]:
20 images = [normalize(vector) for vector in image_embeds]
21 texts = [normalize(vector) for vector in text_embeds]
22 logits = [
23 [dot(image, text) / temperature for text in texts]
24 for image in images
25 ]
26
27 loss_i2t = sum(cross_entropy(row, i) for i, row in enumerate(logits)) / len(logits)
28 columns = [list(column) for column in zip(*logits, strict=True)]
29 loss_t2i = sum(cross_entropy(column, i) for i, column in enumerate(columns)) / len(columns)
30 return (loss_i2t + loss_t2i) / 2, logits
31
32image_embeds = [
33 [0.95, 0.05, 0.00],
34 [0.05, 0.92, 0.03],
35 [0.00, 0.07, 0.94],
36]
37text_embeds = [
38 [0.91, 0.08, 0.01],
39 [0.06, 0.89, 0.04],
40 [0.02, 0.09, 0.90],
41]
42
43loss, logits = clip_loss(image_embeds, text_embeds)
44print("logit matrix:")
45for row in logits:
46 print([round(value, 2) for value in row])
47print(f"symmetric_loss: {loss:.3f}")1logit matrix:
2[5.0, 0.6, 0.14]
3[0.71, 5.0, 0.66]
4[0.09, 0.59, 5.0]
5symmetric_loss: 0.022Low loss means the diagonal image-caption pairs dominate the off-diagonal mismatches. A random or poorly aligned batch would have a much flatter matrix and a higher loss.
The softmax-based cross-entropy requires the full similarity matrix, which means CLIP benefits from very large batch sizes (32,768 in the original paper[1]) so that each example sees many in-batch negatives. In practice, that pushes training toward large distributed runs with expensive cross-device synchronization.
An off-diagonal pair is treated as negative by this objective, but it is not automatically a clean semantic negative. Two photographs of the same product defect may be valid matches even if they arrived as separate labeled pairs. A batch-construction audit catches obvious collisions before the loss pushes them apart.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Pair:
5 image_id: str
6 caption: str
7 concept: str
8
9pairs = [
10 Pair("box-front", "corner crushed during transit", "crushed-corner"),
11 Pair("box-side", "cardboard corner is dented", "crushed-corner"),
12 Pair("label", "barcode obscured by tape", "blocked-barcode"),
13]
14
15potential_false_negatives = [
16 (left.image_id, right.image_id, left.concept)
17 for index, left in enumerate(pairs)
18 for right in pairs[index + 1:]
19 if left.concept == right.concept
20]
21
22print("pairs_treated_as_negative_but_related:", potential_false_negatives)1pairs_treated_as_negative_but_related: [('box-front', 'box-side', 'crushed-corner')]The success of CLIP is heavily dependent on the sheer scale and diversity of its pre-training data. The original paper doesn't release or assign a canonical public name to the full corpus. It describes training on 400 million (image, text) pairs collected from the internet.[1]
Unlike prior vision datasets (e.g., ImageNet) that relied on manual class labels, CLIP learns from natural language supervision attached to images on the web. That means the model trains against captions and descriptive text rather than a closed vocabulary of hand-assigned categories. The result is a much broader representation of visual concepts and a much stronger zero-shot transfer story.
CLIP enables classification without any task-specific training. Instead of a final classification layer with fixed weights, the model computes the similarity between the image and the text descriptions of possible classes.
Analogy: instead of teaching a child that "Class 42 is a zebra," you give them a dictionary. They look at a picture and find the word in the dictionary that best describes it.
Here's a concrete example. Suppose you have a warehouse photo and you want to classify it into one of three categories:
| Class | Prompt | Similarity |
|---|---|---|
| damaged box | "a photo of a damaged box" | 0.87 |
| intact box | "a photo of an intact box" | 0.12 |
| fragile item | "a photo of a fragile item" | 0.41 |
The highest score is the model's top candidate. In a real decision path, it should win only if validation has established an acceptable score or margin policy; ambiguous images should abstain or route to review. You did not retrain the model, but you still need to calibrate the decision rule.
The practical breakthrough isn't classification alone; it's a text-defined candidate vocabulary. A warehouse damage-claim system can test a prompt such as "poly mailer with barcode partially obscured by packing tape" without fitting a new label head. Whether the prompt is reliable on real claims still has to be measured.
The function below shows the standard zero-shot pattern: prompt the candidate classes in natural language, encode image and text, normalize both, then score them with a scaled dot product. This standalone version uses small hand-written vectors so the inference mechanics are visible.
1import json
2import math
3
4def normalize(vector: list[float]) -> list[float]:
5 length = math.sqrt(sum(value * value for value in vector))
6 return [value / length for value in vector]
7
8def dot(left: list[float], right: list[float]) -> float:
9 return sum(a * b for a, b in zip(left, right, strict=True))
10
11def softmax(values: list[float]) -> list[float]:
12 max_value = max(values)
13 exp_values = [math.exp(value - max_value) for value in values]
14 total = sum(exp_values)
15 return [value / total for value in exp_values]
16
17def zero_shot_classify(
18 image_embed: list[float],
19 class_names: list[str],
20 text_embeds: dict[str, list[float]],
21 logit_scale: float = 12.0,
22) -> dict[str, object]:
23 image = normalize(image_embed)
24 prompts = [
25 f"a photo of {'an' if name[0] in 'aeiou' else 'a'} {name}"
26 for name in class_names
27 ]
28 logits = [
29 logit_scale * dot(image, normalize(text_embeds[prompt]))
30 for prompt in prompts
31 ]
32 probs = softmax(logits)
33 best_index = max(range(len(probs)), key=probs.__getitem__)
34 return {
35 "predicted_class": class_names[best_index],
36 "scores": {
37 class_name: round(probability, 3)
38 for class_name, probability in zip(class_names, probs, strict=True)
39 },
40 }
41
42text_embeds = {
43 "a photo of a damaged box": [0.92, 0.20, 0.05],
44 "a photo of an intact box": [0.18, 0.88, 0.08],
45 "a photo of a fragile item": [0.48, 0.31, 0.81],
46}
47
48result = zero_shot_classify(
49 image_embed=[0.89, 0.24, 0.08],
50 class_names=["damaged box", "intact box", "fragile item"],
51 text_embeds=text_embeds,
52)
53print(json.dumps(result, indent=2))1{
2 "predicted_class": "damaged box",
3 "scores": {
4 "damaged box": 0.988,
5 "intact box": 0.001,
6 "fragile item": 0.01
7 }
8}The format of the text prompt significantly impacts performance. "A photo of a {class}" typically works better than just "{class}".[1]
Ensembling multiple prompts can systematically improve the stability of zero-shot classifications. The snippet below takes a hardcoded list of templates as input to format the target class description in various contexts. By tokenizing and passing each variation through the text encoder and averaging the resulting embeddings, the procedure outputs a single, more stable vector representation that mitigates the variance of any individual prompt template.
1import json
2import math
3
4def normalize(vector: list[float]) -> list[float]:
5 length = math.sqrt(sum(value * value for value in vector))
6 return [value / length for value in vector]
7
8def mean_vector(vectors: list[list[float]]) -> list[float]:
9 return [
10 sum(vector[i] for vector in vectors) / len(vectors)
11 for i in range(len(vectors[0]))
12 ]
13
14prompt_vectors = {
15 "a photo of a damaged box": [0.92, 0.20, 0.05],
16 "a picture of a damaged box": [0.88, 0.24, 0.07],
17 "an image showing damaged box": [0.86, 0.28, 0.08],
18 "a damaged box in the wild": [0.82, 0.31, 0.12],
19}
20
21ensemble = normalize(mean_vector([
22 normalize(vector)
23 for vector in prompt_vectors.values()
24]))
25
26print(json.dumps({
27 "class": "damaged box",
28 "templates": len(prompt_vectors),
29 "ensemble_embedding": [round(value, 3) for value in ensemble],
30}, indent=2))1{
2 "class": "damaged box",
3 "templates": 4,
4 "ensemble_embedding": [
5 0.955,
6 0.284,
7 0.088
8 ]
9}Prompt ensembling reduces wording sensitivity; it does not create a deployment threshold. Add an abstention rule using held-out data, and make uncertain classifications visible to downstream policy.
1def decide(scores: dict[str, float], min_score: float, min_margin: float) -> str:
2 ranked = sorted(scores.items(), key=lambda item: item[1], reverse=True)
3 (best_label, best_score), (_, second_score) = ranked[:2]
4 if best_score < min_score or best_score - second_score < min_margin:
5 return "review"
6 return best_label
7
8cases = {
9 "clear_damage": {"damaged": 0.87, "intact": 0.12, "barcode_blocked": 0.31},
10 "ambiguous_corner": {"damaged": 0.56, "intact": 0.53, "barcode_blocked": 0.18},
11}
12
13for name, scores in cases.items():
14 print(name, decide(scores, min_score=0.60, min_margin=0.10))1clear_damage damaged
2ambiguous_corner reviewA common deployment mistake is using CLIP's global similarity score to count inventory from warehouse shelf photos or localize damaged regions on a package. Standard CLIP inference exposes a pooled image representation optimized for image-text matching, not calibrated counts or boxes. For counting and grounded localization, validate a detector such as Grounding DINO[2] or a grounding-aware VLM instead of interpreting global similarity as a region prediction.
CLIP was an important advance in vision-language alignment, but its contrastive objective and standard scoring interface impose specific constraints. Because zero-shot scoring compares one pooled image representation with one text representation, the output does not expose region coordinates or counting evidence. Understanding that boundary matters before using it in production systems.
No generation. CLIP can only classify or retrieve; it can't generate text captions or new images on its own.
Compositional understanding can be weak. For example, CLIP benchmarks show failures when concepts are rearranged into different relations, such as "a red box beside a blue return bin" versus "a blue box beside a red return bin."
Fine-grained tasks need separate validation. Standard pooled scoring is a poor contract for counting ("how many boxes") or exact spatial reasoning because it does not output region-level evidence.
Text in images is brittle. CLIP shows non-trivial OCR transfer, especially on rendered text, but performance drops on dense documents, small fonts, and stylized text.[1]
Bias. It inherits biases from web-crawled data, sometimes associating certain visual concepts with stereotypes.[3]
The vision encoder is the foundation of any VLM. Its job is to compress high-dimensional pixels into a semantic vector. This transformation must preserve essential details like shape, color, and object relationships while discarding irrelevant noise. The choice of architecture fundamentally determines the model's scalability and its ability to capture fine-grained spatial information.
While earlier models used Convolutional Neural Networks (CNNs), like ResNet, many modern published VLMs use Vision Transformers (ViT). ViT treats images as sequences of patches, processing them with a standard Transformer encoder, much like Large Language Models (LLMs) process text tokens. The figure below illustrates this process: an input image is sliced into fixed-size patches, linearly projected into embeddings, and fed through a Transformer stack to produce a sequence of contextualized visual tokens.
A 224×224 image divided into 16×16 patches results in visual tokens, plus a special [CLS] token used to aggregate the global image representation.
Patch size is part of the serving configuration. The example above uses 16x16 patches; the high-resolution calculation later in this chapter uses 14x14 patches. Compute the budget from the actual tower rather than copying a memorized number.
1def patch_tokens(height: int, width: int, patch_size: int, crops: int = 1) -> int:
2 assert height % patch_size == 0 and width % patch_size == 0
3 return (height // patch_size) * (width // patch_size) * crops
4
5for label, size, crops in [("thumbnail", 224, 1), ("page_crop", 336, 1), ("tiled_page", 336, 4)]:
6 print(label, patch_tokens(size, size, patch_size=14, crops=crops))1thumbnail 256
2page_crop 576
3tiled_page 2304| Feature | Vision Transformer (ViT) | ResNet (CNN) |
|---|---|---|
| Scaling | Scales well with sufficient data; attention cost grows with patch-token count. | Strong convolutional baseline with different compute and inductive-bias trade-offs. |
| Context Window | Global context from the first layer via self-attention across all patches. | Local context initially, building to global context only in deep layers. |
| Inductive Bias | Low (treats image as sequence of patches), requiring more data to learn structure. | High (translation invariance built-in), requires less data to train from scratch. |
| Architecture | Transformer-shaped token sequence is convenient for many connectors. | Also usable as a vision tower; connector still maps its features to the downstream task. |
A scaling cost in standard CLIP training is the softmax-based contrastive loss over image-text candidates in an effective batch. Original CLIP used a batch size of 32,768, which required a distributed setup; smaller implementations can still train, with a different accuracy and negative-diversity trade-off.[1]
Instead of asking "Which of these 32,000 captions is the best match?" (a global multiple-choice question), SigLIP (Sigmoid Loss for Language-Image Pre-training) asks "Is this specific image and caption a good match, yes or no?" (a local true/false question). This improves upon CLIP by replacing the softmax-based contrastive loss with a pairwise sigmoid loss:
Here is the batch size, for matching pairs and otherwise, while and are the L2-normalized image and text embeddings. and are learned temperature and bias parameters, and is the logistic sigmoid. The original paper initializes the bias to a large negative value so that the loss starts near the all-negative prior, which stabilizes early training when most pairs in a batch are mismatches.[5] Rather than forcing a single matching caption to compete against all others in the batch, this equation independently evaluates each pairing, treating it as a separate binary classification problem.
SigLIP's pairwise sigmoid loss changes the computation contract: positive and negative pair terms can be evaluated without a batch-wide probability distribution. It does not by itself say what accelerator, data volume, or accuracy a warehouse matcher will need.
1import math
2
3def softplus(value: float) -> float:
4 return math.log1p(math.exp(value))
5
6def pair_loss(score: float, is_match: bool) -> float:
7 label = 1.0 if is_match else -1.0
8 return softplus(-label * score)
9
10scores = [
11 ("same_package", 3.2, True),
12 ("wrong_caption", -2.1, False),
13 ("hard_negative", 0.4, False),
14]
15
16for name, score, match in scores:
17 print(name, round(pair_loss(score, match), 3))1same_package 0.04
2wrong_caption 0.116
3hard_negative 0.913A common published VLM pattern is to swap plain CLIP-style towers for newer SigLIP-family encoders because the sigmoid objective is easier to scale and stronger at moderate batch sizes. SigLIP 2 extends that recipe with caption-based pretraining, self-distillation, masked prediction, and native-aspect-ratio support, and the paper reports better transfer than original SigLIP across model scales when used as a VLM vision backbone.[6] The contrastive idea from CLIP is still the foundation; SigLIP changed the loss, and SigLIP 2 hardened the training recipe.
CLIP and SigLIP tell a model whether an image and a caption belong together. Modern VLMs go further: they turn visual features into something a language model can reason over token by token.
The safest way to study those systems is to stick to architectures that have actually been published. Open papers already show the main patterns we need:
That gives us a solid mental model without pretending we know the internals of proprietary systems.
LLaVA is the clearest example of the "project visual features into the language model" recipe. It takes a pretrained CLIP vision encoder and a pretrained LLM, then learns a projection layer between them.
The original LLaVA recipe uses two stages:[7]
The architectural lesson is important: LLaVA reported useful multimodal instruction-following behavior from a pretrained vision encoder, an LLM, and a learned projector, without introducing a new vision backbone.[7]
Most strong VLMs are not trained in one jump from raw images to polished chat behavior. They are usually built in stages, with each stage solving a different problem.
| Stage | Typical objective | What it teaches |
|---|---|---|
| Contrastive pretraining | image-text matching as in CLIP or SigLIP | shared semantic space for retrieval and zero-shot recognition |
| Connector or projector alignment | freeze big backbones, train connector | map vision features into language-model space cheaply |
| Multimodal instruction tuning | image + prompt -> assistant response | teach task behavior, dialogue format, OCR usage, and grounded answers |
| Preference or safety post-training | chosen vs rejected responses, critiques, or policy filters | reduce harmful or low-quality multimodal behavior |
CLIP and SigLIP mostly cover the first row.[1][5] LLaVA makes the second and third rows explicit with feature alignment followed by visual instruction tuning.[7] A chat-oriented system needs generative and instruction-following training beyond contrastive alignment, because matching alone does not teach long-form answer behavior.
This ladder is useful because it prevents a common confusion:
If a multimodal assistant sees the right evidence but answers in the wrong format, you need better instruction tuning, not more CLIP data. If it gives polished but unsafe grounded answers, you need better post-training, not a new projector.
BLIP-2 (Bootstrapping Language-Image Pre-training, version 2) is useful because it attacks a different problem. Instead of forwarding every visual feature directly into the LLM, it introduces a Q-Former that learns a fixed set of query vectors. Those queries pull the most relevant information out of the frozen vision encoder.
This gives you a bounded number of visual tokens regardless of how large the raw vision feature map is.[8]
That is relevant in production because tokens admitted to the language path affect prefill and cache cost. If a smaller fixed budget preserves task-critical evidence, it can improve the measured latency-cost trade-off over naive concatenation.
Flamingo pushed another important idea: first compress variable-length visual features with a Perceiver Resampler, then let language tokens attend to that compact bank through gated cross-attention blocks.[9]
This kind of design is especially useful when:
The core trade-off is complexity. Resamplers and cross-attention blocks preserve more structure, but they also add more moving parts than the very simple LLaVA recipe.
Qwen-VL (Qwen Vision-Language) is a good published example of a chat-oriented VLM that was explicitly built for grounding and text reading, not just caption matching. The paper highlights a visual receptor, an input-output interface, and a multi-stage training pipeline that let the model handle OCR-heavy and box-grounded tasks in addition to generic image understanding.[10]
That matters because text reading and grounding are resolution-hungry tasks.
Suppose your vision tower uses 14×14 patches. A single 336×336 crop becomes patch tokens before any projector-side compression. Tile one document page into four such crops and you're already above 2,300 visual tokens before the model even reads the user's question.
In an autoregressive VLM, those visual tokens hurt twice: they make prefill slower, and they enlarge the KV cache that must remain resident during decode. PagedAttention helps the serving stack pack and recycle that memory more efficiently, but it doesn't reduce the number of visual tokens you created in the first place.[11]
The connector decides how many encoder features become autoregressive prefix tokens. The example below compares raw tiles with a fixed-size compression path without claiming that compression preserves enough OCR evidence.
1def prefix_tokens(text_tokens: int, crops: int, patch_tokens_per_crop: int, compressed_tokens: int | None) -> int:
2 visual_tokens = crops * patch_tokens_per_crop
3 if compressed_tokens is not None:
4 visual_tokens = compressed_tokens
5 return text_tokens + visual_tokens
6
7raw_prefix = prefix_tokens(text_tokens=180, crops=4, patch_tokens_per_crop=576, compressed_tokens=None)
8compressed_prefix = prefix_tokens(text_tokens=180, crops=4, patch_tokens_per_crop=576, compressed_tokens=64)
9
10print("raw_prefix_tokens:", raw_prefix)
11print("compressed_prefix_tokens:", compressed_prefix)
12print("tokens_avoided_if_quality_holds:", raw_prefix - compressed_prefix)1raw_prefix_tokens: 2484
2compressed_prefix_tokens: 244
3tokens_avoided_if_quality_holds: 2240Production systems usually respond in one of three ways:
When a model performs well on documents, inspect the full evidence path: resolution, tiling, OCR, connector compression, and the model's measured failure slices. Published designs such as BLIP-2's Q-Former and Flamingo's resampler show ways to bound or mediate visual features before language reasoning.[8][9]
The component between the vision encoder and the LLM controls what evidence reaches generation and how many visual tokens the language path must handle.
Common options include:
An overcompressed connector can remove evidence needed for OCR or grounding. A large or expensive connector can increase latency or language-prefix cost. Both outcomes must be measured by task slice.
That's why there's no single best adapter. The right design depends on whether you care more about:
Closed vendors may publish capability demos and benchmark numbers without enough architecture detail to justify internals claims.
Treat the model as an evaluated dependency: measure supported input resolution, visual-token billing if exposed, latency by image count, OCR and grounding slices, refusal behavior, and output contract. Do not infer a projector type, token budget, or training curriculum unless the provider publishes it.
Evaluating a VLM requires more than one benchmark because different tests stress very different abilities.
| Benchmark | Focus | Metric | What it tells you |
|---|---|---|---|
| ImageNet | Zero-shot classification | Top-1 Accuracy | Whether the model recognizes common visual categories |
| COCO (Common Objects in Context) Captions | Captioning | CIDEr, BLEU | Whether the generated text overlaps with human references |
| VQAv2 | Visual question answering | Accuracy | Whether the model can answer grounded questions about an image |
| TextVQA | OCR + reasoning | Accuracy | Whether the model can read and reason over embedded text |
| MMMU | Expert multimodal reasoning | Accuracy | Whether it can combine specialist knowledge and visual evidence |
| MMBench | Prompt variation | Score | Whether the model is stable under option ordering and prompt variations |
No single metric is enough.
That's why serious evaluation needs a benchmark mix, not a single scoreboard.
Aggregate scores can also hide the exact slice that blocks launch. Keep capability slices separate and require task-specific pass bars.
1results = {
2 "product_retrieval_recall_at_10": (0.94, 0.90),
3 "shipping_label_ocr_accuracy": (0.71, 0.92),
4 "damage_box_grounding_iou": (0.76, 0.70),
5}
6
7failed = [
8 metric
9 for metric, (measured, required) in results.items()
10 if measured < required
11]
12
13print("launch_ready:", not failed)
14print("failed_slices:", failed)1launch_ready: False
2failed_slices: ['shipping_label_ocr_accuracy']Document AI is fundamentally a token-budget problem. Dense documents contain tiny text, tables, figures, and layout cues that don't survive aggressive resizing.
In practice, production systems usually combine some mix of:
The engineering goal isn't "feed the page in at maximum resolution." It's "preserve the evidence that matters while keeping the token budget survivable."
For image search, a generative VLM is usually the wrong first tool.
A dual-encoder model like CLIP or SigLIP is better because you can:
That supports an indexed first stage whose latency and cost can be measured independently. Generative VLMs are candidates for reranking or explanation after retrieval when those added capabilities justify their measured cost.
1def select_path(task: str) -> list[str]:
2 if task == "retrieve_similar_products":
3 return ["dual_encoder", "vector_index"]
4 if task == "answer_from_label_text":
5 return ["ocr", "generative_vlm", "citation_check"]
6 if task == "click_refund_button":
7 return ["grounding_model", "action_policy", "human_confirm"]
8 raise ValueError(task)
9
10for task in ["retrieve_similar_products", "answer_from_label_text", "click_refund_button"]:
11 print(task, "->", " + ".join(select_path(task)))1retrieve_similar_products -> dual_encoder + vector_index
2answer_from_label_text -> ocr + generative_vlm + citation_check
3click_refund_button -> grounding_model + action_policy + human_confirmVisual agents need more than a captioning model. They need grounding.
It isn't enough to answer "the blue button is near the top-right." The system often has to emit an action like:
(x, y)That usually requires a grounding-aware model or a separate grounding component on top of the core VLM, plus strong safety checks around actions.
Use these checkpoints to test core mechanics before moving on.
By the end of this chapter, you should be able to defend these design choices:
| Decision | Pass bar |
|---|---|
| Retrieval vs generation | You can explain when CLIP or SigLIP is enough, and when a generative VLM is required because the task needs open-ended answers, OCR-heavy reasoning, or grounded actions. |
| Backbone choice | You can compare ViT and ResNet trade-offs around scale, inductive bias, and tokenization, and explain why patch size changes prefill and KV-cache budget. |
| Spatial failure modes | You can predict when pooled image embeddings will fail on counting, localization, OCR, or grounding before a team ships the wrong model class. |
| Connector choice | You can defend when a simple projector is enough and when a Q-Former, resampler, or other compression step is needed before the LLM. |
| Document strategy | You can choose between higher resolution, tiling, OCR preprocessing, and crop-and-compress pipelines based on what evidence must survive. |
| Serving budget | You can estimate how visual token count affects prefill latency, concurrency, and KV-cache pressure, not only encoder FLOPs. |
Symptom: Teams ask for counts, boxes, or grounded actions from one global embedding. Cause: CLIP was trained for global alignment, not dense spatial supervision. Fix: Use detectors, grounding-aware VLMs, or region features when localization or counting matters.
Symptom: Similarity scores are wired into a task that needs open-ended explanations or grounded dialogue. Cause: Image-text matching was confused with multimodal generation. Fix: Keep CLIP or SigLIP for retrieval and zero-shot scoring, and use a generative VLM when you need open-ended answers.
Symptom: Document quality rises a little, but latency and memory spike hard. Cause: Resolution was increased without checking token budget, tiling policy, or compression. Fix: Measure visual token count, tile selectively, preserve only the evidence that matters, and compress before the LLM when possible.
Symptom: Offline profiling looks cheap, but production decode concurrency collapses. Cause: The team counted encoder FLOPs but ignored prefill latency and KV-cache growth from large visual prefixes. Fix: Track prefill cost and KV pressure together with encoder cost before picking image resolution or crop count.
Symptom: Contrastive training underperforms on visually similar products or captions that describe the same item in different words. Cause: Some in-batch negatives are false negatives, so the loss pushes semantically related pairs apart. Fix: Curate data carefully, use richer captioning, and treat contrastive batch design as a modeling choice rather than a harmless training detail.
Symptom: Prompt wording swings predictions or specialist categories collapse together. Cause: Generic templates do not provide enough domain signal. Fix: Prompt-ensemble, calibrate on validation data, and fine-tune or probe when domain precision matters.
After this article, you should be able to:
Learning Transferable Visual Models From Natural Language Supervision.
Radford, A., et al. · 2021 · ICML 2021
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Liu, S., et al. · 2023 · arXiv preprint
Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications.
Agarwal, S., et al. · 2021 · arXiv preprint
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Dosovitskiy, A., et al. · 2020 · ICLR 2021
Sigmoid Loss for Language Image Pre-training.
Zhai, X., et al. · 2023 · ICCV 2023
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Tschannen, M., Gritsenko, A., Wang, X., et al. · 2025
Visual Instruction Tuning.
Liu, H., et al. · 2023 · NeurIPS 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.
Li, J., et al. · 2023 · ICML 2023
Flamingo: a Visual Language Model for Few-Shot Learning.
Alayrac, J.-B., et al. · 2022 · NeurIPS 2022
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., et al. · 2023 · arXiv preprint
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Honeybee: Locality-enhanced Projector for Multimodal LLM
Cha, J., et al. · 2023 · arXiv preprint