LearnSystem Design CapstonesVision-Language Models & CLIP

👁️HardMultimodal Models

Vision-Language Models & CLIP

Design a visual inspection and search product while learning CLIP, zero-shot classification, visual token budgets, grounding, and modern VLM connectors.

43 min read

Learning path

Step 150 of 158 in the full curriculum

LLM-Powered Search Engine Multimodal LLM Architecture

LLM-powered search turned retrieved evidence into cited answers. Vision-language models add visual evidence to that same loop: images, diagrams, screenshots, document pages, and asset photos become searchable and reasoned about alongside text.

Vision-language models connect images and text so systems can search, classify, and reason across both modalities. Start with CLIP-style contrastive learning, then expand to production uses and failure modes.

A data-center inspection system shouldn't learn rack hazards only from a fixed list of labels like rack_hazard or safe_rack. It needs real photos paired with descriptions: "blocked vent on rack door," "asset tag torn near barcode," or "warning label visible on power unit." The system learns the concept from images and language together.

Product brief: data-center visual inspection

Design a service that lets technicians upload rack photos, search past inspections in natural language, and route uncertain hazards to a reviewer. CLIP or SigLIP powers first-pass retrieval; OCR, grounding, or a generative VLM handles questions that need label text, regions, or explanations.

Requirements and API

POST /v1/inspection-images accepts an immutable image reference, site, rack, capture time, and idempotency key. It returns image_id, ingestion state, and encoder version.
POST /v1/inspection-search accepts query, authorized site filters, limit, and an optional minimum reviewed score. It returns ranked image_id values, similarity scores, thumbnails, and review state. Scores are ranking signals, not probabilities.
POST /v1/inspection-analysis accepts image_id, question, and requested evidence type. It returns answer, OCR spans or grounded regions, model versions, and answered, abstained, or review.
Search targets p95 below 300ms. Analysis targets p95 below 3 seconds. Access filters apply before retrieval, and a hazard never auto-closes from an unreviewed model result.

Data flow and sizing

Ingestion follows upload -> malware and format checks -> metadata authorization -> image encoder -> normalized embedding -> versioned vector index. Search follows authorized query -> text encoder -> site-filtered nearest-neighbor search -> optional reranker -> response. Analysis adds crop or tile policy -> OCR or VLM -> grounding check -> review policy after the selected image is authorized.

For 10 million images and 768-dimensional FP16 embeddings, raw vectors need 10,000,000 x 768 x 2 = 15.36 GB before index structures, metadata, replicas, and old encoder versions. Keep a 100-QPS retrieval pool separate from a 5-QPS generative-analysis pool so expensive VLM calls can't consume search capacity.

Recovery, rollout, and evaluation

Ingestion is idempotent by content hash and source version. Failed images enter a retry or dead-letter state without becoming searchable. An encoder upgrade writes a new index generation; reads stay on the old complete generation until backfill and validation finish, then an atomic alias switch enables rollback. If OCR or grounding fails, return abstained or review rather than a caption presented as verified evidence.

Roll out with a frozen, technician-reviewed set, shadow traffic, a site-limited canary, and staged index switching. Block expansion when asset retrieval recall@10, hazard false-negative rate, asset-tag OCR accuracy, grounding IoU, p95 latency, review rate, or cross-site access tests miss their route-specific bars. Monitor drift by site, camera, lighting, rack type, and prompt template instead of relying on one aggregate benchmark.

Many supervised computer vision systems expose a fixed label inventory: recognizing a new category generally requires labeled examples and another training step. CLIP (Contrastive Language-Image Pre-training) showed that image-text matching at web scale can produce representations that transfer to new text-defined classes without training a new classifier head.^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020} That made practical open-vocabulary retrieval possible and provided useful vision features for later generative systems.

Conceptual foundation

A quick refresher: embeddings and similarity

Before CLIP, recall what embeddings are: high-dimensional vectors that compress meaning into a list of numbers. A sentence like "blocked vent on rack door" becomes a vector, perhaps 768 numbers long. An image of that same blocked rack vent becomes another 768-number vector.

If the model has done its job, those two vectors point in roughly the same direction in space. We measure that alignment with cosine similarity, a score between -1 and 1. Within one trained model, a higher score ranks an image-text pair as more aligned than a lower-scoring candidate. It isn't a calibrated probability and doesn't prove that the pair means exactly the same thing. That's the geometric heart of CLIP: it learns to rank matching image-text pairs above mismatches.

From fixed labels to open vocabulary

A fixed label list can teach that one class is rack_hazard, but it can't show how blocked airflow appears across lighting, angles, cables, and rack labels.

Learning from many real inspection images, UI screenshots, diagrams, and captions gives Vision-Language Models (VLMs) a broader signal. Instead of memorizing fixed categories, they learn to align the visual concept of an object with its natural language description.

This shared image-text space allows a system to score text prompts that were not fixed training labels. Transfer still depends on whether pre-training learned the relevant visual evidence and whether the new domain resembles the data it saw. A data-center team must validate prompts on its own racks, lighting, cabling, and review policy before using scores operationally.

Shared image-text space

CLIP creates a shared routing map for vision and language. One side starts with pixels, the other with words. They don't share raw input format, but the original model was trained on 400 million image-text pairs.^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020} Over training, matched images and captions receive higher similarity than mismatches. If its training produced useful rack-safety features, it may rank a new prompt such as "blocked airflow behind rack panel" well even when that exact label was not a training class. That's a transfer hypothesis to evaluate, not an automatic guarantee.

CLIP: contrastive language-image pre-training

Architecture^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020}

CLIP jointly trains an image encoder and text encoder to align visual and textual representations in a shared embedding space. In the original paper, the vision side was either a ResNet (Residual Network) or a ViT (Vision Transformer), while the text side was a Transformer that produced one representation for each caption.^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020} The contrastive objective pulls matching pairs together and pushes mismatches apart, which is what makes zero-shot transfer and cross-modal retrieval work.

CLIP shared-space trace where an image lane and caption lane both project into one embedding map, and the caption describing a rack hazard lands closest to the image point while a safe-rack caption point stays farther away. — CLIP keeps encoders separate, then lets shared-space distance rank the right caption nearest to the image.

The illustration above shows core CLIP idea: keep encoders separate, normalize both outputs, then judge whether image and text land near each other in one shared space.

Training objective

The training process resembles matching data-center photos to their correct captions. For every image, the model must pick the correct caption out of thousands of incorrect ones.

A tiny batch makes the pattern easier to see with real numbers. Suppose you have 3 images and 3 captions:

Image	Caption
A	"cable label torn near port"
B	"asset tag partly covered by tape"
C	"warning label visible on power unit"

There are $3 \times 3 = 9$ possible pairings. Only the 3 diagonal pairs (A+A, B+B, C+C) are correct. The other 6 are mismatches. CLIP's job is to make the similarity scores for the diagonal pairs as high as possible, while keeping the off-diagonal scores low.

CLIP contrastive training matrix for three paired images and captions, where the three diagonal image-caption pairs are positives and the six off-diagonal pairs are negatives pushed apart in both retrieval directions. — The diagonal pairs stay close. Every off-diagonal pair becomes a negative for both encoders.

In a real training run, the batch size is much larger (32,768 in the original paper^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020}), so each image sees thousands of negative examples. But the logic is the same: maximize the similarity of matching pairs and minimize it for non-matching pairs using a symmetric contrastive loss that's often described as InfoNCE-style (Information Noise Contrastive Estimation).

Let $I_i$ and $T_j$ be the L2-normalized image and text embeddings. CLIP computes logits

z_{ij} = \exp(t) \cdot I_i^\top T_j

where $t$ is a learned logit-scale parameter. You'll also see this written as division by a temperature $\tau$ , where $\tau = 1 / \exp(t)$ . The symmetric objective is:

\mathcal{L}_{I \rightarrow T} = -\frac{1}{N}\sum_{i=1}^{N}\log\frac{\exp(z_{ii})}{\sum_{j=1}^{N}\exp(z_{ij})}

\mathcal{L}_{T \rightarrow I} = -\frac{1}{N}\sum_{i=1}^{N}\log\frac{\exp(z_{ii})}{\sum_{j=1}^{N}\exp(z_{ji})}

\mathcal{L} = \frac{\mathcal{L}_{I \rightarrow T} + \mathcal{L}_{T \rightarrow I}}{2}

The first equation asks: "Given image $i$ , what's the probability that caption $i$ is the right one?" It compares the matching score $\exp(z_{ii})$ against the sum of all scores for that image across every caption in the batch. From the text side, the second equation repeats the same comparison in the other direction. The final loss averages both directions so neither encoder dominates.

This implementation uses a simplified fixed-temperature version of that loss. It keeps the math readable while matching the same symmetric training pattern.

training-objective.py

import math

def normalize(vector: list[float]) -> list[float]:
    length = math.sqrt(sum(value * value for value in vector))
    return [value / length for value in vector]

def dot(left: list[float], right: list[float]) -> float:
    return sum(a * b for a, b in zip(left, right, strict=True))

def cross_entropy(logits: list[float], target: int) -> float:
    max_logit = max(logits)
    log_sum_exp = max_logit + math.log(sum(math.exp(x - max_logit) for x in logits))
    return -logits[target] + log_sum_exp

def clip_loss(
    image_embeds: list[list[float]],
    text_embeds: list[list[float]],
    temperature: float = 0.2,
) -> tuple[float, list[list[float]]]:
    images = [normalize(vector) for vector in image_embeds]
    texts = [normalize(vector) for vector in text_embeds]
    logits = [
        [dot(image, text) / temperature for text in texts]
        for image in images
    ]

    loss_i2t = sum(cross_entropy(row, i) for i, row in enumerate(logits)) / len(logits)
    columns = [list(column) for column in zip(*logits, strict=True)]
    loss_t2i = sum(cross_entropy(column, i) for i, column in enumerate(columns)) / len(columns)
    return (loss_i2t + loss_t2i) / 2, logits

image_embeds = [
    [0.95, 0.05, 0.00],
    [0.05, 0.92, 0.03],
    [0.00, 0.07, 0.94],
]
text_embeds = [
    [0.91, 0.08, 0.01],
    [0.06, 0.89, 0.04],
    [0.02, 0.09, 0.90],
]

loss, logits = clip_loss(image_embeds, text_embeds)
print("logit matrix:")
for row in logits:
    print([round(value, 2) for value in row])
print(f"symmetric_loss: {loss:.3f}")

Output

logit matrix:
[5.0, 0.6, 0.14]
[0.71, 5.0, 0.66]
[0.09, 0.59, 5.0]
symmetric_loss: 0.022

Low loss means the diagonal image-caption pairs dominate the off-diagonal mismatches. A random or poorly aligned batch would have a much flatter matrix and a higher loss.

The softmax-based cross-entropy requires the full $N \times N$ similarity matrix, which means CLIP benefits from very large batch sizes (32,768 in the original paper^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020}) so that each example sees many in-batch negatives. In practice, that pushes training toward large distributed runs with expensive cross-device synchronization.

An off-diagonal pair is treated as negative by this objective, but it isn't automatically a clean semantic negative. Two photographs of the same rack hazard may be valid matches even if they arrived as separate labeled pairs. A batch-construction audit catches obvious collisions before the loss pushes them apart.

audit-false-negatives.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Pair:
    image_id: str
    caption: str
    concept: str

pairs = [
    Pair("rack-front", "blocked vent on rack door", "blocked-airflow"),
    Pair("rack-side", "airflow blocked by loose cabling", "blocked-airflow"),
    Pair("asset-tag", "barcode obscured by cable tie", "blocked-barcode"),
]

potential_false_negatives = [
    (left.image_id, right.image_id, left.concept)
    for index, left in enumerate(pairs)
    for right in pairs[index + 1:]
    if left.concept == right.concept
]

print("pairs_treated_as_negative_but_related:", potential_false_negatives)

Output

pairs_treated_as_negative_but_related: [('rack-front', 'rack-side', 'blocked-airflow')]

Training data

The success of CLIP is heavily dependent on the sheer scale and diversity of its pre-training data. The original paper doesn't release or assign a canonical public name to the full corpus. It describes training on 400 million (image, text) pairs collected from the internet.^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020}

Unlike prior vision datasets (e.g., ImageNet) that relied on manual class labels, CLIP learns from natural language supervision attached to images on the web. That means the model trains against captions and descriptive text rather than a closed vocabulary of hand-assigned categories. That supervision gives CLIP a much broader representation of visual concepts and a much stronger zero-shot transfer story.

Zero-shot classification

CLIP enables classification without any task-specific training. Instead of a final classification layer with fixed weights, the model computes the similarity between the image and the text descriptions of possible classes.

Analogy: instead of teaching an inspection station that "Class 42 means blocked airflow," you give it a catalog of plain-language descriptions. For each rack photo, it scores which description fits best.

Here's a concrete example. Suppose you have a data-center rack photo and you want to classify it into one of three categories:

Class	Prompt	Similarity
rack hazard	"a photo of a rack hazard"	0.87
safe rack	"a photo of a safe rack"	0.12
blocked asset tag	"a photo of a blocked asset tag"	0.41

The highest score is the model's top candidate. In a real decision path, it should win only if validation has established an acceptable score or margin policy; ambiguous images should abstain or route to review. You did not retrain the model, but you still need to calibrate the decision rule.

The practical breakthrough isn't classification alone; it's a text-defined candidate vocabulary. A data-center inspection system can test a prompt such as "cable label partially obscured by a tie" without fitting a new label head. Whether the prompt is reliable on real claims still has to be measured.

CLIP task map where one shared image-text embedding space fans out into zero-shot classification, image search, probing, and reranking, plus a rack-hazard example where candidate prompts score differently against same image vector. — CLIP keeps one shared embedding space; downstream tasks mostly change scoring, indexing, or light heads around it.

This function shows the standard zero-shot pattern: prompt the candidate classes in natural language, encode image and text, normalize both, then score them with a scaled dot product. This standalone version uses small hand-written vectors so the inference mechanics are visible.

zero-shot-classification.py

import json
import math

def normalize(vector: list[float]) -> list[float]:
    length = math.sqrt(sum(value * value for value in vector))
    return [value / length for value in vector]

def dot(left: list[float], right: list[float]) -> float:
    return sum(a * b for a, b in zip(left, right, strict=True))

def softmax(values: list[float]) -> list[float]:
    max_value = max(values)
    exp_values = [math.exp(value - max_value) for value in values]
    total = sum(exp_values)
    return [value / total for value in exp_values]

def zero_shot_classify(
    image_embed: list[float],
    class_names: list[str],
    text_embeds: dict[str, list[float]],
    logit_scale: float = 12.0,
) -> dict[str, object]:
    image = normalize(image_embed)
    prompts = [
        f"a photo of {'an' if name[0] in 'aeiou' else 'a'} {name}"
        for name in class_names
    ]
    logits = [
        logit_scale * dot(image, normalize(text_embeds[prompt]))
        for prompt in prompts
    ]
    probs = softmax(logits)
    best_index = max(range(len(probs)), key=probs.__getitem__)
    return {
        "predicted_class": class_names[best_index],
        "scores": {
            class_name: round(probability, 3)
            for class_name, probability in zip(class_names, probs, strict=True)
        },
    }

text_embeds = {
    "a photo of a rack hazard": [0.92, 0.20, 0.05],
    "a photo of a safe rack": [0.18, 0.88, 0.08],
    "a photo of a blocked asset tag": [0.48, 0.31, 0.81],
}

result = zero_shot_classify(
    image_embed=[0.89, 0.24, 0.08],
    class_names=["rack hazard", "safe rack", "blocked asset tag"],
    text_embeds=text_embeds,
)
print(json.dumps(result, indent=2))

Output

{
  "predicted_class": "rack hazard",
  "scores": {
    "rack hazard": 0.988,
    "safe rack": 0.001,
    "blocked asset tag": 0.01
  }
}

Prompt engineering matters

The format of the text prompt affects performance. "A photo of a {class}" typically works better than just "{class}".^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020}

Ensembling multiple prompts can systematically improve the stability of zero-shot classifications. The snippet below takes a hardcoded list of templates as input to format the target class description in various contexts. By tokenizing and passing each variation through the text encoder and averaging the resulting embeddings, the procedure outputs a single, more stable vector representation that mitigates the variance of any individual prompt template.

prompt-engineering-matters.py

import json
import math

def normalize(vector: list[float]) -> list[float]:
    length = math.sqrt(sum(value * value for value in vector))
    return [value / length for value in vector]

def mean_vector(vectors: list[list[float]]) -> list[float]:
    return [
        sum(vector[i] for vector in vectors) / len(vectors)
        for i in range(len(vectors[0]))
    ]

prompt_vectors = {
    "a photo of a rack hazard": [0.92, 0.20, 0.05],
    "a picture of a rack hazard": [0.88, 0.24, 0.07],
    "an image showing rack hazard": [0.86, 0.28, 0.08],
    "a rack hazard in the wild": [0.82, 0.31, 0.12],
}

ensemble = normalize(mean_vector([
    normalize(vector)
    for vector in prompt_vectors.values()
]))

print(json.dumps({
    "class": "rack hazard",
    "templates": len(prompt_vectors),
    "ensemble_embedding": [round(value, 3) for value in ensemble],
}, indent=2))

Output

{
  "class": "rack hazard",
  "templates": 4,
  "ensemble_embedding": [
    0.955,
    0.284,
    0.088
  ]
}

Prompt ensembling reduces wording sensitivity; it doesn't create a deployment threshold. Add an abstention rule using held-out data, and make uncertain classifications visible to downstream policy.

zero-shot-abstention-policy.py

def decide(scores: dict[str, float], min_score: float, min_margin: float) -> str:
    ranked = sorted(scores.items(), key=lambda item: item[1], reverse=True)
    (best_label, best_score), (_, second_score) = ranked[:2]
    if best_score < min_score or best_score - second_score < min_margin:
        return "review"
    return best_label

cases = {
    "clear_hazard": {"rack_hazard": 0.87, "safe_rack": 0.12, "tag_blocked": 0.31},
    "ambiguous_panel": {"rack_hazard": 0.56, "safe_rack": 0.53, "tag_blocked": 0.18},
}

for name, scores in cases.items():
    print(name, decide(scores, min_score=0.60, min_margin=0.10))

Output

clear_hazard rack_hazard
ambiguous_panel review

Limitations of CLIP

A common deployment mistake is using CLIP's global similarity score to count assets from data-center rack photos or localize blocked vents. Standard CLIP inference exposes a pooled image representation optimized for image-text matching, not calibrated counts or boxes. For counting and grounded localization, validate a detector such as Grounding DINO^{[2]Reference 2Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.https://arxiv.org/abs/2303.05499} or a grounding-aware VLM instead of interpreting global similarity as a region prediction.

CLIP was an important advance in vision-language alignment, but its contrastive objective and standard scoring interface impose specific constraints. Because zero-shot scoring compares one pooled image representation with one text representation, the output doesn't expose region coordinates or counting evidence. Understanding that boundary matters before using it in production systems.

No generation. CLIP can only classify or retrieve; it can't generate text captions or new images on its own.

Symptom: You ask CLIP to "describe this image" and it gives you a similarity score instead of a sentence.
Cause: CLIP is a dual-encoder alignment model, not a decoder. It has no language-generation head.
Fix: Use a generative VLM like LLaVA (Large Language-and-Vision Assistant) for captioning, or use CLIP for retrieval and a separate LLM for generation.

Compositional understanding can be weak. For example, CLIP benchmarks show failures when concepts are rearranged into different relations, such as "a red toolbox beside a blue safety cone" versus "a blue toolbox beside a red safety cone."

Symptom: Zero-shot accuracy drops sharply on tasks that require understanding spatial relationships between multiple objects.
Cause: The training objective only asks "does this caption match this image overall?" It doesn't supervise exact object positions or relations.
Fix: For spatial reasoning, use grounding-aware models or add a specialized detection head like Grounding DINO.

Fine-grained tasks need separate validation. Standard pooled scoring is a poor contract for counting ("how many racks") or exact spatial reasoning because it doesn't output region-level evidence.

Symptom: A team maps a top similarity score to "three rack hazards," but a labeled audit shows five.
Cause: The deployed scoring path returns global alignment rather than count-supervised region predictions.
Fix: Use region-based models or add an explicit counting module.

Text in images is brittle. CLIP shows non-trivial OCR transfer, especially on rendered text, but the paper reports highly variable performance across text domains and formats.^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020} Don't treat a pooled similarity score as a document-OCR interface.

Symptom: CLIP misses label numbers on server racks or misreads stylized logos.
Cause: Standard CLIP scoring was not trained as a reliable text-transcription contract, and OCR transfer changes with the text domain and format.
Fix: For document OCR, use specialized document models or preprocessing pipelines with explicit OCR stages.

Bias. It inherits biases from web-crawled data, sometimes associating certain visual concepts with stereotypes.^{[3]Reference 3Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications.https://arxiv.org/abs/2108.02818}

Symptom: Certain demographic groups or contexts get systematically lower similarity scores for neutral prompts.
Cause: The training data reflects internet distributions, which include skewed representations.
Fix: Audit embeddings on representative data before deployment; use debiasing techniques or domain-specific fine-tuning.

Vision encoder architectures

The vision encoder is the foundation of any VLM. Its job is to compress high-dimensional pixels into useful features. This transformation should preserve task-critical details like shape, color, and object relationships while discarding irrelevant noise. Architecture choice affects scalability and fine-grained spatial fidelity, but training data, objective, resolution, and connector design matter too.

Vision Transformer (ViT)^{[4]Reference 4An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.https://arxiv.org/abs/2010.11929}

While earlier models used Convolutional Neural Networks (CNNs), like ResNet, many modern published VLMs use Vision Transformers (ViT). ViT treats images as sequences of patches, processing them with a standard Transformer encoder, much like Large Language Models (LLMs) process text tokens. This visual illustrates this process: an input image is sliced into fixed-size patches, linearly projected into embeddings, and fed through a Transformer stack to produce a sequence of contextualized visual tokens.

Vision Transformer patch-token trace where a 224 by 224 image is cut into a 14 by 14 patch grid, producing 196 visual tokens before any class token or projector compression. — Patch size is a token-budget knob. A 224 by 224 image with 16 by 16 patches already produces 196 visual tokens before any language path starts.

Patch tokenization

A 224×224 image divided into 16×16 patches results in $14 \times 14 = 196$ visual tokens, plus a special [CLS] token used to aggregate the global image representation.

Patch size is a token-budget choice. The example above uses 16x16 patches; the high-resolution calculation below uses 14x14 patches. Compute the budget from the actual tower rather than copying a memorized number.

patch-token-budget.py

def patch_tokens(height: int, width: int, patch_size: int, crops: int = 1) -> int:
    assert height % patch_size == 0 and width % patch_size == 0
    return (height // patch_size) * (width // patch_size) * crops

for label, size, crops in [("thumbnail", 224, 1), ("page_crop", 336, 1), ("tiled_page", 336, 4)]:
    print(label, patch_tokens(size, size, patch_size=14, crops=crops))

Output

thumbnail 256
page_crop 576
tiled_page 2304

ViT vs. ResNet

Feature	Vision Transformer (ViT)	ResNet (CNN)
Scaling	Scales well with sufficient data; attention cost grows with patch-token count.	Strong convolutional baseline with different compute and inductive-bias trade-offs.
Context Window	Global context from the first layer via self-attention across all patches.	Local context initially, building to global context only in deep layers.
Inductive Bias	Low (treats image as sequence of patches), requiring more data to learn structure.	High (locality and translation-equivariance biases), often requiring less data to train from scratch.
Architecture	Transformer-shaped token sequence is convenient for many connectors.	Also usable as a vision tower; connector still maps its features to the downstream task.

SigLIP^{[5]Reference 5Sigmoid Loss for Language Image Pre-training.https://arxiv.org/abs/2303.15343}

A scaling cost in standard CLIP training is the softmax-based contrastive loss over image-text candidates in an effective batch. Original CLIP used a batch size of 32,768, which required a distributed setup; smaller implementations can still train, with a different accuracy and negative-diversity trade-off.^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020}

Instead of asking "Which of these 32,000 captions is the best match?" (a global multiple-choice question), SigLIP (Sigmoid Loss for Language-Image Pre-training) asks "Is this specific image and caption a good match, yes or no?" (a local true/false question). It replaces the softmax-based contrastive loss with a pairwise sigmoid loss:

\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{N} \log \sigma\big(z_{ij} \cdot (t \cdot v_i^\top u_j + b)\big)

Here $N$ is the batch size, $z_{ij} = +1$ for matching pairs and $z_{ij} = -1$ otherwise, while $v_i$ and $u_j$ are the L2-normalized image and text embeddings. $t$ and $b$ are learned temperature and bias parameters, and $\sigma$ is the logistic sigmoid. The original paper initializes the bias to a large negative value so that the loss starts near the all-negative prior, which stabilizes early training when most pairs in a batch are mismatches.^{[5]Reference 5Sigmoid Loss for Language Image Pre-training.https://arxiv.org/abs/2303.15343} Rather than forcing a single matching caption to compete against all others in the batch, this equation independently evaluates each pairing, treating it as a separate binary classification problem.

Advantages over CLIP

Decoupled computation: No global normalization is needed. Each pair is scored independently, allowing the similarity matrix to be computed in chunks across devices.^{[5]Reference 5Sigmoid Loss for Language Image Pre-training.https://arxiv.org/abs/2303.15343}
Efficiency: The sigmoid loss removes global softmax normalization. The SigLIP paper reports stronger results than its softmax baseline at studied smaller batch sizes and diminishing returns from pushing batch size much further, rather than a blanket hardware guarantee.^{[5]Reference 5Sigmoid Loss for Language Image Pre-training.https://arxiv.org/abs/2303.15343}
Memory and partitioning: The loss is easier to chunk across devices because it doesn't require a single global softmax over the full batch.

SigLIP's pairwise sigmoid loss changes the computation contract: positive and negative pair terms can be evaluated without a batch-wide probability distribution. It doesn't by itself say what accelerator, data volume, or accuracy a data-center matcher will need.

siglip-pairwise-loss.py

import math

def softplus(value: float) -> float:
    return max(value, 0.0) + math.log1p(math.exp(-abs(value)))

def pair_loss(score: float, is_match: bool) -> float:
    label = 1.0 if is_match else -1.0
    return softplus(-label * score)

scores = [
    ("same_asset", 3.2, True),
    ("wrong_caption", -2.1, False),
    ("hard_negative", 0.4, False),
]

for name, score, match in scores:
    print(name, round(pair_loss(score, match), 3))

Output

same_asset 0.04
wrong_caption 0.116
hard_negative 0.913

Why SigLIP matters in newer stacks

A common published VLM pattern is to swap plain CLIP-style towers for newer SigLIP-family encoders because the sigmoid objective is easier to scale and stronger at moderate batch sizes. SigLIP 2 extends that recipe with caption-based pretraining, self-distillation, masked prediction, and native-aspect-ratio support, and the paper reports better transfer than original SigLIP across model scales when used as a VLM vision backbone.^{[6]Reference 6SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Featureshttps://arxiv.org/abs/2502.14786} The contrastive idea from CLIP is still the foundation; SigLIP changed the loss, and SigLIP 2 hardened the training recipe.

Modern VLMs: from alignment to generation

CLIP and SigLIP tell a model whether an image and a caption belong together. Modern VLMs go further: they turn visual features into something a language model can reason over token by token.

Study published architectures first. Open papers already show the main patterns we need:

simple projectors that map vision features into the LLM space
query-based compressors that reduce many visual patches into a fixed token budget
resamplers and adapters that preserve more spatial detail for high-resolution inputs

That gives us a solid mental model without pretending we know the internals of proprietary systems.

Generative VLM connector comparison where LLaVA sends many visual tokens straight into the text prefix, BLIP-2 compresses them through learned queries, and Flamingo keeps a compact visual bank that language tokens can revisit later. — Connector choice changes two things at once: how many visual tokens hit the prefix and how much evidence the model can revisit later.

LLaVA: the straightforward projector path^{[7]Reference 7Visual Instruction Tuning.https://arxiv.org/abs/2304.08485}

LLaVA is the clearest example of the "project visual features into the language model" recipe. It takes a pretrained CLIP vision encoder and a pretrained LLM, then learns a projection layer between them.

The original LLaVA recipe uses two stages:^{[7]Reference 7Visual Instruction Tuning.https://arxiv.org/abs/2304.08485}

Feature alignment: freeze the vision encoder and LLM, train only the projector.
Visual instruction tuning: fine-tune the projector and LLM on multimodal instruction data.

The architectural lesson is important: LLaVA reported useful multimodal instruction-following behavior from a pretrained vision encoder, an LLM, and a learned projector, without introducing a new vision backbone.^{[7]Reference 7Visual Instruction Tuning.https://arxiv.org/abs/2304.08485}

Common multimodal training ladder

A common published VLM recipe doesn't train in one jump from raw images to polished chat behavior. It uses stages, with each stage solving a different problem.

Stage	Typical objective	What it teaches
Contrastive pretraining	image-text matching as in CLIP or SigLIP	shared semantic space for retrieval and zero-shot recognition
Connector or projector alignment	freeze big backbones, train connector	map vision features into language-model space cheaply
Multimodal instruction tuning	image + prompt -> assistant response	teach task behavior, dialogue format, OCR usage, and grounded answers
Preference or safety post-training	chosen vs rejected responses, critiques, or policy filters	reduce harmful or low-quality multimodal behavior

CLIP and SigLIP mostly cover the first row.^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020}^{[5]Reference 5Sigmoid Loss for Language Image Pre-training.https://arxiv.org/abs/2303.15343} LLaVA makes the second and third rows explicit with feature alignment followed by visual instruction tuning.^{[7]Reference 7Visual Instruction Tuning.https://arxiv.org/abs/2304.08485} A chat-oriented system needs generative and instruction-following training beyond contrastive alignment, because matching alone doesn't teach long-form answer behavior.

This ladder is useful because it prevents a common confusion:

contrastive pretraining teaches matching
instruction tuning teaches how to answer
post-training teaches which answers are preferred or allowed

If a multimodal assistant sees the right evidence but answers in the wrong format, you need better instruction tuning, not more CLIP data. If it gives polished but unsafe grounded answers, you need better post-training, not a new projector.

BLIP-2 and the Q-Former: compress before the LLM^{[8]Reference 8BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.https://arxiv.org/abs/2301.12597}

BLIP-2 (Bootstrapping Language-Image Pre-training, version 2) is useful because it attacks a different problem. Instead of forwarding every visual feature directly into the LLM, it introduces a Q-Former that learns a fixed set of query vectors. Those queries pull the most relevant information out of the frozen vision encoder.

This creates a bounded number of visual tokens regardless of how large the raw vision feature map is.^{[8]Reference 8BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.https://arxiv.org/abs/2301.12597}

That's relevant in production because tokens admitted to the language path affect prefill and cache cost. If a smaller fixed budget preserves task-critical evidence, it can improve the measured latency-cost trade-off over naive concatenation.

Flamingo-style resampling: preserve more context without exploding token count^{[9]Reference 9Flamingo: a Visual Language Model for Few-Shot Learning.https://arxiv.org/abs/2204.14198}

Flamingo pushed another important idea: first compress variable-length visual features with a Perceiver Resampler, then let language tokens attend to that compact bank through gated cross-attention blocks.^{[9]Reference 9Flamingo: a Visual Language Model for Few-Shot Learning.https://arxiv.org/abs/2204.14198}

This kind of design is especially useful when:

multiple images appear in one prompt
video or long visual sequences are involved
the language model needs to revisit visual evidence later in the answer

The core trade-off is complexity. Resamplers and cross-attention blocks preserve more structure, but they also add more moving parts than the very simple LLaVA recipe.

Qwen-VL: grounding and text reading beyond caption matching^{[10]Reference 10Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyondhttps://arxiv.org/abs/2308.12966}

Qwen-VL (Qwen Vision-Language) is a good published example of a chat-oriented VLM explicitly built for grounding and text reading, rather than caption matching alone. The paper describes a visual receptor, an input-output interface, and a multi-stage training pipeline that let the model handle OCR-heavy and box-grounded tasks in addition to generic image understanding.^{[10]Reference 10Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyondhttps://arxiv.org/abs/2308.12966}

That matters because text reading and grounding are resolution-hungry tasks.

High-resolution token math

Suppose your vision tower uses 14×14 patches. A single 336×336 crop becomes $24 \times 24 = 576$ patch tokens before any projector-side compression. Tile one document page into four such crops and you're already above 2,300 visual tokens before the model even reads the user's question.

In a projected-prefix VLM, those admitted visual tokens hurt twice: they make prefill slower, and they enlarge the KV cache that must remain resident during decode. PagedAttention helps the serving stack pack and recycle that memory more efficiently, but it doesn't reduce the number of prefix tokens you created in the first place.^{[11]Reference 11Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180} Cross-attention designs keep a separate visual bank, so measure their memory and latency path separately.

Projected-prefix visual token budget ladder where one 224 by 224 image creates 256 patch tokens, one 336 by 336 crop creates 576 patch tokens, and four such crops create 2304 visual tokens before text, showing how detail-preserving tiling quickly turns into long prefixes. — In projected-prefix paths, detail-preserving crops turn into longer prefixes. Measure compression against evidence loss before choosing serving policy.

In a projector path, the connector decides how many encoder features become autoregressive prefix tokens. Other architectures can expose a separate cross-attention memory bank instead. This comparison uses raw tiles and a fixed-size prefix compression path without claiming that compression preserves enough OCR evidence.

visual-prefix-serving-budget.py

def prefix_tokens(text_tokens: int, crops: int, patch_tokens_per_crop: int, compressed_tokens: int | None) -> int:
    visual_tokens = crops * patch_tokens_per_crop
    if compressed_tokens is not None:
        visual_tokens = compressed_tokens
    return text_tokens + visual_tokens

raw_prefix = prefix_tokens(text_tokens=180, crops=4, patch_tokens_per_crop=576, compressed_tokens=None)
compressed_prefix = prefix_tokens(text_tokens=180, crops=4, patch_tokens_per_crop=576, compressed_tokens=64)

print("raw_prefix_tokens:", raw_prefix)
print("compressed_prefix_tokens:", compressed_prefix)
print("tokens_avoided_if_quality_holds:", raw_prefix - compressed_prefix)

Output

raw_prefix_tokens: 2484
compressed_prefix_tokens: 244
tokens_avoided_if_quality_holds: 2240

Production systems usually respond in one of three ways:

resize aggressively and accept some information loss
tile or crop the image, then summarize each region
compress the visual stream with a Q-Former, resampler, or other projector-side bottleneck before handing it to the LLM

When a model performs well on documents, inspect the full evidence path: resolution, tiling, OCR, connector compression, and the model's measured failure slices. Published designs such as BLIP-2's Q-Former and Flamingo's resampler show ways to bound or mediate visual features before language reasoning.^{[8]Reference 8BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.https://arxiv.org/abs/2301.12597}^{[9]Reference 9Flamingo: a Visual Language Model for Few-Shot Learning.https://arxiv.org/abs/2204.14198}

Connector design controls a serving trade-off

The component between the vision encoder and the LLM controls what evidence reaches generation and how many visual tokens the language path must handle.

Common options include:

Linear projection: cheap and simple, but can bottleneck representational power.
MLP projector: still simple, usually stronger than a single linear layer.
Q-Former / query transformer: compresses many visual features into a fixed learned token set.^{[8]Reference 8BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.https://arxiv.org/abs/2301.12597}
Convolutional or locality-aware abstractors: preserve more spatial structure before handing features to the LLM.^{[12]Reference 12Honeybee: Locality-enhanced Projector for Multimodal LLMhttps://arxiv.org/abs/2312.06742}

An overcompressed connector can remove evidence needed for OCR or grounding. A large or expensive connector can increase latency or language-prefix cost. Both outcomes must be measured by task slice.

That's why there's no single best adapter. The right design depends on whether you care more about:

raw simplicity
token efficiency
spatial fidelity
video or multi-image support

Designing around unpublished internals

Closed vendors may publish capability demos and benchmark numbers without enough architecture detail to justify internals claims.

Treat the model as an evaluated dependency: measure supported input resolution, visual-token billing if exposed, latency by image count, OCR and grounding slices, refusal behavior, and output contract. Don't infer a projector type, token budget, or training curriculum unless the provider publishes it.

Evaluation

Evaluating a VLM requires more than one benchmark because different tests stress very different abilities.

Benchmark	Focus	Metric	What it tells you
ImageNet	Zero-shot classification	Top-1 Accuracy	Whether the model recognizes common visual categories
COCO (Common Objects in Context) Captions	Captioning	CIDEr, BLEU	Whether the generated text overlaps with human references
VQAv2	Visual question answering	Accuracy	Whether the model can answer grounded questions about an image
TextVQA	OCR + reasoning	Accuracy	Whether the model can read and reason over embedded text
MMMU	Expert multimodal reasoning	Accuracy	Whether it can combine specialist knowledge and visual evidence
MMBench	Fine-grained multimodal QA	CircularEval score	Whether performance holds across ability slices and reordered multiple-choice options

No single metric is enough.

CLIP-like models can do well on retrieval and still fail badly on counting.
Caption metrics can look fine even when spatial reasoning is weak.
OCR-heavy tasks expose very different failure modes than general image description.

That's why serious evaluation needs a benchmark mix, not a single scoreboard.

Aggregate scores can also hide the exact slice that blocks launch. Keep capability slices separate and require task-specific pass bars.

vlm-evaluation-slices.py

results = {
    "asset_retrieval_recall_at_10": (0.94, 0.90),
    "asset_tag_ocr_accuracy": (0.71, 0.92),
    "rack_hazard_grounding_iou": (0.76, 0.70),
}

failed = [
    metric
    for metric, (measured, required) in results.items()
    if measured < required
]

print("launch_ready:", not failed)
print("failed_slices:", failed)

Output

launch_ready: False
failed_slices: ['asset_tag_ocr_accuracy']

Production applications

Document understanding

Document AI is a token-budget problem. Dense documents contain tiny text, tables, figures, and layout cues that don't survive aggressive resizing.

In practice, production systems usually combine some mix of:

page tiling or crop-based preprocessing
OCR features
a VLM for ambiguous or layout-heavy cases

The engineering goal isn't "feed the page in at maximum resolution." It's "preserve the evidence that matters while keeping the token budget survivable."

Image search and retrieval

For image search, a generative VLM is usually the wrong first tool.

A dual-encoder model like CLIP or SigLIP is better because you can:

precompute image embeddings offline
index them once
embed only the query at request time

That supports an indexed first stage whose latency and cost can be measured independently. Generative VLMs are candidates for reranking or explanation after retrieval when those added capabilities justify their measured cost.

route-vision-requests.py

def select_path(task: str) -> list[str]:
    if task == "retrieve_similar_assets":
        return ["dual_encoder", "vector_index"]
    if task == "answer_from_label_text":
        return ["ocr", "generative_vlm", "citation_check"]
    if task == "click_disable_port":
        return ["grounding_model", "action_policy", "human_confirm"]
    raise ValueError(task)

for task in ["retrieve_similar_assets", "answer_from_label_text", "click_disable_port"]:
    print(task, "->", " + ".join(select_path(task)))

Output

retrieve_similar_assets -> dual_encoder + vector_index
answer_from_label_text -> ocr + generative_vlm + citation_check
click_disable_port -> grounding_model + action_policy + human_confirm

Visual agents

Visual agents need more than a captioning model. They need grounding.

It isn't enough to answer "the blue button is near the top-right." The system often has to emit an action like:

click (x, y)
select the second menu item
type into the input field beneath a label

That usually requires a grounding-aware model or a separate grounding component on top of the core VLM, plus strong safety checks around actions.

Mastery check

Use these checkpoints to test core mechanics before moving on.

What strong answers show

Decision	Pass bar
Retrieval vs generation	You can explain when CLIP or SigLIP is enough, and when a generative VLM is required because the task needs open-ended answers, OCR-heavy reasoning, or grounded actions.
Backbone choice	You can compare ViT and ResNet trade-offs around scale, inductive bias, and tokenization, and explain why patch size changes prefill and KV-cache budget.
Spatial failure modes	You can predict when pooled image embeddings will fail on counting, localization, OCR, or grounding before a team ships the wrong model class.
Connector choice	You can defend when a simple projector is enough and when a Q-Former, resampler, or other compression step is needed before the LLM.
Document strategy	You can choose between higher resolution, tiling, OCR preprocessing, and crop-and-compress pipelines based on what evidence must survive.
Serving budget	You can estimate how visual token count affects prefill latency, concurrency, and KV-cache pressure, instead of tracking encoder FLOPs alone.
Product contract	You can specify search and analysis APIs, size the embedding index, fail closed on missing evidence, and roll encoder versions through a reversible index generation.

Follow-up questions

Common pitfalls

"One pooled CLIP embedding understands an image like a person does"

Symptom: Teams ask for counts, boxes, or grounded actions from one global embedding.
Cause: CLIP was trained for global alignment, not dense spatial supervision.
Fix: Use detectors, grounding-aware VLMs, or region features when localization or counting matters.

"CLIP or SigLIP can also serve my chat and captioning path"

Symptom: Similarity scores are wired into a task that needs open-ended explanations or grounded dialogue.
Cause: Image-text matching was confused with multimodal generation.
Fix: Keep CLIP or SigLIP for retrieval and zero-shot scoring, and use a generative VLM when you need open-ended answers.

"Higher resolution always means better OCR"

Symptom: Document quality rises a little, but latency and memory spike hard.
Cause: Resolution was increased without checking token budget, tiling policy, or compression.
Fix: Measure visual token count, tile selectively, preserve only the evidence that matters, and compress before the LLM when possible.

"Visual serving cost ends when the encoder finishes"

Symptom: Offline profiling looks cheap, but production decode concurrency collapses.
Cause: The team counted encoder FLOPs but ignored prefill latency and KV-cache growth from large visual prefixes.
Fix: Track prefill cost and KV pressure together with encoder cost before picking image resolution or crop count.

"Every in-batch non-match is a clean negative"

Symptom: Contrastive training underperforms on visually similar assets or captions that describe the same scene in different words.
Cause: Some in-batch negatives are false negatives, so the loss pushes semantically related pairs apart.
Fix: Curate data carefully, use richer captioning, and treat contrastive batch design as a modeling choice rather than a harmless training detail.

"Zero-shot prompts will hold in fine-grained domains"

Symptom: Prompt wording swings predictions or specialist categories collapse together.
Cause: Generic templates don't provide enough domain signal.
Fix: Prompt-ensemble, calibrate on validation data, and fine-tune or probe when domain precision matters.

What you should carry forward

At this point, check these skills:

Explain how CLIP's dual-encoder architecture aligns images and text in a shared embedding space.
Walk through the contrastive loss logic with a concrete batch example.
Implement zero-shot classification using prompt engineering and embedding similarity.
Diagnose why CLIP fails on spatial reasoning, counting, or fine-grained OCR.
Compare LLaVA, BLIP-2, and Flamingo-style designs for connecting vision encoders to language models.
Size the visual token budget for a production VLM and choose an appropriate compression strategy.

Next Step

Continue to Multimodal LLM Architecture

CLIP showed how images and text can align, and you saw the cost of carrying visual evidence into generation. Next you'll design the full multimodal stack: modality encoders, connectors, fusion choices, training stages, and serving controls.

PreviousLLM-Powered Search Engine

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Learning Transferable Visual Models From Natural Language Supervision.

Radford, A., et al. · 2021 · ICML 2021

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

Liu, S., et al. · 2023 · arXiv preprint

Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications.

Agarwal, S., et al. · 2021 · arXiv preprint

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Dosovitskiy, A., et al. · 2020 · ICLR 2021

Sigmoid Loss for Language Image Pre-training.

Zhai, X., et al. · 2023 · ICCV 2023

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., et al. · 2025

Visual Instruction Tuning.

Liu, H., et al. · 2023 · NeurIPS 2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.

Li, J., et al. · 2023 · ICML 2023

Flamingo: a Visual Language Model for Few-Shot Learning.

Alayrac, J.-B., et al. · 2022 · NeurIPS 2022

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., et al. · 2023 · arXiv preprint

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Honeybee: Locality-enhanced Projector for Multimodal LLM

Cha, J., et al. · 2023 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Vision-Language Models & CLIP

Product brief: data-center visual inspection

Requirements and API

Data flow and sizing

Recovery, rollout, and evaluation

Conceptual foundation

A quick refresher: embeddings and similarity

What does cosine similarity measure in CLIP-style systems?

From fixed labels to open vocabulary

Why is open-vocabulary classification different from fixed-label classification?

Shared image-text space

What is the shared image-text space buying you?

CLIP: contrastive language-image pre-training

Architecture[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020

Why is CLIP called a dual-encoder model?

Training objective

In a batch of 3 images and 3 captions, which pairs should the contrastive loss reward?

Why does original CLIP benefit from very large batch sizes?

Training data

Zero-shot classification

What changes between training a new classifier head and doing CLIP zero-shot classification?

Prompt engineering matters

Why does prompt ensembling improve zero-shot classification?

Limitations of CLIP

Why is CLIP bad at counting and exact localization?

When is CLIP the wrong tool even if it recognizes the concept?

Vision encoder architectures

Vision Transformer (ViT)[4]Reference 4An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.https://arxiv.org/abs/2010.11929

Patch tokenization

How many patch tokens does a 224x224 image produce with 16x16 ViT patches?

ViT vs. ResNet

SigLIP[5]Reference 5Sigmoid Loss for Language Image Pre-training.https://arxiv.org/abs/2303.15343

Advantages over CLIP

What does SigLIP change relative to CLIP's softmax loss?

Why SigLIP matters in newer stacks

Modern VLMs: from alignment to generation

LLaVA: the straightforward projector path[7]Reference 7Visual Instruction Tuning.https://arxiv.org/abs/2304.08485

What is the core LLaVA architecture recipe?

Common multimodal training ladder

BLIP-2 and the Q-Former: compress before the LLM[8]Reference 8BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.https://arxiv.org/abs/2301.12597

Why does BLIP-2 put a Q-Former before the LLM?

Flamingo-style resampling: preserve more context without exploding token count[9]Reference 9Flamingo: a Visual Language Model for Few-Shot Learning.https://arxiv.org/abs/2204.14198

When is a Flamingo-style resampler more attractive than simple projection?

Qwen-VL: grounding and text reading beyond caption matching[10]Reference 10Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyondhttps://arxiv.org/abs/2308.12966

High-resolution token math

Why can high-resolution document images hurt projected-prefix VLM serving twice?

Connector design controls a serving trade-off

How do you choose between a linear projector, MLP, Q-Former, and locality-aware adapter?

Designing around unpublished internals

What is the safe way to discuss closed VLM architectures?

Evaluation

Why is one VLM benchmark insufficient for deployment?

Production applications

Document understanding

What is the core design goal for document understanding with VLMs?

Image search and retrieval

Why is CLIP or SigLIP usually better than a generative VLM for first-pass image search?

Visual agents

Why do visual agents need grounding beyond captioning?

Mastery check

Why does CLIP need large batch sizes during training?

Zero-shot prompts "a photo of a rack hazard" and "a photo of a safe rack" score 0.51 and 0.49 for a blocked-airflow rack photo. What is one likely cause and one fix?

Why can't CLIP tell you "there are three red boxes on the left shelf"?

A colleague says SigLIP eliminates the need for large batches entirely. Is that true?

What strong answers show

Follow-up questions

Your team wants to replace CLIP with SigLIP because distributed training keeps stalling on all-gather pressure. What changes, and what doesn't change?

An operations lead asks CLIP to count rack hazards in a rack row and point to each one. Why is that a design mismatch?

You inherit a LLaVA-style stack with a CLIP tower and an LLM, but OCR-heavy screenshots are too expensive to serve. Which component should you question first?

Operations asks for a visual agent that can click the 'disable port' button in a screenshot. Why is caption quality not enough evidence that the model is ready?

In a projected-prefix stack, one document page is tiled into four 336 by 336 crops. Why can that hurt twice at inference time?

Common pitfalls

"One pooled CLIP embedding understands an image like a person does"

"CLIP or SigLIP can also serve my chat and captioning path"

"Higher resolution always means better OCR"

"Visual serving cost ends when the encoder finishes"

"Every in-batch non-match is a clean negative"

"Zero-shot prompts will hold in fine-grained domains"

What you should carry forward

Mastery Check

Architecture^{[1]Reference 1Learning Transferable Visual Models From Natural Language Supervision.https://arxiv.org/abs/2103.00020}

Vision Transformer (ViT)^{[4]Reference 4An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.https://arxiv.org/abs/2010.11929}

SigLIP^{[5]Reference 5Sigmoid Loss for Language Image Pre-training.https://arxiv.org/abs/2303.15343}

LLaVA: the straightforward projector path^{[7]Reference 7Visual Instruction Tuning.https://arxiv.org/abs/2304.08485}

BLIP-2 and the Q-Former: compress before the LLM^{[8]Reference 8BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.https://arxiv.org/abs/2301.12597}

Flamingo-style resampling: preserve more context without exploding token count^{[9]Reference 9Flamingo: a Visual Language Model for Few-Shot Learning.https://arxiv.org/abs/2204.14198}

Qwen-VL: grounding and text reading beyond caption matching^{[10]Reference 10Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyondhttps://arxiv.org/abs/2308.12966}