LearnTransformer Deep DivesSentence Embeddings & Contrastive Loss

📐HardEmbeddings & Vector Search

Sentence Embeddings & Contrastive Loss

Learn how contrastive losses train sentence embeddings, why hard negatives matter, and how retrieval systems combine bi-encoders, rerankers, and dimension tradeoffs.

36 min read

Learning path

Step 85 of 155 in the full curriculum

Capstone: Production Agent Embedding Similarity & Quantization

Sentence Embeddings & Contrastive Loss

In the production-agent capstone, document_qa_v2 found return-policy-us-v3 before the agent drafted an answer about a cracked tablet. That contract deliberately hid one important mechanism: how did a policy passage become a good candidate for the question in the first place?

A sentence embedding maps a query or passage to one fixed-width vector. A retriever can then find passages near a customer question even when wording changes:

Query	Passage that should rank near it	Tempting non-match
"Can I return a tablet that arrived cracked?"	"Damaged electronics may be returned within 30 days."	"A private seller note requests an immediate refund."

This chapter teaches how contrastive learning shapes that vector space. The agent still needs evidence and authorization gates; embeddings decide which text gets considered before those gates run.

From word embeddings to sentence embeddings

Word embeddings gave individual tokens numerical coordinates. Retrieval needs one vector for an entire query or policy passage. The challenge is to compress variable-length text into a fixed-size representation whose neighborhood ordering is useful for the task.

Averaging context-free word vectors is a useful baseline, but it loses order and context. Don't confuse that baseline with mean pooling contextual token outputs inside a trained sentence encoder: pooling specifies how to produce one vector; the training objective determines whether its geometry supports retrieval. A common objective is contrastive learning, which rewards a relevant pair for scoring above irrelevant candidates.

Naive approaches (and why they fail)

Mean pooling of word embeddings

A simple approach to creating a sentence embedding is to calculate the word vectors for each token in the sentence and then average them. This naive baseline is fast but loses important structural information. The function below uses a tiny local vector table so you can run the baseline and see exactly what it throws away:

mean-pooling-of-word-embeddings.py

word_vectors: dict[str, tuple[float, float, float]] = {
    "carrier": (0.9, 0.1, 0.0),
    "delayed": (0.8, 0.2, 0.1),
    "order": (0.7, 0.1, 0.2),
    "refund": (0.1, 0.9, 0.2),
    "posted": (0.2, 0.8, 0.1),
}

def mean_pool(
    sentence: str,
    vectors: dict[str, tuple[float, float, float]],
) -> tuple[float, float, float]:
    tokens = [token.lower() for token in sentence.split()]
    token_vectors = [vectors[token] for token in tokens if token in vectors]

    if not token_vectors:
        width = len(next(iter(vectors.values())))
        return tuple(0.0 for _ in range(width))

    return tuple(
        sum(vector[dim] for vector in token_vectors) / len(token_vectors)
        for dim in range(len(token_vectors[0]))
    )

pooled_a = mean_pool("carrier delayed order", word_vectors)
pooled_b = mean_pool("order delayed carrier", word_vectors)

print("A:", tuple(round(value, 3) for value in pooled_a))
print("B:", tuple(round(value, 3) for value in pooled_b))
print("Same vector:", pooled_a == pooled_b)

Output

A: (0.8, 0.133, 0.1)
B: (0.8, 0.133, 0.1)
Same vector: True

The final line is the problem: both sentences produce the same vector because averaging ignores order.

Problems

Ignores word order: "carrier delayed order" and "order delayed carrier" have identical embeddings because the sum of vectors is commutative ( $A+B = B+A$ ), even though word order can change what the sentence emphasizes.
Common-word dilution: Frequent words and boilerplate phrasing can wash out the signal from rare, informative tokens.
No context: Polysemous words like "charge" (payment dispute vs. powering a scanner) get averaged into a single messy vector.

[CLS] token from BERT

Using the [CLS] (Classification) token from BERT (Bidirectional Encoder Representations from Transformers) directly as a sentence embedding isn't a sound retrieval baseline. Reimers & Gurevych (2019) showed that BERT's paired-input architecture is impractical for large semantic search and introduced SBERT so independently encoded sentence vectors could be compared with cosine similarity.^[1] A pretrained task token hasn't been trained to make nearest-neighbor distance a relevance score.

The anisotropy problem

Contextual representations can show anisotropy: vectors occupy a narrow cone in the high-dimensional space rather than using directions evenly.^[2] Think of many policy passages all pointing toward one generic "support text" direction. Their cosine scores can look high even when they answer different questions. Contrastive objectives can improve this geometry by rewarding aligned positives while penalizing competing candidates.

Embedding-space geometry comparison showing raw contextual sentence vectors collapsed into a narrow cone before contrastive training and separated semantic clusters after contrastive fine-tuning. — Raw contextual vectors can bunch into one generic direction, while contrastive fine-tuning creates separable semantic neighborhoods that make nearest-neighbor search meaningful.

Contrastive learning for sentence embeddings

The core idea

The goal of contrastive learning is simple but powerful: reshape the embedding space so that sentences with similar meaning land close to each other, while unrelated sentences are pushed apart.

Before we formalize it, recall what "close" means here. Cosine similarity measures how aligned two vectors point (their angle), ignoring their length. For two unit vectors (length 1), it's simply their dot product. A value of +1 means same direction, 0 means orthogonal directions, and a negative value means opposing directions. None of those numbers proves semantic identity or irrelevance by itself; only an evaluated embedding model makes cosine ranking useful. The next lesson studies this scoring rule in detail.

With that in mind, contrastive training teaches the model to:

Pull embeddings of similar sentences toward each other (high cosine similarity)
Push embeddings of dissimilar sentences away (low cosine similarity)

SimCSE (Simple Contrastive Learning of Sentence Embeddings)^[3] demonstrated an unusually small self-supervised construction: pass the same sentence through the encoder twice while dropout supplies different noisy views, then treat those views as a positive pair. Wang & Isola's analysis gives useful vocabulary for the resulting geometry: alignment asks whether positives are close, and uniformity asks whether representations avoid crowding into a small region of the hypersphere.^[4] E5 later trained single-vector text embeddings contrastively from a large weakly supervised pair corpus.^[5]

The most common training objective here is Information Noise Contrastive Estimation (InfoNCE): for one anchor sentence, the model should rank the true match above every other candidate in the batch.

Contrastive learning for sentence embeddings showing an anchor support query, a semantically matching positive, in-batch negatives, and the InfoNCE denominator that compares the positive against every candidate in the batch. — Contrastive training turns a batch into a ranking problem: the true match should outrank every in-batch negative for the same anchor query.

InfoNCE loss

Think of contrastive learning as matching a support ticket to its true duplicate while pushing away unrelated tickets. You want the matching pair close together and every non-match farther away. InfoNCE (Information Noise Contrastive Estimation) mathematically formalizes this push-and-pull dynamic.

Before looking at the formula, walk through a tiny concrete case. Suppose you have a batch of 2 query-passage pairs, and after normalizing their embeddings you measure cosine similarities (which, for unit vectors, are just dot products):

Query	Positive	Similarity
$q_1$	$p_1$	0.90
$q_1$	$p_2$	0.20
$q_2$	$p_1$	0.15
$q_2$	$p_2$	0.85

For query $q_1$ , the true match is $p_1$ (similarity 0.90). The other passage in the batch, $p_2$ , acts as an in-batch negative (similarity 0.20). InfoNCE wants the model to make $p_1$ look more likely than $p_2$ .

For this worked row, choose a sharp temperature $\tau = 0.05$ and compute the loss contribution for $q_1$ step by step:

Scale the similarities: positive logit = $0.90 / 0.05 = 18.0$ , negative logit = $0.20 / 0.05 = 4.0$
Exponentiate (this turns logits into unnormalized probabilities): $\exp(18.0) \approx 65{,}659{,}969$ , $\exp(4.0) \approx 54.6$
Normalize into a probability for the positive: $65{,}659{,}969 / (65{,}659{,}969 + 54.6) \approx 0.99999917$
Take negative log: $-\log(0.99999917) \approx 0.00000083$ (tiny loss; the model is already very confident)

If the model were wrong ( $q_1$ similarity to $p_1$ only 0.20, to $p_2$ 0.90), the positive probability would drop to about $8.3 \times 10^{-7}$ and the loss would jump to roughly 14 nats. The optimizer would receive a strong gradient pushing the correct pair closer.

InfoNCE similarity matrix for two query-passage pairs showing diagonal positives, off-diagonal in-batch negatives, temperature-scaled logits, and positive probabilities. — InfoNCE reads each similarity row as a multiple-choice question where the diagonal passage is the correct answer and off-diagonal passages are in-batch negatives.

The standard contrastive loss for a batch of $N$ positive pairs:^[6]

$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\text{sim}(z_i, z_i^+) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i, z_j^+) / \tau)}$

Reading the formula

For each example $i$ , compare its similarity to the true match $z_i^+$ (numerator) against the full batch-level denominator. That denominator includes the positive pair itself plus every other candidate in the batch. Temperature $\tau$ controls how sharply the model distinguishes between similar and dissimilar pairs. The loss says: "make the true pair more likely than every alternative in the batch."

Where:

$z_i, z_i^+$ are embeddings of a positive pair (e.g., query and relevant document)
$\tau$ is the temperature parameter
All non-matching examples in that denominator act as in-batch negatives

The following copy-runnable implementation shows the same calculation without hiding the matrix math behind a framework. Production training code would vectorize this in PyTorch, JAX, or another tensor library, but the loop below makes the denominator explicit:

reading-the-formula.py

from math import exp, log, sqrt

def normalize(vector: list[float]) -> list[float]:
    norm = sqrt(sum(value * value for value in vector))
    return [value / norm for value in vector]

def dot(left: list[float], right: list[float]) -> float:
    return sum(a * b for a, b in zip(left, right))

def logsumexp(values: list[float]) -> float:
    peak = max(values)
    return peak + log(sum(exp(value - peak) for value in values))

def row_cross_entropy(logits: list[float], correct: int) -> float:
    return logsumexp(logits) - logits[correct]

def info_nce_loss(
    query_vectors: list[list[float]],
    positive_vectors: list[list[float]],
    temperature: float = 0.2,
) -> float:
    queries = [normalize(vector) for vector in query_vectors]
    positives = [normalize(vector) for vector in positive_vectors]
    losses: list[float] = []

    for row, query in enumerate(queries):
        logits = [dot(query, candidate) / temperature for candidate in positives]
        losses.append(row_cross_entropy(logits, row))

    return sum(losses) / len(losses)

query_vectors = [[1.0, 0.0], [0.0, 1.0]]
positive_vectors = [[0.95, 0.05], [0.10, 0.90]]

loss = info_nce_loss(query_vectors, positive_vectors)
score_pos = dot(normalize(query_vectors[0]), normalize(positive_vectors[0]))
score_neg = dot(normalize(query_vectors[0]), normalize(positive_vectors[1]))
extreme_loss = row_cross_entropy([1000.0, 986.0], correct=0)

print("loss:", round(loss, 4))
print("q1 positive score:", round(score_pos, 4))
print("q1 negative score:", round(score_neg, 4))
print("stable large-logit loss:", f"{extreme_loss:.8f}")

Output

loss: 0.0104
q1 positive score: 0.9986
q1 negative score: 0.1104
stable large-logit loss: 0.00000083

The equation is often expanded into raw exponentials when calculating a small example on paper. Code should compute the same expression with log-sum-exp or a framework cross-entropy operation, so large logits don't overflow.

Triplet loss

A second contrastive objective, triplet loss, focuses on individual anchor-positive-negative triplets instead of comparing one anchor against a whole candidate pool:

$\mathcal{L} = \max(0, d(a, p) - d(a, n) + m)$

Where:

$d(\cdot, \cdot)$ is the Euclidean distance between embeddings
$m$ is a margin hyperparameter
$a$ is the anchor, $p$ is the positive, $n$ is the negative

The loss enforces that the anchor must be closer to the positive than to the negative by at least margin $m$ : $d(a, p) + m \leq d(a, n)$ . If this constraint is already satisfied, the loss is zero. The margin prevents the model from wasting capacity pushing already-distant negatives even farther away.

Worked example: computing triplet loss by hand

Consider three sentences about return labels:

Role	Sentence
Anchor ( $a$ )	"How do I print a return label?"
Positive ( $p$ )	"Where can I generate a return shipping label?"
Negative ( $n$ )	"How do I replace a damaged shipping label?"

After encoding, suppose the distances are:

$d(a, p) = 0.2$ (the paraphrase is close)
$d(a, n) = 0.5$ (the hard negative is farther, but not by much)

With margin $m = 0.1$ , plug into the formula:

$0.2 - 0.5 + 0.1 = -0.2$

$\max(0, -0.2) = 0$

The loss is zero because the model already satisfies the margin constraint: the positive is closer than the negative by more than 0.1. Now imagine a bad model where $d(a, p) = 0.5$ and $d(a, n) = 0.3$ (the negative is closer than the positive):

$0.5 - 0.3 + 0.1 = 0.3$

$\max(0, 0.3) = 0.3$

A non-zero loss tells the optimizer to push the anchor and positive together while pushing the negative away until the gap exceeds the margin.

Key differences from InfoNCE

Triplet loss compares a chosen negative against a margin, so mining determines most of its learning signal.
InfoNCE compares each anchor with a candidate pool. Batches supply negatives cheaply, but they can also contain false negatives.
Neither objective is an automatic win. Choose data construction deliberately and evaluate held-out retrieval failures, not training loss alone.

Temperature parameter τ

Temperature controls the "sharpness" of the softmax distribution over similarity scores:

For the worked similarity gap, $0.90 - 0.20 = 0.70$ , changing temperature changes the positive probability:

τ	$P(\text{positive})$ for this row	What to inspect
0.01	$>0.9999$	Saturates quickly; a false negative receives extreme pressure.
0.05	$0.999999$	Very sharp separation for this easy row.
0.10	$0.9991$	Still confident, with less sharpness.
1.00	$0.6682$	Much flatter signal.

There is no universal best temperature. Tune it against held-out retrieval failures and implement the loss stably. Low temperature amplifies mislabeled or false negatives; overflow is an implementation bug that stable log-softmax or cross-entropy avoids.

Temperature effects in contrastive learning showing sharper probabilities for low temperature, flatter probabilities for high temperature, and higher pressure on false negatives at very low values. — Temperature is a sharpness knob: low values make hard negatives dominate training, while high values flatten the softmax and weaken the separation signal.

temperature-sharpens-the-same-row.py

from math import exp

def probability_of_positive(
    positive_similarity: float,
    negative_similarity: float,
    temperature: float,
) -> float:
    scaled_gap = (positive_similarity - negative_similarity) / temperature
    return 1.0 / (1.0 + exp(-scaled_gap))

for temperature in (0.01, 0.05, 0.10, 1.00):
    probability = probability_of_positive(0.90, 0.20, temperature)
    print(f"tau={temperature:.2f}: P(positive)={probability:.6f}")

Output

tau=0.01: P(positive)=1.000000
tau=0.05: P(positive)=0.999999
tau=0.10: P(positive)=0.999089
tau=1.00: P(positive)=0.668188

Training strategies for sentence embeddings

Training a good embedding model requires good training data. You need pairs of sentences that are either semantically similar or different, and you need enough diversity to teach the model real distinctions. Two main approaches dominate.

Supervised: fine-tuning on NLI data

Natural Language Inference (NLI) labels whether a hypothesis follows from a premise (entailment), conflicts with it (contradiction), or does neither (neutral). Entailment is directional, not a promise that two texts are interchangeable.

SBERT trained Siamese and triplet architectures with NLI supervision and evaluated sentence similarity behavior.^[1] Supervised SimCSE uses entailment as a positive and the corresponding contradiction as a hard negative.^[3] This is a useful training construction, but you still need retrieval evaluation before treating two policy passages as substitutes.

SBERT's architectural move is simple but important: it runs both sentences through the same encoder with shared weights, pools each sentence into a single vector, and then trains on top of those pooled embeddings.^[1] During inference, you only keep the single-sentence encoding path. That shared-weight Siamese network setup (two inputs processed through the same shared encoder) is what makes precomputing document embeddings and doing nearest-neighbor search practical.

Self-supervised: SimCSE

What if you don't have labeled NLI data? SimCSE (Simple Contrastive Learning of Sentence Embeddings) from Gao et al. (2021)^[3] shows you don't need it. The trick is elegant: pass the same sentence through the encoder twice with different dropout masks. Since dropout randomly zeros out different neurons each time, you get two slightly different embedding vectors for the same sentence. These two views are positives (they should be similar), while all other sentences in the batch are negatives.

This gives surprisingly strong embeddings without labeled pairs, although supervised SimCSE performs better across the paper's reported STS tasks.^[3] The key insight is that the stochasticity in the transformer forward pass, which is normally just a regularization technique, becomes an implicit data augmentation mechanism for contrastive learning.

Data augmentation

Beyond dropout, augmentations are hypotheses about meaning preservation:

Dropout masks (SimCSE): different mask patterns per forward pass
Verified paraphrases or back-translations: use only after checking that policy scope and exceptions survive
Domain pairs: mine resolved duplicate questions that cite the same approved policy passage
Re-ranking with cross-encoders: score candidate paraphrases before accepting them as positives

Avoid casual word deletion or insertion for policy text: dropping "not," a product category, or an approval condition changes the rule while incorrectly labeling the pair positive.

In-batch negatives in practice

Most contrastive learning implementations use in-batch negatives by default: for a batch of $N$ positive pairs, each anchor has one matching positive, and the other $N-1$ candidates in the batch act as negatives. This is efficient because you get many negatives "for free" without explicitly labeling them.

Larger batches increase the chance of informative competitors, but they also increase the chance of false negatives: another row may cite the same relevant policy as the anchor while the loss treats it as wrong. Distributed training commonly gathers embeddings across GPUs before computing this loss. Plain gradient accumulation doesn't create more in-batch negatives unless the implementation explicitly reuses embeddings across microbatches.

audit-false-negatives-before-training.py

batch = [
    {"query": "Can I return a cracked tablet?", "policy_id": "return-policy-us-v3"},
    {"query": "What if electronics arrive damaged?", "policy_id": "return-policy-us-v3"},
    {"query": "Where is my return label?", "policy_id": "shipping-label-v2"},
]

false_negatives = []
for anchor_index, anchor in enumerate(batch):
    for candidate_index, candidate in enumerate(batch):
        if anchor_index == candidate_index:
            continue
        if candidate["policy_id"] == anchor["policy_id"]:
            false_negatives.append(
                f"row {anchor_index} treats row {candidate_index} as negative"
            )

print("false negatives:", false_negatives)
print("action: deduplicate shared policy positives before InfoNCE")

Output

false negatives: ['row 0 treats row 1 as negative', 'row 1 treats row 0 as negative']
action: deduplicate shared policy positives before InfoNCE

Hard negative mining

Why hard negatives matter

Random negatives are too easy. The model quickly learns to distinguish "refund label missing" from "warehouse forklift battery." Hard negatives force the model to learn subtle semantic distinctions, which is where real retrieval quality comes from.

Negative type	Anchor	Candidate	Why it matters
Easy negative	"How do I print a return label?"	"The warehouse forklift needs charging"	Different topic; the model learns this separation almost immediately
Hard negative	"How do I print a return label?"	"How do I replace a damaged shipping label?"	Same keywords, different intent; forces fine-grained learning

Hard negative mining for sentence embeddings comparing easy negatives, BM25 lexical negatives, cross-encoder filtered hard negatives, and iterative mining as the model improves. — Hard negative mining works because lexical similarity alone isn't enough: the best training examples share words with the anchor but answer a different intent.

Mining strategies

1. In-batch negatives

Use other examples in the batch. Simple, scales with batch size, but negatives may be too easy.

2. BM25 negatives

Use a lexical search algorithm like BM25 (Best Matching 25, a sparse retrieval function based on keyword frequency) to find documents that are lexically similar but semantically different:

text

Query: "How do I print a return label?"
Hard negative: "Return label printer calibration guide"  # shares "return" and "label" but answers a different question
Easy negative: "Forklift battery maintenance schedule"

3. Cross-encoder mining

Use a cross-encoder to find high-BM25-score but low-relevance passages. The code below demonstrates the control flow with deterministic helper functions: lexical overlap stands in for BM25, and a small relevance scorer stands in for the cross-encoder. The important behavior is the filter: keep candidates that look lexically relevant but fail the semantic relevance check.

3-cross-encoder-mining.py

def tokens(text: str) -> set[str]:
    return {part.strip("?.!,").lower() for part in text.split()}

def lexical_overlap(query: str, candidate: str) -> int:
    return len(tokens(query) & tokens(candidate))

def cross_encoder_relevance(query: str, candidate: str) -> float:
    query_terms = tokens(query)
    candidate_terms = tokens(candidate)

    if "print" in query_terms and "print" in candidate_terms:
        return 0.92
    if "generate" in candidate_terms and "label" in candidate_terms:
        return 0.81
    return 0.18

def mine_hard_negatives(
    queries: list[str],
    corpus: list[str],
    top_k: int = 2,
) -> dict[str, list[str]]:
    hard_negatives: dict[str, list[str]] = {}

    for query in queries:
        lexical_candidates = sorted(
            corpus,
            key=lambda candidate: lexical_overlap(query, candidate),
            reverse=True,
        )

        hard_negatives[query] = [
            candidate
            for candidate in lexical_candidates
            if lexical_overlap(query, candidate) > 0
            and cross_encoder_relevance(query, candidate) < 0.30
        ][:top_k]

    return hard_negatives

queries = ["How do I print a return label?"]
corpus = [
    "How do I print a return shipping label?",
    "Return label printer calibration guide",
    "How do I replace a damaged shipping label?",
    "Forklift battery maintenance schedule",
]

hard = mine_hard_negatives(queries, corpus)
print(hard[queries[0]])

Output

['How do I replace a damaged shipping label?', 'Return label printer calibration guide']

4. Iterative mining

Re-mine hard negatives after each training epoch using the improved model. This progressively finds harder examples as the model improves.

Bi-encoder vs cross-encoder

Bi-encoder, cross-encoder, and late-interaction retrieval architectures compared by query-document interaction, precomputability, latency, relevance detail, and index size. — Embedding retrieval architecture is a placement decision: bi-encoders interact at vector comparison time, cross-encoders interact inside attention, and late-interaction models keep token-level matching without full pairwise scoring.

Bi-encoder (dual encoder)

Encode query and document independently, compare with dot product or cosine similarity:

Advantages

Documents can be pre-encoded and indexed. At query time, you only encode the query once, then use an ANN (Approximate Nearest Neighbor) index to retrieve candidates sublinearly in practice.

Disadvantages

No cross-attention between query and document; it may miss fine-grained relevance signals.

Cross-encoder

Concatenate query and document, process jointly through full transformer:

Advantages

Full attention can model phrase order, negation, and query-document interactions that a single-vector score misses. On a suitable shortlist, this often improves precision over first-stage vector scoring.

Disadvantages

You must run inference for every (query, document) pair. If you scored the full corpus directly, that's O(N) transformer passes per query, which is too slow for large-scale search.

Late interaction: ColBERT

ColBERT (Contextualized Late Interaction over BERT)^[7] offers a middle ground between bi-encoder and cross-encoder:

ColBERT late interaction MaxSim scoring showing query-token vectors matched against document-token vectors, max similarity selected per query token, and summed into a relevance score. — ColBERT stores token vectors for documents, then scores a query by taking each query token's best document-token match and summing those MaxSim values.

Instead of a single embedding per document, ColBERT stores per-token embeddings and computes relevance using MaxSim: for each query token, find the maximum similarity to any document token, then sum.

Advantages

Retains token-level matching signals while documents can still be pre-encoded.

Disadvantages

Much larger index size (one vector per token instead of per document).

This architecture choice is a budget tradeoff, not a universal ranking. A bi-encoder protects first-stage recall at corpus scale; a cross-encoder can improve precision for a bounded candidate set; late interaction spends more index space to preserve token-level evidence.

maxsim-keeps-token-matches.py

similarities = {
    "return": {"damaged": 0.12, "return": 0.93, "tablet": 0.08},
    "tablet": {"damaged": 0.19, "return": 0.07, "tablet": 0.91},
}

best_by_query_token = {
    query_token: max(document_scores.values())
    for query_token, document_scores in similarities.items()
}
score = sum(best_by_query_token.values())

print("best token scores:", best_by_query_token)
print("MaxSim score:", round(score, 2))

Output

best token scores: {'return': 0.93, 'tablet': 0.91}
MaxSim score: 1.84

A production reranking pattern

Two-stage retrieval pipeline where a bi-encoder retrieves a top-10 shortlist within a 200 millisecond example budget, a cross-encoder reranks all 10 candidates, and final results are served. — Bi-encoder retrieval protects recall across the whole corpus. Cross-encoder attention is reserved for a small shortlist where precision gains justify the latency cost.

In a production system, these two architectures are combined in a two-stage pipeline to balance speed and accuracy. The function below illustrates this pattern with deterministic scores: first retrieve a broad candidate set using a fast bi-encoder score, then re-score those candidates with a slower cross-encoder score and return the final ranking.

the-reranking-pattern-production-standard.py

documents = [
    {
        "id": "carrier-delay-credit",
        "text": "Carrier delay credit policy for late packages",
        "bi_score": 0.82,
        "cross_score": 0.97,
    },
    {
        "id": "return-label-printer",
        "text": "Return label printer calibration guide",
        "bi_score": 0.79,
        "cross_score": 0.22,
    },
    {
        "id": "late-package-refund",
        "text": "Refund workflow for late package delivery",
        "bi_score": 0.77,
        "cross_score": 0.91,
    },
    {
        "id": "forklift-battery",
        "text": "Warehouse forklift battery maintenance",
        "bi_score": 0.10,
        "cross_score": 0.05,
    },
]

def search_with_rerank(
    query: str,
    corpus: list[dict[str, str | float]],
    candidate_k: int = 3,
    top_k: int = 2,
) -> list[str]:
    candidates = sorted(corpus, key=lambda doc: float(doc["bi_score"]), reverse=True)[
        :candidate_k
    ]
    reranked = sorted(
        candidates,
        key=lambda doc: float(doc["cross_score"]),
        reverse=True,
    )
    return [str(doc["id"]) for doc in reranked[:top_k]]

results = search_with_rerank("late package refund", documents)
print(results)

Output

['carrier-delay-credit', 'late-package-refund']

Instruction-tuned embeddings

The problem with task ambiguity

"Label" means different things for clustering (group return-label documents), retrieval (find barcode-label troubleshooting), and classification (is this about shipping or inventory?). Traditional embeddings produce the same vector regardless of the downstream task. This limits performance when a single model must serve multiple purposes.

Task-specific prefixes and instructions

Some embedding families expose task prefixes or lightweight instructions that steer the encoder toward retrieval, clustering, or classification. E5 is a simple example: it distinguishes inputs like query: and passage: during contrastive pre-training.^[5] INSTRUCTOR-style models go further and condition the embedding on an explicit task instruction.^[8] The format is model-specific. Prefixes that help one family can hurt another, so follow its training or model documentation.

This example keeps the families separate on purpose. The exact prefix or instruction string depends on the embedding family you chose:

task-specific-prefixes-and-instructions.py

def format_e5_pair(query: str, passage: str) -> tuple[str, str]:
    """E5-style inputs are prefixed strings."""
    return f"query: {query}", f"passage: {passage}"

def format_instructor_input(instruction: str, text: str) -> list[str]:
    """INSTRUCTOR-style inputs are commonly [instruction, text] pairs."""
    return [instruction, text]

e5_query, e5_passage = format_e5_pair(
    "what is contrastive learning?",
    "Contrastive learning pulls positive pairs together in embedding space.",
)

instructor_example = format_instructor_input(
    "Represent the Wikipedia question for retrieving supporting documents:",
    "what is contrastive learning?",
)

assert e5_query.startswith("query: ")
assert e5_passage.startswith("passage: ")
assert instructor_example[0].startswith("Represent")
assert instructor_example[1] == "what is contrastive learning?"

This doesn't mean one magic prefix solves every task. It means some embedding families expect an extra conditioning signal. Use the format documented for that specific model family, then benchmark it on your own retrieval, clustering, or classification workload.

Matryoshka representation learning (MRL)

The idea

Matryoshka embeddings are named after nesting dolls because selected prefix widths are trained to remain useful on their own. For example, a full 768-dimensional embedding can be trained together with 128- and 32-dimensional prefixes. You then choose among trained and evaluated widths based on the storage-quality budget.

Train embeddings so that selected prefix widths preserve useful representations under their own losses:^[9]

Matryoshka representation learning showing one embedding trained with losses at several prefix dimensions and the resulting storage-accuracy tradeoff when truncating dimensions. — Matryoshka training applies contrastive losses at multiple prefix sizes so smaller dimensions remain usable instead of becoming arbitrary truncations.

For a contrastively trained embedding model, the loss can be computed at several predefined truncation points simultaneously. The following runnable toy implementation slices full embeddings down to smaller prefixes, calculates the same InfoNCE objective at each prefix, and averages the losses:

the-idea.py

from math import exp, log, sqrt

def normalize(vector: list[float]) -> list[float]:
    norm = sqrt(sum(value * value for value in vector))
    return [value / norm for value in vector]

def dot(left: list[float], right: list[float]) -> float:
    return sum(a * b for a, b in zip(left, right))

def logsumexp(values: list[float]) -> float:
    peak = max(values)
    return peak + log(sum(exp(value - peak) for value in values))

def info_nce_loss(
    query_vectors: list[list[float]],
    positive_vectors: list[list[float]],
    temperature: float = 0.2,
) -> float:
    queries = [normalize(vector) for vector in query_vectors]
    positives = [normalize(vector) for vector in positive_vectors]
    losses: list[float] = []

    for row, query in enumerate(queries):
        logits = [dot(query, candidate) / temperature for candidate in positives]
        losses.append(logsumexp(logits) - logits[row])

    return sum(losses) / len(losses)

def matryoshka_loss(
    embeddings_a: list[list[float]],
    embeddings_b: list[list[float]],
    dims: tuple[int, int, int] = (2, 4, 6),
) -> float:
    losses = []

    for dim in dims:
        truncated_a = [row[:dim] for row in embeddings_a]
        truncated_b = [row[:dim] for row in embeddings_b]
        losses.append(info_nce_loss(truncated_a, truncated_b))

    return sum(losses) / len(losses)

queries = [[1.0, 0.0, 0.9, 0.1, 0.5, 0.2], [0.0, 1.0, 0.1, 0.9, 0.2, 0.5]]
positives = [[0.95, 0.05, 0.85, 0.15, 0.45, 0.25], [0.05, 0.95, 0.15, 0.85, 0.25, 0.45]]

loss = matryoshka_loss(queries, positives)
print(round(loss, 4))

Output

0.0153

Why it matters

Benefit	Why it matters
Flexible deployment	Use the full embedding width for maximum accuracy, or a smaller prefix when storage is constrained.
No retraining	One model can serve several dimensionality budgets.
Graceful degradation	Performance should drop smoothly as dimensions shrink, but you still need to benchmark each cutoff.
Deployment constraint	Shorten only at dimensions a chosen model documents or you validate; arbitrary slicing isn't guaranteed to preserve rankings.

Evaluation: STS and MTEB

Semantic textual similarity (STS)

Before large-scale benchmarks like MTEB existed, a common benchmark for evaluating sentence embeddings was Semantic Textual Similarity (STS). Datasets like STS-B (STS Benchmark) provide sentence pairs rated by human annotators for semantic relatedness:

text

"A package is delayed" / "A shipment is running late" => 4.5
"A package is delayed" / "A refund has posted" => 1.2

To evaluate a model, you compute the cosine similarity for every pair using the model's embeddings, and then calculate the Spearman rank correlation between the model's similarity scores and the human ratings. A high correlation means the model's embedding space aligns well with human judgment.

MTEB (Massive text embedding benchmark)

As models improved, optimizing only for STS was no longer sufficient. An embedding model excellent at STS might fail miserably at information retrieval or clustering. To address this, the Massive Text Embedding Benchmark (MTEB) was introduced.^[10] The original MTEB paper evaluated models on 58 datasets grouped into 8 task categories, giving a much broader view of embedding quality than STS alone:

Task	# Datasets	Example
Classification	12	Sentiment, topic
Clustering	11	Document clustering
Pair Classification	3	Paraphrase detection
Reranking	4	Passage reranking
Retrieval	15	Question-passage retrieval
STS	10	Semantic similarity
Summarization	1	Summary similarity
BitextMining	2	Parallel sentence mining

Sentence embedding evaluation comparison showing STS as a narrow semantic-similarity check and MTEB as a broader benchmark across classification, clustering, reranking, retrieval, STS, summarization, and bitext mining tasks. — STS is a useful narrow check, but production model choice should look across retrieval, reranking, clustering, classification, latency, and storage behavior.

Choosing a model in practice

Don't anchor on a single MTEB average. Deployment success usually depends on a few operational questions:

Does the model expect plain text, query/passage prefixes, or explicit instructions?
Can you shorten the embedding width safely, or are you locked into the full dimensionality?
How well does it handle your language mix, domain jargon, and query length distribution?
What are the latency, throughput, and memory costs once you batch and index it at production scale?
Do you still need BM25 or a cross-encoder reranker to hit Recall@K and NDCG targets?

In practice, architecture often matters more than a tiny leaderboard delta. A strong bi-encoder with good hard negatives, sensible chunking, and a reranker usually beats blindly swapping to the latest model name.

The original MTEB finding is the durable lesson: no one model dominated every task category.^[10] Evaluate policy retrieval, reranking, languages, latency, and storage behavior that match your deployment instead of selecting by one aggregate score.

Carry the evidence boundary into retrieval evaluation

document_qa_v2 can use an embedding retriever to propose policy passages, but vector proximity isn't authorization. A release test should verify both retrieval quality and that unapproved text never becomes answer evidence:

policy-retrieval-release-gate.py

approved_evidence = {"return-policy-us-v3", "shipping-label-v2"}
retrieval_cases = [
    {
        "query": "Can I return a cracked tablet?",
        "expected": "return-policy-us-v3",
        "candidates": ["seller-private-note-44", "return-policy-us-v3"],
    },
    {
        "query": "Where do I get a return label?",
        "expected": "shipping-label-v2",
        "candidates": ["shipping-label-v2", "warehouse-note-12"],
    },
]
attack_candidates = ["seller-private-note-44"]

def approved_candidate(candidates: list[str]) -> str | None:
    return next((doc for doc in candidates if doc in approved_evidence), None)

served = [approved_candidate(case["candidates"]) for case in retrieval_cases]
hits = sum(
    evidence == case["expected"]
    for evidence, case in zip(served, retrieval_cases)
)
attack_evidence = approved_candidate(attack_candidates)

print("approved evidence recall@2:", f"{hits / len(retrieval_cases):.0%}")
print("served evidence:", served)
print("private-note attack evidence:", attack_evidence)

Output

approved evidence recall@2: 100%
served evidence: ['return-policy-us-v3', 'shipping-label-v2']
private-note attack evidence: None

Key libraries and tools

Building embedding-based systems requires the right tooling:

Tool	What it gives you
Sentence-Transformers (`sentence-transformers`)	Pretrained sentence embedding models, pooling modules, contrastive training losses such as MultipleNegativesRankingLoss, and batched encoding utilities.
FAISS (Facebook AI Similarity Search)	Efficient similarity search and clustering for dense vectors, including inverted-file and product-quantization approaches for approximate nearest neighbor search.^[11]

Mastery check

Key concepts

alignment and uniformity in embedding space
InfoNCE numerator, denominator, and temperature
hard negatives vs easy negatives
bi-encoder vs cross-encoder vs late interaction
reranking as recall first, then precision
Matryoshka prefix training for safe dimension cuts

Evaluation rubric

Foundational: Derives the InfoNCE objective and explains what the numerator, denominator, and temperature do.
Intermediate: Explains why raw BERT [CLS] embeddings fail for semantic search without sentence-level contrastive fine-tuning.
Intermediate: Explains why hard negatives matter more than random negatives once the model already separates broad topics.
Advanced: Compares bi-encoders, cross-encoders, and late-interaction models by latency, indexability, accuracy, and storage.
Advanced: Explains ColBERT's MaxSim scoring and why it keeps more token-level signal than a single document vector.
Advanced: Explains Matryoshka embeddings and when shorter prefixes are worth the storage-accuracy tradeoff.
Advanced: Designs a two-stage production retrieval pipeline with recall and latency budgets defended quantitatively.

Follow-up questions

Common pitfalls

Raw [CLS] is treated like a search-ready sentence embedding

Symptom: Nearly every query-document pair gets suspiciously similar cosine scores. Cause: Raw BERT [CLS] vectors weren't tuned for semantic retrieval and can inherit poorly discriminative anisotropic geometry. Fix: Start from a sentence embedding model or fine-tune with a contrastive objective before building nearest-neighbor search.

Negatives stay too easy

Symptom: Training loss falls, but recall on realistic queries barely moves. Cause: Random negatives stop teaching once the model separates unrelated topics. Fix: Mine BM25 or cross-encoder negatives that share words with the anchor but answer a different intent.

The reranker is asked to save missing recall

Symptom: The reranker looks good in pairwise inspection, yet the right document is often absent in production results. Cause: The correct passage never entered the shortlist. Fix: Tune first-stage Recall@K separately, then widen candidate budget before blaming the reranker.

Dimensions are shortened blindly

Symptom: Index storage drops as expected, but retrieval quality falls off a cliff. Cause: A standard embedding vector was truncated without prefix-aware training or provider support. Fix: Use Matryoshka-trained or provider-documented shortening controls and benchmark each target width.

Task conditioning is ignored

Symptom: One embedding model works for clustering but underperforms on retrieval. Cause: The model family expected query/passage prefixes or instructions, but every input was embedded as plain text. Fix: Follow the model card format for retrieval, clustering, and classification separately.

Try it yourself

These exercises let you verify your understanding without needing a GPU cluster.

Exercise 1: compute triplet loss by hand

Given an anchor $a$ , positive $p$ , and negative $n$ with distances $d(a,p) = 0.3$ and $d(a,n) = 0.7$ , compute the triplet loss for margins $m = 0.1$ and $m = 0.5$ . In which case does the model still have work to do?

Solution sketch: For $m = 0.1$ : $0.3 - 0.7 + 0.1 = -0.3$ , so $\max(0, -0.3) = 0$ . The margin is already satisfied. For $m = 0.5$ : $0.3 - 0.7 + 0.5 = 0.1$ , so $\max(0, 0.1) = 0.1$ . The larger margin forces the model to pull the positive even closer or push the negative farther away.

Exercise 2: spot the embedding mistake

A teammate reports that their semantic search system returns nearly identical similarity scores for every query-document pair. They are using a pretrained BERT model and taking the [CLS] token as the sentence embedding. What is the most likely cause, and what is the smallest change that would fix it?

Solution sketch: Raw BERT [CLS] embeddings weren't trained to make cosine distance a semantic-retrieval score, and anisotropic geometry can make their scores poorly discriminative. The smallest fix is to switch to a sentence embedding model that was fine-tuned with a sentence-level objective (for example, SBERT or E5), rather than using raw BERT.

Exercise 3: design a two-stage retrieval pipeline

You have 2 million support tickets and a latency budget of 200 ms per query. You own a bi-encoder that encodes a query in 10 ms and a cross-encoder that scores one query-document pair in 15 ms. Why is scoring the full corpus with the cross-encoder impossible, and what pipeline would hit the latency budget?

Solution sketch: $2{,}000{,}000 \times 15\,\text{ms} = 30{,}000{,}000\,\text{ms} \approx 8.3$ hours per query. The cross-encoder is far too slow for the full corpus. Reserve 10 ms for query encoding and choose a top-10 shortlist only if ANN lookup and overhead fit inside the remaining 40 ms: reranking then costs $10 \times 15\,\text{ms} = 150\,\text{ms}$ , for at most 200 ms total. If Recall@10 is inadequate, the budget or reranker throughput must change; silently widening to top 100 violates the requirement.

What you have now

You now understand why raw transformer outputs fail for semantic search, how contrastive learning reshapes the embedding space, and how to train and deploy sentence embeddings in production. You can explain InfoNCE and triplet loss with concrete numbers, mine hard negatives from lexical overlap, and design a two-stage retrieval pipeline that balances speed and accuracy.

The next article, Embedding Similarity & Quantization, builds directly on this foundation. You will learn the mathematical details of cosine similarity versus dot product, how Matryoshka truncation changes the accuracy-storage tradeoff, and how to quantize embeddings to 8-bit, 4-bit, or binary formats for large-scale indexes. Those techniques only make sense once you understand why the embedding space was shaped by contrastive loss in the first place.

Next Step

Continue to Embedding Similarity & Quantization

Contrastive learning explains how useful sentence vectors are trained; similarity scoring and quantization show how those vectors are searched and stored efficiently at scale.

PreviousCapstone: Production Agent

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.

Ethayarajh, K. · 2019

SimCSE: Simple Contrastive Learning of Sentence Embeddings.

Gao, T., Yao, X., & Chen, D. · 2021 · EMNLP 2021

Understanding Contrastive Learning Requires Incorporating Inductive Biases.

Wang, T., & Isola, P. · 2020 · ICML 2020

Text Embeddings by Weakly-Supervised Contrastive Pre-training.

Wang, L., et al. · 2022

Representation Learning with Contrastive Predictive Coding.

Oord, A. van den, Li, Y., & Vinyals, O. · 2018

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Khattab, O., & Zaharia, M. · 2020 · SIGIR 2020

One Embedder, Any Task: Instruction-Finetuned Text Embeddings.

Su, H., et al. · 2022 · arXiv preprint

Matryoshka Representation Learning.

Kusupati, A., et al. · 2022 · NeurIPS 2022

MTEB: Massive Text Embedding Benchmark.

Muennighoff, N., et al. · 2023 · EACL 2023

Billion-scale similarity search with GPUs.

Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint

Sentence Embeddings & Contrastive Loss

Sentence Embeddings & Contrastive Loss

From word embeddings to sentence embeddings

Naive approaches (and why they fail)

Mean pooling of word embeddings

Why can mean pooling make "carrier delayed order" and "order delayed carrier" identical?

Problems

[CLS] token from BERT

The anisotropy problem

Contrastive learning for sentence embeddings

The core idea

InfoNCE loss

In the q1q_1q1​ example, why is the loss tiny when the positive similarity is 0.90 and the negative similarity is 0.20?

Reading the formula

Triplet loss

Worked example: computing triplet loss by hand

For triplet loss, what does zero loss mean?

Key differences from InfoNCE

Temperature parameter τ

What happens if temperature is too high in contrastive learning?

Training strategies for sentence embeddings

Supervised: fine-tuning on NLI data

Self-supervised: SimCSE

Data augmentation

In-batch negatives in practice

Hard negative mining

Why hard negatives matter

Mining strategies

1. In-batch negatives

2. BM25 negatives

3. Cross-encoder mining

4. Iterative mining

Bi-encoder vs cross-encoder

Bi-encoder (dual encoder)

Advantages

Disadvantages

Cross-encoder

Advantages

Disadvantages

Late interaction: ColBERT

Advantages

Disadvantages

A production reranking pattern

Instruction-tuned embeddings

The problem with task ambiguity

Task-specific prefixes and instructions

Matryoshka representation learning (MRL)

The idea

Why it matters

Evaluation: STS and MTEB

Semantic textual similarity (STS)

MTEB (Massive text embedding benchmark)

Choosing a model in practice

Carry the evidence boundary into retrieval evaluation

Key libraries and tools

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Your cross-encoder gives excellent scores when you manually include the correct passage, but production search still misses that passage. Which stage is failing?

Why do hard negatives improve embedding quality more than random warehouse-style negatives?

When can you safely shorten a sentence embedding from 768 dimensions to 128?

When is a cross-encoder the right tool even if a bi-encoder already exists?

Why does ColBERT usually recover more relevance detail than a standard bi-encoder, and what price does it pay?

Common pitfalls

Raw [CLS] is treated like a search-ready sentence embedding

Negatives stay too easy

The reranker is asked to save missing recall

Dimensions are shortened blindly

Task conditioning is ignored

Try it yourself

Exercise 1: compute triplet loss by hand

Exercise 2: spot the embedding mistake

Exercise 3: design a two-stage retrieval pipeline

What you have now