Learn how contrastive losses train sentence embeddings, why hard negatives matter, and how retrieval systems combine bi-encoders, rerankers, and dimension tradeoffs.
In the production-agent capstone, document_qa_v2 found return-policy-us-v3 before the agent drafted an answer about a cracked tablet. That contract deliberately hid one important mechanism: how did a policy passage become a good candidate for the question in the first place?
A sentence embedding maps a query or passage to one fixed-width vector. A retriever can then find passages near a customer question even when wording changes:
| Query | Passage that should rank near it | Tempting non-match |
|---|---|---|
| "Can I return a tablet that arrived cracked?" | "Damaged electronics may be returned within 30 days." | "A private seller note requests an immediate refund." |
This chapter teaches how contrastive learning shapes that vector space. The agent still needs evidence and authorization gates; embeddings decide which text gets considered before those gates run.
Word embeddings gave individual tokens numerical coordinates. Retrieval needs one vector for an entire query or policy passage. The challenge is to compress variable-length text into a fixed-size representation whose neighborhood ordering is useful for the task.
Averaging context-free word vectors is a useful baseline, but it loses order and context. Don't confuse that baseline with mean pooling contextual token outputs inside a trained sentence encoder: pooling specifies how to produce one vector; the training objective determines whether its geometry supports retrieval. A common objective is contrastive learning, which rewards a relevant pair for scoring above irrelevant candidates.
A simple approach to creating a sentence embedding is to calculate the word vectors for each token in the sentence and then average them. This naive baseline is fast but loses important structural information. The function below uses a tiny local vector table so you can run the baseline and see exactly what it throws away:
1word_vectors: dict[str, tuple[float, float, float]] = {
2 "carrier": (0.9, 0.1, 0.0),
3 "delayed": (0.8, 0.2, 0.1),
4 "order": (0.7, 0.1, 0.2),
5 "refund": (0.1, 0.9, 0.2),
6 "posted": (0.2, 0.8, 0.1),
7}
8
9def mean_pool(
10 sentence: str,
11 vectors: dict[str, tuple[float, float, float]],
12) -> tuple[float, float, float]:
13 tokens = [token.lower() for token in sentence.split()]
14 token_vectors = [vectors[token] for token in tokens if token in vectors]
15
16 if not token_vectors:
17 width = len(next(iter(vectors.values())))
18 return tuple(0.0 for _ in range(width))
19
20 return tuple(
21 sum(vector[dim] for vector in token_vectors) / len(token_vectors)
22 for dim in range(len(token_vectors[0]))
23 )
24
25pooled_a = mean_pool("carrier delayed order", word_vectors)
26pooled_b = mean_pool("order delayed carrier", word_vectors)
27
28print("A:", tuple(round(value, 3) for value in pooled_a))
29print("B:", tuple(round(value, 3) for value in pooled_b))
30print("Same vector:", pooled_a == pooled_b)1A: (0.8, 0.133, 0.1)
2B: (0.8, 0.133, 0.1)
3Same vector: TrueThe final line is the problem: both sentences produce the same vector because averaging ignores order.
Using the [CLS] (Classification) token from BERT (Bidirectional Encoder Representations from Transformers) directly as a sentence embedding isn't a sound retrieval baseline. Reimers & Gurevych (2019) showed that BERT's paired-input architecture is impractical for large semantic search and introduced SBERT so independently encoded sentence vectors could be compared with cosine similarity.[1] A pretrained task token hasn't been trained to make nearest-neighbor distance a relevance score.
Contextual representations can show anisotropy: vectors occupy a narrow cone in the high-dimensional space rather than using directions evenly.[2] Think of many policy passages all pointing toward one generic "support text" direction. Their cosine scores can look high even when they answer different questions. Contrastive objectives can improve this geometry by rewarding aligned positives while penalizing competing candidates.
The goal of contrastive learning is simple but powerful: reshape the embedding space so that sentences with similar meaning land close to each other, while unrelated sentences are pushed apart.
Before we formalize it, recall what "close" means here. Cosine similarity measures how aligned two vectors point (their angle), ignoring their length. For two unit vectors (length 1), it's simply their dot product. A value of +1 means same direction, 0 means orthogonal directions, and a negative value means opposing directions. None of those numbers proves semantic identity or irrelevance by itself; only an evaluated embedding model makes cosine ranking useful. The next lesson studies this scoring rule in detail.
With that in mind, contrastive training teaches the model to:
SimCSE (Simple Contrastive Learning of Sentence Embeddings)[3] demonstrated an unusually small self-supervised construction: pass the same sentence through the encoder twice while dropout supplies different noisy views, then treat those views as a positive pair. Wang & Isola's analysis gives useful vocabulary for the resulting geometry: alignment asks whether positives are close, and uniformity asks whether representations avoid crowding into a small region of the hypersphere.[4] E5 later trained single-vector text embeddings contrastively from a large weakly supervised pair corpus.[5]
The most common training objective here is Information Noise Contrastive Estimation (InfoNCE): for one anchor sentence, the model should rank the true match above every other candidate in the batch.
Think of contrastive learning as matching a support ticket to its true duplicate while pushing away unrelated tickets. You want the matching pair close together and every non-match farther away. InfoNCE (Information Noise Contrastive Estimation) mathematically formalizes this push-and-pull dynamic.
Before looking at the formula, walk through a tiny concrete case. Suppose you have a batch of 2 query-passage pairs, and after normalizing their embeddings you measure cosine similarities (which, for unit vectors, are just dot products):
| Query | Positive | Similarity |
|---|---|---|
| 0.90 | ||
| 0.20 | ||
| 0.15 | ||
| 0.85 |
For query , the true match is (similarity 0.90). The other passage in the batch, , acts as an in-batch negative (similarity 0.20). InfoNCE wants the model to make look more likely than .
For this worked row, choose a sharp temperature and compute the loss contribution for step by step:
If the model were wrong ( similarity to only 0.20, to 0.90), the positive probability would drop to about and the loss would jump to roughly 14 nats. The optimizer would receive a strong gradient pushing the correct pair closer.
The standard contrastive loss for a batch of positive pairs:[6]
For each example , compare its similarity to the true match (numerator) against the full batch-level denominator. That denominator includes the positive pair itself plus every other candidate in the batch. Temperature controls how sharply the model distinguishes between similar and dissimilar pairs. The loss says: "make the true pair more likely than every alternative in the batch."
Where:
The following copy-runnable implementation shows the same calculation without hiding the matrix math behind a framework. Production training code would vectorize this in PyTorch, JAX, or another tensor library, but the loop below makes the denominator explicit:
1from math import exp, log, sqrt
2
3def normalize(vector: list[float]) -> list[float]:
4 norm = sqrt(sum(value * value for value in vector))
5 return [value / norm for value in vector]
6
7def dot(left: list[float], right: list[float]) -> float:
8 return sum(a * b for a, b in zip(left, right))
9
10def logsumexp(values: list[float]) -> float:
11 peak = max(values)
12 return peak + log(sum(exp(value - peak) for value in values))
13
14def row_cross_entropy(logits: list[float], correct: int) -> float:
15 return logsumexp(logits) - logits[correct]
16
17def info_nce_loss(
18 query_vectors: list[list[float]],
19 positive_vectors: list[list[float]],
20 temperature: float = 0.2,
21) -> float:
22 queries = [normalize(vector) for vector in query_vectors]
23 positives = [normalize(vector) for vector in positive_vectors]
24 losses: list[float] = []
25
26 for row, query in enumerate(queries):
27 logits = [dot(query, candidate) / temperature for candidate in positives]
28 losses.append(row_cross_entropy(logits, row))
29
30 return sum(losses) / len(losses)
31
32query_vectors = [[1.0, 0.0], [0.0, 1.0]]
33positive_vectors = [[0.95, 0.05], [0.10, 0.90]]
34
35loss = info_nce_loss(query_vectors, positive_vectors)
36score_pos = dot(normalize(query_vectors[0]), normalize(positive_vectors[0]))
37score_neg = dot(normalize(query_vectors[0]), normalize(positive_vectors[1]))
38extreme_loss = row_cross_entropy([1000.0, 986.0], correct=0)
39
40print("loss:", round(loss, 4))
41print("q1 positive score:", round(score_pos, 4))
42print("q1 negative score:", round(score_neg, 4))
43print("stable large-logit loss:", f"{extreme_loss:.8f}")1loss: 0.0104
2q1 positive score: 0.9986
3q1 negative score: 0.1104
4stable large-logit loss: 0.00000083The equation is often expanded into raw exponentials when calculating a small example on paper. Code should compute the same expression with log-sum-exp or a framework cross-entropy operation, so large logits don't overflow.
A second contrastive objective, triplet loss, focuses on individual anchor-positive-negative triplets instead of comparing one anchor against a whole candidate pool:
Where:
The loss enforces that the anchor must be closer to the positive than to the negative by at least margin : . If this constraint is already satisfied, the loss is zero. The margin prevents the model from wasting capacity pushing already-distant negatives even farther away.
Consider three sentences about return labels:
| Role | Sentence |
|---|---|
| Anchor () | "How do I print a return label?" |
| Positive () | "Where can I generate a return shipping label?" |
| Negative () | "How do I replace a damaged shipping label?" |
After encoding, suppose the distances are:
With margin , plug into the formula:
The loss is zero because the model already satisfies the margin constraint: the positive is closer than the negative by more than 0.1. Now imagine a bad model where and (the negative is closer than the positive):
A non-zero loss tells the optimizer to push the anchor and positive together while pushing the negative away until the gap exceeds the margin.
Temperature controls the "sharpness" of the softmax distribution over similarity scores:
For the worked similarity gap, , changing temperature changes the positive probability:
| τ | for this row | What to inspect |
|---|---|---|
| 0.01 | Saturates quickly; a false negative receives extreme pressure. | |
| 0.05 | Very sharp separation for this easy row. | |
| 0.10 | Still confident, with less sharpness. | |
| 1.00 | Much flatter signal. |
There is no universal best temperature. Tune it against held-out retrieval failures and implement the loss stably. Low temperature amplifies mislabeled or false negatives; overflow is an implementation bug that stable log-softmax or cross-entropy avoids.
1from math import exp
2
3def probability_of_positive(
4 positive_similarity: float,
5 negative_similarity: float,
6 temperature: float,
7) -> float:
8 scaled_gap = (positive_similarity - negative_similarity) / temperature
9 return 1.0 / (1.0 + exp(-scaled_gap))
10
11for temperature in (0.01, 0.05, 0.10, 1.00):
12 probability = probability_of_positive(0.90, 0.20, temperature)
13 print(f"tau={temperature:.2f}: P(positive)={probability:.6f}")1tau=0.01: P(positive)=1.000000
2tau=0.05: P(positive)=0.999999
3tau=0.10: P(positive)=0.999089
4tau=1.00: P(positive)=0.668188Training a good embedding model requires good training data. You need pairs of sentences that are either semantically similar or different, and you need enough diversity to teach the model real distinctions. Two main approaches dominate.
Natural Language Inference (NLI) labels whether a hypothesis follows from a premise (entailment), conflicts with it (contradiction), or does neither (neutral). Entailment is directional, not a promise that two texts are interchangeable.
SBERT trained Siamese and triplet architectures with NLI supervision and evaluated sentence similarity behavior.[1] Supervised SimCSE uses entailment as a positive and the corresponding contradiction as a hard negative.[3] This is a useful training construction, but you still need retrieval evaluation before treating two policy passages as substitutes.
SBERT's architectural move is simple but important: it runs both sentences through the same encoder with shared weights, pools each sentence into a single vector, and then trains on top of those pooled embeddings.[1] During inference, you only keep the single-sentence encoding path. That shared-weight Siamese network setup (two inputs processed through the same shared encoder) is what makes precomputing document embeddings and doing nearest-neighbor search practical.
What if you don't have labeled NLI data? SimCSE (Simple Contrastive Learning of Sentence Embeddings) from Gao et al. (2021)[3] shows you don't need it. The trick is elegant: pass the same sentence through the encoder twice with different dropout masks. Since dropout randomly zeros out different neurons each time, you get two slightly different embedding vectors for the same sentence. These two views are positives (they should be similar), while all other sentences in the batch are negatives.
This gives surprisingly strong embeddings without labeled pairs, although supervised SimCSE performs better across the paper's reported STS tasks.[3] The key insight is that the stochasticity in the transformer forward pass, which is normally just a regularization technique, becomes an implicit data augmentation mechanism for contrastive learning.
Beyond dropout, augmentations are hypotheses about meaning preservation:
Avoid casual word deletion or insertion for policy text: dropping "not," a product category, or an approval condition changes the rule while incorrectly labeling the pair positive.
Most contrastive learning implementations use in-batch negatives by default: for a batch of positive pairs, each anchor has one matching positive, and the other candidates in the batch act as negatives. This is efficient because you get many negatives "for free" without explicitly labeling them.
Larger batches increase the chance of informative competitors, but they also increase the chance of false negatives: another row may cite the same relevant policy as the anchor while the loss treats it as wrong. Distributed training commonly gathers embeddings across GPUs before computing this loss. Plain gradient accumulation doesn't create more in-batch negatives unless the implementation explicitly reuses embeddings across microbatches.
1batch = [
2 {"query": "Can I return a cracked tablet?", "policy_id": "return-policy-us-v3"},
3 {"query": "What if electronics arrive damaged?", "policy_id": "return-policy-us-v3"},
4 {"query": "Where is my return label?", "policy_id": "shipping-label-v2"},
5]
6
7false_negatives = []
8for anchor_index, anchor in enumerate(batch):
9 for candidate_index, candidate in enumerate(batch):
10 if anchor_index == candidate_index:
11 continue
12 if candidate["policy_id"] == anchor["policy_id"]:
13 false_negatives.append(
14 f"row {anchor_index} treats row {candidate_index} as negative"
15 )
16
17print("false negatives:", false_negatives)
18print("action: deduplicate shared policy positives before InfoNCE")1false negatives: ['row 0 treats row 1 as negative', 'row 1 treats row 0 as negative']
2action: deduplicate shared policy positives before InfoNCERandom negatives are too easy. The model quickly learns to distinguish "refund label missing" from "warehouse forklift battery." Hard negatives force the model to learn subtle semantic distinctions, which is where real retrieval quality comes from.
| Negative type | Anchor | Candidate | Why it matters |
|---|---|---|---|
| Easy negative | "How do I print a return label?" | "The warehouse forklift needs charging" | Different topic; the model learns this separation almost immediately |
| Hard negative | "How do I print a return label?" | "How do I replace a damaged shipping label?" | Same keywords, different intent; forces fine-grained learning |
Use other examples in the batch. Simple, scales with batch size, but negatives may be too easy.
Use a lexical search algorithm like BM25 (Best Matching 25, a sparse retrieval function based on keyword frequency) to find documents that are lexically similar but semantically different:
1Query: "How do I print a return label?"
2Hard negative: "Return label printer calibration guide" # shares "return" and "label" but answers a different question
3Easy negative: "Forklift battery maintenance schedule"Use a cross-encoder to find high-BM25-score but low-relevance passages. The code below demonstrates the control flow with deterministic helper functions: lexical overlap stands in for BM25, and a small relevance scorer stands in for the cross-encoder. The important behavior is the filter: keep candidates that look lexically relevant but fail the semantic relevance check.
1def tokens(text: str) -> set[str]:
2 return {part.strip("?.!,").lower() for part in text.split()}
3
4def lexical_overlap(query: str, candidate: str) -> int:
5 return len(tokens(query) & tokens(candidate))
6
7def cross_encoder_relevance(query: str, candidate: str) -> float:
8 query_terms = tokens(query)
9 candidate_terms = tokens(candidate)
10
11 if "print" in query_terms and "print" in candidate_terms:
12 return 0.92
13 if "generate" in candidate_terms and "label" in candidate_terms:
14 return 0.81
15 return 0.18
16
17def mine_hard_negatives(
18 queries: list[str],
19 corpus: list[str],
20 top_k: int = 2,
21) -> dict[str, list[str]]:
22 hard_negatives: dict[str, list[str]] = {}
23
24 for query in queries:
25 lexical_candidates = sorted(
26 corpus,
27 key=lambda candidate: lexical_overlap(query, candidate),
28 reverse=True,
29 )
30
31 hard_negatives[query] = [
32 candidate
33 for candidate in lexical_candidates
34 if lexical_overlap(query, candidate) > 0
35 and cross_encoder_relevance(query, candidate) < 0.30
36 ][:top_k]
37
38 return hard_negatives
39
40queries = ["How do I print a return label?"]
41corpus = [
42 "How do I print a return shipping label?",
43 "Return label printer calibration guide",
44 "How do I replace a damaged shipping label?",
45 "Forklift battery maintenance schedule",
46]
47
48hard = mine_hard_negatives(queries, corpus)
49print(hard[queries[0]])1['How do I replace a damaged shipping label?', 'Return label printer calibration guide']Re-mine hard negatives after each training epoch using the improved model. This progressively finds harder examples as the model improves.
Encode query and document independently, compare with dot product or cosine similarity:
Documents can be pre-encoded and indexed. At query time, you only encode the query once, then use an ANN (Approximate Nearest Neighbor) index to retrieve candidates sublinearly in practice.
No cross-attention between query and document; it may miss fine-grained relevance signals.
Concatenate query and document, process jointly through full transformer:
Full attention can model phrase order, negation, and query-document interactions that a single-vector score misses. On a suitable shortlist, this often improves precision over first-stage vector scoring.
You must run inference for every (query, document) pair. If you scored the full corpus directly, that's O(N) transformer passes per query, which is too slow for large-scale search.
ColBERT (Contextualized Late Interaction over BERT)[7] offers a middle ground between bi-encoder and cross-encoder:
Instead of a single embedding per document, ColBERT stores per-token embeddings and computes relevance using MaxSim: for each query token, find the maximum similarity to any document token, then sum.
Retains token-level matching signals while documents can still be pre-encoded.
Much larger index size (one vector per token instead of per document).
This architecture choice is a budget tradeoff, not a universal ranking. A bi-encoder protects first-stage recall at corpus scale; a cross-encoder can improve precision for a bounded candidate set; late interaction spends more index space to preserve token-level evidence.
1similarities = {
2 "return": {"damaged": 0.12, "return": 0.93, "tablet": 0.08},
3 "tablet": {"damaged": 0.19, "return": 0.07, "tablet": 0.91},
4}
5
6best_by_query_token = {
7 query_token: max(document_scores.values())
8 for query_token, document_scores in similarities.items()
9}
10score = sum(best_by_query_token.values())
11
12print("best token scores:", best_by_query_token)
13print("MaxSim score:", round(score, 2))1best token scores: {'return': 0.93, 'tablet': 0.91}
2MaxSim score: 1.84
In a production system, these two architectures are combined in a two-stage pipeline to balance speed and accuracy. The function below illustrates this pattern with deterministic scores: first retrieve a broad candidate set using a fast bi-encoder score, then re-score those candidates with a slower cross-encoder score and return the final ranking.
1documents = [
2 {
3 "id": "carrier-delay-credit",
4 "text": "Carrier delay credit policy for late packages",
5 "bi_score": 0.82,
6 "cross_score": 0.97,
7 },
8 {
9 "id": "return-label-printer",
10 "text": "Return label printer calibration guide",
11 "bi_score": 0.79,
12 "cross_score": 0.22,
13 },
14 {
15 "id": "late-package-refund",
16 "text": "Refund workflow for late package delivery",
17 "bi_score": 0.77,
18 "cross_score": 0.91,
19 },
20 {
21 "id": "forklift-battery",
22 "text": "Warehouse forklift battery maintenance",
23 "bi_score": 0.10,
24 "cross_score": 0.05,
25 },
26]
27
28def search_with_rerank(
29 query: str,
30 corpus: list[dict[str, str | float]],
31 candidate_k: int = 3,
32 top_k: int = 2,
33) -> list[str]:
34 candidates = sorted(corpus, key=lambda doc: float(doc["bi_score"]), reverse=True)[
35 :candidate_k
36 ]
37 reranked = sorted(
38 candidates,
39 key=lambda doc: float(doc["cross_score"]),
40 reverse=True,
41 )
42 return [str(doc["id"]) for doc in reranked[:top_k]]
43
44results = search_with_rerank("late package refund", documents)
45print(results)1['carrier-delay-credit', 'late-package-refund']"Label" means different things for clustering (group return-label documents), retrieval (find barcode-label troubleshooting), and classification (is this about shipping or inventory?). Traditional embeddings produce the same vector regardless of the downstream task. This limits performance when a single model must serve multiple purposes.
Some embedding families expose task prefixes or lightweight instructions that steer the encoder toward retrieval, clustering, or classification. E5 is a simple example: it distinguishes inputs like query: and passage: during contrastive pre-training.[5] INSTRUCTOR-style models go further and condition the embedding on an explicit task instruction.[8] The format is model-specific. Prefixes that help one family can hurt another, so follow its training or model documentation.
This example keeps the families separate on purpose. The exact prefix or instruction string depends on the embedding family you chose:
1def format_e5_pair(query: str, passage: str) -> tuple[str, str]:
2 """E5-style inputs are prefixed strings."""
3 return f"query: {query}", f"passage: {passage}"
4
5def format_instructor_input(instruction: str, text: str) -> list[str]:
6 """INSTRUCTOR-style inputs are commonly [instruction, text] pairs."""
7 return [instruction, text]
8
9e5_query, e5_passage = format_e5_pair(
10 "what is contrastive learning?",
11 "Contrastive learning pulls positive pairs together in embedding space.",
12)
13
14instructor_example = format_instructor_input(
15 "Represent the Wikipedia question for retrieving supporting documents:",
16 "what is contrastive learning?",
17)
18
19assert e5_query.startswith("query: ")
20assert e5_passage.startswith("passage: ")
21assert instructor_example[0].startswith("Represent")
22assert instructor_example[1] == "what is contrastive learning?"This doesn't mean one magic prefix solves every task. It means some embedding families expect an extra conditioning signal. Use the format documented for that specific model family, then benchmark it on your own retrieval, clustering, or classification workload.
Matryoshka embeddings are named after nesting dolls because selected prefix widths are trained to remain useful on their own. For example, a full 768-dimensional embedding can be trained together with 128- and 32-dimensional prefixes. You then choose among trained and evaluated widths based on the storage-quality budget.
Train embeddings so that selected prefix widths preserve useful representations under their own losses:[9]
For a contrastively trained embedding model, the loss can be computed at several predefined truncation points simultaneously. The following runnable toy implementation slices full embeddings down to smaller prefixes, calculates the same InfoNCE objective at each prefix, and averages the losses:
1from math import exp, log, sqrt
2
3def normalize(vector: list[float]) -> list[float]:
4 norm = sqrt(sum(value * value for value in vector))
5 return [value / norm for value in vector]
6
7def dot(left: list[float], right: list[float]) -> float:
8 return sum(a * b for a, b in zip(left, right))
9
10def logsumexp(values: list[float]) -> float:
11 peak = max(values)
12 return peak + log(sum(exp(value - peak) for value in values))
13
14def info_nce_loss(
15 query_vectors: list[list[float]],
16 positive_vectors: list[list[float]],
17 temperature: float = 0.2,
18) -> float:
19 queries = [normalize(vector) for vector in query_vectors]
20 positives = [normalize(vector) for vector in positive_vectors]
21 losses: list[float] = []
22
23 for row, query in enumerate(queries):
24 logits = [dot(query, candidate) / temperature for candidate in positives]
25 losses.append(logsumexp(logits) - logits[row])
26
27 return sum(losses) / len(losses)
28
29def matryoshka_loss(
30 embeddings_a: list[list[float]],
31 embeddings_b: list[list[float]],
32 dims: tuple[int, int, int] = (2, 4, 6),
33) -> float:
34 losses = []
35
36 for dim in dims:
37 truncated_a = [row[:dim] for row in embeddings_a]
38 truncated_b = [row[:dim] for row in embeddings_b]
39 losses.append(info_nce_loss(truncated_a, truncated_b))
40
41 return sum(losses) / len(losses)
42
43queries = [[1.0, 0.0, 0.9, 0.1, 0.5, 0.2], [0.0, 1.0, 0.1, 0.9, 0.2, 0.5]]
44positives = [[0.95, 0.05, 0.85, 0.15, 0.45, 0.25], [0.05, 0.95, 0.15, 0.85, 0.25, 0.45]]
45
46loss = matryoshka_loss(queries, positives)
47print(round(loss, 4))10.0153| Benefit | Why it matters |
|---|---|
| Flexible deployment | Use the full embedding width for maximum accuracy, or a smaller prefix when storage is constrained. |
| No retraining | One model can serve several dimensionality budgets. |
| Graceful degradation | Performance should drop smoothly as dimensions shrink, but you still need to benchmark each cutoff. |
| Deployment constraint | Shorten only at dimensions a chosen model documents or you validate; arbitrary slicing isn't guaranteed to preserve rankings. |
Before large-scale benchmarks like MTEB existed, a common benchmark for evaluating sentence embeddings was Semantic Textual Similarity (STS). Datasets like STS-B (STS Benchmark) provide sentence pairs rated by human annotators for semantic relatedness:
1"A package is delayed" / "A shipment is running late" => 4.5
2"A package is delayed" / "A refund has posted" => 1.2To evaluate a model, you compute the cosine similarity for every pair using the model's embeddings, and then calculate the Spearman rank correlation between the model's similarity scores and the human ratings. A high correlation means the model's embedding space aligns well with human judgment.
As models improved, optimizing only for STS was no longer sufficient. An embedding model excellent at STS might fail miserably at information retrieval or clustering. To address this, the Massive Text Embedding Benchmark (MTEB) was introduced.[10] The original MTEB paper evaluated models on 58 datasets grouped into 8 task categories, giving a much broader view of embedding quality than STS alone:
| Task | # Datasets | Example |
|---|---|---|
| Classification | 12 | Sentiment, topic |
| Clustering | 11 | Document clustering |
| Pair Classification | 3 | Paraphrase detection |
| Reranking | 4 | Passage reranking |
| Retrieval | 15 | Question-passage retrieval |
| STS | 10 | Semantic similarity |
| Summarization | 1 | Summary similarity |
| BitextMining | 2 | Parallel sentence mining |
Don't anchor on a single MTEB average. Deployment success usually depends on a few operational questions:
In practice, architecture often matters more than a tiny leaderboard delta. A strong bi-encoder with good hard negatives, sensible chunking, and a reranker usually beats blindly swapping to the latest model name.
The original MTEB finding is the durable lesson: no one model dominated every task category.[10] Evaluate policy retrieval, reranking, languages, latency, and storage behavior that match your deployment instead of selecting by one aggregate score.
document_qa_v2 can use an embedding retriever to propose policy passages, but vector proximity isn't authorization. A release test should verify both retrieval quality and that unapproved text never becomes answer evidence:
1approved_evidence = {"return-policy-us-v3", "shipping-label-v2"}
2retrieval_cases = [
3 {
4 "query": "Can I return a cracked tablet?",
5 "expected": "return-policy-us-v3",
6 "candidates": ["seller-private-note-44", "return-policy-us-v3"],
7 },
8 {
9 "query": "Where do I get a return label?",
10 "expected": "shipping-label-v2",
11 "candidates": ["shipping-label-v2", "warehouse-note-12"],
12 },
13]
14attack_candidates = ["seller-private-note-44"]
15
16def approved_candidate(candidates: list[str]) -> str | None:
17 return next((doc for doc in candidates if doc in approved_evidence), None)
18
19served = [approved_candidate(case["candidates"]) for case in retrieval_cases]
20hits = sum(
21 evidence == case["expected"]
22 for evidence, case in zip(served, retrieval_cases)
23)
24attack_evidence = approved_candidate(attack_candidates)
25
26print("approved evidence recall@2:", f"{hits / len(retrieval_cases):.0%}")
27print("served evidence:", served)
28print("private-note attack evidence:", attack_evidence)1approved evidence recall@2: 100%
2served evidence: ['return-policy-us-v3', 'shipping-label-v2']
3private-note attack evidence: NoneBuilding embedding-based systems requires the right tooling:
| Tool | What it gives you |
|---|---|
Sentence-Transformers (sentence-transformers) | Pretrained sentence embedding models, pooling modules, contrastive training losses such as MultipleNegativesRankingLoss, and batched encoding utilities. |
| FAISS (Facebook AI Similarity Search) | Efficient similarity search and clustering for dense vectors, including inverted-file and product-quantization approaches for approximate nearest neighbor search.[11] |
Symptom: Nearly every query-document pair gets suspiciously similar cosine scores. Cause: Raw BERT [CLS] vectors weren't tuned for semantic retrieval and can inherit poorly discriminative anisotropic geometry. Fix: Start from a sentence embedding model or fine-tune with a contrastive objective before building nearest-neighbor search.
Symptom: Training loss falls, but recall on realistic queries barely moves. Cause: Random negatives stop teaching once the model separates unrelated topics. Fix: Mine BM25 or cross-encoder negatives that share words with the anchor but answer a different intent.
Symptom: The reranker looks good in pairwise inspection, yet the right document is often absent in production results. Cause: The correct passage never entered the shortlist. Fix: Tune first-stage Recall@K separately, then widen candidate budget before blaming the reranker.
Symptom: Index storage drops as expected, but retrieval quality falls off a cliff. Cause: A standard embedding vector was truncated without prefix-aware training or provider support. Fix: Use Matryoshka-trained or provider-documented shortening controls and benchmark each target width.
Symptom: One embedding model works for clustering but underperforms on retrieval. Cause: The model family expected query/passage prefixes or instructions, but every input was embedded as plain text. Fix: Follow the model card format for retrieval, clustering, and classification separately.
These exercises let you verify your understanding without needing a GPU cluster.
Given an anchor , positive , and negative with distances and , compute the triplet loss for margins and . In which case does the model still have work to do?
Solution sketch: For : , so . The margin is already satisfied. For : , so . The larger margin forces the model to pull the positive even closer or push the negative farther away.
A teammate reports that their semantic search system returns nearly identical similarity scores for every query-document pair. They are using a pretrained BERT model and taking the [CLS] token as the sentence embedding. What is the most likely cause, and what is the smallest change that would fix it?
Solution sketch: Raw BERT [CLS] embeddings weren't trained to make cosine distance a semantic-retrieval score, and anisotropic geometry can make their scores poorly discriminative. The smallest fix is to switch to a sentence embedding model that was fine-tuned with a sentence-level objective (for example, SBERT or E5), rather than using raw BERT.
You have 2 million support tickets and a latency budget of 200 ms per query. You own a bi-encoder that encodes a query in 10 ms and a cross-encoder that scores one query-document pair in 15 ms. Why is scoring the full corpus with the cross-encoder impossible, and what pipeline would hit the latency budget?
Solution sketch: hours per query. The cross-encoder is far too slow for the full corpus. Reserve 10 ms for query encoding and choose a top-10 shortlist only if ANN lookup and overhead fit inside the remaining 40 ms: reranking then costs , for at most 200 ms total. If Recall@10 is inadequate, the budget or reranker throughput must change; silently widening to top 100 violates the requirement.
You now understand why raw transformer outputs fail for semantic search, how contrastive learning reshapes the embedding space, and how to train and deploy sentence embeddings in production. You can explain InfoNCE and triplet loss with concrete numbers, mine hard negatives from lexical overlap, and design a two-stage retrieval pipeline that balances speed and accuracy.
The next article, Embedding Similarity & Quantization, builds directly on this foundation. You will learn the mathematical details of cosine similarity versus dot product, how Matryoshka truncation changes the accuracy-storage tradeoff, and how to quantize embeddings to 8-bit, 4-bit, or binary formats for large-scale indexes. Those techniques only make sense once you understand why the embedding space was shaped by contrastive loss in the first place.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.
Ethayarajh, K. · 2019
SimCSE: Simple Contrastive Learning of Sentence Embeddings.
Gao, T., Yao, X., & Chen, D. · 2021 · EMNLP 2021
Understanding Contrastive Learning Requires Incorporating Inductive Biases.
Wang, T., & Isola, P. · 2020 · ICML 2020
Text Embeddings by Weakly-Supervised Contrastive Pre-training.
Wang, L., et al. · 2022
Representation Learning with Contrastive Predictive Coding.
Oord, A. van den, Li, Y., & Vinyals, O. · 2018
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.
Khattab, O., & Zaharia, M. · 2020 · SIGIR 2020
One Embedder, Any Task: Instruction-Finetuned Text Embeddings.
Su, H., et al. · 2022 · arXiv preprint
Matryoshka Representation Learning.
Kusupati, A., et al. · 2022 · NeurIPS 2022
MTEB: Massive Text Embedding Benchmark.
Muennighoff, N., et al. · 2023 · EACL 2023
Billion-scale similarity search with GPUs.
Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint