LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnTransformer Deep DivesSentence Embeddings & Contrastive Loss
📐HardEmbeddings & Vector Search

Sentence Embeddings & Contrastive Loss

Learn how contrastive losses train sentence embeddings, why hard negatives matter, and how retrieval systems combine bi-encoders, rerankers, and dimension tradeoffs.

36 min read
Learning path
Step 85 of 155 in the full curriculum
Capstone: Production AgentEmbedding Similarity & Quantization

Sentence Embeddings & Contrastive Loss

In the production-agent capstone, document_qa_v2 found return-policy-us-v3 before the agent drafted an answer about a cracked tablet. That contract deliberately hid one important mechanism: how did a policy passage become a good candidate for the question in the first place?

A sentence embedding maps a query or passage to one fixed-width vector. A retriever can then find passages near a customer question even when wording changes:

QueryPassage that should rank near itTempting non-match
"Can I return a tablet that arrived cracked?""Damaged electronics may be returned within 30 days.""A private seller note requests an immediate refund."

This chapter teaches how contrastive learning shapes that vector space. The agent still needs evidence and authorization gates; embeddings decide which text gets considered before those gates run.

From word embeddings to sentence embeddings

Word embeddings gave individual tokens numerical coordinates. Retrieval needs one vector for an entire query or policy passage. The challenge is to compress variable-length text into a fixed-size representation whose neighborhood ordering is useful for the task.

Averaging context-free word vectors is a useful baseline, but it loses order and context. Don't confuse that baseline with mean pooling contextual token outputs inside a trained sentence encoder: pooling specifies how to produce one vector; the training objective determines whether its geometry supports retrieval. A common objective is contrastive learning, which rewards a relevant pair for scoring above irrelevant candidates.


Naive approaches (and why they fail)

Mean pooling of word embeddings

A simple approach to creating a sentence embedding is to calculate the word vectors for each token in the sentence and then average them. This naive baseline is fast but loses important structural information. The function below uses a tiny local vector table so you can run the baseline and see exactly what it throws away:

mean-pooling-of-word-embeddings.py
1word_vectors: dict[str, tuple[float, float, float]] = { 2 "carrier": (0.9, 0.1, 0.0), 3 "delayed": (0.8, 0.2, 0.1), 4 "order": (0.7, 0.1, 0.2), 5 "refund": (0.1, 0.9, 0.2), 6 "posted": (0.2, 0.8, 0.1), 7} 8 9def mean_pool( 10 sentence: str, 11 vectors: dict[str, tuple[float, float, float]], 12) -> tuple[float, float, float]: 13 tokens = [token.lower() for token in sentence.split()] 14 token_vectors = [vectors[token] for token in tokens if token in vectors] 15 16 if not token_vectors: 17 width = len(next(iter(vectors.values()))) 18 return tuple(0.0 for _ in range(width)) 19 20 return tuple( 21 sum(vector[dim] for vector in token_vectors) / len(token_vectors) 22 for dim in range(len(token_vectors[0])) 23 ) 24 25pooled_a = mean_pool("carrier delayed order", word_vectors) 26pooled_b = mean_pool("order delayed carrier", word_vectors) 27 28print("A:", tuple(round(value, 3) for value in pooled_a)) 29print("B:", tuple(round(value, 3) for value in pooled_b)) 30print("Same vector:", pooled_a == pooled_b)
Output
1A: (0.8, 0.133, 0.1) 2B: (0.8, 0.133, 0.1) 3Same vector: True

The final line is the problem: both sentences produce the same vector because averaging ignores order.

Problems

  • Ignores word order: "carrier delayed order" and "order delayed carrier" have identical embeddings because the sum of vectors is commutative (A+B=B+AA+B = B+AA+B=B+A), even though word order can change what the sentence emphasizes.
  • Common-word dilution: Frequent words and boilerplate phrasing can wash out the signal from rare, informative tokens.
  • No context: Polysemous words like "charge" (payment dispute vs. powering a scanner) get averaged into a single messy vector.

[CLS] token from BERT

Using the [CLS] (Classification) token from BERT (Bidirectional Encoder Representations from Transformers) directly as a sentence embedding isn't a sound retrieval baseline. Reimers & Gurevych (2019) showed that BERT's paired-input architecture is impractical for large semantic search and introduced SBERT so independently encoded sentence vectors could be compared with cosine similarity.[1] A pretrained task token hasn't been trained to make nearest-neighbor distance a relevance score.

The anisotropy problem

Contextual representations can show anisotropy: vectors occupy a narrow cone in the high-dimensional space rather than using directions evenly.[2] Think of many policy passages all pointing toward one generic "support text" direction. Their cosine scores can look high even when they answer different questions. Contrastive objectives can improve this geometry by rewarding aligned positives while penalizing competing candidates.

Embedding-space geometry comparison showing raw contextual sentence vectors collapsed into a narrow cone before contrastive training and separated semantic clusters after contrastive fine-tuning. Embedding-space geometry comparison showing raw contextual sentence vectors collapsed into a narrow cone before contrastive training and separated semantic clusters after contrastive fine-tuning.
Raw contextual vectors can bunch into one generic direction, while contrastive fine-tuning creates separable semantic neighborhoods that make nearest-neighbor search meaningful.

Contrastive learning for sentence embeddings

The core idea

The goal of contrastive learning is simple but powerful: reshape the embedding space so that sentences with similar meaning land close to each other, while unrelated sentences are pushed apart.

Before we formalize it, recall what "close" means here. Cosine similarity measures how aligned two vectors point (their angle), ignoring their length. For two unit vectors (length 1), it's simply their dot product. A value of +1 means same direction, 0 means orthogonal directions, and a negative value means opposing directions. None of those numbers proves semantic identity or irrelevance by itself; only an evaluated embedding model makes cosine ranking useful. The next lesson studies this scoring rule in detail.

With that in mind, contrastive training teaches the model to:

  • Pull embeddings of similar sentences toward each other (high cosine similarity)
  • Push embeddings of dissimilar sentences away (low cosine similarity)

SimCSE (Simple Contrastive Learning of Sentence Embeddings)[3] demonstrated an unusually small self-supervised construction: pass the same sentence through the encoder twice while dropout supplies different noisy views, then treat those views as a positive pair. Wang & Isola's analysis gives useful vocabulary for the resulting geometry: alignment asks whether positives are close, and uniformity asks whether representations avoid crowding into a small region of the hypersphere.[4] E5 later trained single-vector text embeddings contrastively from a large weakly supervised pair corpus.[5]

The most common training objective here is Information Noise Contrastive Estimation (InfoNCE): for one anchor sentence, the model should rank the true match above every other candidate in the batch.

Contrastive learning for sentence embeddings showing an anchor support query, a semantically matching positive, in-batch negatives, and the InfoNCE denominator that compares the positive against every candidate in the batch. Contrastive learning for sentence embeddings showing an anchor support query, a semantically matching positive, in-batch negatives, and the InfoNCE denominator that compares the positive against every candidate in the batch.
Contrastive training turns a batch into a ranking problem: the true match should outrank every in-batch negative for the same anchor query.

InfoNCE loss

Think of contrastive learning as matching a support ticket to its true duplicate while pushing away unrelated tickets. You want the matching pair close together and every non-match farther away. InfoNCE (Information Noise Contrastive Estimation) mathematically formalizes this push-and-pull dynamic.

Before looking at the formula, walk through a tiny concrete case. Suppose you have a batch of 2 query-passage pairs, and after normalizing their embeddings you measure cosine similarities (which, for unit vectors, are just dot products):

QueryPositiveSimilarity
q1q_1q1​p1p_1p1​0.90
q1q_1q1​p2p_2p2​0.20
q2q_2q2​p1p_1p1​0.15
q2q_2q2​p2p_2p2​0.85

For query q1q_1q1​, the true match is p1p_1p1​ (similarity 0.90). The other passage in the batch, p2p_2p2​, acts as an in-batch negative (similarity 0.20). InfoNCE wants the model to make p1p_1p1​ look more likely than p2p_2p2​.

For this worked row, choose a sharp temperature τ=0.05\tau = 0.05τ=0.05 and compute the loss contribution for q1q_1q1​ step by step:

  1. Scale the similarities: positive logit = 0.90/0.05=18.00.90 / 0.05 = 18.00.90/0.05=18.0, negative logit = 0.20/0.05=4.00.20 / 0.05 = 4.00.20/0.05=4.0
  2. Exponentiate (this turns logits into unnormalized probabilities): exp⁡(18.0)≈65,659,969\exp(18.0) \approx 65{,}659{,}969exp(18.0)≈65,659,969, exp⁡(4.0)≈54.6\exp(4.0) \approx 54.6exp(4.0)≈54.6
  3. Normalize into a probability for the positive: 65,659,969/(65,659,969+54.6)≈0.9999991765{,}659{,}969 / (65{,}659{,}969 + 54.6) \approx 0.9999991765,659,969/(65,659,969+54.6)≈0.99999917
  4. Take negative log: −log⁡(0.99999917)≈0.00000083-\log(0.99999917) \approx 0.00000083−log(0.99999917)≈0.00000083 (tiny loss; the model is already very confident)

If the model were wrong (q1q_1q1​ similarity to p1p_1p1​ only 0.20, to p2p_2p2​ 0.90), the positive probability would drop to about 8.3×10−78.3 \times 10^{-7}8.3×10−7 and the loss would jump to roughly 14 nats. The optimizer would receive a strong gradient pushing the correct pair closer.

InfoNCE similarity matrix for two query-passage pairs showing diagonal positives, off-diagonal in-batch negatives, temperature-scaled logits, and positive probabilities. InfoNCE similarity matrix for two query-passage pairs showing diagonal positives, off-diagonal in-batch negatives, temperature-scaled logits, and positive probabilities.
InfoNCE reads each similarity row as a multiple-choice question where the diagonal passage is the correct answer and off-diagonal passages are in-batch negatives.

The standard contrastive loss for a batch of NNN positive pairs:[6]

L=−1N∑i=1Nlog⁡exp⁡(sim(zi,zi+)/τ)∑j=1Nexp⁡(sim(zi,zj+)/τ)\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\text{sim}(z_i, z_i^+) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i, z_j^+) / \tau)}L=−N1​∑i=1N​log∑j=1N​exp(sim(zi​,zj+​)/τ)exp(sim(zi​,zi+​)/τ)​

Reading the formula

For each example iii, compare its similarity to the true match zi+z_i^+zi+​ (numerator) against the full batch-level denominator. That denominator includes the positive pair itself plus every other candidate in the batch. Temperature τ\tauτ controls how sharply the model distinguishes between similar and dissimilar pairs. The loss says: "make the true pair more likely than every alternative in the batch."

Where:

  • zi,zi+z_i, z_i^+zi​,zi+​ are embeddings of a positive pair (e.g., query and relevant document)
  • τ\tauτ is the temperature parameter
  • All non-matching examples in that denominator act as in-batch negatives

The following copy-runnable implementation shows the same calculation without hiding the matrix math behind a framework. Production training code would vectorize this in PyTorch, JAX, or another tensor library, but the loop below makes the denominator explicit:

reading-the-formula.py
1from math import exp, log, sqrt 2 3def normalize(vector: list[float]) -> list[float]: 4 norm = sqrt(sum(value * value for value in vector)) 5 return [value / norm for value in vector] 6 7def dot(left: list[float], right: list[float]) -> float: 8 return sum(a * b for a, b in zip(left, right)) 9 10def logsumexp(values: list[float]) -> float: 11 peak = max(values) 12 return peak + log(sum(exp(value - peak) for value in values)) 13 14def row_cross_entropy(logits: list[float], correct: int) -> float: 15 return logsumexp(logits) - logits[correct] 16 17def info_nce_loss( 18 query_vectors: list[list[float]], 19 positive_vectors: list[list[float]], 20 temperature: float = 0.2, 21) -> float: 22 queries = [normalize(vector) for vector in query_vectors] 23 positives = [normalize(vector) for vector in positive_vectors] 24 losses: list[float] = [] 25 26 for row, query in enumerate(queries): 27 logits = [dot(query, candidate) / temperature for candidate in positives] 28 losses.append(row_cross_entropy(logits, row)) 29 30 return sum(losses) / len(losses) 31 32query_vectors = [[1.0, 0.0], [0.0, 1.0]] 33positive_vectors = [[0.95, 0.05], [0.10, 0.90]] 34 35loss = info_nce_loss(query_vectors, positive_vectors) 36score_pos = dot(normalize(query_vectors[0]), normalize(positive_vectors[0])) 37score_neg = dot(normalize(query_vectors[0]), normalize(positive_vectors[1])) 38extreme_loss = row_cross_entropy([1000.0, 986.0], correct=0) 39 40print("loss:", round(loss, 4)) 41print("q1 positive score:", round(score_pos, 4)) 42print("q1 negative score:", round(score_neg, 4)) 43print("stable large-logit loss:", f"{extreme_loss:.8f}")
Output
1loss: 0.0104 2q1 positive score: 0.9986 3q1 negative score: 0.1104 4stable large-logit loss: 0.00000083

The equation is often expanded into raw exponentials when calculating a small example on paper. Code should compute the same expression with log-sum-exp or a framework cross-entropy operation, so large logits don't overflow.

Triplet loss

A second contrastive objective, triplet loss, focuses on individual anchor-positive-negative triplets instead of comparing one anchor against a whole candidate pool:

L=max⁡(0,d(a,p)−d(a,n)+m)\mathcal{L} = \max(0, d(a, p) - d(a, n) + m)L=max(0,d(a,p)−d(a,n)+m)

Where:

  • d(⋅,⋅)d(\cdot, \cdot)d(⋅,⋅) is the Euclidean distance between embeddings
  • mmm is a margin hyperparameter
  • aaa is the anchor, ppp is the positive, nnn is the negative

The loss enforces that the anchor must be closer to the positive than to the negative by at least margin mmm: d(a,p)+m≤d(a,n)d(a, p) + m \leq d(a, n)d(a,p)+m≤d(a,n). If this constraint is already satisfied, the loss is zero. The margin prevents the model from wasting capacity pushing already-distant negatives even farther away.

Worked example: computing triplet loss by hand

Consider three sentences about return labels:

RoleSentence
Anchor (aaa)"How do I print a return label?"
Positive (ppp)"Where can I generate a return shipping label?"
Negative (nnn)"How do I replace a damaged shipping label?"

After encoding, suppose the distances are:

  • d(a,p)=0.2d(a, p) = 0.2d(a,p)=0.2 (the paraphrase is close)
  • d(a,n)=0.5d(a, n) = 0.5d(a,n)=0.5 (the hard negative is farther, but not by much)

With margin m=0.1m = 0.1m=0.1, plug into the formula:

0.2−0.5+0.1=−0.20.2 - 0.5 + 0.1 = -0.20.2−0.5+0.1=−0.2

max⁡(0,−0.2)=0\max(0, -0.2) = 0max(0,−0.2)=0

The loss is zero because the model already satisfies the margin constraint: the positive is closer than the negative by more than 0.1. Now imagine a bad model where d(a,p)=0.5d(a, p) = 0.5d(a,p)=0.5 and d(a,n)=0.3d(a, n) = 0.3d(a,n)=0.3 (the negative is closer than the positive):

0.5−0.3+0.1=0.30.5 - 0.3 + 0.1 = 0.30.5−0.3+0.1=0.3

max⁡(0,0.3)=0.3\max(0, 0.3) = 0.3max(0,0.3)=0.3

A non-zero loss tells the optimizer to push the anchor and positive together while pushing the negative away until the gap exceeds the margin.

Key differences from InfoNCE

  • Triplet loss compares a chosen negative against a margin, so mining determines most of its learning signal.
  • InfoNCE compares each anchor with a candidate pool. Batches supply negatives cheaply, but they can also contain false negatives.
  • Neither objective is an automatic win. Choose data construction deliberately and evaluate held-out retrieval failures, not training loss alone.

Temperature parameter τ

Temperature controls the "sharpness" of the softmax distribution over similarity scores:

For the worked similarity gap, 0.90−0.20=0.700.90 - 0.20 = 0.700.90−0.20=0.70, changing temperature changes the positive probability:

τP(positive)P(\text{positive})P(positive) for this rowWhat to inspect
0.01>0.9999>0.9999>0.9999Saturates quickly; a false negative receives extreme pressure.
0.050.9999990.9999990.999999Very sharp separation for this easy row.
0.100.99910.99910.9991Still confident, with less sharpness.
1.000.66820.66820.6682Much flatter signal.

There is no universal best temperature. Tune it against held-out retrieval failures and implement the loss stably. Low temperature amplifies mislabeled or false negatives; overflow is an implementation bug that stable log-softmax or cross-entropy avoids.

Temperature effects in contrastive learning showing sharper probabilities for low temperature, flatter probabilities for high temperature, and higher pressure on false negatives at very low values. Temperature effects in contrastive learning showing sharper probabilities for low temperature, flatter probabilities for high temperature, and higher pressure on false negatives at very low values.
Temperature is a sharpness knob: low values make hard negatives dominate training, while high values flatten the softmax and weaken the separation signal.
temperature-sharpens-the-same-row.py
1from math import exp 2 3def probability_of_positive( 4 positive_similarity: float, 5 negative_similarity: float, 6 temperature: float, 7) -> float: 8 scaled_gap = (positive_similarity - negative_similarity) / temperature 9 return 1.0 / (1.0 + exp(-scaled_gap)) 10 11for temperature in (0.01, 0.05, 0.10, 1.00): 12 probability = probability_of_positive(0.90, 0.20, temperature) 13 print(f"tau={temperature:.2f}: P(positive)={probability:.6f}")
Output
1tau=0.01: P(positive)=1.000000 2tau=0.05: P(positive)=0.999999 3tau=0.10: P(positive)=0.999089 4tau=1.00: P(positive)=0.668188


Training strategies for sentence embeddings

Training a good embedding model requires good training data. You need pairs of sentences that are either semantically similar or different, and you need enough diversity to teach the model real distinctions. Two main approaches dominate.

Supervised: fine-tuning on NLI data

Natural Language Inference (NLI) labels whether a hypothesis follows from a premise (entailment), conflicts with it (contradiction), or does neither (neutral). Entailment is directional, not a promise that two texts are interchangeable.

SBERT trained Siamese and triplet architectures with NLI supervision and evaluated sentence similarity behavior.[1] Supervised SimCSE uses entailment as a positive and the corresponding contradiction as a hard negative.[3] This is a useful training construction, but you still need retrieval evaluation before treating two policy passages as substitutes.

SBERT's architectural move is simple but important: it runs both sentences through the same encoder with shared weights, pools each sentence into a single vector, and then trains on top of those pooled embeddings.[1] During inference, you only keep the single-sentence encoding path. That shared-weight Siamese network setup (two inputs processed through the same shared encoder) is what makes precomputing document embeddings and doing nearest-neighbor search practical.

Self-supervised: SimCSE

What if you don't have labeled NLI data? SimCSE (Simple Contrastive Learning of Sentence Embeddings) from Gao et al. (2021)[3] shows you don't need it. The trick is elegant: pass the same sentence through the encoder twice with different dropout masks. Since dropout randomly zeros out different neurons each time, you get two slightly different embedding vectors for the same sentence. These two views are positives (they should be similar), while all other sentences in the batch are negatives.

This gives surprisingly strong embeddings without labeled pairs, although supervised SimCSE performs better across the paper's reported STS tasks.[3] The key insight is that the stochasticity in the transformer forward pass, which is normally just a regularization technique, becomes an implicit data augmentation mechanism for contrastive learning.

Data augmentation

Beyond dropout, augmentations are hypotheses about meaning preservation:

  • Dropout masks (SimCSE): different mask patterns per forward pass
  • Verified paraphrases or back-translations: use only after checking that policy scope and exceptions survive
  • Domain pairs: mine resolved duplicate questions that cite the same approved policy passage
  • Re-ranking with cross-encoders: score candidate paraphrases before accepting them as positives

Avoid casual word deletion or insertion for policy text: dropping "not," a product category, or an approval condition changes the rule while incorrectly labeling the pair positive.

In-batch negatives in practice

Most contrastive learning implementations use in-batch negatives by default: for a batch of NNN positive pairs, each anchor has one matching positive, and the other N−1N-1N−1 candidates in the batch act as negatives. This is efficient because you get many negatives "for free" without explicitly labeling them.

Larger batches increase the chance of informative competitors, but they also increase the chance of false negatives: another row may cite the same relevant policy as the anchor while the loss treats it as wrong. Distributed training commonly gathers embeddings across GPUs before computing this loss. Plain gradient accumulation doesn't create more in-batch negatives unless the implementation explicitly reuses embeddings across microbatches.

audit-false-negatives-before-training.py
1batch = [ 2 {"query": "Can I return a cracked tablet?", "policy_id": "return-policy-us-v3"}, 3 {"query": "What if electronics arrive damaged?", "policy_id": "return-policy-us-v3"}, 4 {"query": "Where is my return label?", "policy_id": "shipping-label-v2"}, 5] 6 7false_negatives = [] 8for anchor_index, anchor in enumerate(batch): 9 for candidate_index, candidate in enumerate(batch): 10 if anchor_index == candidate_index: 11 continue 12 if candidate["policy_id"] == anchor["policy_id"]: 13 false_negatives.append( 14 f"row {anchor_index} treats row {candidate_index} as negative" 15 ) 16 17print("false negatives:", false_negatives) 18print("action: deduplicate shared policy positives before InfoNCE")
Output
1false negatives: ['row 0 treats row 1 as negative', 'row 1 treats row 0 as negative'] 2action: deduplicate shared policy positives before InfoNCE

Hard negative mining

Why hard negatives matter

Random negatives are too easy. The model quickly learns to distinguish "refund label missing" from "warehouse forklift battery." Hard negatives force the model to learn subtle semantic distinctions, which is where real retrieval quality comes from.

Negative typeAnchorCandidateWhy it matters
Easy negative"How do I print a return label?""The warehouse forklift needs charging"Different topic; the model learns this separation almost immediately
Hard negative"How do I print a return label?""How do I replace a damaged shipping label?"Same keywords, different intent; forces fine-grained learning
Hard negative mining for sentence embeddings comparing easy negatives, BM25 lexical negatives, cross-encoder filtered hard negatives, and iterative mining as the model improves. Hard negative mining for sentence embeddings comparing easy negatives, BM25 lexical negatives, cross-encoder filtered hard negatives, and iterative mining as the model improves.
Hard negative mining works because lexical similarity alone isn't enough: the best training examples share words with the anchor but answer a different intent.

Mining strategies

1. In-batch negatives

Use other examples in the batch. Simple, scales with batch size, but negatives may be too easy.

2. BM25 negatives

Use a lexical search algorithm like BM25 (Best Matching 25, a sparse retrieval function based on keyword frequency) to find documents that are lexically similar but semantically different:

text
1Query: "How do I print a return label?" 2Hard negative: "Return label printer calibration guide" # shares "return" and "label" but answers a different question 3Easy negative: "Forklift battery maintenance schedule"

3. Cross-encoder mining

Use a cross-encoder to find high-BM25-score but low-relevance passages. The code below demonstrates the control flow with deterministic helper functions: lexical overlap stands in for BM25, and a small relevance scorer stands in for the cross-encoder. The important behavior is the filter: keep candidates that look lexically relevant but fail the semantic relevance check.

3-cross-encoder-mining.py
1def tokens(text: str) -> set[str]: 2 return {part.strip("?.!,").lower() for part in text.split()} 3 4def lexical_overlap(query: str, candidate: str) -> int: 5 return len(tokens(query) & tokens(candidate)) 6 7def cross_encoder_relevance(query: str, candidate: str) -> float: 8 query_terms = tokens(query) 9 candidate_terms = tokens(candidate) 10 11 if "print" in query_terms and "print" in candidate_terms: 12 return 0.92 13 if "generate" in candidate_terms and "label" in candidate_terms: 14 return 0.81 15 return 0.18 16 17def mine_hard_negatives( 18 queries: list[str], 19 corpus: list[str], 20 top_k: int = 2, 21) -> dict[str, list[str]]: 22 hard_negatives: dict[str, list[str]] = {} 23 24 for query in queries: 25 lexical_candidates = sorted( 26 corpus, 27 key=lambda candidate: lexical_overlap(query, candidate), 28 reverse=True, 29 ) 30 31 hard_negatives[query] = [ 32 candidate 33 for candidate in lexical_candidates 34 if lexical_overlap(query, candidate) > 0 35 and cross_encoder_relevance(query, candidate) < 0.30 36 ][:top_k] 37 38 return hard_negatives 39 40queries = ["How do I print a return label?"] 41corpus = [ 42 "How do I print a return shipping label?", 43 "Return label printer calibration guide", 44 "How do I replace a damaged shipping label?", 45 "Forklift battery maintenance schedule", 46] 47 48hard = mine_hard_negatives(queries, corpus) 49print(hard[queries[0]])
Output
1['How do I replace a damaged shipping label?', 'Return label printer calibration guide']

4. Iterative mining

Re-mine hard negatives after each training epoch using the improved model. This progressively finds harder examples as the model improves.


Bi-encoder vs cross-encoder

Bi-encoder, cross-encoder, and late-interaction retrieval architectures compared by query-document interaction, precomputability, latency, relevance detail, and index size. Bi-encoder, cross-encoder, and late-interaction retrieval architectures compared by query-document interaction, precomputability, latency, relevance detail, and index size.
Embedding retrieval architecture is a placement decision: bi-encoders interact at vector comparison time, cross-encoders interact inside attention, and late-interaction models keep token-level matching without full pairwise scoring.

Bi-encoder (dual encoder)

Encode query and document independently, compare with dot product or cosine similarity:

Advantages

Documents can be pre-encoded and indexed. At query time, you only encode the query once, then use an ANN (Approximate Nearest Neighbor) index to retrieve candidates sublinearly in practice.

Disadvantages

No cross-attention between query and document; it may miss fine-grained relevance signals.

Cross-encoder

Concatenate query and document, process jointly through full transformer:

Advantages

Full attention can model phrase order, negation, and query-document interactions that a single-vector score misses. On a suitable shortlist, this often improves precision over first-stage vector scoring.

Disadvantages

You must run inference for every (query, document) pair. If you scored the full corpus directly, that's O(N) transformer passes per query, which is too slow for large-scale search.

Late interaction: ColBERT

ColBERT (Contextualized Late Interaction over BERT)[7] offers a middle ground between bi-encoder and cross-encoder:

ColBERT late interaction MaxSim scoring showing query-token vectors matched against document-token vectors, max similarity selected per query token, and summed into a relevance score. ColBERT late interaction MaxSim scoring showing query-token vectors matched against document-token vectors, max similarity selected per query token, and summed into a relevance score.
ColBERT stores token vectors for documents, then scores a query by taking each query token's best document-token match and summing those MaxSim values.

Instead of a single embedding per document, ColBERT stores per-token embeddings and computes relevance using MaxSim: for each query token, find the maximum similarity to any document token, then sum.

Advantages

Retains token-level matching signals while documents can still be pre-encoded.

Disadvantages

Much larger index size (one vector per token instead of per document).

This architecture choice is a budget tradeoff, not a universal ranking. A bi-encoder protects first-stage recall at corpus scale; a cross-encoder can improve precision for a bounded candidate set; late interaction spends more index space to preserve token-level evidence.

maxsim-keeps-token-matches.py
1similarities = { 2 "return": {"damaged": 0.12, "return": 0.93, "tablet": 0.08}, 3 "tablet": {"damaged": 0.19, "return": 0.07, "tablet": 0.91}, 4} 5 6best_by_query_token = { 7 query_token: max(document_scores.values()) 8 for query_token, document_scores in similarities.items() 9} 10score = sum(best_by_query_token.values()) 11 12print("best token scores:", best_by_query_token) 13print("MaxSim score:", round(score, 2))
Output
1best token scores: {'return': 0.93, 'tablet': 0.91} 2MaxSim score: 1.84

A production reranking pattern

Two-stage retrieval pipeline where a bi-encoder retrieves a top-10 shortlist within a 200 millisecond example budget, a cross-encoder reranks all 10 candidates, and final results are served. Two-stage retrieval pipeline where a bi-encoder retrieves a top-10 shortlist within a 200 millisecond example budget, a cross-encoder reranks all 10 candidates, and final results are served.
Bi-encoder retrieval protects recall across the whole corpus. Cross-encoder attention is reserved for a small shortlist where precision gains justify the latency cost.

In a production system, these two architectures are combined in a two-stage pipeline to balance speed and accuracy. The function below illustrates this pattern with deterministic scores: first retrieve a broad candidate set using a fast bi-encoder score, then re-score those candidates with a slower cross-encoder score and return the final ranking.

the-reranking-pattern-production-standard.py
1documents = [ 2 { 3 "id": "carrier-delay-credit", 4 "text": "Carrier delay credit policy for late packages", 5 "bi_score": 0.82, 6 "cross_score": 0.97, 7 }, 8 { 9 "id": "return-label-printer", 10 "text": "Return label printer calibration guide", 11 "bi_score": 0.79, 12 "cross_score": 0.22, 13 }, 14 { 15 "id": "late-package-refund", 16 "text": "Refund workflow for late package delivery", 17 "bi_score": 0.77, 18 "cross_score": 0.91, 19 }, 20 { 21 "id": "forklift-battery", 22 "text": "Warehouse forklift battery maintenance", 23 "bi_score": 0.10, 24 "cross_score": 0.05, 25 }, 26] 27 28def search_with_rerank( 29 query: str, 30 corpus: list[dict[str, str | float]], 31 candidate_k: int = 3, 32 top_k: int = 2, 33) -> list[str]: 34 candidates = sorted(corpus, key=lambda doc: float(doc["bi_score"]), reverse=True)[ 35 :candidate_k 36 ] 37 reranked = sorted( 38 candidates, 39 key=lambda doc: float(doc["cross_score"]), 40 reverse=True, 41 ) 42 return [str(doc["id"]) for doc in reranked[:top_k]] 43 44results = search_with_rerank("late package refund", documents) 45print(results)
Output
1['carrier-delay-credit', 'late-package-refund']

Instruction-tuned embeddings

The problem with task ambiguity

"Label" means different things for clustering (group return-label documents), retrieval (find barcode-label troubleshooting), and classification (is this about shipping or inventory?). Traditional embeddings produce the same vector regardless of the downstream task. This limits performance when a single model must serve multiple purposes.

Task-specific prefixes and instructions

Some embedding families expose task prefixes or lightweight instructions that steer the encoder toward retrieval, clustering, or classification. E5 is a simple example: it distinguishes inputs like query: and passage: during contrastive pre-training.[5] INSTRUCTOR-style models go further and condition the embedding on an explicit task instruction.[8] The format is model-specific. Prefixes that help one family can hurt another, so follow its training or model documentation.

This example keeps the families separate on purpose. The exact prefix or instruction string depends on the embedding family you chose:

task-specific-prefixes-and-instructions.py
1def format_e5_pair(query: str, passage: str) -> tuple[str, str]: 2 """E5-style inputs are prefixed strings.""" 3 return f"query: {query}", f"passage: {passage}" 4 5def format_instructor_input(instruction: str, text: str) -> list[str]: 6 """INSTRUCTOR-style inputs are commonly [instruction, text] pairs.""" 7 return [instruction, text] 8 9e5_query, e5_passage = format_e5_pair( 10 "what is contrastive learning?", 11 "Contrastive learning pulls positive pairs together in embedding space.", 12) 13 14instructor_example = format_instructor_input( 15 "Represent the Wikipedia question for retrieving supporting documents:", 16 "what is contrastive learning?", 17) 18 19assert e5_query.startswith("query: ") 20assert e5_passage.startswith("passage: ") 21assert instructor_example[0].startswith("Represent") 22assert instructor_example[1] == "what is contrastive learning?"

This doesn't mean one magic prefix solves every task. It means some embedding families expect an extra conditioning signal. Use the format documented for that specific model family, then benchmark it on your own retrieval, clustering, or classification workload.


Matryoshka representation learning (MRL)

The idea

Matryoshka embeddings are named after nesting dolls because selected prefix widths are trained to remain useful on their own. For example, a full 768-dimensional embedding can be trained together with 128- and 32-dimensional prefixes. You then choose among trained and evaluated widths based on the storage-quality budget.

Train embeddings so that selected prefix widths preserve useful representations under their own losses:[9]

Matryoshka representation learning showing one embedding trained with losses at several prefix dimensions and the resulting storage-accuracy tradeoff when truncating dimensions. Matryoshka representation learning showing one embedding trained with losses at several prefix dimensions and the resulting storage-accuracy tradeoff when truncating dimensions.
Matryoshka training applies contrastive losses at multiple prefix sizes so smaller dimensions remain usable instead of becoming arbitrary truncations.

For a contrastively trained embedding model, the loss can be computed at several predefined truncation points simultaneously. The following runnable toy implementation slices full embeddings down to smaller prefixes, calculates the same InfoNCE objective at each prefix, and averages the losses:

the-idea.py
1from math import exp, log, sqrt 2 3def normalize(vector: list[float]) -> list[float]: 4 norm = sqrt(sum(value * value for value in vector)) 5 return [value / norm for value in vector] 6 7def dot(left: list[float], right: list[float]) -> float: 8 return sum(a * b for a, b in zip(left, right)) 9 10def logsumexp(values: list[float]) -> float: 11 peak = max(values) 12 return peak + log(sum(exp(value - peak) for value in values)) 13 14def info_nce_loss( 15 query_vectors: list[list[float]], 16 positive_vectors: list[list[float]], 17 temperature: float = 0.2, 18) -> float: 19 queries = [normalize(vector) for vector in query_vectors] 20 positives = [normalize(vector) for vector in positive_vectors] 21 losses: list[float] = [] 22 23 for row, query in enumerate(queries): 24 logits = [dot(query, candidate) / temperature for candidate in positives] 25 losses.append(logsumexp(logits) - logits[row]) 26 27 return sum(losses) / len(losses) 28 29def matryoshka_loss( 30 embeddings_a: list[list[float]], 31 embeddings_b: list[list[float]], 32 dims: tuple[int, int, int] = (2, 4, 6), 33) -> float: 34 losses = [] 35 36 for dim in dims: 37 truncated_a = [row[:dim] for row in embeddings_a] 38 truncated_b = [row[:dim] for row in embeddings_b] 39 losses.append(info_nce_loss(truncated_a, truncated_b)) 40 41 return sum(losses) / len(losses) 42 43queries = [[1.0, 0.0, 0.9, 0.1, 0.5, 0.2], [0.0, 1.0, 0.1, 0.9, 0.2, 0.5]] 44positives = [[0.95, 0.05, 0.85, 0.15, 0.45, 0.25], [0.05, 0.95, 0.15, 0.85, 0.25, 0.45]] 45 46loss = matryoshka_loss(queries, positives) 47print(round(loss, 4))
Output
10.0153

Why it matters

BenefitWhy it matters
Flexible deploymentUse the full embedding width for maximum accuracy, or a smaller prefix when storage is constrained.
No retrainingOne model can serve several dimensionality budgets.
Graceful degradationPerformance should drop smoothly as dimensions shrink, but you still need to benchmark each cutoff.
Deployment constraintShorten only at dimensions a chosen model documents or you validate; arbitrary slicing isn't guaranteed to preserve rankings.

Evaluation: STS and MTEB

Semantic textual similarity (STS)

Before large-scale benchmarks like MTEB existed, a common benchmark for evaluating sentence embeddings was Semantic Textual Similarity (STS). Datasets like STS-B (STS Benchmark) provide sentence pairs rated by human annotators for semantic relatedness:

text
1"A package is delayed" / "A shipment is running late" => 4.5 2"A package is delayed" / "A refund has posted" => 1.2

To evaluate a model, you compute the cosine similarity for every pair using the model's embeddings, and then calculate the Spearman rank correlation between the model's similarity scores and the human ratings. A high correlation means the model's embedding space aligns well with human judgment.

MTEB (Massive text embedding benchmark)

As models improved, optimizing only for STS was no longer sufficient. An embedding model excellent at STS might fail miserably at information retrieval or clustering. To address this, the Massive Text Embedding Benchmark (MTEB) was introduced.[10] The original MTEB paper evaluated models on 58 datasets grouped into 8 task categories, giving a much broader view of embedding quality than STS alone:

Task# DatasetsExample
Classification12Sentiment, topic
Clustering11Document clustering
Pair Classification3Paraphrase detection
Reranking4Passage reranking
Retrieval15Question-passage retrieval
STS10Semantic similarity
Summarization1Summary similarity
BitextMining2Parallel sentence mining
Sentence embedding evaluation comparison showing STS as a narrow semantic-similarity check and MTEB as a broader benchmark across classification, clustering, reranking, retrieval, STS, summarization, and bitext mining tasks. Sentence embedding evaluation comparison showing STS as a narrow semantic-similarity check and MTEB as a broader benchmark across classification, clustering, reranking, retrieval, STS, summarization, and bitext mining tasks.
STS is a useful narrow check, but production model choice should look across retrieval, reranking, clustering, classification, latency, and storage behavior.

Choosing a model in practice

Don't anchor on a single MTEB average. Deployment success usually depends on a few operational questions:

  • Does the model expect plain text, query/passage prefixes, or explicit instructions?
  • Can you shorten the embedding width safely, or are you locked into the full dimensionality?
  • How well does it handle your language mix, domain jargon, and query length distribution?
  • What are the latency, throughput, and memory costs once you batch and index it at production scale?
  • Do you still need BM25 or a cross-encoder reranker to hit Recall@K and NDCG targets?

In practice, architecture often matters more than a tiny leaderboard delta. A strong bi-encoder with good hard negatives, sensible chunking, and a reranker usually beats blindly swapping to the latest model name.

The original MTEB finding is the durable lesson: no one model dominated every task category.[10] Evaluate policy retrieval, reranking, languages, latency, and storage behavior that match your deployment instead of selecting by one aggregate score.

Carry the evidence boundary into retrieval evaluation

document_qa_v2 can use an embedding retriever to propose policy passages, but vector proximity isn't authorization. A release test should verify both retrieval quality and that unapproved text never becomes answer evidence:

policy-retrieval-release-gate.py
1approved_evidence = {"return-policy-us-v3", "shipping-label-v2"} 2retrieval_cases = [ 3 { 4 "query": "Can I return a cracked tablet?", 5 "expected": "return-policy-us-v3", 6 "candidates": ["seller-private-note-44", "return-policy-us-v3"], 7 }, 8 { 9 "query": "Where do I get a return label?", 10 "expected": "shipping-label-v2", 11 "candidates": ["shipping-label-v2", "warehouse-note-12"], 12 }, 13] 14attack_candidates = ["seller-private-note-44"] 15 16def approved_candidate(candidates: list[str]) -> str | None: 17 return next((doc for doc in candidates if doc in approved_evidence), None) 18 19served = [approved_candidate(case["candidates"]) for case in retrieval_cases] 20hits = sum( 21 evidence == case["expected"] 22 for evidence, case in zip(served, retrieval_cases) 23) 24attack_evidence = approved_candidate(attack_candidates) 25 26print("approved evidence recall@2:", f"{hits / len(retrieval_cases):.0%}") 27print("served evidence:", served) 28print("private-note attack evidence:", attack_evidence)
Output
1approved evidence recall@2: 100% 2served evidence: ['return-policy-us-v3', 'shipping-label-v2'] 3private-note attack evidence: None

Key libraries and tools

Building embedding-based systems requires the right tooling:

ToolWhat it gives you
Sentence-Transformers (sentence-transformers)Pretrained sentence embedding models, pooling modules, contrastive training losses such as MultipleNegativesRankingLoss, and batched encoding utilities.
FAISS (Facebook AI Similarity Search)Efficient similarity search and clustering for dense vectors, including inverted-file and product-quantization approaches for approximate nearest neighbor search.[11]

Mastery check

Key concepts

  • alignment and uniformity in embedding space
  • InfoNCE numerator, denominator, and temperature
  • hard negatives vs easy negatives
  • bi-encoder vs cross-encoder vs late interaction
  • reranking as recall first, then precision
  • Matryoshka prefix training for safe dimension cuts

Evaluation rubric

  • Foundational: Derives the InfoNCE objective and explains what the numerator, denominator, and temperature do.
  • Intermediate: Explains why raw BERT [CLS] embeddings fail for semantic search without sentence-level contrastive fine-tuning.
  • Intermediate: Explains why hard negatives matter more than random negatives once the model already separates broad topics.
  • Advanced: Compares bi-encoders, cross-encoders, and late-interaction models by latency, indexability, accuracy, and storage.
  • Advanced: Explains ColBERT's MaxSim scoring and why it keeps more token-level signal than a single document vector.
  • Advanced: Explains Matryoshka embeddings and when shorter prefixes are worth the storage-accuracy tradeoff.
  • Advanced: Designs a two-stage production retrieval pipeline with recall and latency budgets defended quantitatively.

Follow-up questions

Common pitfalls

Raw [CLS] is treated like a search-ready sentence embedding

Symptom: Nearly every query-document pair gets suspiciously similar cosine scores. Cause: Raw BERT [CLS] vectors weren't tuned for semantic retrieval and can inherit poorly discriminative anisotropic geometry. Fix: Start from a sentence embedding model or fine-tune with a contrastive objective before building nearest-neighbor search.

Negatives stay too easy

Symptom: Training loss falls, but recall on realistic queries barely moves. Cause: Random negatives stop teaching once the model separates unrelated topics. Fix: Mine BM25 or cross-encoder negatives that share words with the anchor but answer a different intent.

The reranker is asked to save missing recall

Symptom: The reranker looks good in pairwise inspection, yet the right document is often absent in production results. Cause: The correct passage never entered the shortlist. Fix: Tune first-stage Recall@K separately, then widen candidate budget before blaming the reranker.

Dimensions are shortened blindly

Symptom: Index storage drops as expected, but retrieval quality falls off a cliff. Cause: A standard embedding vector was truncated without prefix-aware training or provider support. Fix: Use Matryoshka-trained or provider-documented shortening controls and benchmark each target width.

Task conditioning is ignored

Symptom: One embedding model works for clustering but underperforms on retrieval. Cause: The model family expected query/passage prefixes or instructions, but every input was embedded as plain text. Fix: Follow the model card format for retrieval, clustering, and classification separately.


Try it yourself

These exercises let you verify your understanding without needing a GPU cluster.

Exercise 1: compute triplet loss by hand

Given an anchor aaa, positive ppp, and negative nnn with distances d(a,p)=0.3d(a,p) = 0.3d(a,p)=0.3 and d(a,n)=0.7d(a,n) = 0.7d(a,n)=0.7, compute the triplet loss for margins m=0.1m = 0.1m=0.1 and m=0.5m = 0.5m=0.5. In which case does the model still have work to do?

Solution sketch: For m=0.1m = 0.1m=0.1: 0.3−0.7+0.1=−0.30.3 - 0.7 + 0.1 = -0.30.3−0.7+0.1=−0.3, so max⁡(0,−0.3)=0\max(0, -0.3) = 0max(0,−0.3)=0. The margin is already satisfied. For m=0.5m = 0.5m=0.5: 0.3−0.7+0.5=0.10.3 - 0.7 + 0.5 = 0.10.3−0.7+0.5=0.1, so max⁡(0,0.1)=0.1\max(0, 0.1) = 0.1max(0,0.1)=0.1. The larger margin forces the model to pull the positive even closer or push the negative farther away.

Exercise 2: spot the embedding mistake

A teammate reports that their semantic search system returns nearly identical similarity scores for every query-document pair. They are using a pretrained BERT model and taking the [CLS] token as the sentence embedding. What is the most likely cause, and what is the smallest change that would fix it?

Solution sketch: Raw BERT [CLS] embeddings weren't trained to make cosine distance a semantic-retrieval score, and anisotropic geometry can make their scores poorly discriminative. The smallest fix is to switch to a sentence embedding model that was fine-tuned with a sentence-level objective (for example, SBERT or E5), rather than using raw BERT.

Exercise 3: design a two-stage retrieval pipeline

You have 2 million support tickets and a latency budget of 200 ms per query. You own a bi-encoder that encodes a query in 10 ms and a cross-encoder that scores one query-document pair in 15 ms. Why is scoring the full corpus with the cross-encoder impossible, and what pipeline would hit the latency budget?

Solution sketch: 2,000,000×15 ms=30,000,000 ms≈8.32{,}000{,}000 \times 15\,\text{ms} = 30{,}000{,}000\,\text{ms} \approx 8.32,000,000×15ms=30,000,000ms≈8.3 hours per query. The cross-encoder is far too slow for the full corpus. Reserve 10 ms for query encoding and choose a top-10 shortlist only if ANN lookup and overhead fit inside the remaining 40 ms: reranking then costs 10×15 ms=150 ms10 \times 15\,\text{ms} = 150\,\text{ms}10×15ms=150ms, for at most 200 ms total. If Recall@10 is inadequate, the budget or reranker throughput must change; silently widening to top 100 violates the requirement.


What you have now

You now understand why raw transformer outputs fail for semantic search, how contrastive learning reshapes the embedding space, and how to train and deploy sentence embeddings in production. You can explain InfoNCE and triplet loss with concrete numbers, mine hard negatives from lexical overlap, and design a two-stage retrieval pipeline that balances speed and accuracy.

The next article, Embedding Similarity & Quantization, builds directly on this foundation. You will learn the mathematical details of cosine similarity versus dot product, how Matryoshka truncation changes the accuracy-storage tradeoff, and how to quantize embeddings to 8-bit, 4-bit, or binary formats for large-scale indexes. Those techniques only make sense once you understand why the embedding space was shaped by contrastive loss in the first place.

Next Step
Continue to Embedding Similarity & Quantization

Contrastive learning explains how useful sentence vectors are trained; similarity scoring and quantization show how those vectors are searched and stored efficiently at scale.

PreviousCapstone: Production Agent
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings.

Ethayarajh, K. · 2019

SimCSE: Simple Contrastive Learning of Sentence Embeddings.

Gao, T., Yao, X., & Chen, D. · 2021 · EMNLP 2021

Understanding Contrastive Learning Requires Incorporating Inductive Biases.

Wang, T., & Isola, P. · 2020 · ICML 2020

Text Embeddings by Weakly-Supervised Contrastive Pre-training.

Wang, L., et al. · 2022

Representation Learning with Contrastive Predictive Coding.

Oord, A. van den, Li, Y., & Vinyals, O. · 2018

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Khattab, O., & Zaharia, M. · 2020 · SIGIR 2020

One Embedder, Any Task: Instruction-Finetuned Text Embeddings.

Su, H., et al. · 2022 · arXiv preprint

Matryoshka Representation Learning.

Kusupati, A., et al. · 2022 · NeurIPS 2022

MTEB: Massive Text Embedding Benchmark.

Muennighoff, N., et al. · 2023 · EACL 2023

Billion-scale similarity search with GPUs.

Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint