LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnTransformer Deep DivesEmbedding Similarity & Quantization
📐HardEmbeddings & Vector Search

Embedding Similarity & Quantization

Learn vector scoring contracts, evaluate Matryoshka widths, and measure scalar, product, and binary quantization before shipping compressed retrieval.

38 min read
Learning path
Step 86 of 155 in the full curriculum
Sentence Embeddings & Contrastive LossScaled Dot-Product Attention

Embedding Similarity & Quantization

The previous chapter left document_qa_v2 with a strict boundary: a retriever may propose policy passages, but only approved evidence may support an answer. Now the approved policy collection is growing, and every query has to search it quickly without silently degrading which passage is retrieved.

Suppose a customer reports a cracked tablet. The correct result is the approved return policy, not a private seller note that happens to repeat "cracked tablet" many times. Both passages become long lists of numbers called embeddings (or vectors). Retrieval compares those vectors, but its scoring and compression choices determine whether the approved policy even reaches the evidence gate.

This chapter teaches three vector scoring rules, shows exactly when they rank results identically, and then tackles memory pressure. You will compress stored vectors with scalar, product, and binary quantization, measure what each shortcut loses, and build the shortlist-and-rescore pattern that protects answer quality.

Three ways to measure closeness

An embedding is just a point in high-dimensional space. If you have two points, you can ask how near they are. There are three common rulers, and each answers a slightly different question.

  • Cosine similarity asks: "Do they point in the same direction?"
  • Dot product asks: "Do they point in the same direction, and how strong are they?"
  • Euclidean distance asks: "What's the straight-line distance between them?"

For any pipeline that normalizes every vector to length 1, cosine similarity and dot product become mathematically equivalent for ranking. That turns the metric decision into a systems decision: use the scoring rule your search library can execute quickly, then keep the same normalization contract at ingest time and query time.

Let's see how they differ with a tiny concrete example you can follow without a computer.

A worked example in two dimensions

Suppose we embed two support tickets into tiny 2D vectors so we can draw them on paper:

  • Ticket A: "Carrier failed pickup" maps to [3, 4]
  • Ticket B: "Pickup delayed by carrier" maps to [4, 3]

Both point northeast, so they should be similar. Let's measure that similarity with each ruler.

Cosine similarity

Cosine similarity ignores length and looks only at direction. The formula is:

cos⁡(θ)=A⋅B∥A∥∥B∥\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}cos(θ)=∥A∥∥B∥A⋅B​

First, compute the dot product in the numerator. Multiply matching dimensions and add:

A⋅B=(3×4)+(4×3)=12+12=24\mathbf{A} \cdot \mathbf{B} = (3 \times 4) + (4 \times 3) = 12 + 12 = 24A⋅B=(3×4)+(4×3)=12+12=24

Next, compute the lengths (norms) of each vector. The length of a vector is the square root of the sum of squared components:

∥A∥=32+42=9+16=25=5\|\mathbf{A}\| = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = 5∥A∥=32+42​=9+16​=25​=5 ∥B∥=42+32=16+9=25=5\|\mathbf{B}\| = \sqrt{4^2 + 3^2} = \sqrt{16 + 9} = \sqrt{25} = 5∥B∥=42+32​=16+9​=25​=5

Now divide:

cos⁡(θ)=245×5=2425=0.96\cos(\theta) = \frac{24}{5 \times 5} = \frac{24}{25} = 0.96cos(θ)=5×524​=2524​=0.96

A score of 0.96 means these invented vectors point in nearly the same direction. Whether direction corresponds to relevance in a real index is something the embedding training and retrieval evaluation must establish.

Dot product

The dot product skips the normalization step and uses the raw sum we already calculated:

A⋅B=24\mathbf{A} \cdot \mathbf{B} = 24A⋅B=24

Because both vectors happen to have the same length (5), the dot product gives a ranking that matches cosine similarity. But if one vector were much longer, the dot product would change even though the angle stayed the same.

Euclidean distance

Subtract the dimensions, square the differences, sum them, and take the square root:

∥A−B∥2=(3−4)2+(4−3)2=1+1=2≈1.41\|\mathbf{A} - \mathbf{B}\|_2 = \sqrt{(3-4)^2 + (4-3)^2} = \sqrt{1 + 1} = \sqrt{2} \approx 1.41∥A−B∥2​=(3−4)2+(4−3)2​=1+1​=2​≈1.41

A small Euclidean distance also tells us these tickets are close in space. For this single pair, all three rulers describe a close relationship. Add candidates with different norms, even in two dimensions, and their rankings can diverge. The choice of ruler changes which tickets you retrieve.

The same arithmetic becomes a few lines of NumPy. This is a direct way to test whether your retrieval code matches the math you just worked by hand:

a-worked-example-in-two-dimensions.py
1import numpy as np 2 3ticket_a = np.array([3.0, 4.0]) 4ticket_b = np.array([4.0, 3.0]) 5 6dot = float(ticket_a @ ticket_b) 7cosine = dot / (np.linalg.norm(ticket_a) * np.linalg.norm(ticket_b)) 8euclidean = float(np.linalg.norm(ticket_a - ticket_b)) 9 10print(f"dot={dot:.0f}, cosine={cosine:.2f}, euclidean={euclidean:.2f}") 11 12longer_ticket_a = ticket_a * 10 13same_direction_cosine = float( 14 (longer_ticket_a @ ticket_b) 15 / (np.linalg.norm(longer_ticket_a) * np.linalg.norm(ticket_b)) 16) 17larger_dot = float(longer_ticket_a @ ticket_b) 18 19print(f"10x vector: cosine={same_direction_cosine:.2f}, dot={larger_dot:.0f}")
Output
1dot=24, cosine=0.96, euclidean=1.41 210x vector: cosine=0.96, dot=240

Cosine similarity: direction only

Cosine similarity is useful when the retrieval contract says vector length shouldn't change ranking. It compares direction, not "meaning" directly. A strong embedding model can arrange relevant policy text in similar directions, but cosine alone can't prove relevance or authorization.

A useful mental model is direction versus load. Cosine checks whether two delivery trucks head the same way; it doesn't care how heavily loaded they are. Raw dot product cares about direction and load size. Which signal matters is set by model training and validated on retrieval queries, not chosen by intuition after indexing.

Cosine similarity ranges from −1-1−1 to 111: 1 means same direction, 0 means orthogonal, and -1 means opposite direction. Its key property is magnitude invariance. A vector of [1, 1] has the same cosine direction as [100, 100] because they point the same way.

When to use cosine similarity

  • When the model card or your retrieval evaluation calls for angular comparison.
  • When document and query vectors are normalized consistently and vector norm isn't intended to carry ranking signal.
  • When an experiment on your own queries confirms that angular scoring retrieves the right evidence.

Dot product: direction plus strength

The dot product is an un-normalized alignment score between two vectors:

A⋅B=∑i=1naibi\mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^n a_i b_iA⋅B=i=1∑n​ai​bi​

Multiply each pair of matching dimensions and add them all up. Unlike cosine similarity, there's no dividing by lengths, so if one vector is twice as long, the dot product doubles. That makes it the cheaper query-time metric once vectors have already been unit-normalized, but it also makes raw dot products sensitive to how "big" each embedding is.

The dot product ranges from −∞-\infty−∞ to +∞+\infty+∞. Its key property is sensitivity to both direction and magnitude.

When to use dot product

  • When magnitude is intentionally part of the learned score, as in maximum inner product search (MIPS).
  • When your training setup or ANN library is optimized for inner-product search.
  • When vectors are already L2-normalized, because inner product and cosine produce the same ranking.
  • When both indexed vectors and query vectors are unit-normalized before scoring, so inner product preserves cosine ranking without doing norm divisions during candidate scoring.

When all three metrics agree

Some APIs return unit-length embeddings, and many production pipelines normalize vectors at ingest time even when the model itself doesn't. Treat normalization as an explicit contract between model output, index construction, and query serving. Don't assume it's implicit because the model is modern.

When both vectors are unit length (∥A∥=∥B∥=1\|\mathbf{A}\| = \|\mathbf{B}\| = 1∥A∥=∥B∥=1), the mathematical relationship between our similarity metrics collapses into a clean equivalence. The denominator in the cosine similarity formula becomes 1×1=11 \times 1 = 11×1=1, making it identical to the raw dot product. Euclidean distance also becomes a direct function of the dot product:

cos⁡(A,B)=A⋅B=1−∥A−B∥22\cos(\mathbf{A}, \mathbf{B}) = \mathbf{A} \cdot \mathbf{B} = 1 - \frac{\|\mathbf{A} - \mathbf{B}\|^2}{2}cos(A,B)=A⋅B=1−2∥A−B∥2​

Why this matters

When all vectors have length 1, cosine similarity, dot product, and squared Euclidean distance all induce the same ranking. In practice, pick the metric your library optimizes best and keep query-time and index-time conventions consistent. After the stored vectors and each incoming query have been normalized once, inner product maps directly to fast matrix multiplication kernels without norm divisions during candidate scoring.

Comparison of cosine similarity and dot product showing that cosine keeps the same score when vector length changes, while dot product scales with magnitude until vectors are normalized. Comparison of cosine similarity and dot product showing that cosine keeps the same score when vector length changes, while dot product scales with magnitude until vectors are normalized.
Cosine cares about direction. Raw dot product cares about direction and norm. After L2 normalization, both rankings match.

The following retrieval fixture makes the contract visible. A private-note fixture has a deliberately large raw norm and wins a raw dot-product search. Once all candidates and the query are normalized, the approved policy wins under both cosine and unit-vector dot product.

normalization-is-a-ranking-contract.py
1import numpy as np 2 3labels = np.array(["approved_return_policy", "private_seller_note", "carrier_tracking"]) 4documents = np.array( 5 [ 6 [0.95, 0.05], # Direction closely matches the return-policy query. 7 [8.00, 6.00], # Deliberately large norm lets raw dot product dominate. 8 [0.20, 0.90], 9 ], 10 dtype=np.float32, 11) 12query = np.array([1.0, 0.0], dtype=np.float32) 13 14def normalize(rows: np.ndarray) -> np.ndarray: 15 rows_2d = np.atleast_2d(rows) 16 unit = rows_2d / np.linalg.norm(rows_2d, axis=1, keepdims=True) 17 return unit[0] if rows.ndim == 1 else unit 18 19raw_dot = documents @ query 20unit_documents = normalize(documents) 21unit_query = normalize(query) 22cosine = unit_documents @ unit_query 23unit_dot = unit_documents @ unit_query 24 25print(f"raw dot winner: {labels[np.argmax(raw_dot)]}") 26print(f"normalized winner: {labels[np.argmax(cosine)]}") 27print(f"cosine equals unit dot: {np.allclose(cosine, unit_dot)}")
Output
1raw dot winner: private_seller_note 2normalized winner: approved_return_policy 3cosine equals unit dot: True

You shouldn't infer that every large-norm vector is bad. The lesson is narrower: if the serving contract is cosine-like ranking, skipping normalization changes what the index is optimizing.

Euclidean distance: straight-line gap

Euclidean distance (also called L2 distance) measures the straight-line distance between two points:

∥A−B∥2=∑i=1n(ai−bi)2\|\mathbf{A} - \mathbf{B}\|_2 = \sqrt{\sum_{i=1}^n (a_i - b_i)^2}∥A−B∥2​=i=1∑n​(ai​−bi​)2​

Subtract matching dimensions, square each difference, sum them up, and take the square root. This gives the straight-line distance between two points in high-dimensional space, just like measuring distance on a map, but with hundreds of dimensions instead of two.

High dimension doesn't make Euclidean distance automatically wrong. On unit-normalized vectors, squared Euclidean distance produces exactly the same ranking as cosine and dot product through the equation above. Without that normalization, L2 distance also includes norm differences. Use it when the model, index, and evaluation are built around that geometry.

Euclidean distance ranges from 000 to +∞+\infty+∞, where 0 means identical points. It also appears inside product quantization, where the original PQ formulation approximates squared L2 distances between a full-precision query and compressed database vectors.[1]

Metric contract and payload guide: angular ranking after normalization, raw inner product when vector norm is intended as signal, L2 for distance-based objectives, plus FP32 to INT8 to binary payload sizes. Metric contract and payload guide: angular ranking after normalization, raw inner product when vector norm is intended as signal, L2 for distance-based objectives, plus FP32 to INT8 to binary payload sizes.
Choose the scoring rule your model and evaluation support, then measure compressed retrieval against that baseline. The payload bar shows why compression is tempting.

Choosing how many dimensions you need

Embedding dimensionality controls how much semantic information the vector can encode. That capacity comes with direct storage, memory-bandwidth, and latency costs. Choosing dimensions is a real architecture decision, not a leaderboard footnote.

Width is a model-and-evaluation contract

A dimension count has no universal quality meaning by itself. A 512-dimensional model trained well for your support queries can outperform a different 1,536-dimensional model. Width comparisons become defensible when you hold the model family, scoring convention, corpus, and evaluation set fixed.

Choice under evaluationWhat you can concludeWhat you still must measure
Full trained widthBaseline payload and retrieval score for this modelLatency, RAM, and evidence recall
Model-supported shorter prefixLower raw payload for the same familyRecall loss and failure slices
Different smaller modelPossibly cheaper candidate systemEverything again; training changed too

Dimension count and normalization are separate choices. A supported shorter vector still needs the serving normalization contract documented by its model or applied by your pipeline.

Why dimensions matter

For a fixed representation family, more stored dimensions have deterministic systems costs:

  1. Storage: Each float32 is 4 bytes. 1M vectors at 1536d is about 5.7 GiB of raw payload.
  2. Exact scoring work: A dot product or L2 comparison reads and combines O(d)O(d)O(d) numbers per candidate.
  3. Memory bandwidth: Loading vectors from memory is often the bottleneck for Approximate Nearest Neighbor (ANN) search, a class of algorithms that find close vectors without an exhaustive search.
  4. End-to-end latency: Halving dimensions doesn't promise halved latency because graph traversal, filters, networking, and reranking remain. Benchmark before paying or saving the memory bill.

Matryoshka embeddings

Some embedding families are trained so selected truncated prefixes stay useful. Matryoshka Representation Learning (MRL) optimizes losses at multiple chosen representation sizes during training.[2] Hosted APIs can expose a supported shortening control too; OpenAI's embeddings guide documents a dimensions parameter for text-embedding-3 models.[3]

Matryoshka embeddings are named after nesting dolls because one representation contains selected shorter prefixes. That doesn't guarantee every arbitrary cut point, every task, or every query keeps acceptable recall. Use widths that the training setup or provider supports, then evaluate each proposed serving width on your own evidence queries.

Standard embedding training minimizes a single loss function (like contrastive loss) on the full vector. MRL minimizes a weighted sum of losses at a selected set of truncation points, written as d∈Dd \in \mathcal{D}d∈D. This nested structure encourages the trained prefix widths to form useful representations on their own.

MRL is different from post-hoc dimensionality reduction such as PCA. PCA projects already-trained embeddings into a smaller linear subspace, usually by preserving variance on a fixed dataset. MRL changes training itself so prefix slices are useful without a separate projection step. PCA can still be useful for frozen embeddings, but MRL is cleaner when the model was trained for truncation and you want one serving path with several dimension budgets.

Nested embedding prefix example showing supported small, middle, and full width tiers, with reminders to evaluate recall and normalize a manually sliced prefix before indexing. Nested embedding prefix example showing supported small, middle, and full width tiers, with reminders to evaluate recall and normalize a manually sliced prefix before indexing.
A prefix-aware or provider-supported model can expose several widths. Treat each width as a candidate that must pass recall checks, not a guaranteed quality tier.

Supported prefix shortening can reduce raw payload while keeping one model family in the serving stack. OpenAI's embeddings documentation, for example, lists default sizes of 1,536 for text-embedding-3-small and 3,072 for text-embedding-3-large, and documents shorter embeddings through the dimensions parameter.[3] Those documented capabilities still don't replace evaluation on your retrieval corpus.

If you ask a provider for a shorter embedding through the API, follow that provider's normalization contract. Manual slicing is different. If you cut a prefix from an already generated vector, normalize that sliced prefix before indexing. The code below demonstrates that contract without any external API call:

matryoshka-embeddings.py
1import numpy as np 2 3def normalize_l2(vector: np.ndarray) -> np.ndarray: 4 norm = np.linalg.norm(vector) 5 if norm == 0: 6 return vector 7 return vector / norm 8 9full_embedding = np.array( 10 [0.21, -0.05, 0.44, 0.31, 0.18, -0.27, 0.12, 0.09], 11 dtype=np.float32, 12) 13 14short_embedding = normalize_l2(full_embedding[:4]) 15 16print(f"dims={short_embedding.shape[0]}, norm={np.linalg.norm(short_embedding):.3f}")
Output
1dims=4, norm=1.000

Normalization is necessary for a unit-vector scoring contract, but it can't restore information removed by truncation. This deliberately small fixture makes that distinction concrete: the last two dimensions separate an approved policy from a private seller note, so the shortened representation loses one correct result.

a-prefix-still-needs-recall-evaluation.py
1import numpy as np 2 3labels = np.array(["private_seller_note", "approved_return_policy", "carrier_delay_policy"]) 4documents = np.array( 5 [ 6 [1.0, 0.0, 1.0, 0.0], 7 [1.0, 0.0, 0.0, 1.0], 8 [0.0, 1.0, 0.0, 1.0], 9 ], 10 dtype=np.float32, 11) 12queries = np.array( 13 [ 14 [1.0, 0.0, 0.0, 1.0], 15 [0.0, 1.0, 0.0, 1.0], 16 ], 17 dtype=np.float32, 18) 19expected = np.array(["approved_return_policy", "carrier_delay_policy"]) 20 21def top1_at_width(width: int) -> np.ndarray: 22 docs = documents[:, :width] 23 asks = queries[:, :width] 24 docs = docs / np.linalg.norm(docs, axis=1, keepdims=True) 25 asks = asks / np.linalg.norm(asks, axis=1, keepdims=True) 26 return labels[np.argmax(asks @ docs.T, axis=1)] 27 28full = top1_at_width(4) 29short = top1_at_width(2) 30print(f"full top-1 recall={np.mean(full == expected):.1%}, results={full.tolist()}") 31print(f"short top-1 recall={np.mean(short == expected):.1%}, results={short.tolist()}")
Output
1full top-1 recall=100.0%, results=['approved_return_policy', 'carrier_delay_policy'] 2short top-1 recall=50.0%, results=['private_seller_note', 'carrier_delay_policy']

This isn't a benchmark for a real MRL model. It's the release test shape you need: evaluate supported widths on approved-query fixtures, including near-duplicate but unauthorized passages.

The memory problem: a sizing example

Work through this step by step. The same arithmetic appears whenever you size a vector database.

Scenario: You have 1 million embeddings from a hosted API, each with 1,536 dimensions, stored as 32-bit floats (FP32). How much RAM do you need for the raw vector payload, and how can you reduce it?

Step 1: Calculate the baseline

Each dimension is a float32, which takes 4 bytes.

1,000,000×1,536×4 bytes≈6.14 GB1{,}000{,}000 \times 1{,}536 \times 4 \text{ bytes} \approx 6.14 \text{ GB}1,000,000×1,536×4 bytes≈6.14 GB

That's just the raw numbers. Real indexes add overhead for graph links, metadata, and replicas, but the vector payload alone is already over 6 GB.

Step 2: Apply 8-bit scalar quantization

If you map each float32 to a single byte (INT8), you cut storage by roughly 4x:

1,000,000×1,536×1 byte≈1.53 GB1{,}000{,}000 \times 1{,}536 \times 1 \text{ byte} \approx 1.53 \text{ GB}1,000,000×1,536×1 byte≈1.53 GB

Step 3: Apply product quantization aggressively

Suppose you split each 1,536-dimensional vector into 96 subvectors of 16 dimensions each, and you represent each subvector with an 8-bit centroid ID. The compressed vector is just 96 bytes:

1,000,000×96 bytes≈96 MB1{,}000{,}000 \times 96 \text{ bytes} \approx 96 \text{ MB}1,000,000×96 bytes≈96 MB

That's a 64x compression from the original FP32 payload. The trade-off is that you're now comparing approximate codes instead of exact floats, so you typically pair PQ with a higher-precision reranking stage.

The byte math is easy to turn into a sizing helper. Notice the difference between decimal gigabytes, which vendors often use in pricing copy, and binary gibibytes, which operating systems usually report:

the-memory-problem-a-sizing-example.py
1def vector_payload_bytes(num_vectors: int, dimensions: int, bytes_per_dim: float) -> float: 2 return num_vectors * dimensions * bytes_per_dim 3 4raw_bytes = vector_payload_bytes(1_000_000, 1_536, 4) 5int8_bytes = vector_payload_bytes(1_000_000, 1_536, 1) 6pq_bytes = vector_payload_bytes(1_000_000, 96, 1) 7 8print(f"FP32={raw_bytes / 1e9:.2f} GB ({raw_bytes / 1024**3:.2f} GiB)") 9print(f"INT8={int8_bytes / 1e9:.2f} GB") 10print(f"PQ={pq_bytes / 1024**2:.1f} MiB")
Output
1FP32=6.14 GB (5.72 GiB) 2INT8=1.54 GB 3PQ=91.6 MiB

Re-embedding cost scales with total input tokens, not document count alone. Compute it as total_tokens / 1_000_000 * embedding_price_per_1M_tokens, using the current price for the model you have selected. Keep model version, dimensions, normalization, and quantization settings alongside every index so you can compare and migrate without mixing incompatible vectors.

Shrinking vectors without losing meaning

At scale, storing and searching raw float32 vectors may exceed a practical memory or latency budget. Every exact scan must read N×dN \times dN×d values; an approximate index reduces candidates, while quantization reduces bytes per stored candidate. Both are engineering choices that can lower recall, so each needs an evaluation gate.

Scalar quantization (SQ)

Scalar quantization (SQ) replaces each float32 (4 bytes) with a lower-precision representation, typically mapping each dimension independently. The simplest signed int8 version is symmetric around zero:

xq=clip⁡(round⁡(xs),−127,127),x≈sxqx_q = \operatorname{clip}\left(\operatorname{round}\left(\frac{x}{s}\right), -127, 127\right), \qquad x \approx s x_qxq​=clip(round(sx​),−127,127),x≈sxq​

Here, s is the scale. Each dimension goes from 4 bytes (float32) to 1 byte (int8), so raw storage drops by about 4x while preserving an approximate reconstruction path back to float space. In practice, some libraries use this symmetric signed form, while others use affine quantization with a zero-point, often over uint8.

The function below implements symmetric scalar quantization. It takes an array of float32 vectors and returns the compressed int8 vectors along with the per-dimension scales needed to approximately reconstruct the original floats later:

scalar-quantization-sq.py
1import numpy as np 2 3def symmetric_quantize_int8(vectors: np.ndarray) -> tuple[np.ndarray, np.ndarray]: 4 """ 5 Per-dimension symmetric quantization from float32 to int8. 6 Returns quantized vectors and scale. 7 """ 8 qmax = 127 9 10 max_abs = np.maximum(np.max(np.abs(vectors), axis=0), 1e-8) 11 scale = max_abs / qmax 12 13 quantized = np.clip( 14 np.round(vectors / scale), 15 -qmax, 16 qmax, 17 ).astype(np.int8) 18 19 return quantized, scale 20 21vectors = np.array( 22 [ 23 [0.20, -1.00, 2.50], 24 [0.24, -0.80, 2.10], 25 [0.50, -1.40, 2.30], 26 ], 27 dtype=np.float32, 28) 29 30quantized, scale = symmetric_quantize_int8(vectors) 31reconstructed = scale * quantized.astype(np.float32) 32max_error = float(np.max(np.abs(vectors - reconstructed))) 33 34print(quantized.dtype, quantized.shape, f"max_error={max_error:.4f}")
Output
1int8 (3, 3) max_error=0.0063

Quality impact

INT8 cuts the raw vector payload from four bytes per dimension to one. It doesn't promise any fixed Recall@K. Model distribution, calibration sample, scoring rule, shortlist depth, and reranking all change the observed loss, so treat INT8 as the first candidate to benchmark rather than an automatic release decision.

The outlier trap

The biggest practical mistake in scalar quantization is deriving the scale from the absolute minimum and maximum of a calibration sample without checking outliers. If one dimension spikes far outside the usual range, the scale stretches and ordinary values collapse into fewer integer buckets. Robust clipping can preserve ordinary resolution while saturating outliers, but its percentile is another measured choice rather than a universal constant.

The next executable example quantizes ordinary values plus one spike. It compares a scale set by the spike with a clipped scale set from the ordinary calibration range:

outlier-calibration-costs-resolution.py
1import numpy as np 2 3ordinary = np.linspace(-1.0, 1.0, 101, dtype=np.float32) 4with_outlier = np.concatenate([ordinary, np.array([25.0], dtype=np.float32)]) 5 6def quantize_reconstruct(values: np.ndarray, bound: float) -> np.ndarray: 7 q = np.clip(np.round(values / (bound / 127)), -127, 127).astype(np.int8) 8 return q.astype(np.float32) * (bound / 127) 9 10bound_from_max = float(np.max(np.abs(with_outlier))) 11bound_from_calibration = 1.0 12max_scaled = quantize_reconstruct(ordinary, bound_from_max) 13clipped_scaled = quantize_reconstruct(ordinary, bound_from_calibration) 14 15max_error = float(np.mean(np.abs(ordinary - max_scaled))) 16clipped_error = float(np.mean(np.abs(ordinary - clipped_scaled))) 17print(f"outlier-bound ordinary MAE={max_error:.4f}") 18print(f"clipped-bound ordinary MAE={clipped_error:.4f}")
Output
1outlier-bound ordinary MAE=0.0483 2clipped-bound ordinary MAE=0.0019

Lower reconstruction error on ordinary values is encouraging, but retrieval is the real target. A release check must measure whether relevant approved passages stay in the shortlist after calibration and compression.

Product quantization (PQ)

Introduced by Jégou et al.[1], Product Quantization (PQ) splits each vector into MMM subvectors and quantizes each independently using a learned codebook.

Example with M=4M=4M=4 on an 8-dimensional vector:

Original:[1.2,3.4∣0.8,2.1∣5.3,1.7∣0.4,3.2](8×4=32 bytes)\text{Original}: [1.2, 3.4 \mid 0.8, 2.1 \mid 5.3, 1.7 \mid 0.4, 3.2] \quad (8 \times 4 = 32 \text{ bytes})Original:[1.2,3.4∣0.8,2.1∣5.3,1.7∣0.4,3.2](8×4=32 bytes)

This example shows a single 8-dimensional vector split into 4 subvectors of 2 dimensions each. Each subvector gets mapped to the ID of its nearest centroid in that subspace's codebook:

  1. [1.2,3.4]→Centroid ID: 7[1.2, 3.4] \to \text{Centroid ID: } 7[1.2,3.4]→Centroid ID: 7
  2. [0.8,2.1]→Centroid ID: 23[0.8, 2.1] \to \text{Centroid ID: } 23[0.8,2.1]→Centroid ID: 23
  3. [5.3,1.7]→Centroid ID: 3[5.3, 1.7] \to \text{Centroid ID: } 3[5.3,1.7]→Centroid ID: 3
  4. [0.4,3.2]→Centroid ID: 41[0.4, 3.2] \to \text{Centroid ID: } 41[0.4,3.2]→Centroid ID: 41
Compressed code:(7,23,3,41)(4 bytes at 8 bits/code)\text{Compressed code}: \quad (7, 23, 3, 41) \quad (4 \text{ bytes at 8 bits/code})Compressed code:(7,23,3,41)(4 bytes at 8 bits/code)

Each subvector is now stored as a centroid index instead of raw floats. Storage drops from 32 bytes to 4 bytes because we keep IDs; at query time, lookup tables score those IDs against the full-precision query.

Product quantization: one 8-dimensional vector is split into four 2-dimensional chunks, each chunk is replaced by one centroid ID, and the compressed code stores 7, 23, 3, 41 instead of raw floats. Product quantization: one 8-dimensional vector is split into four 2-dimensional chunks, each chunk is replaced by one centroid ID, and the compressed code stores 7, 23, 3, 41 instead of raw floats.
PQ stores centroid IDs instead of raw float chunks. Query scoring uses lookup tables over those IDs, not full decompression of every vector.

With 256 centroids per subspace (8 bits), the stored code can be much smaller than the original vector. For example, compressing 768d float32 vectors (3,072 bytes) using PQ with 96 subspaces produces a 96-byte per-vector code, a 32x payload compression. Shared codebooks and index metadata add separate overhead.

Asymmetric distance computation (ADC)

At query time, you usually keep the query vector in full precision and compare each query subvector against the centroids in the matching codebook[1]. That gives a small lookup table per subspace:

d(q,x^)≈∑m=1M∥qm−cm,im∥22d(\mathbf{q}, \hat{\mathbf{x}}) \approx \sum_{m=1}^{M} \left\| \mathbf{q}_m - \mathbf{c}_{m, i_m} \right\|_2^2d(q,x^)≈m=1∑M​∥qm​−cm,im​​∥22​

The compressed database vector stores only centroid IDs imi_mim​, so scoring a candidate becomes MMM table lookups and additions. That's why PQ can search compressed vectors without fully decompressing them. This squared-L2 form is the original PQ/ADC setup; a system serving inner-product search must use an implementation and evaluation contract that supports its chosen score.

In a real index, training data learns each codebook. Here we write down two tiny codebooks so you can see asymmetric distance computation (ADC) itself: the full-precision query creates a lookup table, and stored IDs select four distances to add.

pq-asymmetric-distance-lookups.py
1import numpy as np 2 3# Four subspaces, two centroids in each; production PQ learns many centroids. 4codebooks = np.array( 5 [ 6 [[0.0, 0.0], [1.2, 3.4]], 7 [[0.0, 0.0], [0.8, 2.1]], 8 [[0.0, 0.0], [5.3, 1.7]], 9 [[0.0, 0.0], [0.4, 3.2]], 10 ], 11 dtype=np.float32, 12) 13query = np.array([1.1, 3.3, 0.7, 2.0, 5.2, 1.8, 0.3, 3.1], dtype=np.float32) 14stored_codes = { 15 "approved_return_policy": np.array([1, 1, 1, 1]), 16 "carrier_delay_policy": np.array([0, 0, 0, 0]), 17} 18 19query_chunks = query.reshape(4, 2) 20lookup = np.sum((codebooks - query_chunks[:, None, :]) ** 2, axis=2) 21scores = { 22 name: float(lookup[np.arange(4), code].sum()) 23 for name, code in stored_codes.items() 24} 25winner = min(scores, key=scores.get) 26 27print({name: round(score, 2) for name, score in scores.items()}) 28print(f"ADC nearest code: {winner}")
Output
1{'approved_return_policy': 0.08, 'carrier_delay_policy': 56.57} 2ADC nearest code: approved_return_policy

Trade-off

Higher compression usually means lower recall. The exact loss depends on the number of subspaces, centroid count, and reranking budget, so benchmark PQ with your real query distribution instead of relying on a single headline number.

Binary quantization (BQ)

Of the three payload reductions in this chapter, Binary Quantization (BQ) is the most aggressive: it reduces each dimension to a single bit. This simple version thresholds at zero: positive numbers become 1, and zero or negative numbers become 0. It discards magnitude and granular precision, so a binary shortlist is a candidate to evaluate, not a final ranking guarantee.

The code snippet below demonstrates how simple this transformation is in practice. It takes one raw float32 vector and returns packed bytes containing one sign bit per dimension:

binary-quantization-bq.py
1import numpy as np 2 3def binary_quantize(vector: np.ndarray) -> np.ndarray: 4 """ 5 Pack one sign bit per dimension into bytes. 6 Positive values become 1, zero/negative become 0. 7 """ 8 bits = (vector > 0).astype(np.uint8) 9 return np.packbits(bits, bitorder="little") 10 11def hamming_distance(packed_a: np.ndarray, packed_b: np.ndarray) -> int: 12 differing_bits = np.bitwise_xor(packed_a, packed_b) 13 return int(np.unpackbits(differing_bits, bitorder="little").sum()) 14 15ticket_a = np.array([0.3, -0.1, 0.8, -0.4], dtype=np.float32) 16ticket_b = np.array([0.5, -0.2, 0.7, 0.1], dtype=np.float32) 17 18packed_a = binary_quantize(ticket_a) 19packed_b = binary_quantize(ticket_b) 20distance = hamming_distance(packed_a, packed_b) 21 22print(f"packed_a={packed_a[0]}, packed_b={packed_b[0]}, hamming={distance}")
Output
1packed_a=5, packed_b=13, hamming=1

Note that (vector > 0).astype(np.uint8) is still one byte per dimension. You only get true 1-bit storage once you pack those bits.

Worked example

Take a 4-dimensional embedding after L2 normalization: [0.3, -0.1, 0.8, -0.4]. Binary quantization thresholds at zero:

  • Positive maps to 1, non-positive maps to 0, giving bit vector [1, 0, 1, 0]
  • Packed into bytes with little-endian bit order: one byte holding those four bits, decimal value 5 (0b00000101).

Now compare against another binary vector [1, 0, 1, 1] (from [0.5, -0.2, 0.7, 0.1]):

  • XOR the packed bytes, leaving differing bits only in the last position
  • popcount (number of 1s in XOR result) = 1, so Hamming distance = 1

A Hamming distance of 1 out of 4 bits means the vectors are close in this binary approximation. In production, you can use this fast filter to pull a shortlist, then rescore survivors with INT8 or FP32 scores. Because sign bits discard magnitude, binary quantization always needs validation on your model and query distribution. This zero-threshold version works best when coordinates are centered around zero; other distributions need a different calibration or encoding choice.

In this packed-bit design, candidate scoring starts with Hamming distance (XOR + popcount), which is fast on modern CPUs because it relies on bitwise operations. It's only an approximation to angular similarity, though, so production systems often rescore the shortlist with a higher-precision representation.

DimensionPractical effect
Compression32x raw payload reduction when moving from float32 to one bit per dimension.
SpeedPacked bitsets use XOR plus popcount, so both memory traffic and arithmetic drop sharply. Implementations can add architecture-specific SIMD acceleration for packed-byte comparisons.
QualityRecall can drop noticeably, so BQ usually fits best as a first-stage filter.

The two-stage retrieval pattern

Production systems often combine quantization with reranking to balance speed and accuracy.

Two-stage retrieval works like triaging a warehouse exception queue. Stage 1 (binary quantization) scans 10,000 tickets for rough matches, which is fast but approximate. Stage 2 (full-precision reranking) gives the top 50 tickets a detailed comparison. You wouldn't deeply inspect all 10,000 because that's too slow, and you wouldn't resolve cases from rough matches alone because that's too inaccurate. The funnel gives you both speed and quality.

One cost-sensitive pattern stretches this into three stages: a binary Hamming search for the cheap shortlist, an INT8 dot-product rescore of that shortlist, and an optional cross-encoder reranker[4] on the final handful when answer quality justifies the extra latency. Each stage looks at fewer candidates with more precision than the one before it.

Two-stage retrieval funnel where binary Hamming search finds a rough top 1000 candidate set, then FP32 or INT8 rescoring and a cross-encoder reranker produce the final top 10. Two-stage retrieval funnel where binary Hamming search finds a rough top 1000 candidate set, then FP32 or INT8 rescoring and a cross-encoder reranker produce the final top 10.
Binary quantization is a speed filter, not the final judge. The second stage spends precise scoring only on the shortlist.

The production-agent boundary from the previous lesson still applies after compression. This miniature release gate checks that binary retrieval keeps the approved policy in its shortlist, then applies precise scoring only to approved candidates:

binary-shortlist-evidence-gate.py
1import numpy as np 2 3labels = np.array(["private_seller_note", "approved_return_policy", "carrier_delay_policy"]) 4approved = np.array([False, True, True]) 5documents = np.array( 6 [ 7 [0.05, -2.00, 0.05, -2.00], # Same signs, wrong private evidence. 8 [0.78, -0.38, 0.58, -0.22], # Approved policy, close in full precision. 9 [0.10, 0.60, -0.40, -0.20], 10 ], 11 dtype=np.float32, 12) 13query = np.array([0.80, -0.40, 0.60, -0.20], dtype=np.float32) 14expected = "approved_return_policy" 15 16def sign_bits(rows: np.ndarray) -> np.ndarray: 17 return rows > 0 18 19def normalize(rows: np.ndarray) -> np.ndarray: 20 rows_2d = np.atleast_2d(rows) 21 unit = rows_2d / np.linalg.norm(rows_2d, axis=1, keepdims=True) 22 return unit[0] if rows.ndim == 1 else unit 23 24hamming = np.count_nonzero(sign_bits(documents) != sign_bits(query), axis=1) 25shortlist = np.argsort(hamming, kind="stable")[:2] 26allowed_shortlist = shortlist[approved[shortlist]] 27precise_scores = normalize(documents[allowed_shortlist]) @ normalize(query) 28served = labels[allowed_shortlist[np.argmax(precise_scores)]] 29 30recall_at_2 = expected in labels[shortlist] 31print(f"binary shortlist={labels[shortlist].tolist()}, recall@2={recall_at_2}") 32print(f"served evidence={served}, authorized={bool(approved[labels == served][0])}") 33assert recall_at_2 and served == expected
Output
1binary shortlist=['private_seller_note', 'approved_return_policy'], recall@2=True 2served evidence=approved_return_policy, authorized=True

On real data, run that same gate over many labeled queries and failure slices. A compressed first pass is acceptable only if its shortlist recall remains high enough for precise, authorized evidence selection to succeed.

Building a production pipeline

One production vector-search pattern stacks quantization for smaller payloads, an ANN index for lower search work, and reranking for a more precise final comparison. Exact composition is library-specific: some implementations combine HNSW with scalar or product quantization, while others use a different ANN structure or retain full-precision vectors alongside compressed codes for rescoring.

ANN indexing with HNSW

Many vector databases offer HNSW (Hierarchical Navigable Small World) indices[5] as an ANN structure. HNSW builds a multi-layered proximity graph: upper layers contain progressively fewer sampled nodes, while the base layer contains the indexed collection. Search starts in an upper layer, follows promising graph links, and descends toward a candidate neighborhood at the base.

When configuring an HNSW implementation, three controls commonly expose its quality/cost tradeoff. Exact names and allowed values depend on the library:

ControlChanged duringUsually buysUsually costs
M (graph connectivity)Index buildMore routes to true neighborsMore graph memory and build work
ef_constructionIndex buildA better-connected graphSlower index construction
ef_searchQuery servingMore candidate explorationHigher query latency

Measure recall and latency first while changing query-time exploration. If the graph can't reach the required recall within the latency budget, rebuild candidates with adjusted construction settings. No copied parameter triple proves quality for a new corpus.

Full pipeline overview

Production vector retrieval pipeline with a library-specific indexing path that embeds documents, optionally compresses vectors, builds an ANN index, and stores hot search payload, plus a query path that embeds the query, searches ANN, rescores, reranks, and checks top-k results. Production vector retrieval pipeline with a library-specific indexing path that embeds documents, optionally compresses vectors, builds an ANN index, and stores hot search payload, plus a query path that embeds the query, searches ANN, rescores, reranks, and checks top-k results.
A retrieval stack separates offline index construction from online query serving. Compression and ANN layout are library-specific; rescoring gives final quality a measurable gate.

Model and dimension selection

Public benchmarks such as MTEB evaluate embedding models across several task families.[6] They help you assemble candidates, but they can't tell you which index retrieves your approved policy passages under your storage and latency limits.

DecisionEstablish a baselineAccept a change only when
Model familyFull-precision Recall@K on approved evidence queriesTarget slices and latency remain acceptable
Supported prefix widthFull trained width with identical scoringMeasured recall loss fits your release threshold
SQ, PQ, or BQUncompressed shortlist and final rankingCompressed shortlist preserves required evidence recall
ANN settingsExact-search results for a labeled sampleRecall and latency both fit the serving budget

Keep a versioned experiment row for model, width, normalization, quantizer, index settings, shortlist size, and reranker. That row is what makes a compressed-index rollout explainable when a return-policy answer later changes.

Cost analysis at scale

As your vector corpus grows from millions to billions of documents, the cost structure of your retrieval system shifts sharply. At small scale, storing raw float32 vectors in RAM is feasible. At billion scale, it can become the dominant infrastructure cost because low-latency approximate nearest neighbor search keeps hot index data in memory.

The byte math is deterministic even when cloud pricing isn't. The table below shows just the vector payload, rounded in binary units, before HNSW links, metadata, replicas, or region-specific RAM pricing:

ScaleRaw Payload (1024d, FP32)With INT8 SQWith PQ (96-byte code)
1M vectors3.8 GiB0.95 GiB92 MiB
10M vectors38 GiB9.5 GiB916 MiB
100M vectors380 GiB95 GiB8.9 GiB
1B vectors3.7 TiB950 GiB89 GiB

Practice: choose a retrieval stack

You have 50 million approved policy and support-resolution embeddings at 768 dimensions. A private seller-note collection must never be served as answer evidence. P99 latency budget is 60 ms, and raw FP32 vectors would crowd out the rest of the service's RAM.

  1. Would you keep raw FP32 payload in memory for first-stage search?
  2. If your ingest path already L2-normalizes vectors, would you score with cosine or dot product online?
  3. Would you start with INT8 SQ, PQ, or BQ for the first pass?
  4. Where would you spend full-precision compute?
  5. What test prevents a cheaper index from silently changing evidence quality?

Solution sketch:

  • Don't start with raw FP32 if memory is already tight. Benchmark INT8 first, then PQ if you still need a bigger cut.
  • Use dot product online if vectors are already unit-normalized. It preserves cosine ranking without redundant norm work.
  • Start by benchmarking INT8 SQ because it changes payload less aggressively than PQ or BQ. Move further only when the evaluated RAM/latency gain justifies measured shortlist loss.
  • Spend full-precision compute on a shortlist, not on the whole corpus. That is where rescoring and reranking pay off.
  • Build a labeled release set containing approved policies plus tempting unauthorized near-matches. Compare Recall@K against the full-precision baseline and reject any serving result sourced from unapproved text.

Mastery check

Key concepts

  • Cosine similarity measures direction. Raw dot product measures direction plus magnitude.
  • Once vectors are unit-normalized, cosine and dot product give the same ranking.
  • Matryoshka Representation Learning trains selected prefix widths to remain useful; each deployed width still needs evaluation.
  • Scalar quantization compresses each dimension independently. Product quantization compresses subvectors into centroid IDs.
  • Binary quantization is a cheap filter, not a trustworthy final ranker.
  • A safe retrieval rollout is staged and measured: compressed first pass, authorized shortlist check, higher-precision rescore, then optional reranker.

Evaluation rubric

  • Strong: derives why normalized cosine and dot product are equivalent, then ties that to serving efficiency.
  • Strong: computes rough RAM savings for FP32, INT8, and PQ without confusing bytes, GB, GiB, and code sizes.
  • Strong: explains PQ as subvector codebooks plus ADC lookup tables, not as generic "compression magic."
  • Strong: chooses compression stage based on latency, evidence-recall target, and RAM budget, then defines a release gate.
  • Weak: says "always use cosine" or "higher dimensions are always better" without checking normalization or workload.
  • Weak: treats quantization as inherently safe or inherently inaccurate instead of measuring shortlist recall and served evidence.

Follow-up questions

Why does dot product become equivalent to cosine similarity for normalized vectors?

Cosine divides the dot product by the product of the vector norms. If both vectors have norm 1, that denominator is 1, so cosine similarity equals the dot product. That lets a retrieval system normalize once, then use fast inner-product scoring without recomputing norms.

How does Matryoshka Representation Learning differ from PCA?

MRL is a training-time technique that teaches one embedding to work at several prefix lengths. PCA is a post-training projection applied to frozen vectors. MRL is useful when the model was trained for prefix truncation; PCA is a retrofit when you already have embeddings and need a smaller projected representation.

Why is Product Quantization often more attractive than Scalar Quantization at billion-vector scale?

Scalar quantization usually gives about 4x raw payload compression by moving from float32 to int8. Product Quantization can compress much more aggressively by storing centroid IDs for subspaces, which can be the difference between a huge distributed RAM footprint and a smaller compressed index. The tradeoff is approximation error, so PQ usually needs rescoring.

Why is Binary Quantization faster than float32 dot products?

Binary Quantization stores one sign bit per dimension, packs bits into bytes, and scores candidates with XOR plus popcount. That cuts memory traffic and replaces floating-point multiply-adds with cheap bit operations, but the score is coarse and should usually feed a reranker.

When is bigger embedding dimension worth the extra RAM bill?

Only when your own query set shows that smaller prefixes or lower-dimensional models miss too many relevant items even after reranking. Dimension count is a measured tradeoff, not a prestige number.

Common pitfalls

  • Recall falls after INT8 rollout. Cause: scale got stretched by outliers or mismatch between offline and online normalization. Fix: clip calibration range robustly, recheck normalization contract, and benchmark recall on real queries.
  • ANN results look fine offline but wrong in production. Cause: you tuned on MTEB or synthetic samples instead of real ticket language and chunking. Fix: build an eval set from your own queries and review misses manually.
  • Search is still too slow after quantization. Cause: you compressed payload but still rescore too many candidates or set ef_search too high. Fix: tune shortlist size, HNSW parameters, and rerank depth together.
  • Migration silently degrades quality. Cause: old and new vectors use different model versions, dimensions, or quantization settings. Fix: version the whole embedding contract and build the new index side by side before traffic cutover.
Next Step
Continue to Scaled Dot-Product Attention

Embeddings taught you how to represent meaning as vectors and compare them at retrieval time. Attention shows how Transformer layers build contextual vectors by comparing tokens inside a sequence, including the scaling term, multi-head structure, quadratic cost, KV-cache pressure, and FlashAttention.

PreviousSentence Embeddings & Contrastive Loss
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Product Quantization for Nearest Neighbor Search.

Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011

Matryoshka Representation Learning.

Kusupati, A., et al. · 2022 · NeurIPS 2022

Vector embeddings

OpenAI · 2024

Rerank API.

Cohere. · 2026 · Official documentation

Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.

Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence

MTEB: Massive Text Embedding Benchmark.

Muennighoff, N., et al. · 2023 · EACL 2023