LearnTransformer Deep DivesEmbedding Similarity & Quantization

📐HardEmbeddings & Vector Search

Embedding Similarity & Quantization

Learn vector scoring contracts, evaluate Matryoshka widths, and measure scalar, product, and binary quantization before deploying compressed retrieval.

39 min read

Learning path

Step 89 of 158 in the full curriculum

Sentence Embeddings & Contrastive Loss Scaled Dot-Product Attention

document_qa_v2 has a strict boundary: a retriever may propose policy passages, but only approved evidence may support an answer. As the approved policy collection grows, every query has to search it quickly without silently degrading which passage is retrieved.

Suppose a user asks whether a contractor can open production dashboards. The correct result is the approved access policy, not a private draft note that happens to repeat "contractor dashboard" many times. Both passages become long lists of numbers called embeddings (or vectors). Retrieval compares those vectors, but its scoring and compression choices determine whether the approved policy even reaches the evidence gate.

Start with three vector scoring rules, see exactly when they rank results identically, and then tackle memory pressure. You'll compress stored vectors with scalar, product, and binary quantization, measure what each shortcut loses, and build the shortlist-and-rescore pattern that protects answer quality.

Three ways to measure closeness

An embedding is just a point in high-dimensional space. If you have two points, you can ask how near they are. There are three common rulers, and each answers a slightly different question.

Cosine similarity asks: "Do they point in the same direction?"
Dot product asks: "Do they point in the same direction, and how large are their magnitudes?"
Euclidean distance asks: "What's the straight-line distance between them?"

For any pipeline that normalizes every vector to length 1, cosine similarity and dot product become mathematically equivalent for ranking. That turns the metric decision into a systems decision: use the scoring rule your search library can execute quickly, then keep the same normalization contract at ingest time and query time.

A tiny concrete example shows how the two rulers differ without needing a computer.

A worked example in two dimensions

Suppose we embed two incident notes into tiny 2D vectors so we can draw them on paper:

Note A: "GPU cache warmed" maps to [3, 4]
Note B: "Cache warm on GPU" maps to [4, 3]

Both point northeast, so they should be similar. Measure that similarity with each ruler.

Cosine similarity

Cosine similarity ignores length and looks only at direction. The formula is:

\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}

First, compute the dot product in the numerator. Multiply matching dimensions and add:

\mathbf{A} \cdot \mathbf{B} = (3 \times 4) + (4 \times 3) = 12 + 12 = 24

Next, compute the lengths (norms) of each vector. The length of a vector is the square root of the sum of squared components:

\|\mathbf{A}\| = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = 5

\|\mathbf{B}\| = \sqrt{4^2 + 3^2} = \sqrt{16 + 9} = \sqrt{25} = 5

Now divide:

\cos(\theta) = \frac{24}{5 \times 5} = \frac{24}{25} = 0.96

A score of 0.96 means these invented vectors point in nearly the same direction. Whether direction corresponds to relevance in a real index is something the embedding training and retrieval evaluation must establish.

Dot product

The dot product skips the normalization step and uses the raw sum we already calculated:

\mathbf{A} \cdot \mathbf{B} = 24

Because both vectors happen to have the same length (5), the dot product gives a ranking that matches cosine similarity. But if one vector were much longer, the dot product would change even though the angle stayed the same.

Euclidean distance

Subtract the dimensions, square the differences, sum them, and take the square root:

\|\mathbf{A} - \mathbf{B}\|_2 = \sqrt{(3-4)^2 + (4-3)^2} = \sqrt{1 + 1} = \sqrt{2} \approx 1.41

A small Euclidean distance also tells us these notes are close in space. For this single pair, all three rulers describe a close relationship. Add candidates with different norms, even in two dimensions, and their rankings can diverge. The choice of ruler changes which notes you retrieve.

The same arithmetic becomes a few lines of NumPy. This is a direct way to test whether your retrieval code matches the math you just worked by hand:

a-worked-example-in-two-dimensions.py

import numpy as np

ticket_a = np.array([3.0, 4.0])
ticket_b = np.array([4.0, 3.0])

dot = float(ticket_a @ ticket_b)
cosine = dot / (np.linalg.norm(ticket_a) * np.linalg.norm(ticket_b))
euclidean = float(np.linalg.norm(ticket_a - ticket_b))

print(f"dot={dot:.0f}, cosine={cosine:.2f}, euclidean={euclidean:.2f}")

longer_ticket_a = ticket_a * 10
same_direction_cosine = float(
    (longer_ticket_a @ ticket_b)
    / (np.linalg.norm(longer_ticket_a) * np.linalg.norm(ticket_b))
)
larger_dot = float(longer_ticket_a @ ticket_b)

print(f"10x vector: cosine={same_direction_cosine:.2f}, dot={larger_dot:.0f}")

Output

dot=24, cosine=0.96, euclidean=1.41
10x vector: cosine=0.96, dot=240

Cosine similarity: direction only

Cosine similarity is useful when the retrieval contract says vector length shouldn't change ranking. It compares direction, not "meaning" directly. A strong embedding model can arrange relevant policy text in similar directions, but cosine alone can't prove relevance or authorization.

Separate direction from signal strength. Cosine checks whether two arrows point toward the same concept cluster; it ignores how long the arrows are. Raw dot product cares about both alignment and vector length. Which signal matters is set by model training and validated on retrieval queries, not chosen by intuition after indexing.

Cosine similarity ranges from $-1$ to $1$ : 1 means same direction, 0 means orthogonal, and -1 means opposite direction. Its key property is magnitude invariance. A vector of [1, 1] has the same cosine direction as [100, 100] because they point the same way.

When to use cosine similarity

When the model card or your retrieval evaluation calls for angular comparison.
When document and query vectors are normalized consistently and vector norm isn't intended to carry ranking signal.
When an experiment on your own queries confirms that angular scoring retrieves the right evidence.

Dot product: direction plus magnitude

The dot product is an un-normalized alignment score between two vectors:

\mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^n a_i b_i

Multiply each pair of matching dimensions and add them all up. Unlike cosine similarity, there's no dividing by lengths, so if one vector is twice as long, the dot product doubles. That makes it the cheaper query-time metric once vectors have already been unit-normalized, but it also makes raw dot products sensitive to how "big" each embedding is.

The dot product ranges from $-\infty$ to $+\infty$ . Its key property is sensitivity to both direction and magnitude.

When to use dot product

When magnitude is intentionally part of the learned score, as in maximum inner product search (MIPS).
When your training setup or ANN library is optimized for inner-product search.
When vectors are already L2-normalized, because inner product and cosine produce the same ranking.
When both indexed vectors and query vectors are unit-normalized before scoring, so inner product preserves cosine ranking without doing norm divisions during candidate scoring.

When all three metrics agree

Some APIs return unit-length embeddings, and many production pipelines normalize vectors at ingest time even when the model itself doesn't. Treat normalization as an explicit contract between model output, index construction, and query serving. Don't assume it's implicit because the model is modern.

When both vectors are unit length ( $\|\mathbf{A}\| = \|\mathbf{B}\| = 1$ ), the mathematical relationship between our similarity metrics collapses into a clean equivalence. The denominator in the cosine similarity formula becomes $1 \times 1 = 1$ , making it identical to the raw dot product. Euclidean distance also becomes a direct function of the dot product:

\cos(\mathbf{A}, \mathbf{B}) = \mathbf{A} \cdot \mathbf{B} = 1 - \frac{\|\mathbf{A} - \mathbf{B}\|^2}{2}

Why this matters

When all vectors have length 1, cosine similarity, dot product, and squared Euclidean distance all induce the same ranking. In practice, pick the metric your library optimizes best and keep query-time and index-time conventions consistent. After the stored vectors and each incoming query have been normalized once, inner product maps directly to fast matrix multiplication kernels without norm divisions during candidate scoring.

Two vectors keep nearly the same direction. As one vector scales up, cosine stays flat while dot product rises. — Direction stays aligned, but dot product keeps growing with vector norm.

This retrieval fixture makes the contract visible. A private-note fixture has a deliberately large raw norm and wins a raw dot-product search. Once all candidates and the query are normalized, the approved policy wins under both cosine and unit-vector dot product.

normalization-is-a-ranking-contract.py

import numpy as np

labels = np.array(["approved_access_policy", "private_draft_note", "incident_runbook"])
documents = np.array(
    [
        [0.95, 0.05],  # Direction closely matches the access-policy query.
        [8.00, 6.00],  # Deliberately large norm lets raw dot product dominate.
        [0.20, 0.90],
    ],
    dtype=np.float32,
)
query = np.array([1.0, 0.0], dtype=np.float32)

def normalize(rows: np.ndarray) -> np.ndarray:
    rows_2d = np.atleast_2d(rows)
    unit = rows_2d / np.linalg.norm(rows_2d, axis=1, keepdims=True)
    return unit[0] if rows.ndim == 1 else unit

raw_dot = documents @ query
unit_documents = normalize(documents)
unit_query = normalize(query)
cosine = unit_documents @ unit_query
unit_dot = unit_documents @ unit_query

print(f"raw dot winner: {labels[np.argmax(raw_dot)]}")
print(f"normalized winner: {labels[np.argmax(cosine)]}")
print(f"cosine equals unit dot: {np.allclose(cosine, unit_dot)}")

Output

raw dot winner: private_draft_note
normalized winner: approved_access_policy
cosine equals unit dot: True

You shouldn't infer that every large-norm vector is bad. The safe conclusion is narrower: if the serving contract is cosine-like ranking, skipping normalization changes what the index is optimizing.

Euclidean distance: straight-line gap

Euclidean distance (also called L2 distance) measures the straight-line distance between two points:

\|\mathbf{A} - \mathbf{B}\|_2 = \sqrt{\sum_{i=1}^n (a_i - b_i)^2}

Subtract matching dimensions, square each difference, sum them up, and take the square root. That is the straight-line distance between two points in high-dimensional space, just like measuring distance on a map, but with hundreds of dimensions instead of two.

High dimension doesn't make Euclidean distance automatically wrong. On unit-normalized vectors, squared Euclidean distance produces exactly the same ranking as cosine and dot product through the equation above. Without that normalization, L2 distance also includes norm differences. Use it when the model, index, and evaluation are built around that geometry.

Euclidean distance ranges from $0$ to $+\infty$ , where 0 means identical points. It also appears inside product quantization, where the original PQ formulation approximates squared L2 distances between a full-precision query and compressed database vectors.^{[1]Reference 1Product Quantization for Nearest Neighbor Search.https://dblp.org/rec/journals/pami/JegouDS11}

Similarity metrics compare geometry contracts with vector payload size. — Compression changes storage size and scoring cost; the retrieval contract still needs one intended geometry such as cosine, dot product, or L2 distance.

Choosing how many dimensions you need

Embedding dimensionality controls how much semantic information the vector can encode. That capacity comes with direct storage, memory-bandwidth, and latency costs. Choosing dimensions is a real architecture decision, not a leaderboard footnote.

Width is a model-and-evaluation contract

A dimension count has no universal quality meaning by itself. A 512-dimensional model trained well for your support queries can outperform a different 1,536-dimensional model. Width comparisons become defensible when you hold the model family, scoring convention, corpus, and evaluation set fixed.

Choice under evaluation	What you can conclude	What you still must measure
Full trained width	Baseline payload and retrieval score for this model	Latency, RAM, and evidence recall
Model-supported shorter prefix	Lower raw payload for the same family	Recall loss and failure slices
Different smaller model	Possibly cheaper candidate system	Everything again; training changed too

Dimension count and normalization are separate choices. A supported shorter vector still needs the serving normalization contract documented by its model or applied by your pipeline.

Why dimensions matter

For a fixed representation family, more stored dimensions have deterministic systems costs:

Storage: Each float32 is 4 bytes. 1M vectors at 1536d is about 5.7 GiB of raw payload.
Exact scoring work: A dot product or L2 comparison does linear work in the number of dimensions, written $O(d)$ , per candidate.
Memory bandwidth: Loading vectors from memory is often the bottleneck for Approximate Nearest Neighbor (ANN) search, a class of algorithms that find close vectors without an exhaustive search.
End-to-end latency: Halving dimensions doesn't promise halved latency because graph traversal, filters, networking, and reranking remain. Benchmark before paying or saving the memory bill.

Matryoshka embeddings

Some embedding families are trained so selected truncated prefixes stay useful. Matryoshka Representation Learning (MRL) optimizes losses at multiple chosen representation sizes during training.^{[2]Reference 2Matryoshka Representation Learning.https://arxiv.org/abs/2205.13147} Hosted APIs can expose a supported shortening control too; OpenAI's embeddings guide documents a dimensions parameter for text-embedding-3 models.^{[3]Reference 3Vector embeddingshttps://developers.openai.com/api/docs/guides/embeddings}

Matryoshka embeddings are named after nesting dolls because one representation contains selected shorter prefixes. That doesn't guarantee every arbitrary cut point, every task, or every query keeps acceptable recall. Use widths that the training setup or provider supports, then evaluate each proposed serving width on your own evidence queries.

Standard embedding training minimizes a single loss function (like contrastive loss) on the full vector. MRL minimizes a weighted sum of losses at a selected set of truncation points, written as $d \in \mathcal{D}$ . This nested structure encourages the trained prefix widths to form useful representations on their own.

MRL is different from post-hoc dimensionality reduction such as PCA. PCA projects already-trained embeddings into a smaller linear subspace, usually by preserving variance on a fixed dataset. MRL changes training itself so prefix slices are useful without a separate projection step. PCA can still be useful for frozen embeddings, but MRL is cleaner when the model was trained for truncation and you want one serving path with several dimension budgets.

A nested embedding strip marks selected small, middle, and full prefixes, then an exact fixture slices an eight-dimensional vector to four dimensions, renormalizes its norm from 0.580 to 1.000, and compares 100 percent full-width recall with 50 percent short-prefix recall. — Selected prefixes are nested, but each serving width needs its own normalization and retrieval test. The 100% versus 50% result is the synthetic two-query fixture below, not a benchmark for a real MRL model.

Supported prefix shortening can reduce raw payload while keeping one model family in the serving stack. OpenAI's embeddings documentation, for example, lists default sizes of 1,536 for text-embedding-3-small and 3,072 for text-embedding-3-large, and documents shorter embeddings through the dimensions parameter.^{[3]Reference 3Vector embeddingshttps://developers.openai.com/api/docs/guides/embeddings} Those documented capabilities still don't replace evaluation on your retrieval corpus.

If you ask a provider for a shorter embedding through the API, follow that provider's normalization contract. Manual slicing is different. If you cut a prefix from an already generated vector, normalize that sliced prefix before indexing. The local check enforces that contract without any external API call:

matryoshka-embeddings.py

import numpy as np

def normalize_l2(vector: np.ndarray) -> np.ndarray:
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector
    return vector / norm

full_embedding = np.array(
    [0.21, -0.05, 0.44, 0.31, 0.18, -0.27, 0.12, 0.09],
    dtype=np.float32,
)

short_embedding = normalize_l2(full_embedding[:4])

print(f"dims={short_embedding.shape[0]}, norm={np.linalg.norm(short_embedding):.3f}")

Output

dims=4, norm=1.000

Normalization is necessary for a unit-vector scoring contract, but it can't restore information removed by truncation. This deliberately small fixture makes that distinction concrete: the last two dimensions separate an approved policy from a private draft note, so the shortened representation loses one correct result.

a-prefix-still-needs-recall-evaluation.py

import numpy as np

labels = np.array(["private_draft_note", "approved_access_policy", "escalation_policy"])
documents = np.array(
    [
        [1.0, 0.0, 1.0, 0.0],
        [1.0, 0.0, 0.0, 1.0],
        [0.0, 1.0, 0.0, 1.0],
    ],
    dtype=np.float32,
)
queries = np.array(
    [
        [1.0, 0.0, 0.0, 1.0],
        [0.0, 1.0, 0.0, 1.0],
    ],
    dtype=np.float32,
)
expected = np.array(["approved_access_policy", "escalation_policy"])

def top1_at_width(width: int) -> np.ndarray:
    docs = documents[:, :width]
    asks = queries[:, :width]
    docs = docs / np.linalg.norm(docs, axis=1, keepdims=True)
    asks = asks / np.linalg.norm(asks, axis=1, keepdims=True)
    return labels[np.argmax(asks @ docs.T, axis=1)]

full = top1_at_width(4)
short = top1_at_width(2)
print(f"full top-1 recall={np.mean(full == expected):.1%}, results={full.tolist()}")
print(f"short top-1 recall={np.mean(short == expected):.1%}, results={short.tolist()}")

Output

full top-1 recall=100.0%, results=['approved_access_policy', 'escalation_policy']
short top-1 recall=50.0%, results=['private_draft_note', 'escalation_policy']

This isn't a benchmark for a real MRL model. It's the release test shape you need: evaluate supported widths on approved-query fixtures, including near-duplicate but unauthorized passages.

The memory problem: a sizing example

Work through this step by step. The same arithmetic appears whenever you size a vector database.

Scenario: You have 1 million embeddings from a hosted API, each with 1,536 dimensions, stored as 32-bit floats (FP32). How much RAM do you need for the raw vector payload, and how can you reduce it?

Step 1: Calculate the baseline

Each dimension is a float32, which takes 4 bytes.

1{,}000{,}000 \times 1{,}536 \times 4 \text{ bytes} \approx 6.14 \text{ GB}

That's just the raw numbers. Real indexes add overhead for graph links, metadata, and replicas, but the vector payload alone is already over 6 GB.

Step 2: Apply 8-bit scalar quantization

If you map each float32 to a single byte (INT8), you cut storage by roughly 4x:

1{,}000{,}000 \times 1{,}536 \times 1 \text{ byte} \approx 1.53 \text{ GB}

Step 3: Apply product quantization aggressively

Suppose you split each 1,536-dimensional vector into 96 subvectors of 16 dimensions each, and you represent each subvector with an 8-bit centroid ID. The compressed vector is just 96 bytes:

1{,}000{,}000 \times 96 \text{ bytes} \approx 96 \text{ MB}

That's a 64x compression from the original FP32 payload. You're now comparing approximate codes instead of exact floats, so you typically pair PQ with a higher-precision reranking stage.

The byte math is easy to turn into a sizing helper. Notice the difference between decimal gigabytes, which vendors often use in pricing copy, and binary gibibytes, which operating systems usually report:

the-memory-problem-a-sizing-example.py

def vector_payload_bytes(num_vectors: int, dimensions: int, bytes_per_dim: float) -> float:
    return num_vectors * dimensions * bytes_per_dim

raw_bytes = vector_payload_bytes(1_000_000, 1_536, 4)
int8_bytes = vector_payload_bytes(1_000_000, 1_536, 1)
pq_bytes = vector_payload_bytes(1_000_000, 96, 1)

print(f"FP32={raw_bytes / 1e9:.2f} GB ({raw_bytes / 1024**3:.2f} GiB)")
print(f"INT8={int8_bytes / 1e9:.2f} GB")
print(f"PQ={pq_bytes / 1024**2:.1f} MiB")

Output

FP32=6.14 GB (5.72 GiB)
INT8=1.54 GB
PQ=91.6 MiB

Re-embedding cost scales with total input tokens, not document count alone. Compute it as total_tokens / 1_000_000 * embedding_price_per_1M_tokens, using the current price for the model you have selected. Keep model version, dimensions, normalization, and quantization settings alongside every index so you can compare and migrate without mixing incompatible vectors.

Shrinking vectors without losing meaning

At scale, storing and searching raw float32 vectors may exceed a practical memory or latency budget. Every exact scan must read $N \times d$ values; an approximate index reduces candidates, while quantization reduces bytes per stored candidate. Both are engineering choices that can lower recall, so each needs an evaluation gate.

Diagram showing 1. Measure FP32 baseline Recall@K, latency, RAM, 2. Budget fits?, yes, and Keep FP32 baseline. — 1. Measure FP32 baseline Recall@K, latency, RAM, 2. Budget fits?, yes, and Keep FP32 baseline.

Compression is a release experiment, not a one-way conversion. Start from a labeled FP32 baseline, reject any shortcut that loses required evidence from the shortlist, and keep authorization checks in the precise serving path.

Scalar quantization (SQ)

Scalar quantization (SQ) replaces each float32 (4 bytes) with a lower-precision representation, typically mapping each dimension independently. The simplest signed int8 version is symmetric around zero:

x_q = \operatorname{clip}\left(\operatorname{round}\left(\frac{x}{s}\right), -127, 127\right), \qquad x \approx s x_q

Here, s is the scale. Each dimension goes from 4 bytes (float32) to 1 byte (int8), so raw storage drops by about 4x while preserving an approximate reconstruction path back to float space. In practice, some libraries use this symmetric signed form, while others use affine quantization with a zero-point, often over uint8.

This function implements symmetric scalar quantization. It takes an array of float32 vectors and returns the compressed int8 vectors along with the per-dimension scales needed to approximately reconstruct the original floats later:

scalar-quantization-sq.py

import numpy as np

def symmetric_quantize_int8(vectors: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    """
    Per-dimension symmetric quantization from float32 to int8.
    Returns quantized vectors and scale.
    """
    qmax = 127

    max_abs = np.maximum(np.max(np.abs(vectors), axis=0), 1e-8)
    scale = max_abs / qmax

    quantized = np.clip(
        np.round(vectors / scale),
        -qmax,
        qmax,
    ).astype(np.int8)

    return quantized, scale

vectors = np.array(
    [
        [0.20, -1.00, 2.50],
        [0.24, -0.80, 2.10],
        [0.50, -1.40, 2.30],
    ],
    dtype=np.float32,
)

quantized, scale = symmetric_quantize_int8(vectors)
reconstructed = scale * quantized.astype(np.float32)
max_error = float(np.max(np.abs(vectors - reconstructed)))

print(quantized.dtype, quantized.shape, f"max_error={max_error:.4f}")

Output

int8 (3, 3) max_error=0.0063

Quality impact

INT8 cuts the raw vector payload from four bytes per dimension to one. It doesn't promise any fixed Recall@K. Model distribution, calibration sample, scoring rule, shortlist depth, and reranking all change the observed loss, so treat INT8 as the first candidate to benchmark rather than an automatic release decision.

The outlier trap

The biggest practical mistake in scalar quantization is deriving the scale from the absolute minimum and maximum of a calibration sample without checking outliers. If one dimension spikes far outside the usual range, the scale stretches and ordinary values collapse into fewer integer buckets. Percentile clipping can preserve ordinary resolution while saturating outliers, but the percentile is another measured choice rather than a universal constant.

The next executable example quantizes ordinary values plus one spike. It compares a scale set by the spike with a clipped scale set from the ordinary calibration range:

outlier-calibration-costs-resolution.py

import numpy as np

ordinary = np.linspace(-1.0, 1.0, 101, dtype=np.float32)
with_outlier = np.concatenate([ordinary, np.array([25.0], dtype=np.float32)])

def quantize_reconstruct(values: np.ndarray, bound: float) -> np.ndarray:
    q = np.clip(np.round(values / (bound / 127)), -127, 127).astype(np.int8)
    return q.astype(np.float32) * (bound / 127)

bound_from_max = float(np.max(np.abs(with_outlier)))
bound_from_calibration = 1.0
max_scaled = quantize_reconstruct(ordinary, bound_from_max)
clipped_scaled = quantize_reconstruct(ordinary, bound_from_calibration)

max_error = float(np.mean(np.abs(ordinary - max_scaled)))
clipped_error = float(np.mean(np.abs(ordinary - clipped_scaled)))
print(f"outlier-bound ordinary MAE={max_error:.4f}")
print(f"clipped-bound ordinary MAE={clipped_error:.4f}")

Output

outlier-bound ordinary MAE=0.0483
clipped-bound ordinary MAE=0.0019

Lower reconstruction error on ordinary values is encouraging, but retrieval is the real target. A release check must measure whether relevant approved passages stay in the shortlist after calibration and compression.

Product quantization (PQ)

Introduced by Jégou et al.^{[1]Reference 1Product Quantization for Nearest Neighbor Search.https://dblp.org/rec/journals/pami/JegouDS11}, Product Quantization (PQ) splits each vector into $M$ subvectors and quantizes each independently using a learned codebook.

Example with $M=4$ on an 8-dimensional vector:

\text{Original}: [1.2, 3.4 \mid 0.8, 2.1 \mid 5.3, 1.7 \mid 0.4, 3.2] \quad (8 \times 4 = 32 \text{ bytes})

This example shows a single 8-dimensional vector split into 4 subvectors of 2 dimensions each. Each subvector gets mapped to the ID of its nearest centroid in that subspace's codebook:

$[1.2, 3.4] \to \text{Centroid ID: } 7$
$[0.8, 2.1] \to \text{Centroid ID: } 23$
$[5.3, 1.7] \to \text{Centroid ID: } 3$
$[0.4, 3.2] \to \text{Centroid ID: } 41$

\text{Compressed code}: \quad (7, 23, 3, 41) \quad (4 \text{ bytes at 8 bits/code})

Each subvector is now stored as a centroid index instead of raw floats. Storage drops from 32 bytes to 4 bytes because we keep IDs; at query time, lookup tables score those IDs against the full-precision query.

Product quantization turns float chunks into centroid IDs, then ADC picks the lower lookup sum. — PQ stores four chunk IDs. ADC wins by the lower summed lookup distance.

With 256 centroids per subspace (8 bits), the stored code can be much smaller than the original vector. For example, compressing 768d float32 vectors (3,072 bytes) using PQ with 96 subspaces produces a 96-byte per-vector code, a 32x payload compression. Shared codebooks and index metadata add separate overhead.

Asymmetric distance computation (ADC)

At query time, you usually keep the query vector in full precision and compare each query subvector against the centroids in the matching codebook^{[1]Reference 1Product Quantization for Nearest Neighbor Search.https://dblp.org/rec/journals/pami/JegouDS11}. That gives a small lookup table per subspace:

d(\mathbf{q}, \hat{\mathbf{x}}) \approx \sum_{m=1}^{M} \left\| \mathbf{q}_m - \mathbf{c}_{m, i_m} \right\|_2^2

The compressed database vector stores only centroid IDs $i_m$ , so scoring a candidate becomes $M$ table lookups and additions. That's why PQ can search compressed vectors without fully decompressing them. This squared-L2 form is the original PQ/ADC setup; a system serving inner-product search must use an implementation and evaluation contract that supports its chosen score.

In a real index, training data learns each codebook. Here we write down two tiny codebooks so you can see asymmetric distance computation (ADC) itself: the full-precision query creates a lookup table, and stored IDs select four distances to add.

pq-asymmetric-distance-lookups.py

import numpy as np

# Four subspaces, two centroids in each; production PQ learns many centroids.
codebooks = np.array(
    [
        [[0.0, 0.0], [1.2, 3.4]],
        [[0.0, 0.0], [0.8, 2.1]],
        [[0.0, 0.0], [5.3, 1.7]],
        [[0.0, 0.0], [0.4, 3.2]],
    ],
    dtype=np.float32,
)
query = np.array([1.1, 3.3, 0.7, 2.0, 5.2, 1.8, 0.3, 3.1], dtype=np.float32)
stored_codes = {
    "approved_access_policy": np.array([1, 1, 1, 1]),
    "escalation_policy": np.array([0, 0, 0, 0]),
}

query_chunks = query.reshape(4, 2)
lookup = np.sum((codebooks - query_chunks[:, None, :]) ** 2, axis=2)
scores = {
    name: float(lookup[np.arange(4), code].sum())
    for name, code in stored_codes.items()
}
winner = min(scores, key=scores.get)

print({name: round(score, 2) for name, score in scores.items()})
print(f"ADC nearest code: {winner}")

Output

{'approved_access_policy': 0.08, 'escalation_policy': 56.57}
ADC nearest code: approved_access_policy

Trade-off

Higher compression usually means lower recall. The exact loss depends on the number of subspaces, centroid count, and reranking budget, so benchmark PQ with your real query distribution instead of relying on a single headline number.

Binary quantization (BQ)

Of the three payload reductions here, Binary Quantization (BQ) is the most aggressive: it reduces each dimension to a single bit. This version thresholds at zero: positive numbers become 1, and zero or negative numbers become 0. It discards magnitude and granular precision, so a binary shortlist is a candidate to evaluate, not a final ranking guarantee.

The code snippet below keeps the transformation small. It takes one raw float32 vector and returns packed bytes containing one sign bit per dimension:

binary-quantization-bq.py

import numpy as np

def binary_quantize(vector: np.ndarray) -> np.ndarray:
    """
    Pack one sign bit per dimension into bytes.
    Positive values become 1, zero/negative become 0.
    """
    bits = (vector > 0).astype(np.uint8)
    return np.packbits(bits, bitorder="little")

def hamming_distance(packed_a: np.ndarray, packed_b: np.ndarray) -> int:
    differing_bits = np.bitwise_xor(packed_a, packed_b)
    return int(np.unpackbits(differing_bits, bitorder="little").sum())

ticket_a = np.array([0.3, -0.1, 0.8, -0.4], dtype=np.float32)
ticket_b = np.array([0.5, -0.2, 0.7, 0.1], dtype=np.float32)

packed_a = binary_quantize(ticket_a)
packed_b = binary_quantize(ticket_b)
distance = hamming_distance(packed_a, packed_b)

print(f"packed_a={packed_a[0]}, packed_b={packed_b[0]}, hamming={distance}")

Output

packed_a=5, packed_b=13, hamming=1

(vector > 0).astype(np.uint8) is still one byte per dimension. You only get true 1-bit storage once you pack those bits.

Worked example

Take a 4-dimensional embedding after L2 normalization: [0.3, -0.1, 0.8, -0.4]. Binary quantization thresholds at zero:

Positive maps to 1, non-positive maps to 0, giving bit vector [1, 0, 1, 0]
Packed into bytes with little-endian bit order: one byte holding those four bits, decimal value 5 (0b00000101).

Now compare against another binary vector [1, 0, 1, 1] (from [0.5, -0.2, 0.7, 0.1]):

XOR the packed bytes, leaving differing bits only in the last position
popcount (number of 1s in XOR result) = 1, so Hamming distance = 1

A Hamming distance of 1 out of 4 bits means the vectors are close in this binary approximation. In production, you can use this fast filter to pull a shortlist, then rescore survivors with INT8 or FP32 scores. Because sign bits discard magnitude, binary quantization always needs validation on your model and query distribution. This zero-threshold version works best when coordinates are centered around zero; other distributions need a different calibration or encoding choice.

In this packed-bit design, candidate scoring starts with Hamming distance (XOR + popcount), which is fast on modern CPUs because it relies on bitwise operations. It's only an approximation to angular similarity, though, so production systems often rescore the shortlist with a higher-precision representation.

Dimension	Practical effect
Compression	32x raw payload reduction when moving from float32 to one bit per dimension.
Speed	Packed bitsets use XOR plus popcount, so both memory traffic and arithmetic drop sharply. Implementations can add architecture-specific SIMD acceleration for packed-byte comparisons.
Quality	Recall can drop noticeably, so BQ usually fits best as a first-stage filter.

The two-stage retrieval pattern

Production systems often combine quantization with reranking to balance speed and accuracy.

Two-stage retrieval works like triaging an incident-note backlog. Stage 1 (binary quantization) scans 10,000 notes for rough matches, which is fast but approximate. Stage 2 (full-precision reranking) gives the top 50 notes a detailed comparison. You wouldn't inspect all 10,000 in full because that's too slow, and you wouldn't resolve cases from rough matches alone because that's too inaccurate. The funnel gives you both speed and quality.

One cost-sensitive pattern stretches this into three stages: a binary Hamming search for the cheap shortlist, an INT8 dot-product rescore of that shortlist, and an optional cross-encoder reranker^{[4]Reference 4Rerank API.https://docs.cohere.com/reference/rerank} on the final handful when answer quality justifies the extra latency. Each stage looks at fewer candidates with more precision than the one before it.

Binary quantization retrieval funnel where a cheap Hamming scan creates a shortlist, higher-precision reranking resolves tied candidates, authorization filters private chunks, and only usable evidence reaches the answer. — Binary search shortlists cheaply; rerank restores ordering quality; authorization filters private chunks before usable evidence enters context.

The production-agent boundary from the previous lesson still applies after compression. This miniature release gate checks that binary retrieval keeps the approved policy in its shortlist, then applies precise scoring only to approved candidates:

binary-shortlist-evidence-gate.py

import numpy as np

labels = np.array(["private_draft_note", "approved_access_policy", "escalation_policy"])
approved = np.array([False, True, True])
documents = np.array(
    [
        [0.05, -2.00, 0.05, -2.00],  # Same signs, wrong private evidence.
        [0.78, -0.38, 0.58, -0.22],  # Approved policy, close in full precision.
        [0.10, 0.60, -0.40, -0.20],
    ],
    dtype=np.float32,
)
query = np.array([0.80, -0.40, 0.60, -0.20], dtype=np.float32)
expected = "approved_access_policy"

def sign_bits(rows: np.ndarray) -> np.ndarray:
    return rows > 0

def normalize(rows: np.ndarray) -> np.ndarray:
    rows_2d = np.atleast_2d(rows)
    unit = rows_2d / np.linalg.norm(rows_2d, axis=1, keepdims=True)
    return unit[0] if rows.ndim == 1 else unit

hamming = np.count_nonzero(sign_bits(documents) != sign_bits(query), axis=1)
shortlist = np.argsort(hamming, kind="stable")[:2]
allowed_shortlist = shortlist[approved[shortlist]]
precise_scores = normalize(documents[allowed_shortlist]) @ normalize(query)
served = labels[allowed_shortlist[np.argmax(precise_scores)]]

recall_at_2 = expected in labels[shortlist]
print(f"binary shortlist={labels[shortlist].tolist()}, recall@2={recall_at_2}")
print(f"served evidence={served}, authorized={bool(approved[labels == served][0])}")
assert recall_at_2 and served == expected

Output

binary shortlist=['private_draft_note', 'approved_access_policy'], recall@2=True
served evidence=approved_access_policy, authorized=True

On real data, run that same gate over many labeled queries and failure slices. A compressed first pass is acceptable only if its shortlist recall remains high enough for precise, authorized evidence selection to succeed.

Building a production pipeline

One production vector-search pattern stacks quantization for smaller payloads, an ANN index for lower search work, and reranking for a more precise final comparison. Exact composition is library-specific: some implementations combine HNSW with scalar or product quantization, while others use a different ANN structure or retain full-precision vectors alongside compressed codes for rescoring.

ANN indexing with HNSW

Many vector databases offer HNSW (Hierarchical Navigable Small World) indices^{[5]Reference 5Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.https://arxiv.org/abs/1603.09320} as an ANN structure. HNSW builds a multi-layered proximity graph: upper layers contain progressively fewer sampled nodes, while the base layer contains the indexed collection. Search starts in an upper layer, follows promising graph links, and descends toward a candidate neighborhood at the base.

When configuring an HNSW implementation, three controls commonly expose its quality/cost tradeoff. Exact names and allowed values depend on the library:

Control	Changed during	Usually buys	Usually costs
M (graph connectivity)	Index build	More routes to true neighbors	More graph memory and build work
ef_construction	Index build	A better-connected graph	Slower index construction
ef_search	Query serving	More candidate exploration	Higher query latency

Measure recall and latency first while changing query-time exploration. If the graph can't reach the required recall within the latency budget, rebuild candidates with adjusted construction settings. No copied parameter triple proves quality for a new corpus.

Full pipeline overview

Production vector retrieval pipeline showing offline index build feeding online query narrowing from 100,000 indexed chunks to ANN candidates, precise scores, reranked passages, and approval. — Follow the shared index from offline build to online serving: the candidate set shrinks at each stage, so recall, precision, reranking, and approval are separate measurements.

Model and dimension selection

Public benchmarks such as MTEB evaluate embedding models across several task families.^{[6]Reference 6MTEB: Massive Text Embedding Benchmark.https://arxiv.org/abs/2210.07316} They help you assemble candidates, but they can't tell you which index retrieves your approved policy passages under your storage and latency limits.

Decision	Establish a baseline	Accept a change only when
Model family	Full-precision Recall@K on approved evidence queries	Target slices and latency remain acceptable
Supported prefix width	Full trained width with identical scoring	Measured recall loss fits your release threshold
SQ, PQ, or BQ	Uncompressed shortlist and final ranking	Compressed shortlist preserves required evidence recall
ANN settings	Exact-search results for a labeled sample	Recall and latency both fit the serving budget

Keep a versioned experiment row for model, width, normalization, quantizer, index settings, shortlist size, and reranker. That row is what makes a compressed-index rollout explainable when an access-policy answer later changes.

Cost analysis at scale

As your vector corpus grows from millions to billions of documents, the cost structure of your retrieval system shifts sharply. At small scale, storing raw float32 vectors in RAM is feasible. At billion scale, it can become the dominant infrastructure cost because low-latency approximate nearest neighbor search keeps hot index data in memory.

The byte math is deterministic even when cloud pricing isn't. This table shows just the vector payload, rounded in binary units, before HNSW links, metadata, replicas, or region-specific RAM pricing:

Scale	Raw Payload (1024d, FP32)	With INT8 SQ	With PQ (96-byte code)
1M vectors	3.8 GiB	0.95 GiB	92 MiB
10M vectors	38 GiB	9.5 GiB	916 MiB
100M vectors	380 GiB	95 GiB	8.9 GiB
1B vectors	3.7 TiB	950 GiB	89 GiB

Practice: choose a retrieval stack

You have 50 million approved policy and incident-resolution embeddings at 768 dimensions. A private draft-note collection must never be served as answer evidence. P99 latency budget is 60 ms, and raw FP32 vectors would crowd out the rest of the service's RAM.

Would you keep raw FP32 payload in memory for first-stage search?
If your ingest path already L2-normalizes vectors, would you score with cosine or dot product online?
Would you start with INT8 SQ, PQ, or BQ for the first pass?
Where would you spend full-precision compute?
What test prevents a cheaper index from silently changing evidence quality?

Solution sketch:

Don't start with raw FP32 if memory is already tight. Benchmark INT8 first, then PQ if you still need a bigger cut.
Use dot product online if vectors are already unit-normalized. It preserves cosine ranking without redundant norm work.
Start by benchmarking INT8 SQ because it changes payload less aggressively than PQ or BQ. Move further only when the evaluated RAM/latency gain justifies measured shortlist loss.
Spend full-precision compute on a shortlist, not on the whole corpus. That's where rescoring and reranking pay off.
Build a labeled release set containing approved policies plus tempting unauthorized near-matches. Compare Recall@K against the full-precision baseline and reject any serving result sourced from unapproved text.

Mastery check

Key concepts

Cosine similarity measures direction. Raw dot product measures direction plus magnitude.
Once vectors are unit-normalized, cosine and dot product give the same ranking.
Matryoshka Representation Learning trains selected prefix widths to remain useful; each deployed width still needs evaluation.
Scalar quantization compresses each dimension independently. Product quantization compresses subvectors into centroid IDs.
Binary quantization is a cheap filter, not a trustworthy final ranker.
A safe retrieval rollout is staged and measured: compressed first pass, authorized shortlist check, higher-precision rescore, then optional reranker.

Evaluation rubric

Strong: derives why normalized cosine and dot product are equivalent, then ties that to serving efficiency.
Strong: computes rough RAM savings for FP32, INT8, and PQ without confusing bytes, GB, GiB, and code sizes.
Strong: explains PQ as subvector codebooks plus ADC lookup tables, not as generic compression.
Strong: chooses compression stage based on latency, evidence-recall target, and RAM budget, then defines a release gate.
Weak: says "always use cosine" or "higher dimensions are always better" without checking normalization or workload.
Weak: treats quantization as inherently safe or inherently inaccurate instead of measuring shortlist recall and served evidence.

Follow-up questions

Why does dot product become equivalent to cosine similarity for normalized vectors?

Cosine divides the dot product by the product of the vector norms. If both vectors have norm 1, that denominator is 1, so cosine similarity equals the dot product. That lets a retrieval system normalize once, then use fast inner-product scoring without recomputing norms.

How does Matryoshka Representation Learning differ from PCA?

MRL is a training-time technique that teaches one embedding to work at several prefix lengths. PCA is a post-training projection applied to frozen vectors. MRL is useful when the model was trained for prefix truncation; PCA is a retrofit when you already have embeddings and need a smaller projected representation.

Why is Product Quantization often more attractive than Scalar Quantization at billion-vector scale?

Scalar quantization usually gives about 4x raw payload compression by moving from float32 to int8. Product Quantization can compress much more aggressively by storing centroid IDs for subspaces, which can be the difference between a huge distributed RAM footprint and a smaller compressed index. PQ usually needs rescoring because it introduces approximation error.

Why is Binary Quantization faster than float32 dot products?

Binary Quantization stores one sign bit per dimension, packs bits into bytes, and scores candidates with XOR plus popcount. That cuts memory traffic and replaces floating-point multiply-adds with cheap bit operations, but the score is coarse and should usually feed a reranker.

When is bigger embedding dimension worth the extra RAM bill?

Only when your own query set shows that smaller prefixes or lower-dimensional models miss too many relevant items even after reranking. Dimension count is a measured tradeoff, not a prestige number.

Common pitfalls

Recall falls after INT8 rollout. Cause: scale got stretched by outliers or mismatch between offline and online normalization. Fix: clip calibration range robustly, recheck normalization contract, and benchmark recall on real queries.
ANN results look fine offline but wrong in production. Cause: you tuned on MTEB or synthetic samples instead of real ticket language and chunking. Fix: build an eval set from your own queries and review misses manually.
Search is still too slow after quantization. Cause: you compressed payload but still rescore too many candidates or set ef_search too high. Fix: tune shortlist size, HNSW parameters, and rerank depth together.
Migration silently degrades quality. Cause: old and new vectors use different model versions, dimensions, or quantization settings. Fix: version the whole embedding contract and build the new index side by side before traffic cutover.

Next Step

Continue to Scaled Dot-Product Attention

Embeddings taught you how to represent meaning as vectors and compare them at retrieval time. Attention shows how Transformer layers build contextual vectors by comparing tokens inside a sequence, including the scaling term, multi-head structure, quadratic cost, <span data-glossary="kv-cache">KV-cache</span> pressure, and <span data-glossary="flashattention">FlashAttention</span>.

PreviousSentence Embeddings & Contrastive Loss

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Product Quantization for Nearest Neighbor Search.

Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011

Matryoshka Representation Learning.

Kusupati, A., et al. · 2022 · NeurIPS 2022

Vector embeddings

OpenAI · 2024

Rerank API.

Cohere. · 2026 · Official documentation

Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.

Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence

MTEB: Massive Text Embedding Benchmark.

Muennighoff, N., et al. · 2023 · EACL 2023

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Embedding Similarity & Quantization

Three ways to measure closeness

A worked example in two dimensions

Cosine similarity

Dot product

Euclidean distance

If vector A is [3, 4] and vector C is [30, 40], what happens to cosine similarity and dot product when you compare both vectors to [4, 3]?

Cosine similarity: direction only

When to use cosine similarity

Dot product: direction plus magnitude

When to use dot product

When all three metrics agree

Why this matters

Euclidean distance: straight-line gap

Choosing how many dimensions you need

Width is a model-and-evaluation contract

Why dimensions matter

Matryoshka embeddings

You shorten a unit-normalized 3072-dimensional embedding by taking only the first 256 values. Should you index that prefix immediately?

The memory problem: a sizing example

Step 1: Calculate the baseline

Step 2: Apply 8-bit scalar quantization

Step 3: Apply product quantization aggressively

Shrinking vectors without losing meaning

Scalar quantization (SQ)

Quality impact

The outlier trap

Product quantization (PQ)

Asymmetric distance computation (ADC)

Trade-off

Binary quantization (BQ)

Worked example

The two-stage retrieval pattern

Why is binary quantization usually paired with reranking instead of used as the final answer?

Building a production pipeline

ANN indexing with HNSW

Full pipeline overview

Model and dimension selection

Cost analysis at scale

Practice: choose a retrieval stack

Mastery check

You have two unit-length vectors. Vector A has dot product 0.73 with Vector B. What is their cosine similarity, and why does this simplify production retrieval?

Key concepts

Evaluation rubric

Follow-up questions

Why does dot product become equivalent to cosine similarity for normalized vectors?

How does Matryoshka Representation Learning differ from PCA?

Why is Product Quantization often more attractive than Scalar Quantization at billion-vector scale?

Why is Binary Quantization faster than float32 dot products?

When is bigger embedding dimension worth the extra RAM bill?

Common pitfalls

Mastery Check

Discussion