Learn vector scoring contracts, evaluate Matryoshka widths, and measure scalar, product, and binary quantization before shipping compressed retrieval.
The previous chapter left document_qa_v2 with a strict boundary: a retriever may propose policy passages, but only approved evidence may support an answer. Now the approved policy collection is growing, and every query has to search it quickly without silently degrading which passage is retrieved.
Suppose a customer reports a cracked tablet. The correct result is the approved return policy, not a private seller note that happens to repeat "cracked tablet" many times. Both passages become long lists of numbers called embeddings (or vectors). Retrieval compares those vectors, but its scoring and compression choices determine whether the approved policy even reaches the evidence gate.
This chapter teaches three vector scoring rules, shows exactly when they rank results identically, and then tackles memory pressure. You will compress stored vectors with scalar, product, and binary quantization, measure what each shortcut loses, and build the shortlist-and-rescore pattern that protects answer quality.
An embedding is just a point in high-dimensional space. If you have two points, you can ask how near they are. There are three common rulers, and each answers a slightly different question.
For any pipeline that normalizes every vector to length 1, cosine similarity and dot product become mathematically equivalent for ranking. That turns the metric decision into a systems decision: use the scoring rule your search library can execute quickly, then keep the same normalization contract at ingest time and query time.
Let's see how they differ with a tiny concrete example you can follow without a computer.
Suppose we embed two support tickets into tiny 2D vectors so we can draw them on paper:
[3, 4][4, 3]Both point northeast, so they should be similar. Let's measure that similarity with each ruler.
Cosine similarity ignores length and looks only at direction. The formula is:
First, compute the dot product in the numerator. Multiply matching dimensions and add:
Next, compute the lengths (norms) of each vector. The length of a vector is the square root of the sum of squared components:
Now divide:
A score of 0.96 means these invented vectors point in nearly the same direction. Whether direction corresponds to relevance in a real index is something the embedding training and retrieval evaluation must establish.
The dot product skips the normalization step and uses the raw sum we already calculated:
Because both vectors happen to have the same length (5), the dot product gives a ranking that matches cosine similarity. But if one vector were much longer, the dot product would change even though the angle stayed the same.
Subtract the dimensions, square the differences, sum them, and take the square root:
A small Euclidean distance also tells us these tickets are close in space. For this single pair, all three rulers describe a close relationship. Add candidates with different norms, even in two dimensions, and their rankings can diverge. The choice of ruler changes which tickets you retrieve.
The same arithmetic becomes a few lines of NumPy. This is a direct way to test whether your retrieval code matches the math you just worked by hand:
1import numpy as np
2
3ticket_a = np.array([3.0, 4.0])
4ticket_b = np.array([4.0, 3.0])
5
6dot = float(ticket_a @ ticket_b)
7cosine = dot / (np.linalg.norm(ticket_a) * np.linalg.norm(ticket_b))
8euclidean = float(np.linalg.norm(ticket_a - ticket_b))
9
10print(f"dot={dot:.0f}, cosine={cosine:.2f}, euclidean={euclidean:.2f}")
11
12longer_ticket_a = ticket_a * 10
13same_direction_cosine = float(
14 (longer_ticket_a @ ticket_b)
15 / (np.linalg.norm(longer_ticket_a) * np.linalg.norm(ticket_b))
16)
17larger_dot = float(longer_ticket_a @ ticket_b)
18
19print(f"10x vector: cosine={same_direction_cosine:.2f}, dot={larger_dot:.0f}")1dot=24, cosine=0.96, euclidean=1.41
210x vector: cosine=0.96, dot=240Cosine similarity is useful when the retrieval contract says vector length shouldn't change ranking. It compares direction, not "meaning" directly. A strong embedding model can arrange relevant policy text in similar directions, but cosine alone can't prove relevance or authorization.
A useful mental model is direction versus load. Cosine checks whether two delivery trucks head the same way; it doesn't care how heavily loaded they are. Raw dot product cares about direction and load size. Which signal matters is set by model training and validated on retrieval queries, not chosen by intuition after indexing.
Cosine similarity ranges from to : 1 means same direction, 0 means orthogonal, and -1 means opposite direction. Its key property is magnitude invariance. A vector of [1, 1] has the same cosine direction as [100, 100] because they point the same way.
The dot product is an un-normalized alignment score between two vectors:
Multiply each pair of matching dimensions and add them all up. Unlike cosine similarity, there's no dividing by lengths, so if one vector is twice as long, the dot product doubles. That makes it the cheaper query-time metric once vectors have already been unit-normalized, but it also makes raw dot products sensitive to how "big" each embedding is.
The dot product ranges from to . Its key property is sensitivity to both direction and magnitude.
Some APIs return unit-length embeddings, and many production pipelines normalize vectors at ingest time even when the model itself doesn't. Treat normalization as an explicit contract between model output, index construction, and query serving. Don't assume it's implicit because the model is modern.
When both vectors are unit length (), the mathematical relationship between our similarity metrics collapses into a clean equivalence. The denominator in the cosine similarity formula becomes , making it identical to the raw dot product. Euclidean distance also becomes a direct function of the dot product:
When all vectors have length 1, cosine similarity, dot product, and squared Euclidean distance all induce the same ranking. In practice, pick the metric your library optimizes best and keep query-time and index-time conventions consistent. After the stored vectors and each incoming query have been normalized once, inner product maps directly to fast matrix multiplication kernels without norm divisions during candidate scoring.
The following retrieval fixture makes the contract visible. A private-note fixture has a deliberately large raw norm and wins a raw dot-product search. Once all candidates and the query are normalized, the approved policy wins under both cosine and unit-vector dot product.
1import numpy as np
2
3labels = np.array(["approved_return_policy", "private_seller_note", "carrier_tracking"])
4documents = np.array(
5 [
6 [0.95, 0.05], # Direction closely matches the return-policy query.
7 [8.00, 6.00], # Deliberately large norm lets raw dot product dominate.
8 [0.20, 0.90],
9 ],
10 dtype=np.float32,
11)
12query = np.array([1.0, 0.0], dtype=np.float32)
13
14def normalize(rows: np.ndarray) -> np.ndarray:
15 rows_2d = np.atleast_2d(rows)
16 unit = rows_2d / np.linalg.norm(rows_2d, axis=1, keepdims=True)
17 return unit[0] if rows.ndim == 1 else unit
18
19raw_dot = documents @ query
20unit_documents = normalize(documents)
21unit_query = normalize(query)
22cosine = unit_documents @ unit_query
23unit_dot = unit_documents @ unit_query
24
25print(f"raw dot winner: {labels[np.argmax(raw_dot)]}")
26print(f"normalized winner: {labels[np.argmax(cosine)]}")
27print(f"cosine equals unit dot: {np.allclose(cosine, unit_dot)}")1raw dot winner: private_seller_note
2normalized winner: approved_return_policy
3cosine equals unit dot: TrueYou shouldn't infer that every large-norm vector is bad. The lesson is narrower: if the serving contract is cosine-like ranking, skipping normalization changes what the index is optimizing.
Euclidean distance (also called L2 distance) measures the straight-line distance between two points:
Subtract matching dimensions, square each difference, sum them up, and take the square root. This gives the straight-line distance between two points in high-dimensional space, just like measuring distance on a map, but with hundreds of dimensions instead of two.
High dimension doesn't make Euclidean distance automatically wrong. On unit-normalized vectors, squared Euclidean distance produces exactly the same ranking as cosine and dot product through the equation above. Without that normalization, L2 distance also includes norm differences. Use it when the model, index, and evaluation are built around that geometry.
Euclidean distance ranges from to , where 0 means identical points. It also appears inside product quantization, where the original PQ formulation approximates squared L2 distances between a full-precision query and compressed database vectors.[1]
Embedding dimensionality controls how much semantic information the vector can encode. That capacity comes with direct storage, memory-bandwidth, and latency costs. Choosing dimensions is a real architecture decision, not a leaderboard footnote.
A dimension count has no universal quality meaning by itself. A 512-dimensional model trained well for your support queries can outperform a different 1,536-dimensional model. Width comparisons become defensible when you hold the model family, scoring convention, corpus, and evaluation set fixed.
| Choice under evaluation | What you can conclude | What you still must measure |
|---|---|---|
| Full trained width | Baseline payload and retrieval score for this model | Latency, RAM, and evidence recall |
| Model-supported shorter prefix | Lower raw payload for the same family | Recall loss and failure slices |
| Different smaller model | Possibly cheaper candidate system | Everything again; training changed too |
Dimension count and normalization are separate choices. A supported shorter vector still needs the serving normalization contract documented by its model or applied by your pipeline.
For a fixed representation family, more stored dimensions have deterministic systems costs:
Some embedding families are trained so selected truncated prefixes stay useful. Matryoshka Representation Learning (MRL) optimizes losses at multiple chosen representation sizes during training.[2] Hosted APIs can expose a supported shortening control too; OpenAI's embeddings guide documents a dimensions parameter for text-embedding-3 models.[3]
Matryoshka embeddings are named after nesting dolls because one representation contains selected shorter prefixes. That doesn't guarantee every arbitrary cut point, every task, or every query keeps acceptable recall. Use widths that the training setup or provider supports, then evaluate each proposed serving width on your own evidence queries.
Standard embedding training minimizes a single loss function (like contrastive loss) on the full vector. MRL minimizes a weighted sum of losses at a selected set of truncation points, written as . This nested structure encourages the trained prefix widths to form useful representations on their own.
MRL is different from post-hoc dimensionality reduction such as PCA. PCA projects already-trained embeddings into a smaller linear subspace, usually by preserving variance on a fixed dataset. MRL changes training itself so prefix slices are useful without a separate projection step. PCA can still be useful for frozen embeddings, but MRL is cleaner when the model was trained for truncation and you want one serving path with several dimension budgets.
Supported prefix shortening can reduce raw payload while keeping one model family in the serving stack. OpenAI's embeddings documentation, for example, lists default sizes of 1,536 for text-embedding-3-small and 3,072 for text-embedding-3-large, and documents shorter embeddings through the dimensions parameter.[3] Those documented capabilities still don't replace evaluation on your retrieval corpus.
If you ask a provider for a shorter embedding through the API, follow that provider's normalization contract. Manual slicing is different. If you cut a prefix from an already generated vector, normalize that sliced prefix before indexing. The code below demonstrates that contract without any external API call:
1import numpy as np
2
3def normalize_l2(vector: np.ndarray) -> np.ndarray:
4 norm = np.linalg.norm(vector)
5 if norm == 0:
6 return vector
7 return vector / norm
8
9full_embedding = np.array(
10 [0.21, -0.05, 0.44, 0.31, 0.18, -0.27, 0.12, 0.09],
11 dtype=np.float32,
12)
13
14short_embedding = normalize_l2(full_embedding[:4])
15
16print(f"dims={short_embedding.shape[0]}, norm={np.linalg.norm(short_embedding):.3f}")1dims=4, norm=1.000Normalization is necessary for a unit-vector scoring contract, but it can't restore information removed by truncation. This deliberately small fixture makes that distinction concrete: the last two dimensions separate an approved policy from a private seller note, so the shortened representation loses one correct result.
1import numpy as np
2
3labels = np.array(["private_seller_note", "approved_return_policy", "carrier_delay_policy"])
4documents = np.array(
5 [
6 [1.0, 0.0, 1.0, 0.0],
7 [1.0, 0.0, 0.0, 1.0],
8 [0.0, 1.0, 0.0, 1.0],
9 ],
10 dtype=np.float32,
11)
12queries = np.array(
13 [
14 [1.0, 0.0, 0.0, 1.0],
15 [0.0, 1.0, 0.0, 1.0],
16 ],
17 dtype=np.float32,
18)
19expected = np.array(["approved_return_policy", "carrier_delay_policy"])
20
21def top1_at_width(width: int) -> np.ndarray:
22 docs = documents[:, :width]
23 asks = queries[:, :width]
24 docs = docs / np.linalg.norm(docs, axis=1, keepdims=True)
25 asks = asks / np.linalg.norm(asks, axis=1, keepdims=True)
26 return labels[np.argmax(asks @ docs.T, axis=1)]
27
28full = top1_at_width(4)
29short = top1_at_width(2)
30print(f"full top-1 recall={np.mean(full == expected):.1%}, results={full.tolist()}")
31print(f"short top-1 recall={np.mean(short == expected):.1%}, results={short.tolist()}")1full top-1 recall=100.0%, results=['approved_return_policy', 'carrier_delay_policy']
2short top-1 recall=50.0%, results=['private_seller_note', 'carrier_delay_policy']This isn't a benchmark for a real MRL model. It's the release test shape you need: evaluate supported widths on approved-query fixtures, including near-duplicate but unauthorized passages.
Work through this step by step. The same arithmetic appears whenever you size a vector database.
Scenario: You have 1 million embeddings from a hosted API, each with 1,536 dimensions, stored as 32-bit floats (FP32). How much RAM do you need for the raw vector payload, and how can you reduce it?
Each dimension is a float32, which takes 4 bytes.
That's just the raw numbers. Real indexes add overhead for graph links, metadata, and replicas, but the vector payload alone is already over 6 GB.
If you map each float32 to a single byte (INT8), you cut storage by roughly 4x:
Suppose you split each 1,536-dimensional vector into 96 subvectors of 16 dimensions each, and you represent each subvector with an 8-bit centroid ID. The compressed vector is just 96 bytes:
That's a 64x compression from the original FP32 payload. The trade-off is that you're now comparing approximate codes instead of exact floats, so you typically pair PQ with a higher-precision reranking stage.
The byte math is easy to turn into a sizing helper. Notice the difference between decimal gigabytes, which vendors often use in pricing copy, and binary gibibytes, which operating systems usually report:
1def vector_payload_bytes(num_vectors: int, dimensions: int, bytes_per_dim: float) -> float:
2 return num_vectors * dimensions * bytes_per_dim
3
4raw_bytes = vector_payload_bytes(1_000_000, 1_536, 4)
5int8_bytes = vector_payload_bytes(1_000_000, 1_536, 1)
6pq_bytes = vector_payload_bytes(1_000_000, 96, 1)
7
8print(f"FP32={raw_bytes / 1e9:.2f} GB ({raw_bytes / 1024**3:.2f} GiB)")
9print(f"INT8={int8_bytes / 1e9:.2f} GB")
10print(f"PQ={pq_bytes / 1024**2:.1f} MiB")1FP32=6.14 GB (5.72 GiB)
2INT8=1.54 GB
3PQ=91.6 MiBRe-embedding cost scales with total input tokens, not document count alone. Compute it as total_tokens / 1_000_000 * embedding_price_per_1M_tokens, using the current price for the model you have selected. Keep model version, dimensions, normalization, and quantization settings alongside every index so you can compare and migrate without mixing incompatible vectors.
At scale, storing and searching raw float32 vectors may exceed a practical memory or latency budget. Every exact scan must read values; an approximate index reduces candidates, while quantization reduces bytes per stored candidate. Both are engineering choices that can lower recall, so each needs an evaluation gate.
Scalar quantization (SQ) replaces each float32 (4 bytes) with a lower-precision representation, typically mapping each dimension independently. The simplest signed int8 version is symmetric around zero:
Here, s is the scale. Each dimension goes from 4 bytes (float32) to 1 byte (int8), so raw storage drops by about 4x while preserving an approximate reconstruction path back to float space. In practice, some libraries use this symmetric signed form, while others use affine quantization with a zero-point, often over uint8.
The function below implements symmetric scalar quantization. It takes an array of float32 vectors and returns the compressed int8 vectors along with the per-dimension scales needed to approximately reconstruct the original floats later:
1import numpy as np
2
3def symmetric_quantize_int8(vectors: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
4 """
5 Per-dimension symmetric quantization from float32 to int8.
6 Returns quantized vectors and scale.
7 """
8 qmax = 127
9
10 max_abs = np.maximum(np.max(np.abs(vectors), axis=0), 1e-8)
11 scale = max_abs / qmax
12
13 quantized = np.clip(
14 np.round(vectors / scale),
15 -qmax,
16 qmax,
17 ).astype(np.int8)
18
19 return quantized, scale
20
21vectors = np.array(
22 [
23 [0.20, -1.00, 2.50],
24 [0.24, -0.80, 2.10],
25 [0.50, -1.40, 2.30],
26 ],
27 dtype=np.float32,
28)
29
30quantized, scale = symmetric_quantize_int8(vectors)
31reconstructed = scale * quantized.astype(np.float32)
32max_error = float(np.max(np.abs(vectors - reconstructed)))
33
34print(quantized.dtype, quantized.shape, f"max_error={max_error:.4f}")1int8 (3, 3) max_error=0.0063INT8 cuts the raw vector payload from four bytes per dimension to one. It doesn't promise any fixed Recall@K. Model distribution, calibration sample, scoring rule, shortlist depth, and reranking all change the observed loss, so treat INT8 as the first candidate to benchmark rather than an automatic release decision.
The biggest practical mistake in scalar quantization is deriving the scale from the absolute minimum and maximum of a calibration sample without checking outliers. If one dimension spikes far outside the usual range, the scale stretches and ordinary values collapse into fewer integer buckets. Robust clipping can preserve ordinary resolution while saturating outliers, but its percentile is another measured choice rather than a universal constant.
The next executable example quantizes ordinary values plus one spike. It compares a scale set by the spike with a clipped scale set from the ordinary calibration range:
1import numpy as np
2
3ordinary = np.linspace(-1.0, 1.0, 101, dtype=np.float32)
4with_outlier = np.concatenate([ordinary, np.array([25.0], dtype=np.float32)])
5
6def quantize_reconstruct(values: np.ndarray, bound: float) -> np.ndarray:
7 q = np.clip(np.round(values / (bound / 127)), -127, 127).astype(np.int8)
8 return q.astype(np.float32) * (bound / 127)
9
10bound_from_max = float(np.max(np.abs(with_outlier)))
11bound_from_calibration = 1.0
12max_scaled = quantize_reconstruct(ordinary, bound_from_max)
13clipped_scaled = quantize_reconstruct(ordinary, bound_from_calibration)
14
15max_error = float(np.mean(np.abs(ordinary - max_scaled)))
16clipped_error = float(np.mean(np.abs(ordinary - clipped_scaled)))
17print(f"outlier-bound ordinary MAE={max_error:.4f}")
18print(f"clipped-bound ordinary MAE={clipped_error:.4f}")1outlier-bound ordinary MAE=0.0483
2clipped-bound ordinary MAE=0.0019Lower reconstruction error on ordinary values is encouraging, but retrieval is the real target. A release check must measure whether relevant approved passages stay in the shortlist after calibration and compression.
Introduced by Jégou et al.[1], Product Quantization (PQ) splits each vector into subvectors and quantizes each independently using a learned codebook.
Example with on an 8-dimensional vector:
This example shows a single 8-dimensional vector split into 4 subvectors of 2 dimensions each. Each subvector gets mapped to the ID of its nearest centroid in that subspace's codebook:
Each subvector is now stored as a centroid index instead of raw floats. Storage drops from 32 bytes to 4 bytes because we keep IDs; at query time, lookup tables score those IDs against the full-precision query.
With 256 centroids per subspace (8 bits), the stored code can be much smaller than the original vector. For example, compressing 768d float32 vectors (3,072 bytes) using PQ with 96 subspaces produces a 96-byte per-vector code, a 32x payload compression. Shared codebooks and index metadata add separate overhead.
At query time, you usually keep the query vector in full precision and compare each query subvector against the centroids in the matching codebook[1]. That gives a small lookup table per subspace:
The compressed database vector stores only centroid IDs , so scoring a candidate becomes table lookups and additions. That's why PQ can search compressed vectors without fully decompressing them. This squared-L2 form is the original PQ/ADC setup; a system serving inner-product search must use an implementation and evaluation contract that supports its chosen score.
In a real index, training data learns each codebook. Here we write down two tiny codebooks so you can see asymmetric distance computation (ADC) itself: the full-precision query creates a lookup table, and stored IDs select four distances to add.
1import numpy as np
2
3# Four subspaces, two centroids in each; production PQ learns many centroids.
4codebooks = np.array(
5 [
6 [[0.0, 0.0], [1.2, 3.4]],
7 [[0.0, 0.0], [0.8, 2.1]],
8 [[0.0, 0.0], [5.3, 1.7]],
9 [[0.0, 0.0], [0.4, 3.2]],
10 ],
11 dtype=np.float32,
12)
13query = np.array([1.1, 3.3, 0.7, 2.0, 5.2, 1.8, 0.3, 3.1], dtype=np.float32)
14stored_codes = {
15 "approved_return_policy": np.array([1, 1, 1, 1]),
16 "carrier_delay_policy": np.array([0, 0, 0, 0]),
17}
18
19query_chunks = query.reshape(4, 2)
20lookup = np.sum((codebooks - query_chunks[:, None, :]) ** 2, axis=2)
21scores = {
22 name: float(lookup[np.arange(4), code].sum())
23 for name, code in stored_codes.items()
24}
25winner = min(scores, key=scores.get)
26
27print({name: round(score, 2) for name, score in scores.items()})
28print(f"ADC nearest code: {winner}")1{'approved_return_policy': 0.08, 'carrier_delay_policy': 56.57}
2ADC nearest code: approved_return_policyHigher compression usually means lower recall. The exact loss depends on the number of subspaces, centroid count, and reranking budget, so benchmark PQ with your real query distribution instead of relying on a single headline number.
Of the three payload reductions in this chapter, Binary Quantization (BQ) is the most aggressive: it reduces each dimension to a single bit. This simple version thresholds at zero: positive numbers become 1, and zero or negative numbers become 0. It discards magnitude and granular precision, so a binary shortlist is a candidate to evaluate, not a final ranking guarantee.
The code snippet below demonstrates how simple this transformation is in practice. It takes one raw float32 vector and returns packed bytes containing one sign bit per dimension:
1import numpy as np
2
3def binary_quantize(vector: np.ndarray) -> np.ndarray:
4 """
5 Pack one sign bit per dimension into bytes.
6 Positive values become 1, zero/negative become 0.
7 """
8 bits = (vector > 0).astype(np.uint8)
9 return np.packbits(bits, bitorder="little")
10
11def hamming_distance(packed_a: np.ndarray, packed_b: np.ndarray) -> int:
12 differing_bits = np.bitwise_xor(packed_a, packed_b)
13 return int(np.unpackbits(differing_bits, bitorder="little").sum())
14
15ticket_a = np.array([0.3, -0.1, 0.8, -0.4], dtype=np.float32)
16ticket_b = np.array([0.5, -0.2, 0.7, 0.1], dtype=np.float32)
17
18packed_a = binary_quantize(ticket_a)
19packed_b = binary_quantize(ticket_b)
20distance = hamming_distance(packed_a, packed_b)
21
22print(f"packed_a={packed_a[0]}, packed_b={packed_b[0]}, hamming={distance}")1packed_a=5, packed_b=13, hamming=1Note that (vector > 0).astype(np.uint8) is still one byte per dimension. You only get true 1-bit storage once you pack those bits.
Take a 4-dimensional embedding after L2 normalization: [0.3, -0.1, 0.8, -0.4]. Binary quantization thresholds at zero:
[1, 0, 1, 0]5 (0b00000101).Now compare against another binary vector [1, 0, 1, 1] (from [0.5, -0.2, 0.7, 0.1]):
A Hamming distance of 1 out of 4 bits means the vectors are close in this binary approximation. In production, you can use this fast filter to pull a shortlist, then rescore survivors with INT8 or FP32 scores. Because sign bits discard magnitude, binary quantization always needs validation on your model and query distribution. This zero-threshold version works best when coordinates are centered around zero; other distributions need a different calibration or encoding choice.
In this packed-bit design, candidate scoring starts with Hamming distance (XOR + popcount), which is fast on modern CPUs because it relies on bitwise operations. It's only an approximation to angular similarity, though, so production systems often rescore the shortlist with a higher-precision representation.
| Dimension | Practical effect |
|---|---|
| Compression | 32x raw payload reduction when moving from float32 to one bit per dimension. |
| Speed | Packed bitsets use XOR plus popcount, so both memory traffic and arithmetic drop sharply. Implementations can add architecture-specific SIMD acceleration for packed-byte comparisons. |
| Quality | Recall can drop noticeably, so BQ usually fits best as a first-stage filter. |
Production systems often combine quantization with reranking to balance speed and accuracy.
Two-stage retrieval works like triaging a warehouse exception queue. Stage 1 (binary quantization) scans 10,000 tickets for rough matches, which is fast but approximate. Stage 2 (full-precision reranking) gives the top 50 tickets a detailed comparison. You wouldn't deeply inspect all 10,000 because that's too slow, and you wouldn't resolve cases from rough matches alone because that's too inaccurate. The funnel gives you both speed and quality.
One cost-sensitive pattern stretches this into three stages: a binary Hamming search for the cheap shortlist, an INT8 dot-product rescore of that shortlist, and an optional cross-encoder reranker[4] on the final handful when answer quality justifies the extra latency. Each stage looks at fewer candidates with more precision than the one before it.
The production-agent boundary from the previous lesson still applies after compression. This miniature release gate checks that binary retrieval keeps the approved policy in its shortlist, then applies precise scoring only to approved candidates:
1import numpy as np
2
3labels = np.array(["private_seller_note", "approved_return_policy", "carrier_delay_policy"])
4approved = np.array([False, True, True])
5documents = np.array(
6 [
7 [0.05, -2.00, 0.05, -2.00], # Same signs, wrong private evidence.
8 [0.78, -0.38, 0.58, -0.22], # Approved policy, close in full precision.
9 [0.10, 0.60, -0.40, -0.20],
10 ],
11 dtype=np.float32,
12)
13query = np.array([0.80, -0.40, 0.60, -0.20], dtype=np.float32)
14expected = "approved_return_policy"
15
16def sign_bits(rows: np.ndarray) -> np.ndarray:
17 return rows > 0
18
19def normalize(rows: np.ndarray) -> np.ndarray:
20 rows_2d = np.atleast_2d(rows)
21 unit = rows_2d / np.linalg.norm(rows_2d, axis=1, keepdims=True)
22 return unit[0] if rows.ndim == 1 else unit
23
24hamming = np.count_nonzero(sign_bits(documents) != sign_bits(query), axis=1)
25shortlist = np.argsort(hamming, kind="stable")[:2]
26allowed_shortlist = shortlist[approved[shortlist]]
27precise_scores = normalize(documents[allowed_shortlist]) @ normalize(query)
28served = labels[allowed_shortlist[np.argmax(precise_scores)]]
29
30recall_at_2 = expected in labels[shortlist]
31print(f"binary shortlist={labels[shortlist].tolist()}, recall@2={recall_at_2}")
32print(f"served evidence={served}, authorized={bool(approved[labels == served][0])}")
33assert recall_at_2 and served == expected1binary shortlist=['private_seller_note', 'approved_return_policy'], recall@2=True
2served evidence=approved_return_policy, authorized=TrueOn real data, run that same gate over many labeled queries and failure slices. A compressed first pass is acceptable only if its shortlist recall remains high enough for precise, authorized evidence selection to succeed.
One production vector-search pattern stacks quantization for smaller payloads, an ANN index for lower search work, and reranking for a more precise final comparison. Exact composition is library-specific: some implementations combine HNSW with scalar or product quantization, while others use a different ANN structure or retain full-precision vectors alongside compressed codes for rescoring.
Many vector databases offer HNSW (Hierarchical Navigable Small World) indices[5] as an ANN structure. HNSW builds a multi-layered proximity graph: upper layers contain progressively fewer sampled nodes, while the base layer contains the indexed collection. Search starts in an upper layer, follows promising graph links, and descends toward a candidate neighborhood at the base.
When configuring an HNSW implementation, three controls commonly expose its quality/cost tradeoff. Exact names and allowed values depend on the library:
| Control | Changed during | Usually buys | Usually costs |
|---|---|---|---|
| M (graph connectivity) | Index build | More routes to true neighbors | More graph memory and build work |
| ef_construction | Index build | A better-connected graph | Slower index construction |
| ef_search | Query serving | More candidate exploration | Higher query latency |
Measure recall and latency first while changing query-time exploration. If the graph can't reach the required recall within the latency budget, rebuild candidates with adjusted construction settings. No copied parameter triple proves quality for a new corpus.
Public benchmarks such as MTEB evaluate embedding models across several task families.[6] They help you assemble candidates, but they can't tell you which index retrieves your approved policy passages under your storage and latency limits.
| Decision | Establish a baseline | Accept a change only when |
|---|---|---|
| Model family | Full-precision Recall@K on approved evidence queries | Target slices and latency remain acceptable |
| Supported prefix width | Full trained width with identical scoring | Measured recall loss fits your release threshold |
| SQ, PQ, or BQ | Uncompressed shortlist and final ranking | Compressed shortlist preserves required evidence recall |
| ANN settings | Exact-search results for a labeled sample | Recall and latency both fit the serving budget |
Keep a versioned experiment row for model, width, normalization, quantizer, index settings, shortlist size, and reranker. That row is what makes a compressed-index rollout explainable when a return-policy answer later changes.
As your vector corpus grows from millions to billions of documents, the cost structure of your retrieval system shifts sharply. At small scale, storing raw float32 vectors in RAM is feasible. At billion scale, it can become the dominant infrastructure cost because low-latency approximate nearest neighbor search keeps hot index data in memory.
The byte math is deterministic even when cloud pricing isn't. The table below shows just the vector payload, rounded in binary units, before HNSW links, metadata, replicas, or region-specific RAM pricing:
| Scale | Raw Payload (1024d, FP32) | With INT8 SQ | With PQ (96-byte code) |
|---|---|---|---|
| 1M vectors | 3.8 GiB | 0.95 GiB | 92 MiB |
| 10M vectors | 38 GiB | 9.5 GiB | 916 MiB |
| 100M vectors | 380 GiB | 95 GiB | 8.9 GiB |
| 1B vectors | 3.7 TiB | 950 GiB | 89 GiB |
You have 50 million approved policy and support-resolution embeddings at 768 dimensions. A private seller-note collection must never be served as answer evidence. P99 latency budget is 60 ms, and raw FP32 vectors would crowd out the rest of the service's RAM.
Solution sketch:
Cosine divides the dot product by the product of the vector norms. If both vectors have norm 1, that denominator is 1, so cosine similarity equals the dot product. That lets a retrieval system normalize once, then use fast inner-product scoring without recomputing norms.
MRL is a training-time technique that teaches one embedding to work at several prefix lengths. PCA is a post-training projection applied to frozen vectors. MRL is useful when the model was trained for prefix truncation; PCA is a retrofit when you already have embeddings and need a smaller projected representation.
Scalar quantization usually gives about 4x raw payload compression by moving from float32 to int8. Product Quantization can compress much more aggressively by storing centroid IDs for subspaces, which can be the difference between a huge distributed RAM footprint and a smaller compressed index. The tradeoff is approximation error, so PQ usually needs rescoring.
Binary Quantization stores one sign bit per dimension, packs bits into bytes, and scores candidates with XOR plus popcount. That cuts memory traffic and replaces floating-point multiply-adds with cheap bit operations, but the score is coarse and should usually feed a reranker.
Only when your own query set shows that smaller prefixes or lower-dimensional models miss too many relevant items even after reranking. Dimension count is a measured tradeoff, not a prestige number.
ef_search too high. Fix: tune shortlist size, HNSW parameters, and rerank depth together.Product Quantization for Nearest Neighbor Search.
Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011
Matryoshka Representation Learning.
Kusupati, A., et al. · 2022 · NeurIPS 2022
Vector embeddings
OpenAI · 2024
Rerank API.
Cohere. · 2026 · Official documentation
Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.
Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence
MTEB: Massive Text Embedding Benchmark.
Muennighoff, N., et al. · 2023 · EACL 2023