Learn how approximate nearest neighbor indexes use HNSW, IVF, and Product Quantization to balance speed, recall, and memory in production vector databases.
Vector database internals explain why similarity search can be fast without scanning every embedding. This chapter covers Hierarchical Navigable Small World (HNSW), inverted file (IVF) indexing, recall-latency tradeoffs, and the operational knobs behind nearest-neighbor retrieval.
Recursive Language Models showed how to organize compute over huge external context. This chapter starts the retrieval section from the other side: before an agent or RAG pipeline can reason over documents, the system has to find the candidate chunks quickly enough to fit an interactive product.
Imagine finding a specific returns-support ticket in an archive with millions of cases, but instead of an exact ticket ID, you only have a vague description of the customer problem. If you had to check every case one by one, the search would take too long for an interactive product. This is the challenge of similarity search in modern AI systems.
When you ask an AI model a question, it converts your query into a vector (a long list of numbers). To find relevant answers, for example in a RAG (Retrieval-Augmented Generation) system where an LLM grounds its responses in your documents, the database must search through millions or billions of other vectors to find the ones most similar to yours.
This article builds on the embedding idea you learned earlier: text gets turned into vectors so a machine can compare meaning. Here, we'll look at how a production database actually finds the closest vectors when there are millions of them. By the end, you'll know how to choose between the two main index families, tune their knobs, and spot the mistakes that make vector search slow or inaccurate in production.
The simplest method, called Exact Search (or k-NN for k-Nearest Neighbors), calculates the distance from your query vector to every vector in the database.
Mathematically, given a query and a database of vectors each with dimensions, exact search has complexity. That notation means the work grows in direct proportion to both the number of vectors and the number of dimensions.
Let's make that concrete. Suppose you have 100 million product-review embeddings, each with 1,536 dimensions, stored as 4-byte floats. A flat query scores all 100 million vectors, or 153.6 billion scalar components before considering vectorization, batching, memory bandwidth, or accelerator hardware. Whether that fits an SLA is a benchmark question, not something dimensionality alone can answer.
The memory picture is just as stark:
That's raw vector storage alone, before any index overhead (the lab below reports the same quantity in GiB, the binary units RAM is usually measured in). A single ordinary server won't keep that raw collection in RAM. Approximate Nearest Neighbor (ANN) indexes reduce scored candidates by accepting that some true neighbors may be missed.
1def gib_for_float32_vectors(count: int, dimensions: int) -> float:
2 return count * dimensions * 4 / (1024 ** 3)
3
4def scalar_components_scored(count: int, dimensions: int) -> int:
5 return count * dimensions
6
7n_vectors = 100_000_000
8dimensions = 1_536
9
10print(f"raw float32 storage: {gib_for_float32_vectors(n_vectors, dimensions):.1f} GiB")
11print(f"scalar components scored per flat query: {scalar_components_scored(n_vectors, dimensions):,}")1raw float32 storage: 572.2 GiB
2scalar components scored per flat query: 153,600,000,000For large, weakly filtered collections, ANN is usually the practical serving path. Exact flat search still matters: it's the ground truth for measuring recall, and it can be the right production execution path when a metadata filter leaves only a small candidate set. For example, pgvector performs exact nearest-neighbor search by default and documents exact search as useful for selective filter conditions.[1]
The main quality metric is Recall@k: of the true top-k nearest neighbors, how many did your ANN index recover? Latency numbers without recall are misleading because an index can look fast simply by missing good candidates. A system that returns random vectors in 1 ms has terrible recall. A system that takes 20 ms and finds 99 of the true top 100 has excellent recall.
1def recall_at_k(exact_ids: list[int], approximate_ids: list[int]) -> float:
2 expected = set(exact_ids)
3 return len(expected.intersection(approximate_ids)) / len(expected)
4
5exact_top_5 = [10, 21, 34, 55, 89]
6fast_but_weak = [10, 21, 8, 13, 5]
7slower_but_better = [10, 21, 34, 55, 3]
8
9print(f"fast candidate recall@5: {recall_at_k(exact_top_5, fast_but_weak):.0%}")
10print(f"better candidate recall@5: {recall_at_k(exact_top_5, slower_but_better):.0%}")1fast candidate recall@5: 40%
2better candidate recall@5: 80%Key insight: Latency without recall is incomplete. Keep an exact-search evaluation set, measure ANN recall against it, and choose a latency-recall operating point that satisfies the product requirement.
HNSW is a widely used in-memory ANN index because it can provide a strong recall-latency trade-off without a separate centroid-training phase.[2] Whether it beats another index on your data still needs measurement under the same memory and recall constraints.
HNSW is based on a multi-layer graph structure that mimics how people find things in the real world. Think of zooming in on a map: you start at the continent level, narrow down to a country, then a city, then a street. Each layer has more detail than the one above it.
This hierarchical structure is analogous to a skip list (a data structure that uses multiple layers of linked lists to allow fast search by skipping over intermediate nodes), but it's generalized for graphs.
In the figure, Layer 2 has only a few nodes with long-range links, Layer 1 holds more nodes for regional refinement, and Layer 0 contains every node in a dense local graph. The parameters M, ef_search, and Layer 0 density (often 2M) control the recall-latency trade-off.
To see how the layers work in practice, imagine you are building an HNSW index with just 8 vectors. The graph already has 7 vectors arranged across three layers. You now insert vector 8.
ef_construction candidates, pick the best neighbors (up to M), and add bidirectional edges. You also check if any existing neighbor now has too many edges and prune if needed.2M neighbors because the base layer is denser.This incremental construction is why HNSW doesn't need a training phase. Every new vector is wired into the existing graph using local search.
The code below is simplified educational pseudocode. The key steps are: sample a level, zoom in through upper layers, connect on each layer, and update the entry point if the new node is taller than the current graph.
1import math
2import random
3from typing import Protocol, Sequence
4
5class HNSWGraph(Protocol):
6 entry_point: int | None
7 max_level: int
8
9 def neighbors(self, node_id: int, level: int) -> Sequence[int]: ...
10
11 def distance(self, query_vector: Sequence[float], node_id: int) -> float: ...
12
13 def add_node(
14 self, node_id: int, vector: Sequence[float], max_level: int
15 ) -> None: ...
16
17 def add_bidirectional_edge(self, left_id: int, right_id: int, level: int) -> None: ...
18
19 def prune_if_needed(self, node_id: int, level: int, max_degree: int) -> None: ...
20
21def greedy_search(graph, query_vector, entry_id, level) -> int:
22 """Walk this layer until no neighbor is closer to query_vector."""
23 current = entry_id
24 improved = True
25 while improved:
26 improved = False
27 for neighbor in graph.neighbors(current, level):
28 if graph.distance(query_vector, neighbor) < graph.distance(query_vector, current):
29 current = neighbor
30 improved = True
31 return current
32
33def search_layer(
34 graph: HNSWGraph,
35 query_vector: Sequence[float],
36 entry_id: int,
37 level: int,
38 ef: int,
39) -> list[int]:
40 """Best-first search that returns up to ef candidate ids in one layer."""
41 import heapq
42 visited = {entry_id}
43 entry_dist = graph.distance(query_vector, entry_id)
44 candidates = [(entry_dist, entry_id)] # min-heap by distance
45 results = [(-entry_dist, entry_id)] # max-heap of the best ef via negative distance
46
47 while candidates:
48 dist, current = heapq.heappop(candidates)
49 worst_result_dist = -results[0][0]
50 if len(results) >= ef and dist > worst_result_dist:
51 break
52 for neighbor in graph.neighbors(current, level):
53 if neighbor not in visited:
54 visited.add(neighbor)
55 d = graph.distance(query_vector, neighbor)
56 if len(results) < ef or d < -results[0][0]:
57 heapq.heappush(candidates, (d, neighbor))
58 heapq.heappush(results, (-d, neighbor))
59 if len(results) > ef:
60 heapq.heappop(results)
61 return [
62 node
63 for distance, node in sorted((-neg_distance, node) for neg_distance, node in results)
64 ]
65
66def select_neighbors_placeholder(
67 graph, query_vector, candidate_ids, level, max_degree
68) -> list[int]:
69 """Teaching simplification: take nearest candidates by distance."""
70 # Production HNSW applies its diversity heuristic here to improve connectivity.
71 candidates = sorted(candidate_ids,
72 key=lambda nid: graph.distance(query_vector, nid))
73 return candidates[:max_degree]
74
75def sample_level(mL: float) -> int:
76 u = max(random.random(), 1e-12)
77 return math.floor(-math.log(u) * mL)
78
79def hnsw_insert(
80 graph: HNSWGraph,
81 new_id: int,
82 new_vector: Sequence[float],
83 M: int = 16,
84 ef_construction: int = 200,
85 mL: float = 0.36,
86) -> None:
87 """
88 Insert one vector into an HNSW index.
89
90 Typical implementations allow up to 2*M neighbors on layer 0
91 and up to M neighbors on upper layers.
92 """
93 new_level = sample_level(mL)
94
95 if graph.entry_point is None:
96 graph.add_node(new_id, new_vector, max_level=new_level)
97 graph.entry_point = new_id
98 graph.max_level = new_level
99 return
100
101 graph.add_node(new_id, new_vector, max_level=new_level)
102 entry_id = graph.entry_point
103
104 # Phase 1: greedy descent through layers above the new node's top layer.
105 for level in range(graph.max_level, new_level, -1):
106 entry_id = greedy_search(graph, new_vector, entry_id, level)
107
108 # Phase 2: connect the new node from its top layer down to layer 0.
109 for level in range(min(new_level, graph.max_level), -1, -1):
110 candidate_ids = search_layer(
111 graph, new_vector, entry_id, level, ef=ef_construction
112 )
113 max_degree = 2 * M if level == 0 else M
114 neighbors = select_neighbors_placeholder(
115 graph, new_vector, candidate_ids, level, max_degree
116 )
117
118 for neighbor_id in neighbors:
119 graph.add_bidirectional_edge(new_id, neighbor_id, level)
120 graph.prune_if_needed(neighbor_id, level, max_degree)
121
122 # Use the closest candidate as the entry point for the next lower layer.
123 entry_id = min(candidate_ids,
124 key=lambda nid: graph.distance(new_vector, nid))
125
126 if new_level > graph.max_level:
127 graph.entry_point = new_id
128 graph.max_level = new_levelSearch mirrors insertion in reverse. The algorithm starts at the sparse top layers for fast coarse routing, then runs a bounded best-first search at layer 0. That last step isn't exhaustive over the full dataset. It only explores a candidate set of size ef_search, which is why raising ef_search improves recall at the cost of more work. On good graphs, search often behaves close to logarithmic, but HNSW doesn't provide a hard worst-case logarithmic guarantee.
1import heapq
2from dataclasses import dataclass
3from typing import Sequence
4
5@dataclass
6class ToyHNSWGraph:
7 vectors: dict[int, tuple[float, ...]]
8 edges: dict[tuple[int, int], list[int]]
9 entry_point: int
10 max_level: int
11
12 def neighbors(self, node_id: int, level: int) -> Sequence[int]:
13 return self.edges.get((node_id, level), [])
14
15 def distance(self, query_vector: Sequence[float], node_id: int) -> float:
16 vector = self.vectors[node_id]
17 return sum((left - right) ** 2 for left, right in zip(query_vector, vector))
18
19def greedy_search(
20 graph: ToyHNSWGraph, query_vector: Sequence[float], entry_id: int, level: int
21) -> int:
22 current = entry_id
23 improved = True
24 while improved:
25 improved = False
26 for neighbor in graph.neighbors(current, level):
27 if graph.distance(query_vector, neighbor) < graph.distance(query_vector, current):
28 current = neighbor
29 improved = True
30 return current
31
32def search_layer(
33 graph: ToyHNSWGraph,
34 query_vector: Sequence[float],
35 entry_id: int,
36 level: int,
37 ef: int,
38) -> list[int]:
39 visited = {entry_id}
40 entry_dist = graph.distance(query_vector, entry_id)
41 candidates = [(entry_dist, entry_id)]
42 results = [(-entry_dist, entry_id)]
43
44 while candidates:
45 dist, current = heapq.heappop(candidates)
46 worst_result_dist = -results[0][0]
47 if len(results) >= ef and dist > worst_result_dist:
48 break
49 for neighbor in graph.neighbors(current, level):
50 if neighbor in visited:
51 continue
52 visited.add(neighbor)
53 neighbor_dist = graph.distance(query_vector, neighbor)
54 if len(results) < ef or neighbor_dist < -results[0][0]:
55 heapq.heappush(candidates, (neighbor_dist, neighbor))
56 heapq.heappush(results, (-neighbor_dist, neighbor))
57 if len(results) > ef:
58 heapq.heappop(results)
59
60 return [
61 node
62 for distance, node in sorted((-neg_distance, node) for neg_distance, node in results)
63 ]
64
65def hnsw_search(
66 graph: ToyHNSWGraph, query: Sequence[float], k: int = 10, ef_search: int = 100
67) -> list[int]:
68 """
69 Performs approximate nearest neighbor search.
70
71 Args:
72 query: The query vector.
73 k: Number of results to return.
74 ef_search: Size of the dynamic candidate list during search.
75 """
76 if ef_search < k:
77 raise ValueError("ef_search must be at least k")
78
79 current_node = graph.entry_point
80
81 # Phase 1: greedy descent from the top layer down to layer 1.
82 for level in range(graph.max_level, 0, -1):
83 current_node = greedy_search(graph, query, current_node, level)
84
85 # Phase 2: bounded best-first search on layer 0.
86 candidates = search_layer(graph, query, current_node, level=0, ef=ef_search)
87
88 return sorted(candidates,
89 key=lambda nid: graph.distance(query, nid))[:k]
90
91graph = ToyHNSWGraph(
92 vectors={
93 0: (0.0, 0.0),
94 1: (1.0, 0.0),
95 2: (1.0, 1.0),
96 3: (5.0, 5.0),
97 },
98 edges={
99 (0, 1): [1],
100 (1, 1): [0, 2],
101 (2, 1): [1],
102 (0, 0): [1],
103 (1, 0): [0, 2],
104 (2, 0): [1, 3],
105 (3, 0): [2],
106 },
107 entry_point=0,
108 max_level=1,
109)
110
111nearest_ids = hnsw_search(graph, query=(1.05, 1.0), k=2, ef_search=3)
112print(f"nearest ids: {nearest_ids}")1nearest ids: [2, 1]If you set ef_search = 5 but ask for k = 10, most implementations enforce ef_search >= k because you need at least k candidates to return k results. The real risk is subtler: if ef_search is only slightly larger than k, you might miss better neighbors outside the explored candidate set. That's the recall-latency trade-off in action.
Common mistake: Setting
ef_searchfrom intuition rather than an evaluation curve. A small beam can miss true neighbors; an unnecessarily large beam spends latency without a useful recall gain. Measureef_searchvalues against exact top-kresults on representative queries.
Tuning HNSW requires balancing the quality of the graph (which impacts recall) against the memory consumption and indexing speed. The most critical parameters control the number of edges and the beam width used during the graph traversal.
| Parameter | Meaning | Typical Value | Effect |
|---|---|---|---|
M | Max neighbors per node on upper layers | 16-48 | Higher = better recall, higher memory usage and slower builds. |
M0 | Max neighbors on layer 0 (often 2*M) | 2*M | Denser base layer, better local connectivity, more memory. |
ef_construction | Candidate list size during insert | 100-400 | Higher = better graph quality, slower indexing. |
ef_search | Candidate list size during query | k to a few hundred | Higher = better recall, higher latency. Many implementations require ef_search >= k. |
mL | Controls the layer sampling distribution | Often near | Higher = more upper-layer nodes, shallower descents. |
ef_search is a query-time knob. You can trade latency for recall dynamically without rebuilding the index.
1measurements = [
2 {"ef_search": 24, "recall": 0.91, "p95_ms": 4.1},
3 {"ef_search": 48, "recall": 0.97, "p95_ms": 6.8},
4 {"ef_search": 96, "recall": 0.985, "p95_ms": 12.4},
5]
6required_recall = 0.96
7latency_budget_ms = 10.0
8
9eligible = [
10 point for point in measurements
11 if point["recall"] >= required_recall and point["p95_ms"] <= latency_budget_ms
12]
13choice = min(eligible, key=lambda point: point["p95_ms"])
14print(f"release ef_search={choice['ef_search']} recall={choice['recall']:.3f} p95_ms={choice['p95_ms']}")1release ef_search=48 recall=0.970 p95_ms=6.8While HNSW is graph-based, IVF is a partition-based algorithm. It divides the vector space into distinct regions called Voronoi cells, where every point is closer to one centroid than to any other centroid. During search, the algorithm only looks inside a small subset of those cells.
The easiest way to think about IVF is a warehouse zone map. Instead of checking every bin in the building, you find the zone whose centroid is closest to the query and only search those bins. If the item might be in a neighboring zone, you check a few extra zones to protect recall.
The process is broken down into three phases:
nlist.nprobe lists, then do exact distance calculations only inside those selected lists.
Suppose you have 1 million embeddings with 128 dimensions. Here's how the operation count breaks down for brute force versus IVF.
nlist = 1,024 and nprobe = 5That's about a 170x reduction in distance computations. The actual speedup in a real system is smaller because of overhead, but the principle is clear: IVF avoids scanning most of the database.
1def balanced_ivf_scores(total_vectors: int, nlist: int, nprobe: int) -> int:
2 return nlist + round(total_vectors / nlist * nprobe)
3
4total_vectors = 1_000_000
5nlist = 1_024
6nprobe = 5
7estimated_scores = balanced_ivf_scores(total_vectors, nlist, nprobe)
8
9print(f"flat vector scores: {total_vectors:,}")
10print(f"balanced IVF vector and centroid scores: {estimated_scores:,}")
11print(f"estimated reduction: {total_vectors / estimated_scores:.1f}x")1flat vector scores: 1,000,000
2balanced IVF vector and centroid scores: 5,907
3estimated reduction: 169.3xFaiss is a widely used implementation of IVF-style ANN indexing and documents the IndexIVFFlat behavior used below.[3] The example builds an IVF index, trains its coarse quantizer, assigns vectors to cells, and searches the selected lists.
1import faiss
2import numpy as np
3
4def build_ivf_index(
5 vectors: np.ndarray, nlist: int = 1024, nprobe: int = 10
6) -> faiss.IndexIVFFlat:
7 """
8 Builds an IVF index using Faiss.
9
10 Args:
11 vectors: (N, d) float32 array of data.
12 nlist: Number of coarse centroids / inverted lists.
13 nprobe: Number of inverted lists to visit during search.
14 """
15 vectors = np.ascontiguousarray(vectors, dtype=np.float32)
16 if vectors.ndim != 2:
17 raise ValueError("vectors must be a 2D array")
18 if vectors.shape[0] < nlist:
19 raise ValueError("nlist cannot exceed the number of training vectors")
20
21 d = vectors.shape[1]
22
23 # 1. Quantizer: index used to assign vectors to centroids.
24 # IndexFlatL2 uses brute-force L2 distance.
25 quantizer = faiss.IndexFlatL2(d)
26
27 # 2. IVF index.
28 index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
29
30 # 3. Train: run k-means to learn centroids from representative data.
31 index.train(vectors)
32
33 # 4. Add: assign all vectors to their nearest centroid.
34 index.add(vectors)
35
36 # 5. Configure search scope.
37 index.nprobe = min(nprobe, nlist)
38
39 return index
40
41left_cluster = np.column_stack((np.linspace(0.0, 0.39, 40), np.zeros(40)))
42right_cluster = np.column_stack((10.0 + np.linspace(0.0, 0.39, 40), np.full(40, 10.0)))
43vectors = np.vstack((left_cluster, right_cluster)).astype(np.float32)
44
45index = build_ivf_index(vectors, nlist=2, nprobe=2)
46distances, ids = index.search(np.array([[0.02, 0.0]], dtype=np.float32), k=3)
47
48print(f"trained: {index.is_trained}")
49print(f"stored vectors: {index.ntotal}")
50print(f"nearest ids: {ids[0].tolist()}")1trained: True
2stored vectors: 80
3nearest ids: [2, 1, 3]Recall vs nprobe: The nprobe parameter is the most important knob for tuning an IVF index in production. It controls how many inverted lists are visited during search. If the inverted lists were perfectly balanced, the scanned fraction would be roughly nprobe / nlist. Real systems deviate from that because list sizes are uneven, but the intuition is still useful: more probes mean higher recall and more work.
nprobe regime | What gets scanned | Practical effect |
|---|---|---|
Very small (1 to a few lists) | Only the closest coarse cells | Lowest latency, highest miss rate. |
| Moderate | A small fraction of the database | Usually the best operating point. |
Large (nprobe close to nlist) | Most or all inverted lists | IndexIVFFlat approaches exhaustive search and often loses its speed advantage. |
The choice of distance metric affects both training and search. Faiss supports L2 and inner product directly across its main ANN indexes, and cosine similarity is usually implemented by normalizing vectors and then using inner product search.
L2 distance is the standard Euclidean objective. It's a good fit when vector magnitude carries signal or when your model was trained with an L2-style loss.
Inner product (IP) is the right objective for maximum inner product search. Query norm doesn't affect ranking, but database vector norms do, so large-norm items can dominate if that's how your embeddings behave.
Cosine similarity ignores magnitude and keeps only direction. In practice, you normalize both database vectors and query vectors, then use inner product search. On normalized vectors, L2 and inner product give equivalent rankings because .
| Metric | What it measures | Best when | IVF retrain needed? |
|---|---|---|---|
| L2 (Euclidean) | Straight-line distance in space | Magnitude carries signal | Yes, if switching from another metric |
| Inner product | You want maximum dot-product ranking | Yes, if switching from another metric | |
| Cosine | Direction only, no magnitude | You only care about semantic similarity | Yes, on normalized vectors |
Metric choice changes the geometry seen by the coarse quantizer. If you switch an IVF system from raw L2 search to cosine search, retrain the centroids on normalized vectors rather than reusing the old clustering.
1import numpy as np
2
3vectors = np.array([[3.0, 0.0], [1.0, 1.0], [0.0, 2.0]], dtype=np.float32)
4query = np.array([1.0, 0.8], dtype=np.float32)
5
6normalized_vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
7normalized_query = query / np.linalg.norm(query)
8inner_product_scores = normalized_vectors @ normalized_query
9l2_scores = np.sum((normalized_vectors - normalized_query) ** 2, axis=1)
10
11print(f"inner-product rank: {np.argsort(-inner_product_scores).tolist()}")
12print(f"L2 rank after normalization: {np.argsort(l2_scores).tolist()}")1inner-product rank: [1, 0, 2]
2L2 rank after normalization: [1, 0, 2]| Feature | HNSW | IVF-Flat |
|---|---|---|
| Category | Proximity graph | Clustering / inverted file |
| Search cost | Empirically sublinear on suitable well-built graphs; benchmark on your data | Centroid scoring plus scanning roughly nprobe / nlist of the data if lists are balanced |
| Memory | High (raw vectors + graph edges) | High (raw vectors + ids), but lower than HNSW |
| Training step | None | Required to learn centroids |
| Index updates | Inserts are natural; deletes are implementation-specific | Inserts are easy, but centroid quality can drift over time |
| Recall | Often strong at a fixed in-RAM latency budget; workload dependent | Tunable via nprobe; workload dependent |
| Build time | Slow | Faster once centroids are trained |
| Best for | Low-latency, high-recall search in RAM | Large datasets where you want a controllable scan fraction or a base for PQ |
The biggest operational difference is that IVF requires a training step on representative data before indexing. If your vector distribution drifts over time, stale coarse centroids can unbalance lists or lower recall. HNSW doesn't have that centroid-training dependency, but commonly pays with more memory and slower construction.
For billion-scale datasets, storing full float32 vectors in RAM is usually impractical on a single RAM-resident node.
This calculation shows that one billion vectors, each with 768 float32 dimensions (4 bytes each), requires about 3 TB just for raw vector storage. That's before index overhead, operating-system buffers, or any other application memory.
Product Quantization (PQ) compresses vectors by breaking them into smaller chunks and quantizing those chunks independently.[4] Combined with IVF, PQ is one established way to make billion-point memory budgets tractable; whether a deployment also needs sharding depends on code size, metadata, throughput, and availability requirements.
Product Quantization operates by dividing the high-dimensional space into several lower-dimensional subspaces and quantizing each subspace independently. This avoids the "curse of dimensionality" that makes standard K-Means clustering computationally infeasible for very large vectors.
The compression process follows three distinct steps:
In an IVF-PQ system, those PQ codes are usually built on the residual vector after subtracting the assigned coarse centroid, not on the raw vector itself.[3] Quantizing the smaller residual usually gives better accuracy for the same code size.
By using centroids, an ID requires just 1 byte of storage (since ). A 768-dim float vector (3072 bytes) compressed with becomes an array of 96 single-byte IDs, taking up exactly 96 bytes. That achieves a massive 32x compression ratio, enabling billion-scale search on reasonable hardware.
1def bytes_per_raw_vector(dimensions: int) -> int:
2 return dimensions * 4
3
4def bytes_per_pq_code(chunks: int, bits_per_chunk: int = 8) -> int:
5 return chunks * bits_per_chunk // 8
6
7raw_bytes = bytes_per_raw_vector(768)
8code_bytes = bytes_per_pq_code(96)
9ivf_pq_bytes_with_id = code_bytes + 8
10
11print(f"raw vector bytes: {raw_bytes}")
12print(f"PQ code bytes: {code_bytes}")
13print(f"IVF-PQ code plus 64-bit id bytes: {ivf_pq_bytes_with_id}")
14print(f"raw-to-code compression: {raw_bytes / code_bytes:.0f}x")1raw vector bytes: 3072
2PQ code bytes: 96
3IVF-PQ code plus 64-bit id bytes: 104
4raw-to-code compression: 32xPlain PQ has a weakness: it splits dimensions into fixed contiguous chunks. If the variance in your embeddings is unevenly spread across dimensions, some subspaces carry most of the signal while others are nearly constant, and the per-chunk codebooks waste their 256 centroids on the wrong directions. Optimized Product Quantization (OPQ) addresses this by learning an orthogonal rotation matrix and applying it before the split so the quantizer can distribute useful variation across chunks.[5] An orthogonal rotation preserves L2 distances, so it costs one matrix multiply per query but loses no geometric information before quantization.
In Faiss you can request OPQ with the factory string OPQ96,IVF4096,PQ96: the OPQ96 prefix trains the rotation, then IVF and PQ run on the rotated vectors. OPQ is a candidate to benchmark at the same code size; its recall benefit depends on the embedding distribution and training sample.
We don't decompress vectors to search them. Instead, we calculate the distance from the uncompressed query to the compressed database vectors using lookup tables.
Mathematically, the squared Euclidean distance between a query vector and a quantized vector (approximated by codebook centroids ) is:
Where is the -th sub-vector of the query, and is the centroid for the -th sub-vector of . This is the core of Asymmetric Distance Computation (ADC). Instead of reconstructing an approximate full vector and computing distance against it, we precompute the distance from each query sub-vector to all centroids in each codebook. The query-time distance for any stored vector then becomes a simple table lookup and sum. The implementation below demonstrates this. It takes a query vector and pre-computed PQ codebooks, creates the lookup tables, and returns the top k nearest neighbors by partial distance lookup alone.
1import numpy as np
2
3def adc_pq_search(
4 query: np.ndarray,
5 pq_codes: np.ndarray,
6 codebooks: np.ndarray,
7 k: int = 10
8) -> np.ndarray:
9 """
10 Approximate nearest neighbor search using Asymmetric Distance Computation (ADC)
11 with Product Quantization.
12
13 Args:
14 query: (d,) float vector.
15 pq_codes: (N, m) uint8 array of stored codes.
16 codebooks: (m, 256, d/m) float array of centroids.
17 """
18 m = pq_codes.shape[1]
19 if query.shape[0] % m != 0:
20 raise ValueError("query dimension must be divisible by number of PQ chunks")
21 d_sub = query.shape[0] // m
22
23 # 1. Precompute distance tables.
24 # For each subspace, measure query sub-vector distance to every centroid.
25 dist_tables = np.zeros((m, codebooks.shape[1]), dtype=np.float32)
26 for i in range(m):
27 query_sub = query[i*d_sub : (i+1)*d_sub]
28 dist_tables[i, :] = np.sum((codebooks[i] - query_sub) ** 2, axis=1)
29
30 # 2. Aggregate distances with table lookups.
31 # Sum partial distances for each vector based on stored code IDs.
32 dists = np.zeros(len(pq_codes))
33 for i in range(m):
34 # Look up precomputed distance for code stored at this position.
35 dists += dist_tables[i, pq_codes[:, i]]
36
37 return np.argsort(dists)[:k]
38
39query = np.array([0.0, 0.0, 0.0, 0.0], dtype=np.float32)
40codebooks = np.zeros((2, 256, 2), dtype=np.float32)
41codebooks[0, 1] = np.array([2.0, 0.0])
42codebooks[1, 1] = np.array([0.0, 2.0])
43pq_codes = np.array(
44 [
45 [0, 0],
46 [1, 0],
47 [0, 1],
48 [1, 1],
49 ],
50 dtype=np.uint8,
51)
52
53nearest_ids = adc_pq_search(query, pq_codes, codebooks, k=2).tolist()
54print(f"nearest ids: {nearest_ids}")1nearest ids: [0, 1]Real queries almost never search the whole corpus. A support tool asks for "tickets like this one, but only from the last 30 days and only for the EU region." That metadata predicate is a filter, and where you apply it changes both recall and latency. There are three strategies, and the choice is one of the most common production interview questions.
Filter then exact scan first finds rows matching the predicate, then scores only that subset. If the allowed set is small, this is exact and can be efficient. It gets costly as the allowed set grows.
Post-filtering (search, then filter) runs normal ANN search for some candidate count, then discards anything that fails the predicate. This keeps the graph intact, but it has an overfiltering failure mode: if the predicate is selective (say only 1% of vectors match), the initial ANN candidate buffer may contain zero matches, so you return fewer than k results or none at all. Some engines address this by expanding the scan iteratively until enough matching results appear or a budget is exhausted.[1]
Allow-list graph traversal is another useful option. The graph walk only counts allowed nodes toward the result set while it can still traverse disallowed nodes for connectivity. Weaviate documents this pre-filtered HNSW behavior, a sweeping strategy, and an ACORN-inspired option; its documentation recommends ACORN particularly when restrictive filters correlate poorly with the vector neighborhood.[6][7] This isn't a universal winner: a tiny allowed set can make a flat scan cheaper, while broader or poorly correlated filters can benefit from graph traversal.
1candidate_ids = ["eu-1", "us-1", "us-2", "us-3", "eu-2"]
2region = {"eu-1": "eu", "us-1": "us", "us-2": "us", "us-3": "us", "eu-2": "eu"}
3
4def post_filter(candidates: list[str], required_region: str, k: int) -> list[str]:
5 return [item for item in candidates if region[item] == required_region][:k]
6
7initial_candidates = candidate_ids[:3]
8expanded_candidates = candidate_ids
9
10print(f"initial result: {post_filter(initial_candidates, 'eu', k=2)}")
11print(f"expanded result: {post_filter(expanded_candidates, 'eu', k=2)}")1initial result: ['eu-1']
2expanded result: ['eu-1', 'eu-2']Vendor defaults drift faster than the underlying concepts, so skip memorizing a leaderboard table. In interviews and production reviews, ask four concrete questions instead:
That framing travels better than product trivia:
The pattern that stays stable is conceptual, not vendor-specific: HNSW is a candidate when you want strong in-memory recall without centroid training, while IVF and PQ are candidates when scan fraction or bytes per vector become central constraints. Choose from measured recall, tail latency, build/update cost, and memory on your workload.
When data exceeds standard RAM limits, you usually need one of two moves: compress the vectors more aggressively, or move most of the index body out of RAM. A global logistics platform tracking billions of shipment embeddings faces this exact choice: either compress each route vector with PQ or keep a sparse graph in memory and store the full index on fast SSD.
Combining IVF (to reduce search scope) and PQ (to compress vectors) is a common billion-scale recipe.
IndexIVFPQ in Faiss.IndexIVFPQ stores about ceil(m_pq * nbits / 8) + 8 bytes per vector for the PQ code plus the vector id, where m_pq is the number of PQ sub-vectors/codebooks. A 32-byte code therefore lands near 40 bytes per vector, so 1B vectors is roughly 40 GB before centroids and other overhead.Microsoft's DiskANN[8] shows that you can serve a billion-point graph index from a single node by keeping a compact in-memory structure and placing the large index body on SSD.
>5000 queries per second, <3 ms mean latency, and 95%+ 1-recall@1 on SIFT1B on a 16-core machine with 64 GB RAM plus SSD. Treat those numbers as benchmark-specific, not a universal promise for every embedding workload.If single-node vertical scaling fails, we shard horizontally using a Scatter-Gather approach.
These are the symptoms, causes, and fixes we see most often in production vector search.
Symptom: recall drops after a data shift Cause: You trained an IVF index on English product descriptions, then started adding Spanish descriptions. The old centroids no longer represent the new data distribution, so queries land in the wrong Voronoi cells. Fix: Retrain centroids on a representative sample that includes the new data. Monitor recall@k on a held-out validation set and trigger retraining when it drifts below a threshold.
Symptom: HNSW index causes OOM crashes
Cause: An uncompressed HNSW index stores graph adjacency data on top of raw vectors. Edge representation, layer distribution, ids, allocator overhead, and implementation choices all affect the real total.
Fix: Measure bytes per indexed vector in the chosen engine at production-like M and dimension settings, then reserve headroom for builds, queries, operating-system buffers, and metadata. If RAM is tight, evaluate compression or an IVF-PQ design against recall requirements.
Symptom: search is fast but returns poor matches
Cause: The ANN budget is too small for the index family you chose. For IVF that often means nprobe is too low, so you inspect too few lists. For HNSW it often means ef_search is barely above k, so the beam never explores enough candidates.
Fix: Diagnose by index family, not with one generic knob story. Measure recall@k on a labeled set, then raise nprobe for IVF or ef_search for HNSW until recall meets the product target.
Symptom: switching from L2 to cosine breaks IVF results
Cause: You changed the distance metric at search time but didn't retrain the coarse quantizer on normalized vectors. The centroids were learned in L2 geometry, so they misplace vectors after normalization.
Fix: Normalize training vectors before calling index.train(), and normalize both database and query vectors before index.add() and index.search().
HNSW handles inserts naturally because a new node can be attached incrementally. Deletes are harder because removing nodes can damage graph connectivity. Many systems therefore use tombstones and periodic rebuilds or compaction instead of pretending deletion is free.
Because memory is the hard constraint. IVF-PQ compresses vectors aggressively enough to stay resident where HNSW may spend too much RAM on raw vectors plus graph edges. HNSW is often better when the working set already fits; IVF-PQ becomes the practical choice when bytes per vector decide feasibility.
Because ADC is the whole trick that keeps PQ search efficient. Query-time work should be lookup-table additions, not full-vector reconstruction for every candidate. Reconstructing every vector throws away much of the compute benefit of compressed search.
You shrank local indexes but increased fan-out and made the request wait on more shard responses. Scatter-gather throughput can rise while p99 still gets worse if one slow shard or merge overhead dominates.
By the end, you should be able to defend these points without relying on metadata or hidden notes:
M, ef_construction, and ef_search move recall, latency, build time, and memory in different directions.nlist inverted lists and probing only nprobe of them.By this point in the article, you should be able to:
ef_search controls the recall-latency trade-off.nlist and nprobe, estimate the fraction of the database that gets scanned.| Index family | Core idea | Main strength | Main cost | Reach for it when |
|---|---|---|---|---|
| HNSW | Hierarchical proximity graph | Often strong in-RAM recall/latency behavior | Graph memory and slower builds | Low latency matters, the working set fits in RAM, and evaluation supports it |
| IVF-Flat | Coarse partitioning plus exact scan inside probed lists | Simple and tunable | Needs training and still stores raw vectors | You want controllable scan fraction without graph complexity |
| IVF-PQ | IVF plus compressed vector codes | Much lower bytes per vector | Quantization error and more tuning | Memory cost dominates and some recall loss is acceptable |
| DiskANN | SSD-backed graph index | Single-node scale beyond RAM | Storage-aware engineering complexity | The dataset is larger than RAM but one fast SSD-backed node is still attractive |
Evaluate HNSW when you can afford RAM and need low-latency high-recall retrieval. Evaluate IVF-PQ when raw vector storage becomes the bottleneck. Consider DiskANN-style systems when a graph index remains attractive but the index body no longer fits in memory. In each case, release from a recall-and-latency curve measured on representative queries.
pgvector
pgvector contributors · 2026 · GitHub
Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.
Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence
Billion-scale similarity search with GPUs.
Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint
Product Quantization for Nearest Neighbor Search.
Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011
Optimized Product Quantization
Ge, T., He, K., Ke, Q., & Sun, J. · 2013
ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
Patel, L., Kraft, P., Guestrin, C., & Zaharia, M. · 2024
Filtering
Weaviate · 2026
DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node.
Subramanya, S. J., et al. · 2019 · NeurIPS