LearnAdvanced Agents & RetrievalVector DB Internals: HNSW & IVF

📐HardEmbeddings & Vector Search

Vector DB Internals: HNSW & IVF

Learn how approximate nearest neighbor indexes use HNSW, IVF, and Product Quantization to balance speed, recall, and memory in production vector databases.

38 min read

Learning path

Step 111 of 158 in the full curriculum

Model Merging and Weight Interpolation Advanced RAG: HyDE & Self-RAG

Exact vector scoring and production RAG established how retrieval feeds grounded generation. Now open the index itself: before an agent or RAG pipeline can reason over documents, the system has to find relevant candidate chunks quickly enough for an interactive product.

Finding one incident runbook in an archive with millions of traces, alerts, and postmortems is hard when you have only a vague description of the failure. Embeddings turn the query and documents into vectors so the system can compare meaning, but checking every stored vector is expensive. You'll see how Hierarchical Navigable Small World (HNSW) graphs and inverted file (IVF) partitions avoid most of that scan, what Product Quantization (PQ) compresses, and how to choose a recall-latency-memory operating point.

The brute-force problem

The simplest method, called Exact Search (or k-NN for k-Nearest Neighbors), calculates the distance from your query vector to every vector in the database.

Mathematically, given a query $q$ and a database of $N$ vectors each with $d$ dimensions, exact search has $<span data-glossary="big-o-notation">O(N \cdot d)</span>$ complexity. That notation means the work grows in direct proportion to both the number of vectors and the number of dimensions.

Make that concrete with 100 million trace-summary embeddings, each with 1,536 dimensions, stored as 4-byte floats. A flat query scores all 100 million vectors, or 153.6 billion scalar components before considering vectorization, batching, memory bandwidth, or accelerator hardware. Whether that fits an SLA is a benchmark question, not something dimensionality alone can answer.

The memory picture is just as stark:

$100{,}000{,}000 \text{ vectors} \times 1{,}536 \text{ dims} \times 4 \text{ bytes} \approx 614 \text{ GB} \approx 572 \text{ GiB}$

That's raw vector storage alone, before any index overhead (the lab below reports the same quantity in GiB, the binary units RAM is usually measured in). A single ordinary server won't keep that raw collection in RAM. Approximate Nearest Neighbor (ANN) indexes reduce scored candidates by accepting that some true neighbors may be missed.

estimate-flat-storage-and-work.py

def gib_for_float32_vectors(count: int, dimensions: int) -> float:
    return count * dimensions * 4 / (1024 ** 3)

def scalar_components_scored(count: int, dimensions: int) -> int:
    return count * dimensions

n_vectors = 100_000_000
dimensions = 1_536

print(f"raw float32 storage: {gib_for_float32_vectors(n_vectors, dimensions):.1f} GiB")
print(f"scalar components scored per flat query: {scalar_components_scored(n_vectors, dimensions):,}")

Output

raw float32 storage: 572.2 GiB
scalar components scored per flat query: 153,600,000,000

For large, weakly filtered collections, ANN is usually the practical serving path. Exact flat search still matters: it's the ground truth for measuring recall, and it can be the right production execution path when a metadata filter leaves only a small candidate set. For example, pgvector performs exact nearest-neighbor search by default and documents exact search as useful for selective filter conditions.^{[1]Reference 1pgvectorhttps://github.com/pgvector/pgvector}

The main quality metric is Recall@k: of the true top-k nearest neighbors, how many did your ANN index recover? Latency numbers without recall are misleading because an index can look fast by missing good candidates. A system that returns random vectors in 1 ms has terrible recall. A system that takes 20 ms and finds 99 of the true top 100 has excellent recall.

measure-recall-at-k.py

def recall_at_k(exact_ids: list[int], approximate_ids: list[int]) -> float:
    expected = set(exact_ids)
    return len(expected.intersection(approximate_ids)) / len(expected)

exact_top_5 = [10, 21, 34, 55, 89]
fast_but_weak = [10, 21, 8, 13, 5]
slower_but_better = [10, 21, 34, 55, 3]

print(f"fast candidate recall@5: {recall_at_k(exact_top_5, fast_but_weak):.0%}")
print(f"better candidate recall@5: {recall_at_k(exact_top_5, slower_but_better):.0%}")

Output

fast candidate recall@5: 40%
better candidate recall@5: 80%

Recall gate: Latency without recall is incomplete. Keep an exact-search evaluation set, measure ANN recall against it, and choose a latency-recall operating point that satisfies the product requirement.

Vector database overview comparing HNSW graph descent on one side with IVF routing into selected compressed lists on the other. — HNSW drops through sparse graph layers into dense local search. IVF scores coarse centroids first, then scans only selected lists with exact vectors or PQ codes.

HNSW (Hierarchical Navigable Small World)

HNSW is a widely used in-memory ANN index because it can provide a strong recall-latency trade-off without a separate centroid-training phase.^{[2]Reference 2Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.https://arxiv.org/abs/1603.09320} Whether it beats another index on your data still needs measurement under the same memory and recall constraints.

Core concept

HNSW is based on a multi-layer graph structure that mimics how people narrow a search. Map zoom is a useful analogy: start at the continent level, narrow down to a country, then a city, then a street. Each layer has more detail than the one above it.

Top Layers: Sparse graphs with long-range links. These are used for coarse navigation to quickly "zoom in" to the target neighborhood.
Bottom Layer (Layer 0): A dense graph containing all vectors. This is used for fine-grained local search.

This hierarchical structure is analogous to a skip list (a data structure that uses multiple layers of linked lists to allow fast search by skipping over intermediate nodes), but it's generalized for graphs.

HNSW search descends sparse upper layers, then expands locally in dense layer zero. — HNSW routes quickly through sparse upper layers, then spends most real search work in dense layer 0 before returning neighbors.

In the figure, Layer 2 has only a few nodes with long-range links, Layer 1 holds more nodes for regional refinement, and Layer 0 contains every node in a dense local graph. The parameters M, ef_search, and the Layer 0 adjacency cap (often 2M) control the recall-latency trade-off.

A worked example: tracing one insertion

To see how the layers work in practice, imagine you're building an HNSW index with just 8 vectors. The graph already has 7 vectors arranged across three layers. You now insert vector 8.

Sample a level. A random process decides that vector 8 will live on Layers 0 and 1 (but not Layer 2).
Zoom in. The entry point is on Layer 2. You greedily walk Layer 2 to find the node closest to vector 8, then drop to Layer 1.
Connect on Layer 1. On Layer 1, you run a beam search with ef_construction candidates, pick the best neighbors (up to M), and add bidirectional edges. You also check if any existing neighbor now has too many edges and prune if needed.
Drop to Layer 0. You again select up to M neighbors for the new node. Because connections are bidirectional, an existing Layer 0 node can commonly retain up to 2M total connections before pruning. That larger adjacency cap makes the base layer denser.
Update entry point. If vector 8 had been assigned above the graph's current maximum level, it would become the new entry point. Landing on an existing top layer isn't enough.

This incremental construction is why HNSW doesn't need a training phase. Every new vector is wired into the existing graph using local search.

Insert algorithm

This example is simplified educational pseudocode. The key steps are: sample a level, zoom in through upper layers, connect on each layer, and update the entry point if the new node is taller than the current graph.

insert-algorithm.py

import math
import random
from typing import Protocol, Sequence

class HNSWGraph(Protocol):
    entry_point: int | None
    max_level: int

    def neighbors(self, node_id: int, level: int) -> Sequence[int]: ...

    def distance(self, query_vector: Sequence[float], node_id: int) -> float: ...

    def add_node(
        self, node_id: int, vector: Sequence[float], max_level: int
    ) -> None: ...

    def add_bidirectional_edge(self, left_id: int, right_id: int, level: int) -> None: ...

    def prune_if_needed(self, node_id: int, level: int, max_degree: int) -> None: ...

def greedy_search(graph, query_vector, entry_id, level) -> int:
    """Walk this layer until no neighbor is closer to query_vector."""
    current = entry_id
    improved = True
    while improved:
        improved = False
        for neighbor in graph.neighbors(current, level):
            if graph.distance(query_vector, neighbor) < graph.distance(query_vector, current):
                current = neighbor
                improved = True
    return current

def search_layer(
    graph: HNSWGraph,
    query_vector: Sequence[float],
    entry_id: int,
    level: int,
    ef: int,
) -> list[int]:
    """Best-first search that returns up to ef candidate ids in one layer."""
    import heapq
    visited = {entry_id}
    entry_dist = graph.distance(query_vector, entry_id)
    candidates = [(entry_dist, entry_id)]  # min-heap by distance
    results = [(-entry_dist, entry_id)]    # max-heap of the best ef via negative distance

    while candidates:
        dist, current = heapq.heappop(candidates)
        worst_result_dist = -results[0][0]
        if len(results) >= ef and dist > worst_result_dist:
            break
        for neighbor in graph.neighbors(current, level):
            if neighbor not in visited:
                visited.add(neighbor)
                d = graph.distance(query_vector, neighbor)
                if len(results) < ef or d < -results[0][0]:
                    heapq.heappush(candidates, (d, neighbor))
                    heapq.heappush(results, (-d, neighbor))
                    if len(results) > ef:
                        heapq.heappop(results)
    return [
        node
        for distance, node in sorted((-neg_distance, node) for neg_distance, node in results)
    ]

def select_neighbors_placeholder(
    graph, query_vector, candidate_ids, level, max_degree
) -> list[int]:
    """Teaching simplification: take nearest candidates by distance."""
    # Production HNSW applies its diversity heuristic here to improve connectivity.
    candidates = sorted(candidate_ids,
                       key=lambda nid: graph.distance(query_vector, nid))
    return candidates[:max_degree]

def sample_level(mL: float) -> int:
    u = max(random.random(), 1e-12)
    return math.floor(-math.log(u) * mL)

def hnsw_insert(
    graph: HNSWGraph,
    new_id: int,
    new_vector: Sequence[float],
    M: int = 16,
    ef_construction: int = 200,
    mL: float = 0.36,
) -> None:
    """
    Insert one vector into an HNSW index.

    Typical implementations select up to M neighbors for the new node.
    Existing nodes commonly retain up to 2*M connections on layer 0
    and up to M connections on upper layers before pruning.
    """
    new_level = sample_level(mL)

    if graph.entry_point is None:
        graph.add_node(new_id, new_vector, max_level=new_level)
        graph.entry_point = new_id
        graph.max_level = new_level
        return

    graph.add_node(new_id, new_vector, max_level=new_level)
    entry_id = graph.entry_point

    # Phase 1: greedy descent through layers above the new node's top layer.
    for level in range(graph.max_level, new_level, -1):
        entry_id = greedy_search(graph, new_vector, entry_id, level)

    # Phase 2: connect the new node from its top layer down to layer 0.
    for level in range(min(new_level, graph.max_level), -1, -1):
        candidate_ids = search_layer(
            graph, new_vector, entry_id, level, ef=ef_construction
        )
        max_degree = 2 * M if level == 0 else M
        neighbors = select_neighbors_placeholder(
            graph, new_vector, candidate_ids, level, M
        )

        for neighbor_id in neighbors:
            graph.add_bidirectional_edge(new_id, neighbor_id, level)
            graph.prune_if_needed(neighbor_id, level, max_degree)

        # Use the closest candidate as the entry point for the next lower layer.
        entry_id = min(candidate_ids,
                        key=lambda nid: graph.distance(new_vector, nid))

    if new_level > graph.max_level:
        graph.entry_point = new_id
        graph.max_level = new_level

Search algorithm

Search mirrors insertion in reverse. The algorithm starts at the sparse top layers for fast coarse routing, then runs a bounded best-first search at layer 0. That last step isn't exhaustive over the full dataset. It keeps up to ef_search best result candidates while the walk may visit more nodes, which is why raising ef_search often improves recall at the cost of more work. On good graphs, search often behaves close to logarithmic, but HNSW doesn't provide a hard worst-case logarithmic guarantee.

search-algorithm.py

import heapq
from dataclasses import dataclass
from typing import Sequence

@dataclass
class ToyHNSWGraph:
    vectors: dict[int, tuple[float, ...]]
    edges: dict[tuple[int, int], list[int]]
    entry_point: int
    max_level: int

    def neighbors(self, node_id: int, level: int) -> Sequence[int]:
        return self.edges.get((node_id, level), [])

    def distance(self, query_vector: Sequence[float], node_id: int) -> float:
        vector = self.vectors[node_id]
        return sum((left - right) ** 2 for left, right in zip(query_vector, vector))

def greedy_search(
    graph: ToyHNSWGraph, query_vector: Sequence[float], entry_id: int, level: int
) -> int:
    current = entry_id
    improved = True
    while improved:
        improved = False
        for neighbor in graph.neighbors(current, level):
            if graph.distance(query_vector, neighbor) < graph.distance(query_vector, current):
                current = neighbor
                improved = True
    return current

def search_layer(
    graph: ToyHNSWGraph,
    query_vector: Sequence[float],
    entry_id: int,
    level: int,
    ef: int,
) -> list[int]:
    visited = {entry_id}
    entry_dist = graph.distance(query_vector, entry_id)
    candidates = [(entry_dist, entry_id)]
    results = [(-entry_dist, entry_id)]

    while candidates:
        dist, current = heapq.heappop(candidates)
        worst_result_dist = -results[0][0]
        if len(results) >= ef and dist > worst_result_dist:
            break
        for neighbor in graph.neighbors(current, level):
            if neighbor in visited:
                continue
            visited.add(neighbor)
            neighbor_dist = graph.distance(query_vector, neighbor)
            if len(results) < ef or neighbor_dist < -results[0][0]:
                heapq.heappush(candidates, (neighbor_dist, neighbor))
                heapq.heappush(results, (-neighbor_dist, neighbor))
                if len(results) > ef:
                    heapq.heappop(results)

    return [
        node
        for distance, node in sorted((-neg_distance, node) for neg_distance, node in results)
    ]

def hnsw_search(
    graph: ToyHNSWGraph, query: Sequence[float], k: int = 10, ef_search: int = 100
) -> list[int]:
    """
    Performs approximate nearest neighbor search.

    Args:
        query: The query vector.
        k: Number of results to return.
        ef_search: Size of the dynamic candidate list during search.
    """
    if ef_search < k:
        raise ValueError("ef_search must be at least k")

    current_node = graph.entry_point

    # Phase 1: greedy descent from the top layer down to layer 1.
    for level in range(graph.max_level, 0, -1):
        current_node = greedy_search(graph, query, current_node, level)

    # Phase 2: bounded best-first search on layer 0.
    candidates = search_layer(graph, query, current_node, level=0, ef=ef_search)

    return sorted(candidates,
                  key=lambda nid: graph.distance(query, nid))[:k]

graph = ToyHNSWGraph(
    vectors={
        0: (0.0, 0.0),
        1: (1.0, 0.0),
        2: (1.0, 1.0),
        3: (5.0, 5.0),
    },
    edges={
        (0, 1): [1],
        (1, 1): [0, 2],
        (2, 1): [1],
        (0, 0): [1],
        (1, 0): [0, 2],
        (2, 0): [1, 3],
        (3, 0): [2],
    },
    entry_point=0,
    max_level=1,
)

nearest_ids = hnsw_search(graph, query=(1.05, 1.0), k=2, ef_search=3)
print(f"nearest ids: {nearest_ids}")

Output

nearest ids: [2, 1]

If you set ef_search = 5 but ask for k = 10, most implementations enforce ef_search >= k because you need at least k candidates to return k results. The real risk is subtler: if ef_search is only slightly larger than k, you might miss better neighbors outside the explored candidate set. That's the recall-latency trade-off in action.

Common mistake: Setting ef_search from intuition rather than an evaluation curve. A small beam can miss true neighbors; an unnecessarily large beam spends latency without a useful recall gain. Measure ef_search values against exact top-k results on representative queries.

Key parameters

Tuning HNSW requires balancing the quality of the graph (which impacts recall) against the memory consumption and indexing speed. The parameters that matter most control the number of edges and the beam width used during graph traversal.

Parameter	Meaning	Typical Value	Effect
`M`	Max neighbors per node on upper layers	16-48	Higher = better recall, higher memory usage and slower builds.
`M0`	Max neighbors on layer 0 (often `2*M`)	`2*M`	Denser base layer, better local connectivity, more memory.
`ef_construction`	Candidate list size during insert	100-400	Higher = better graph quality, slower indexing.
`ef_search`	Candidate list size during query	`k` to a few hundred	Higher = better recall, higher latency. Many implementations require `ef_search >= k`.
`mL`	Controls the layer sampling distribution	Often near $1/\ln(M)$	Higher = more upper-layer nodes, shallower descents.

ef_search is a query-time knob. You can trade latency for recall dynamically without rebuilding the index.

select-hnsw-operating-point.py

measurements = [
    {"ef_search": 24, "recall": 0.91, "p95_ms": 4.1},
    {"ef_search": 48, "recall": 0.97, "p95_ms": 6.8},
    {"ef_search": 96, "recall": 0.985, "p95_ms": 12.4},
]
required_recall = 0.96
latency_budget_ms = 10.0

eligible = [
    point for point in measurements
    if point["recall"] >= required_recall and point["p95_ms"] <= latency_budget_ms
]
choice = min(eligible, key=lambda point: point["p95_ms"])
print(f"release ef_search={choice['ef_search']} recall={choice['recall']:.3f} p95_ms={choice['p95_ms']}")

Output

release ef_search=48 recall=0.970 p95_ms=6.8

IVF (Inverted File Index)

While HNSW is graph-based, IVF is a partition-based algorithm. It divides the vector space into distinct regions called Voronoi cells, where every point is closer to one centroid than to any other centroid. During search, the algorithm only looks inside a small subset of those cells.

Core concept: the centroid shard map

The easiest way to think about IVF is a centroid shard map. Instead of checking every vector in the collection, you find the shard whose centroid is closest to the query and only search that list. If a neighbor might sit across a boundary, you probe a few extra lists to protect recall.

The process is broken down into three phases:

Train: Run K-means clustering on a representative sample of your vectors to identify a set of $C$ cluster centers, known as centroids. In Faiss terminology, this count is usually called nlist.
Index: During ingestion, assign each vector to its nearest centroid. That places the vector into one inverted list.
Search: When a query vector arrives, first score the centroids, keep the closest nprobe lists, then score candidates only inside those selected lists. IVF-Flat calculates exact distances there; IVF-PQ can score compressed codes with ADC lookup tables.

IVF probe flow where a query scores coarse centroids, keeps only the closest lists, and then scores candidates inside those selected lists with either exact distances or PQ-based approximate distances. — IVF routes through coarse centroids first, probes only a few lists, then spends exact or approximate scoring work inside those selected lists.

A worked example: counting comparisons

Suppose you have 1 million embeddings with 128 dimensions. Compare brute force against IVF with nlist = 1,024 and nprobe = 5:

Brute force: $1{,}000{,}000$ distance computations per query.
IVF centroid scoring: $1{,}024$ distance computations.
IVF list scans: average list size $1{,}000{,}000 / 1{,}024 \approx 977$ , so $5 \times 977 \approx 4{,}885$ vector scores inside probed lists.
IVF total: roughly $5{,}909$ distance computations.

That's about a 170x reduction in distance computations. The actual speedup in a real system is smaller because of overhead, but the principle is clear: IVF avoids scanning most of the database.

estimate-ivf-probe-work.py

def balanced_ivf_scores(total_vectors: int, nlist: int, nprobe: int) -> int:
    return nlist + round(total_vectors / nlist * nprobe)

total_vectors = 1_000_000
nlist = 1_024
nprobe = 5
estimated_scores = balanced_ivf_scores(total_vectors, nlist, nprobe)

print(f"flat vector scores: {total_vectors:,}")
print(f"balanced IVF vector and centroid scores: {estimated_scores:,}")
print(f"estimated reduction: {total_vectors / estimated_scores:.1f}x")

Output

flat vector scores: 1,000,000
balanced IVF vector and centroid scores: 5,907
estimated reduction: 169.3x

Implementation with Faiss

Faiss is a widely used implementation of IVF-style ANN indexing and documents the IndexIVFFlat behavior used below.^{[3]Reference 3Billion-scale similarity search with GPUs.https://arxiv.org/abs/1702.08734} The example builds an IVF index, trains its coarse quantizer, assigns vectors to cells, and searches the selected lists.

implementation-with-faiss.py

import faiss
import numpy as np

def build_ivf_index(
    vectors: np.ndarray, nlist: int = 1024, nprobe: int = 10
) -> faiss.IndexIVFFlat:
    """
    Builds an IVF index using Faiss.

    Args:
        vectors: (N, d) float32 array of data.
        nlist: Number of coarse centroids / inverted lists.
        nprobe: Number of inverted lists to visit during search.
    """
    vectors = np.ascontiguousarray(vectors, dtype=np.float32)
    if vectors.ndim != 2:
        raise ValueError("vectors must be a 2D array")
    if vectors.shape[0] < nlist:
        raise ValueError("nlist cannot exceed the number of training vectors")

    d = vectors.shape[1]

    # 1. Quantizer: index used to assign vectors to centroids.
    # IndexFlatL2 uses brute-force L2 distance.
    quantizer = faiss.IndexFlatL2(d)

    # 2. IVF index.
    index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)

    # 3. Train: run k-means to learn centroids from representative data.
    index.train(vectors)

    # 4. Add: assign all vectors to their nearest centroid.
    index.add(vectors)

    # 5. Configure search scope.
    index.nprobe = min(nprobe, nlist)

    return index

left_cluster = np.column_stack((np.linspace(0.0, 0.39, 40), np.zeros(40)))
right_cluster = np.column_stack((10.0 + np.linspace(0.0, 0.39, 40), np.full(40, 10.0)))
vectors = np.vstack((left_cluster, right_cluster)).astype(np.float32)

index = build_ivf_index(vectors, nlist=2, nprobe=2)
distances, ids = index.search(np.array([[0.02, 0.0]], dtype=np.float32), k=3)

print(f"trained: {index.is_trained}")
print(f"stored vectors: {index.ntotal}")
print(f"nearest ids: {ids[0].tolist()}")

Output

trained: True
stored vectors: 80
nearest ids: [2, 1, 3]

Recall vs nprobe: The nprobe parameter is the most important knob for tuning an IVF index in production. It controls how many inverted lists are visited during search. If the inverted lists were perfectly balanced, the scanned fraction would be roughly nprobe / nlist. Real systems deviate from that because list sizes are uneven, but the direction still holds: more probes mean higher recall and more work.

`nprobe` regime	What gets scanned	Practical effect
Very small (`1` to a few lists)	Only the closest coarse cells	Lowest latency, highest miss rate.
Moderate	A small fraction of the database	Usually the best operating point.
Large (`nprobe` close to `nlist`)	Most or all inverted lists	`IndexIVFFlat` approaches exhaustive search and often loses its speed advantage.

Distance metrics

The choice of distance metric affects both training and search. Faiss supports L2 and inner product directly across its main ANN indexes, and cosine similarity is usually implemented by normalizing vectors and then using inner product search.

L2 distance is the standard Euclidean objective. It's a good fit when vector magnitude carries signal or when your model was trained with an L2-style loss.

Inner product (IP) is the right objective for maximum dot-product ranking. Query norm doesn't affect ranking, but database vector norms do, so large-norm items can dominate if that's how your embeddings behave.

Cosine similarity ignores magnitude and keeps only direction. In practice, you normalize both database vectors and query vectors, then use inner product search. On normalized vectors, L2 and inner product give equivalent rankings because $||x-y||^2 = 2 - 2\langle x, y \rangle$ .

Metric	What it measures	Best when	IVF retrain needed?
L2 (Euclidean)	Straight-line distance in space	Magnitude carries signal	Yes, if switching from another metric
Inner product	$x \cdot y$	You want maximum dot-product ranking	Yes, if switching from another metric
Cosine	Direction only, no magnitude	You only care about semantic similarity	Yes, on normalized vectors

Metric choice changes the geometry seen by the coarse quantizer. If you switch an IVF system from raw L2 search to cosine search, retrain the centroids on normalized vectors rather than reusing the old clustering.

verify-cosine-inner-product-ranking.py

import numpy as np

vectors = np.array([[3.0, 0.0], [1.0, 1.0], [0.0, 2.0]], dtype=np.float32)
query = np.array([1.0, 0.8], dtype=np.float32)

normalized_vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
normalized_query = query / np.linalg.norm(query)
inner_product_scores = normalized_vectors @ normalized_query
l2_scores = np.sum((normalized_vectors - normalized_query) ** 2, axis=1)

print(f"inner-product rank: {np.argsort(-inner_product_scores).tolist()}")
print(f"L2 rank after normalization: {np.argsort(l2_scores).tolist()}")

Output

inner-product rank: [1, 0, 2]
L2 rank after normalization: [1, 0, 2]

HNSW vs IVF

Feature	HNSW	IVF-Flat
Category	Proximity graph	Clustering / inverted file
Search cost	Empirically sublinear on suitable well-built graphs; benchmark on your data	Centroid scoring plus scanning roughly `nprobe / nlist` of the data if lists are balanced
Memory	High (raw vectors + graph edges)	High (raw vectors + ids), but lower than HNSW
Training step	None	Required to learn centroids
Index updates	Inserts are natural; deletes are implementation-specific	Inserts are easy, but centroid quality can drift over time
Recall	Often strong at a fixed in-RAM latency budget; workload dependent	Tunable via `nprobe`; workload dependent
Build time	Slow	Faster once centroids are trained
Best for	Low-latency, high-recall search in RAM	Large datasets where you want a controllable scan fraction or a base for PQ

The biggest operational difference is that IVF requires a training step on representative data before indexing. If your vector distribution drifts over time, stale coarse centroids can unbalance lists or lower recall. HNSW doesn't have that centroid-training dependency, but commonly pays with more memory and slower construction.

Product Quantization (PQ)

For billion-scale datasets, storing full float32 vectors in RAM is usually impractical on a single RAM-resident node.

$10^9 \text{ vectors} \times 768 \text{ dims} \times 4 \text{ bytes} \approx 3 \text{ TB RAM}$

One billion vectors at 768 float32 dimensions need about 3 TB of raw vector storage alone. Add index overhead, operating-system buffers, and application memory on top of that.

Product Quantization (PQ) compresses vectors by breaking them into smaller chunks and quantizing those chunks independently.^{[4]Reference 4Product Quantization for Nearest Neighbor Search.https://dblp.org/rec/journals/pami/JegouDS11} Combined with IVF, PQ is one established way to make billion-point memory budgets tractable; whether a deployment also needs sharding depends on code size, metadata, throughput, and availability requirements.

The algorithm

Product Quantization divides the high-dimensional space into several lower-dimensional subspaces and quantizes each subspace independently. A single full-vector codebook would need an impractically large number of centroids to represent many combinations of values. PQ factorizes that codebook into smaller pieces that are cheaper to train and store.

The compression process follows three distinct steps:

Split: Divide the original $d$ -dimensional vector into $m$ smaller sub-vectors, each having a dimension of $d^* = d/m$ . For instance, a 768-dimension vector split into 96 sub-vectors would yield sub-vectors of dimension 8.
Cluster: For each of the $m$ independent subspaces, run K-Means clustering over your training data to find a set of centroids (often $k=256$ ). This collection of centroids for a given subspace is called its "codebook".
Encode: For every vector in your database, map each of its sub-vectors to the nearest centroid in the corresponding codebook. Replace the actual sub-vector values with the ID of that centroid.

In an IVF-PQ system, those PQ codes are usually built on the residual vector after subtracting the assigned coarse centroid, not on the raw vector itself.^{[3]Reference 3Billion-scale similarity search with GPUs.https://arxiv.org/abs/1702.08734} Quantizing the smaller residual usually gives better accuracy for the same code size.

Product quantization pipeline splitting one vector into chunks, replacing each chunk with a centroid id, and shrinking storage from kilobytes to bytes. — PQ splits a large vector into many small sub-vectors, replaces each with a centroid ID, and stores bytes instead of float values. Compression is large because one code per chunk replaces many raw dimensions.

By using $k=256$ centroids, an ID requires just 1 byte of storage (since $2^8 = 256$ ). A 768-dim float vector (3072 bytes) compressed with $m=96$ becomes an array of 96 single-byte IDs, taking up exactly 96 bytes. That's 32x compression for the vector code before ids, codebooks, centroids, and other index overhead.

budget-pq-code-storage.py

def bytes_per_raw_vector(dimensions: int) -> int:
    return dimensions * 4

def bytes_per_pq_code(chunks: int, bits_per_chunk: int = 8) -> int:
    return chunks * bits_per_chunk // 8

raw_bytes = bytes_per_raw_vector(768)
code_bytes = bytes_per_pq_code(96)
ivf_pq_bytes_with_id = code_bytes + 8

print(f"raw vector bytes: {raw_bytes}")
print(f"PQ code bytes: {code_bytes}")
print(f"IVF-PQ code plus 64-bit id bytes: {ivf_pq_bytes_with_id}")
print(f"raw-to-code compression: {raw_bytes / code_bytes:.0f}x")

Output

raw vector bytes: 3072
PQ code bytes: 96
IVF-PQ code plus 64-bit id bytes: 104
raw-to-code compression: 32x

OPQ: rotating before you split

Plain PQ has a weakness: it splits dimensions into fixed contiguous chunks. If the variance in your embeddings is unevenly spread across dimensions, some subspaces carry most of the signal while others are nearly constant, and the per-chunk codebooks waste their 256 centroids on the wrong directions. Optimized Product Quantization (OPQ) addresses this by learning an orthogonal rotation matrix $R$ and applying it before the split so the quantizer can distribute useful variation across chunks.^{[5]Reference 5Optimized Product Quantizationhttps://openaccess.thecvf.com/content_cvpr_2013/html/Ge_Optimized_Product_Quantization_2013_CVPR_paper.html} An orthogonal rotation preserves L2 distances, so it costs one matrix multiply per query but loses no geometric information before quantization.

In Faiss you can request OPQ with the factory string OPQ96,IVF4096,PQ96: the OPQ96 prefix trains the rotation, then IVF and PQ run on the rotated vectors. OPQ is a candidate to benchmark at the same code size; its recall benefit depends on the embedding distribution and training sample.

Asymmetric distance computation (ADC)

We don't decompress vectors to search them. Instead, we calculate the distance from the uncompressed query to the compressed database vectors using lookup tables.

Mathematically, the squared Euclidean distance between a query vector $q$ and a quantized vector $y$ (approximated by codebook centroids $C(y)$ ) is:

$d(q, y)^2 \approx d(q, C(y))^2 = \sum_{j=1}^{m} d(q_j, C_j(y_j))^2$

Where $q_j$ is the $j$ -th sub-vector of the query, and $C_j(y_j)$ is the centroid for the $j$ -th sub-vector of $y$ . This is the core of Asymmetric Distance Computation (ADC). Instead of reconstructing an approximate full vector and computing distance against it, you precompute the distance from each query sub-vector to all centroids in each codebook. The query-time distance for any stored vector then becomes a table lookup and sum. The lab below builds those tables once per query, then ranks stored codes by summed partial distances only.

asymmetric-distance-computation-adc.py

import numpy as np

def adc_pq_search(
    query: np.ndarray,
    pq_codes: np.ndarray,
    codebooks: np.ndarray,
    k: int = 10
) -> np.ndarray:
    """
    Approximate nearest neighbor search using Asymmetric Distance Computation (ADC)
    with Product Quantization.

    Args:
        query: (d,) float vector.
        pq_codes: (N, m) uint8 array of stored codes.
        codebooks: (m, 256, d/m) float array of centroids.
    """
    m = pq_codes.shape[1]
    if query.shape[0] % m != 0:
        raise ValueError("query dimension must be divisible by number of PQ chunks")
    d_sub = query.shape[0] // m

    # 1. Precompute distance tables.
    # For each subspace, measure query sub-vector distance to every centroid.
    dist_tables = np.zeros((m, codebooks.shape[1]), dtype=np.float32)
    for i in range(m):
        query_sub = query[i*d_sub : (i+1)*d_sub]
        dist_tables[i, :] = np.sum((codebooks[i] - query_sub) ** 2, axis=1)

    # 2. Aggregate distances with table lookups.
    # Sum partial distances for each vector based on stored code IDs.
    dists = np.zeros(len(pq_codes))
    for i in range(m):
        # Look up precomputed distance for code stored at this position.
        dists += dist_tables[i, pq_codes[:, i]]

    # Distance first, then vector ID for deterministic ties.
    vector_ids = np.arange(len(dists))
    return np.lexsort((vector_ids, dists))[:k]

query = np.array([0.0, 0.0, 0.0, 0.0], dtype=np.float32)
codebooks = np.zeros((2, 256, 2), dtype=np.float32)
codebooks[0, 1] = np.array([2.0, 0.0])
codebooks[1, 1] = np.array([0.0, 2.0])
pq_codes = np.array(
    [
        [0, 0],
        [1, 0],
        [0, 1],
        [1, 1],
    ],
    dtype=np.uint8,
)

nearest_ids = adc_pq_search(query, pq_codes, codebooks, k=2).tolist()
print(f"nearest ids: {nearest_ids}")

Output

nearest ids: [0, 1]

Filtered search

Real queries almost never search the whole corpus. A support tool asks for "tickets like this one, but only from the last 30 days and only for the EU region." That metadata predicate is a filter, and where you apply it changes both recall and latency. Three strategies are useful to understand.

Filter then exact scan first finds rows matching the predicate, then scores only that subset. If the allowed set is small, this is exact and can be efficient. It gets costly as the allowed set grows.

Post-filtering (search, then filter) runs normal ANN search for some candidate count, then discards anything that fails the predicate. This keeps the graph intact, but it has an overfiltering failure mode: if the predicate is selective (say only 1% of vectors match), the initial ANN candidate buffer may contain zero matches, so you return fewer than k results or none at all. Some engines address this by expanding the scan iteratively until enough matching results appear or a budget is exhausted.^{[1]Reference 1pgvectorhttps://github.com/pgvector/pgvector}

Allow-list graph traversal is another useful option. The graph walk only counts allowed nodes toward the result set while it can still traverse disallowed nodes for connectivity. Weaviate documents this pre-filtered HNSW behavior, a sweeping strategy, and an ACORN-inspired option; its documentation recommends ACORN particularly when restrictive filters correlate poorly with the vector neighborhood.^{[6]Reference 6ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Datahttps://arxiv.org/abs/2403.04871}^{[7]Reference 7Filteringhttps://docs.weaviate.io/weaviate/concepts/filtering} This isn't a universal winner: a tiny allowed set can make a flat scan cheaper, while broader or poorly correlated filters can benefit from graph traversal.

show-post-filter-underfill.py

candidate_ids = ["eu-1", "us-1", "us-2", "us-3", "eu-2"]
region = {"eu-1": "eu", "us-1": "us", "us-2": "us", "us-3": "us", "eu-2": "eu"}

def post_filter(candidates: list[str], required_region: str, k: int) -> list[str]:
    return [item for item in candidates if region[item] == required_region][:k]

initial_candidates = candidate_ids[:3]
expanded_candidates = candidate_ids

print(f"initial result: {post_filter(initial_candidates, 'eu', k=2)}")
print(f"expanded result: {post_filter(expanded_candidates, 'eu', k=2)}")

Output

initial result: ['eu-1']
expanded result: ['eu-1', 'eu-2']

How products package these ideas

Vendor defaults drift faster than the underlying concepts, so skip memorizing a leaderboard table. In interviews and production reviews, ask four concrete questions instead:

Which ANN families are exposed: HNSW, IVF, PQ, disk-backed graphs, or a managed index whose internals aren't exposed?
How does filtered search work: before search, after search, or during traversal?
Where does the index body live: RAM, SSD, or object storage?
Which compression knobs are available, and what recall do they cost?

That framing travels better than product trivia:

Postgres-style extensions usually expose a small set of explicit index types such as HNSW or IVFFlat inside SQL workflows.
General-purpose vector databases often center HNSW for RAM-resident search, then layer in quantization and filter-aware traversal.
Faiss and Faiss-derived stacks expose broad ANN menus, which shifts more tuning responsibility onto the engineer.
Storage-first systems move more bytes out of RAM and onto SSD or object storage, which improves cost efficiency but makes IO behavior part of your latency budget.

The pattern that stays stable is conceptual, not vendor-specific: HNSW is a candidate when you want strong in-memory recall without centroid training, while IVF and PQ are candidates when scan fraction or bytes per vector become central constraints. Choose from measured recall, tail latency, build/update cost, and memory on your workload.

Scaling to billions

When data exceeds standard RAM limits, you usually need one of two moves: compress the vectors more aggressively, or move most of the index body out of RAM. A large incident-search platform tracking billions of trace-summary embeddings faces this exact choice: either compress each incident vector with PQ or keep a sparse graph in memory and store the full index on fast SSD.

1. Hybrid approaches (IVF-PQ)

Combining IVF (to reduce search scope) and PQ (to compress vectors) is a common billion-scale recipe.

Index: IndexIVFPQ in Faiss.
Memory math: In Faiss, IndexIVFPQ stores about ceil(m_pq * nbits / 8) + 8 bytes per vector for the PQ code plus the vector id, where m_pq is the number of PQ sub-vectors/codebooks. A 32-byte code therefore lands near 40 bytes per vector, so 1B vectors is roughly 40 GB before centroids and other overhead.
Trade-off: Much lower memory than raw float32 vectors, but more recall loss and more tuning knobs than HNSW.

2. Disk-based indexes (DiskANN)

Microsoft's DiskANN^{[8]Reference 8DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node.https://proceedings.neurips.cc/paper/2019/hash/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Abstract.html} shows that you can serve a billion-point graph index from a single node by keeping a compact in-memory structure and placing the large index body on SSD.

Mechanism: The system is graph-based, but optimized around SSD access patterns rather than assuming every edge traversal hits RAM.
IO-aware design: Search quality now depends on storage layout and read amplification, not graph degree and beam width alone.
Performance: The NeurIPS 2019 paper abstract reports >5000 queries per second, <3 ms mean latency, and 95%+ 1-recall@1 on SIFT1B on a 16-core machine with 64 GB RAM plus SSD. Treat those numbers as benchmark-specific, not a universal promise for every embedding workload.

3. Distributed sharding

If single-node vertical scaling fails, you can shard horizontally with scatter-gather.

Distributed vector search scatter-gather diagram showing query fan-out to shards, shard-local top-k packets, latency bars, and the broker waiting for the slowest shard before global merge. — Full fan-out sharding cuts local work, but a complete global top-k still waits for the slowest shard packet before the broker can merge results.

Strategy: Split 1B vectors into $S$ shards (e.g., 10 shards of 100M each).
Query: The gateway node broadcasts the query to all $S$ nodes (Scatter).
Merge: Each node returns its top- $k$ . The gateway merges these lists to find the global top- $k$ (Gather).
Bottlenecks: Under full fan-out, throughput doesn't scale linearly due to coordination overhead, and tail latency (P99) is exposed to the slowest shard.

When vector indexes break

Common pitfalls

These are the symptoms, causes, and fixes teams hit most often in production vector search.

Symptom: recall drops after a data shift
Cause: You trained an IVF index on English product descriptions, then started adding Spanish descriptions. The old centroids no longer represent the new data distribution, so queries land in the wrong Voronoi cells.
Fix: Retrain centroids on a representative sample that includes the new data. Monitor recall@k on a held-out validation set and trigger retraining when it drifts below a threshold.
Symptom: HNSW index causes OOM crashes
Cause: An uncompressed HNSW index stores graph adjacency data on top of raw vectors. Edge representation, layer distribution, ids, allocator overhead, and implementation choices all affect the real total.
Fix: Measure bytes per indexed vector in the chosen engine at production-like M and dimension settings, then reserve headroom for builds, queries, operating-system buffers, and metadata. If RAM is tight, evaluate compression or an IVF-PQ design against recall requirements.
Symptom: search is fast but returns poor matches
Cause: The ANN budget is too small for the index family you chose. For IVF that often means nprobe is too low, so you inspect too few lists. For HNSW it often means ef_search is barely above k, so the beam never explores enough candidates.
Fix: Diagnose by index family, not with one generic knob story. Measure recall@k on a labeled set, then raise nprobe for IVF or ef_search for HNSW until recall meets the product target.
Symptom: switching from L2 to cosine breaks IVF results
Cause: You changed the distance metric at search time but didn't retrain the coarse quantizer on normalized vectors. The centroids were learned in L2 geometry, so they misplace vectors after normalization.
Fix: Normalize training vectors before calling index.train(), and normalize both database and query vectors before index.add() and index.search().

Follow-up questions

Your corpus updates constantly and you need online inserts. What HNSW maintenance caveat matters most?

HNSW handles inserts naturally because a new node can be attached incrementally. Deletes are harder because removing nodes can damage graph connectivity. Many systems therefore use tombstones and periodic rebuilds or compaction instead of pretending deletion is free.

You have 10M vectors and only 16 GB RAM. Why might IVF-PQ beat HNSW even if HNSW usually has stronger recall-latency behavior in RAM?

Because memory is the hard constraint. IVF-PQ compresses vectors aggressively enough to stay resident where HNSW may spend too much RAM on raw vectors plus graph edges. HNSW is often better when the working set already fits; IVF-PQ becomes the practical choice when bytes per vector decide feasibility.

A teammate wants to reconstruct every PQ vector before scoring it because it "sounds simpler." Why is that the wrong instinct?

Because ADC is the whole trick that keeps PQ search efficient. Query-time work should be lookup-table additions, not full-vector reconstruction for every candidate. Reconstructing every vector throws away much of the compute benefit of compressed search.

Distributed search added more shards, but p99 got worse. What probably happened?

You shrank local indexes but increased fan-out and made the request wait on more shard responses. Scatter-gather throughput can rise while p99 still gets worse if one slow shard or merge overhead dominates.

Evaluation rubric

Defend these points without relying on metadata or hidden notes:

HNSW search starts at sparse upper layers, uses greedy routing there, then uses a bounded best-first search at layer 0.
M, ef_construction, and ef_search move recall, latency, build time, and memory in different directions.
IVF reduces search work by partitioning vectors into nlist inverted lists and probing only nprobe of them.
Product Quantization reduces memory by splitting vectors into subspaces and storing centroid IDs instead of float32 values.
IVF-PQ trades memory for quantization error, while HNSW trades memory and build time for strong in-RAM recall and latency.

Key skills

After the labs above, practice:

Estimate flat-search work. Given $N$ vectors and $d$ dimensions, calculate raw memory and scored components per query, then state what must be benchmarked before making a latency claim.
Trace HNSW search. Explain why the algorithm starts at the top layer, how greedy descent works, and why ef_search controls the recall-latency trade-off.
Count IVF comparisons. Given nlist and nprobe, estimate the fraction of the database that gets scanned.
Choose an index family. State whether HNSW or IVF-PQ is a better fit when the constraint is "16 GB RAM for 10M vectors" versus "sub-50 ms latency at 99% recall."
Spot common misconfigurations. Recognize stale centroids, underestimating graph memory, and metric mismatches from their symptoms.

Index family trade-offs

Index family	Core idea	Main strength	Main cost	Reach for it when
HNSW	Hierarchical proximity graph	Often strong in-RAM recall/latency behavior	Graph memory and slower builds	Low latency matters, the working set fits in RAM, and evaluation supports it
IVF-Flat	Coarse partitioning plus exact scan inside probed lists	Simple and tunable	Needs training and still stores raw vectors	You want controllable scan fraction without graph complexity
IVF-PQ	IVF plus compressed vector codes	Much lower bytes per vector	Quantization error and more tuning	Memory cost dominates and some recall loss is acceptable
DiskANN	SSD-backed graph index	Single-node scale beyond RAM	Storage-aware engineering complexity	The dataset is larger than RAM but one fast SSD-backed node is still attractive

Evaluate HNSW when you can afford RAM and need low-latency high-recall retrieval. Evaluate IVF-PQ when raw vector storage becomes the bottleneck. Consider DiskANN-style systems when a graph index remains attractive but the index body no longer fits in memory. In each case, release from a recall-and-latency curve measured on representative queries.

Next Step

Continue to Advanced RAG: HyDE & Self-RAG

There, you'll move one layer above the index and learn how <span data-glossary="query-rewriting">query rewriting</span>, HyDE, Self-RAG, and Corrective RAG change retrieval work before generation. The index trade-offs still matter: low recall or high latency can come from retrieval control, `ef_search`, or `nprobe`, and those causes need different fixes.

PreviousModel Merging and Weight Interpolation

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

pgvector

pgvector contributors · 2026 · GitHub

Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.

Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence

Billion-scale similarity search with GPUs.

Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint

Product Quantization for Nearest Neighbor Search.

Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011

Optimized Product Quantization

Ge, T., He, K., Ke, Q., & Sun, J. · 2013

ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data

Patel, L., Kraft, P., Guestrin, C., & Zaharia, M. · 2024

Filtering

Weaviate · 2026

DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node.

Subramanya, S. J., et al. · 2019 · NeurIPS

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Vector DB Internals: HNSW & IVF

The brute-force problem

Why is unfiltered exact vector search difficult at 100 million vectors?

HNSW (Hierarchical Navigable Small World)

Core concept

A worked example: tracing one insertion

Insert algorithm

Search algorithm

Key parameters

IVF (Inverted File Index)

Core concept: the centroid shard map

A worked example: counting comparisons

Implementation with Faiss

Why does increasing nprobe usually improve IVF recall while increasing query work?

Distance metrics

HNSW vs IVF

Product Quantization (PQ)

You have 1B vectors, raw float32 storage needs terabytes of RAM, and one node only has 128 GB. What must product quantization buy you first before anything else matters?

The algorithm

OPQ: rotating before you split

Asymmetric distance computation (ADC)

Why does ADC stay practical at large N while reconstructing every approximate vector doesn't?

Filtered search

Why can a selective filter make post-filtering return fewer than k results?

How products package these ideas

Scaling to billions

1. Hybrid approaches (IVF-PQ)

2. Disk-based indexes (DiskANN)

3. Distributed sharding

When vector indexes break

Common pitfalls

Follow-up questions

Your corpus updates constantly and you need online inserts. What HNSW maintenance caveat matters most?

You have 10M vectors and only 16 GB RAM. Why might IVF-PQ beat HNSW even if HNSW usually has stronger recall-latency behavior in RAM?

A teammate wants to reconstruct every PQ vector before scoring it because it "sounds simpler." Why is that the wrong instinct?

Distributed search added more shards, but p99 got worse. What probably happened?

Evaluation rubric

Key skills

Index family trade-offs

Mastery Check

Discussion