LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Agents & RetrievalVector DB Internals: HNSW & IVF
📐HardEmbeddings & Vector Search

Vector DB Internals: HNSW & IVF

Learn how approximate nearest neighbor indexes use HNSW, IVF, and Product Quantization to balance speed, recall, and memory in production vector databases.

38 min read
Learning path
Step 109 of 155 in the full curriculum
Recursive Language Models (RLM)Advanced RAG: HyDE & Self-RAG

Vector DB Internals: HNSW & IVF

Vector database internals explain why similarity search can be fast without scanning every embedding. This chapter covers Hierarchical Navigable Small World (HNSW), inverted file (IVF) indexing, recall-latency tradeoffs, and the operational knobs behind nearest-neighbor retrieval.

Recursive Language Models showed how to organize compute over huge external context. This chapter starts the retrieval section from the other side: before an agent or RAG pipeline can reason over documents, the system has to find the candidate chunks quickly enough to fit an interactive product.

Imagine finding a specific returns-support ticket in an archive with millions of cases, but instead of an exact ticket ID, you only have a vague description of the customer problem. If you had to check every case one by one, the search would take too long for an interactive product. This is the challenge of similarity search in modern AI systems.

When you ask an AI model a question, it converts your query into a vector (a long list of numbers). To find relevant answers, for example in a RAG (Retrieval-Augmented Generation) system where an LLM grounds its responses in your documents, the database must search through millions or billions of other vectors to find the ones most similar to yours.

This article builds on the embedding idea you learned earlier: text gets turned into vectors so a machine can compare meaning. Here, we'll look at how a production database actually finds the closest vectors when there are millions of them. By the end, you'll know how to choose between the two main index families, tune their knobs, and spot the mistakes that make vector search slow or inaccurate in production.

The brute-force problem

The simplest method, called Exact Search (or k-NN for k-Nearest Neighbors), calculates the distance from your query vector to every vector in the database.

Mathematically, given a query qqq and a database of NNN vectors each with ddd dimensions, exact search has O(N⋅d)O(N \cdot d)O(N⋅d) complexity. That notation means the work grows in direct proportion to both the number of vectors and the number of dimensions.

Let's make that concrete. Suppose you have 100 million product-review embeddings, each with 1,536 dimensions, stored as 4-byte floats. A flat query scores all 100 million vectors, or 153.6 billion scalar components before considering vectorization, batching, memory bandwidth, or accelerator hardware. Whether that fits an SLA is a benchmark question, not something dimensionality alone can answer.

The memory picture is just as stark:

100,000,000 vectors×1,536 dims×4 bytes≈614 GB≈572 GiB100{,}000{,}000 \text{ vectors} \times 1{,}536 \text{ dims} \times 4 \text{ bytes} \approx 614 \text{ GB} \approx 572 \text{ GiB}100,000,000 vectors×1,536 dims×4 bytes≈614 GB≈572 GiB

That's raw vector storage alone, before any index overhead (the lab below reports the same quantity in GiB, the binary units RAM is usually measured in). A single ordinary server won't keep that raw collection in RAM. Approximate Nearest Neighbor (ANN) indexes reduce scored candidates by accepting that some true neighbors may be missed.

estimate-flat-storage-and-work.py
1def gib_for_float32_vectors(count: int, dimensions: int) -> float: 2 return count * dimensions * 4 / (1024 ** 3) 3 4def scalar_components_scored(count: int, dimensions: int) -> int: 5 return count * dimensions 6 7n_vectors = 100_000_000 8dimensions = 1_536 9 10print(f"raw float32 storage: {gib_for_float32_vectors(n_vectors, dimensions):.1f} GiB") 11print(f"scalar components scored per flat query: {scalar_components_scored(n_vectors, dimensions):,}")
Output
1raw float32 storage: 572.2 GiB 2scalar components scored per flat query: 153,600,000,000

For large, weakly filtered collections, ANN is usually the practical serving path. Exact flat search still matters: it's the ground truth for measuring recall, and it can be the right production execution path when a metadata filter leaves only a small candidate set. For example, pgvector performs exact nearest-neighbor search by default and documents exact search as useful for selective filter conditions.[1]

The main quality metric is Recall@k: of the true top-k nearest neighbors, how many did your ANN index recover? Latency numbers without recall are misleading because an index can look fast simply by missing good candidates. A system that returns random vectors in 1 ms has terrible recall. A system that takes 20 ms and finds 99 of the true top 100 has excellent recall.

measure-recall-at-k.py
1def recall_at_k(exact_ids: list[int], approximate_ids: list[int]) -> float: 2 expected = set(exact_ids) 3 return len(expected.intersection(approximate_ids)) / len(expected) 4 5exact_top_5 = [10, 21, 34, 55, 89] 6fast_but_weak = [10, 21, 8, 13, 5] 7slower_but_better = [10, 21, 34, 55, 3] 8 9print(f"fast candidate recall@5: {recall_at_k(exact_top_5, fast_but_weak):.0%}") 10print(f"better candidate recall@5: {recall_at_k(exact_top_5, slower_but_better):.0%}")
Output
1fast candidate recall@5: 40% 2better candidate recall@5: 80%

Key insight: Latency without recall is incomplete. Keep an exact-search evaluation set, measure ANN recall against it, and choose a latency-recall operating point that satisfies the product requirement.

Vector database internals overview comparing an HNSW graph path with IVF probing and PQ compression. Vector database internals overview comparing an HNSW graph path with IVF probing and PQ compression.
HNSW routes through a layered proximity graph. IVF scores coarse centroids and probes selected lists; IVF-Flat scores stored vectors directly, while IVF-PQ can score compressed codes with ADC lookup tables.

HNSW (Hierarchical Navigable Small World)

HNSW is a widely used in-memory ANN index because it can provide a strong recall-latency trade-off without a separate centroid-training phase.[2] Whether it beats another index on your data still needs measurement under the same memory and recall constraints.

Core concept

HNSW is based on a multi-layer graph structure that mimics how people find things in the real world. Think of zooming in on a map: you start at the continent level, narrow down to a country, then a city, then a street. Each layer has more detail than the one above it.

  • Top Layers: Sparse graphs with long-range links. These are used for coarse navigation to quickly "zoom in" to the target neighborhood.
  • Bottom Layer (Layer 0): A dense graph containing all vectors. This is used for fine-grained local search.

This hierarchical structure is analogous to a skip list (a data structure that uses multiple layers of linked lists to allow fast search by skipping over intermediate nodes), but it's generalized for graphs.

HNSW search path descending from sparse upper graph layers into dense layer 0 before returning top neighbors. HNSW search path descending from sparse upper graph layers into dense layer 0 before returning top neighbors.
HNSW spends only a few hops in sparse upper layers, then widens the search at dense layer 0 where recall is actually won or lost.

In the figure, Layer 2 has only a few nodes with long-range links, Layer 1 holds more nodes for regional refinement, and Layer 0 contains every node in a dense local graph. The parameters M, ef_search, and Layer 0 density (often 2M) control the recall-latency trade-off.

A worked example: tracing one insertion

To see how the layers work in practice, imagine you are building an HNSW index with just 8 vectors. The graph already has 7 vectors arranged across three layers. You now insert vector 8.

  1. Sample a level. A random process decides that vector 8 will live on Layers 0 and 1 (but not Layer 2).
  2. Zoom in. The entry point is on Layer 2. You greedily walk Layer 2 to find the node closest to vector 8, then drop to Layer 1.
  3. Connect on Layer 1. On Layer 1, you run a beam search with ef_construction candidates, pick the best neighbors (up to M), and add bidirectional edges. You also check if any existing neighbor now has too many edges and prune if needed.
  4. Drop to Layer 0. You repeat the connection step on Layer 0, but this time you allow up to 2M neighbors because the base layer is denser.
  5. Update entry point. If vector 8 had been assigned to Layer 2 or 3, it would become the new entry point because it's now the highest node in the graph.

This incremental construction is why HNSW doesn't need a training phase. Every new vector is wired into the existing graph using local search.

Insert algorithm

The code below is simplified educational pseudocode. The key steps are: sample a level, zoom in through upper layers, connect on each layer, and update the entry point if the new node is taller than the current graph.

insert-algorithm.py
1import math 2import random 3from typing import Protocol, Sequence 4 5class HNSWGraph(Protocol): 6 entry_point: int | None 7 max_level: int 8 9 def neighbors(self, node_id: int, level: int) -> Sequence[int]: ... 10 11 def distance(self, query_vector: Sequence[float], node_id: int) -> float: ... 12 13 def add_node( 14 self, node_id: int, vector: Sequence[float], max_level: int 15 ) -> None: ... 16 17 def add_bidirectional_edge(self, left_id: int, right_id: int, level: int) -> None: ... 18 19 def prune_if_needed(self, node_id: int, level: int, max_degree: int) -> None: ... 20 21def greedy_search(graph, query_vector, entry_id, level) -> int: 22 """Walk this layer until no neighbor is closer to query_vector.""" 23 current = entry_id 24 improved = True 25 while improved: 26 improved = False 27 for neighbor in graph.neighbors(current, level): 28 if graph.distance(query_vector, neighbor) < graph.distance(query_vector, current): 29 current = neighbor 30 improved = True 31 return current 32 33def search_layer( 34 graph: HNSWGraph, 35 query_vector: Sequence[float], 36 entry_id: int, 37 level: int, 38 ef: int, 39) -> list[int]: 40 """Best-first search that returns up to ef candidate ids in one layer.""" 41 import heapq 42 visited = {entry_id} 43 entry_dist = graph.distance(query_vector, entry_id) 44 candidates = [(entry_dist, entry_id)] # min-heap by distance 45 results = [(-entry_dist, entry_id)] # max-heap of the best ef via negative distance 46 47 while candidates: 48 dist, current = heapq.heappop(candidates) 49 worst_result_dist = -results[0][0] 50 if len(results) >= ef and dist > worst_result_dist: 51 break 52 for neighbor in graph.neighbors(current, level): 53 if neighbor not in visited: 54 visited.add(neighbor) 55 d = graph.distance(query_vector, neighbor) 56 if len(results) < ef or d < -results[0][0]: 57 heapq.heappush(candidates, (d, neighbor)) 58 heapq.heappush(results, (-d, neighbor)) 59 if len(results) > ef: 60 heapq.heappop(results) 61 return [ 62 node 63 for distance, node in sorted((-neg_distance, node) for neg_distance, node in results) 64 ] 65 66def select_neighbors_placeholder( 67 graph, query_vector, candidate_ids, level, max_degree 68) -> list[int]: 69 """Teaching simplification: take nearest candidates by distance.""" 70 # Production HNSW applies its diversity heuristic here to improve connectivity. 71 candidates = sorted(candidate_ids, 72 key=lambda nid: graph.distance(query_vector, nid)) 73 return candidates[:max_degree] 74 75def sample_level(mL: float) -> int: 76 u = max(random.random(), 1e-12) 77 return math.floor(-math.log(u) * mL) 78 79def hnsw_insert( 80 graph: HNSWGraph, 81 new_id: int, 82 new_vector: Sequence[float], 83 M: int = 16, 84 ef_construction: int = 200, 85 mL: float = 0.36, 86) -> None: 87 """ 88 Insert one vector into an HNSW index. 89 90 Typical implementations allow up to 2*M neighbors on layer 0 91 and up to M neighbors on upper layers. 92 """ 93 new_level = sample_level(mL) 94 95 if graph.entry_point is None: 96 graph.add_node(new_id, new_vector, max_level=new_level) 97 graph.entry_point = new_id 98 graph.max_level = new_level 99 return 100 101 graph.add_node(new_id, new_vector, max_level=new_level) 102 entry_id = graph.entry_point 103 104 # Phase 1: greedy descent through layers above the new node's top layer. 105 for level in range(graph.max_level, new_level, -1): 106 entry_id = greedy_search(graph, new_vector, entry_id, level) 107 108 # Phase 2: connect the new node from its top layer down to layer 0. 109 for level in range(min(new_level, graph.max_level), -1, -1): 110 candidate_ids = search_layer( 111 graph, new_vector, entry_id, level, ef=ef_construction 112 ) 113 max_degree = 2 * M if level == 0 else M 114 neighbors = select_neighbors_placeholder( 115 graph, new_vector, candidate_ids, level, max_degree 116 ) 117 118 for neighbor_id in neighbors: 119 graph.add_bidirectional_edge(new_id, neighbor_id, level) 120 graph.prune_if_needed(neighbor_id, level, max_degree) 121 122 # Use the closest candidate as the entry point for the next lower layer. 123 entry_id = min(candidate_ids, 124 key=lambda nid: graph.distance(new_vector, nid)) 125 126 if new_level > graph.max_level: 127 graph.entry_point = new_id 128 graph.max_level = new_level

Search algorithm

Search mirrors insertion in reverse. The algorithm starts at the sparse top layers for fast coarse routing, then runs a bounded best-first search at layer 0. That last step isn't exhaustive over the full dataset. It only explores a candidate set of size ef_search, which is why raising ef_search improves recall at the cost of more work. On good graphs, search often behaves close to logarithmic, but HNSW doesn't provide a hard worst-case logarithmic guarantee.

search-algorithm.py
1import heapq 2from dataclasses import dataclass 3from typing import Sequence 4 5@dataclass 6class ToyHNSWGraph: 7 vectors: dict[int, tuple[float, ...]] 8 edges: dict[tuple[int, int], list[int]] 9 entry_point: int 10 max_level: int 11 12 def neighbors(self, node_id: int, level: int) -> Sequence[int]: 13 return self.edges.get((node_id, level), []) 14 15 def distance(self, query_vector: Sequence[float], node_id: int) -> float: 16 vector = self.vectors[node_id] 17 return sum((left - right) ** 2 for left, right in zip(query_vector, vector)) 18 19def greedy_search( 20 graph: ToyHNSWGraph, query_vector: Sequence[float], entry_id: int, level: int 21) -> int: 22 current = entry_id 23 improved = True 24 while improved: 25 improved = False 26 for neighbor in graph.neighbors(current, level): 27 if graph.distance(query_vector, neighbor) < graph.distance(query_vector, current): 28 current = neighbor 29 improved = True 30 return current 31 32def search_layer( 33 graph: ToyHNSWGraph, 34 query_vector: Sequence[float], 35 entry_id: int, 36 level: int, 37 ef: int, 38) -> list[int]: 39 visited = {entry_id} 40 entry_dist = graph.distance(query_vector, entry_id) 41 candidates = [(entry_dist, entry_id)] 42 results = [(-entry_dist, entry_id)] 43 44 while candidates: 45 dist, current = heapq.heappop(candidates) 46 worst_result_dist = -results[0][0] 47 if len(results) >= ef and dist > worst_result_dist: 48 break 49 for neighbor in graph.neighbors(current, level): 50 if neighbor in visited: 51 continue 52 visited.add(neighbor) 53 neighbor_dist = graph.distance(query_vector, neighbor) 54 if len(results) < ef or neighbor_dist < -results[0][0]: 55 heapq.heappush(candidates, (neighbor_dist, neighbor)) 56 heapq.heappush(results, (-neighbor_dist, neighbor)) 57 if len(results) > ef: 58 heapq.heappop(results) 59 60 return [ 61 node 62 for distance, node in sorted((-neg_distance, node) for neg_distance, node in results) 63 ] 64 65def hnsw_search( 66 graph: ToyHNSWGraph, query: Sequence[float], k: int = 10, ef_search: int = 100 67) -> list[int]: 68 """ 69 Performs approximate nearest neighbor search. 70 71 Args: 72 query: The query vector. 73 k: Number of results to return. 74 ef_search: Size of the dynamic candidate list during search. 75 """ 76 if ef_search < k: 77 raise ValueError("ef_search must be at least k") 78 79 current_node = graph.entry_point 80 81 # Phase 1: greedy descent from the top layer down to layer 1. 82 for level in range(graph.max_level, 0, -1): 83 current_node = greedy_search(graph, query, current_node, level) 84 85 # Phase 2: bounded best-first search on layer 0. 86 candidates = search_layer(graph, query, current_node, level=0, ef=ef_search) 87 88 return sorted(candidates, 89 key=lambda nid: graph.distance(query, nid))[:k] 90 91graph = ToyHNSWGraph( 92 vectors={ 93 0: (0.0, 0.0), 94 1: (1.0, 0.0), 95 2: (1.0, 1.0), 96 3: (5.0, 5.0), 97 }, 98 edges={ 99 (0, 1): [1], 100 (1, 1): [0, 2], 101 (2, 1): [1], 102 (0, 0): [1], 103 (1, 0): [0, 2], 104 (2, 0): [1, 3], 105 (3, 0): [2], 106 }, 107 entry_point=0, 108 max_level=1, 109) 110 111nearest_ids = hnsw_search(graph, query=(1.05, 1.0), k=2, ef_search=3) 112print(f"nearest ids: {nearest_ids}")
Output
1nearest ids: [2, 1]

If you set ef_search = 5 but ask for k = 10, most implementations enforce ef_search >= k because you need at least k candidates to return k results. The real risk is subtler: if ef_search is only slightly larger than k, you might miss better neighbors outside the explored candidate set. That's the recall-latency trade-off in action.

Common mistake: Setting ef_search from intuition rather than an evaluation curve. A small beam can miss true neighbors; an unnecessarily large beam spends latency without a useful recall gain. Measure ef_search values against exact top-k results on representative queries.

Key parameters

Tuning HNSW requires balancing the quality of the graph (which impacts recall) against the memory consumption and indexing speed. The most critical parameters control the number of edges and the beam width used during the graph traversal.

ParameterMeaningTypical ValueEffect
MMax neighbors per node on upper layers16-48Higher = better recall, higher memory usage and slower builds.
M0Max neighbors on layer 0 (often 2*M)2*MDenser base layer, better local connectivity, more memory.
ef_constructionCandidate list size during insert100-400Higher = better graph quality, slower indexing.
ef_searchCandidate list size during queryk to a few hundredHigher = better recall, higher latency. Many implementations require ef_search >= k.
mLControls the layer sampling distributionOften near 1/ln⁡(M)1/\ln(M)1/ln(M)Higher = more upper-layer nodes, shallower descents.

ef_search is a query-time knob. You can trade latency for recall dynamically without rebuilding the index.

select-hnsw-operating-point.py
1measurements = [ 2 {"ef_search": 24, "recall": 0.91, "p95_ms": 4.1}, 3 {"ef_search": 48, "recall": 0.97, "p95_ms": 6.8}, 4 {"ef_search": 96, "recall": 0.985, "p95_ms": 12.4}, 5] 6required_recall = 0.96 7latency_budget_ms = 10.0 8 9eligible = [ 10 point for point in measurements 11 if point["recall"] >= required_recall and point["p95_ms"] <= latency_budget_ms 12] 13choice = min(eligible, key=lambda point: point["p95_ms"]) 14print(f"release ef_search={choice['ef_search']} recall={choice['recall']:.3f} p95_ms={choice['p95_ms']}")
Output
1release ef_search=48 recall=0.970 p95_ms=6.8

IVF (Inverted File Index)

While HNSW is graph-based, IVF is a partition-based algorithm. It divides the vector space into distinct regions called Voronoi cells, where every point is closer to one centroid than to any other centroid. During search, the algorithm only looks inside a small subset of those cells.

Core concept: the warehouse zone map

The easiest way to think about IVF is a warehouse zone map. Instead of checking every bin in the building, you find the zone whose centroid is closest to the query and only search those bins. If the item might be in a neighboring zone, you check a few extra zones to protect recall.

The process is broken down into three phases:

  1. Train: Run K-means clustering on a representative sample of your vectors to identify a set of CCC cluster centers, known as centroids. In Faiss terminology, this count is usually called nlist.
  2. Index: During ingestion, assign each vector to its nearest centroid. That places the vector into one inverted list.
  3. Search: When a query vector arrives, first score the centroids, keep the closest nprobe lists, then do exact distance calculations only inside those selected lists.
IVF search flow showing centroid scoring, probing a few lists, then scoring candidates exactly for IVF-Flat or approximately with PQ. IVF search flow showing centroid scoring, probing a few lists, then scoring candidates exactly for IVF-Flat or approximately with PQ.
IVF first scores coarse centroids across full vector space, then probes only a few inverted lists. `nlist` sets partition count; `nprobe` decides how many lists feed candidate scoring.

A worked example: counting comparisons

Suppose you have 1 million embeddings with 128 dimensions. Here's how the operation count breaks down for brute force versus IVF.

Brute force

  • Distance computations per query: 1,000,0001{,}000{,}0001,000,000

IVF with nlist = 1,024 and nprobe = 5

  • Centroid scoring: 1,0241{,}0241,024 distance computations
  • Average vectors per list: 1,000,000/1,024≈9771{,}000{,}000 / 1{,}024 \approx 9771,000,000/1,024≈977
  • Exact scans inside probed lists: 5×977≈4,8855 \times 977 \approx 4{,}8855×977≈4,885
  • Total: roughly 5,9095{,}9095,909 distance computations

That's about a 170x reduction in distance computations. The actual speedup in a real system is smaller because of overhead, but the principle is clear: IVF avoids scanning most of the database.

estimate-ivf-probe-work.py
1def balanced_ivf_scores(total_vectors: int, nlist: int, nprobe: int) -> int: 2 return nlist + round(total_vectors / nlist * nprobe) 3 4total_vectors = 1_000_000 5nlist = 1_024 6nprobe = 5 7estimated_scores = balanced_ivf_scores(total_vectors, nlist, nprobe) 8 9print(f"flat vector scores: {total_vectors:,}") 10print(f"balanced IVF vector and centroid scores: {estimated_scores:,}") 11print(f"estimated reduction: {total_vectors / estimated_scores:.1f}x")
Output
1flat vector scores: 1,000,000 2balanced IVF vector and centroid scores: 5,907 3estimated reduction: 169.3x

Implementation with Faiss

Faiss is a widely used implementation of IVF-style ANN indexing and documents the IndexIVFFlat behavior used below.[3] The example builds an IVF index, trains its coarse quantizer, assigns vectors to cells, and searches the selected lists.

implementation-with-faiss.py
1import faiss 2import numpy as np 3 4def build_ivf_index( 5 vectors: np.ndarray, nlist: int = 1024, nprobe: int = 10 6) -> faiss.IndexIVFFlat: 7 """ 8 Builds an IVF index using Faiss. 9 10 Args: 11 vectors: (N, d) float32 array of data. 12 nlist: Number of coarse centroids / inverted lists. 13 nprobe: Number of inverted lists to visit during search. 14 """ 15 vectors = np.ascontiguousarray(vectors, dtype=np.float32) 16 if vectors.ndim != 2: 17 raise ValueError("vectors must be a 2D array") 18 if vectors.shape[0] < nlist: 19 raise ValueError("nlist cannot exceed the number of training vectors") 20 21 d = vectors.shape[1] 22 23 # 1. Quantizer: index used to assign vectors to centroids. 24 # IndexFlatL2 uses brute-force L2 distance. 25 quantizer = faiss.IndexFlatL2(d) 26 27 # 2. IVF index. 28 index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2) 29 30 # 3. Train: run k-means to learn centroids from representative data. 31 index.train(vectors) 32 33 # 4. Add: assign all vectors to their nearest centroid. 34 index.add(vectors) 35 36 # 5. Configure search scope. 37 index.nprobe = min(nprobe, nlist) 38 39 return index 40 41left_cluster = np.column_stack((np.linspace(0.0, 0.39, 40), np.zeros(40))) 42right_cluster = np.column_stack((10.0 + np.linspace(0.0, 0.39, 40), np.full(40, 10.0))) 43vectors = np.vstack((left_cluster, right_cluster)).astype(np.float32) 44 45index = build_ivf_index(vectors, nlist=2, nprobe=2) 46distances, ids = index.search(np.array([[0.02, 0.0]], dtype=np.float32), k=3) 47 48print(f"trained: {index.is_trained}") 49print(f"stored vectors: {index.ntotal}") 50print(f"nearest ids: {ids[0].tolist()}")
Output
1trained: True 2stored vectors: 80 3nearest ids: [2, 1, 3]

Recall vs nprobe: The nprobe parameter is the most important knob for tuning an IVF index in production. It controls how many inverted lists are visited during search. If the inverted lists were perfectly balanced, the scanned fraction would be roughly nprobe / nlist. Real systems deviate from that because list sizes are uneven, but the intuition is still useful: more probes mean higher recall and more work.

nprobe regimeWhat gets scannedPractical effect
Very small (1 to a few lists)Only the closest coarse cellsLowest latency, highest miss rate.
ModerateA small fraction of the databaseUsually the best operating point.
Large (nprobe close to nlist)Most or all inverted listsIndexIVFFlat approaches exhaustive search and often loses its speed advantage.

Distance metrics

The choice of distance metric affects both training and search. Faiss supports L2 and inner product directly across its main ANN indexes, and cosine similarity is usually implemented by normalizing vectors and then using inner product search.

L2 distance is the standard Euclidean objective. It's a good fit when vector magnitude carries signal or when your model was trained with an L2-style loss.

Inner product (IP) is the right objective for maximum inner product search. Query norm doesn't affect ranking, but database vector norms do, so large-norm items can dominate if that's how your embeddings behave.

Cosine similarity ignores magnitude and keeps only direction. In practice, you normalize both database vectors and query vectors, then use inner product search. On normalized vectors, L2 and inner product give equivalent rankings because ∣∣x−y∣∣2=2−2⟨x,y⟩||x-y||^2 = 2 - 2\langle x, y \rangle∣∣x−y∣∣2=2−2⟨x,y⟩.

MetricWhat it measuresBest whenIVF retrain needed?
L2 (Euclidean)Straight-line distance in spaceMagnitude carries signalYes, if switching from another metric
Inner productx⋅yx \cdot yx⋅yYou want maximum dot-product rankingYes, if switching from another metric
CosineDirection only, no magnitudeYou only care about semantic similarityYes, on normalized vectors

Metric choice changes the geometry seen by the coarse quantizer. If you switch an IVF system from raw L2 search to cosine search, retrain the centroids on normalized vectors rather than reusing the old clustering.

verify-cosine-inner-product-ranking.py
1import numpy as np 2 3vectors = np.array([[3.0, 0.0], [1.0, 1.0], [0.0, 2.0]], dtype=np.float32) 4query = np.array([1.0, 0.8], dtype=np.float32) 5 6normalized_vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True) 7normalized_query = query / np.linalg.norm(query) 8inner_product_scores = normalized_vectors @ normalized_query 9l2_scores = np.sum((normalized_vectors - normalized_query) ** 2, axis=1) 10 11print(f"inner-product rank: {np.argsort(-inner_product_scores).tolist()}") 12print(f"L2 rank after normalization: {np.argsort(l2_scores).tolist()}")
Output
1inner-product rank: [1, 0, 2] 2L2 rank after normalization: [1, 0, 2]

HNSW vs IVF

FeatureHNSWIVF-Flat
CategoryProximity graphClustering / inverted file
Search costEmpirically sublinear on suitable well-built graphs; benchmark on your dataCentroid scoring plus scanning roughly nprobe / nlist of the data if lists are balanced
MemoryHigh (raw vectors + graph edges)High (raw vectors + ids), but lower than HNSW
Training stepNoneRequired to learn centroids
Index updatesInserts are natural; deletes are implementation-specificInserts are easy, but centroid quality can drift over time
RecallOften strong at a fixed in-RAM latency budget; workload dependentTunable via nprobe; workload dependent
Build timeSlowFaster once centroids are trained
Best forLow-latency, high-recall search in RAMLarge datasets where you want a controllable scan fraction or a base for PQ

The biggest operational difference is that IVF requires a training step on representative data before indexing. If your vector distribution drifts over time, stale coarse centroids can unbalance lists or lower recall. HNSW doesn't have that centroid-training dependency, but commonly pays with more memory and slower construction.

Product Quantization (PQ)

For billion-scale datasets, storing full float32 vectors in RAM is usually impractical on a single RAM-resident node.

109 vectors×768 dims×4 bytes≈3 TB RAM10^9 \text{ vectors} \times 768 \text{ dims} \times 4 \text{ bytes} \approx 3 \text{ TB RAM}109 vectors×768 dims×4 bytes≈3 TB RAM

This calculation shows that one billion vectors, each with 768 float32 dimensions (4 bytes each), requires about 3 TB just for raw vector storage. That's before index overhead, operating-system buffers, or any other application memory.

Product Quantization (PQ) compresses vectors by breaking them into smaller chunks and quantizing those chunks independently.[4] Combined with IVF, PQ is one established way to make billion-point memory budgets tractable; whether a deployment also needs sharding depends on code size, metadata, throughput, and availability requirements.

The algorithm

Product Quantization operates by dividing the high-dimensional space into several lower-dimensional subspaces and quantizing each subspace independently. This avoids the "curse of dimensionality" that makes standard K-Means clustering computationally infeasible for very large vectors.

The compression process follows three distinct steps:

  1. Split: Divide the original ddd-dimensional vector into mmm smaller sub-vectors, each having a dimension of d∗=d/md^* = d/md∗=d/m. For instance, a 768-dimension vector split into 96 sub-vectors would yield sub-vectors of dimension 8.
  2. Cluster: For each of the mmm independent subspaces, run K-Means clustering over your training data to find a set of centroids (often k=256k=256k=256). This collection of centroids for a given subspace is called its "codebook".
  3. Encode: For every vector in your database, map each of its sub-vectors to the nearest centroid in the corresponding codebook. Replace the actual sub-vector values with the ID of that centroid.

In an IVF-PQ system, those PQ codes are usually built on the residual vector after subtracting the assigned coarse centroid, not on the raw vector itself.[3] Quantizing the smaller residual usually gives better accuracy for the same code size.

Product quantization pipeline splitting one vector into chunks, replacing each chunk with a centroid id, and shrinking storage from kilobytes to bytes. Product quantization pipeline splitting one vector into chunks, replacing each chunk with a centroid id, and shrinking storage from kilobytes to bytes.
PQ splits a large vector into many small sub-vectors, replaces each with a centroid ID, and stores bytes instead of float values. Compression is large because one code per chunk replaces many raw dimensions.

By using k=256k=256k=256 centroids, an ID requires just 1 byte of storage (since 28=2562^8 = 25628=256). A 768-dim float vector (3072 bytes) compressed with m=96m=96m=96 becomes an array of 96 single-byte IDs, taking up exactly 96 bytes. That achieves a massive 32x compression ratio, enabling billion-scale search on reasonable hardware.

budget-pq-code-storage.py
1def bytes_per_raw_vector(dimensions: int) -> int: 2 return dimensions * 4 3 4def bytes_per_pq_code(chunks: int, bits_per_chunk: int = 8) -> int: 5 return chunks * bits_per_chunk // 8 6 7raw_bytes = bytes_per_raw_vector(768) 8code_bytes = bytes_per_pq_code(96) 9ivf_pq_bytes_with_id = code_bytes + 8 10 11print(f"raw vector bytes: {raw_bytes}") 12print(f"PQ code bytes: {code_bytes}") 13print(f"IVF-PQ code plus 64-bit id bytes: {ivf_pq_bytes_with_id}") 14print(f"raw-to-code compression: {raw_bytes / code_bytes:.0f}x")
Output
1raw vector bytes: 3072 2PQ code bytes: 96 3IVF-PQ code plus 64-bit id bytes: 104 4raw-to-code compression: 32x

OPQ: rotating before you split

Plain PQ has a weakness: it splits dimensions into fixed contiguous chunks. If the variance in your embeddings is unevenly spread across dimensions, some subspaces carry most of the signal while others are nearly constant, and the per-chunk codebooks waste their 256 centroids on the wrong directions. Optimized Product Quantization (OPQ) addresses this by learning an orthogonal rotation matrix RRR and applying it before the split so the quantizer can distribute useful variation across chunks.[5] An orthogonal rotation preserves L2 distances, so it costs one matrix multiply per query but loses no geometric information before quantization.

In Faiss you can request OPQ with the factory string OPQ96,IVF4096,PQ96: the OPQ96 prefix trains the rotation, then IVF and PQ run on the rotated vectors. OPQ is a candidate to benchmark at the same code size; its recall benefit depends on the embedding distribution and training sample.

Asymmetric distance computation (ADC)

We don't decompress vectors to search them. Instead, we calculate the distance from the uncompressed query to the compressed database vectors using lookup tables.

Mathematically, the squared Euclidean distance between a query vector qqq and a quantized vector yyy (approximated by codebook centroids C(y)C(y)C(y)) is:

d(q,y)2≈d(q,C(y))2=∑j=1md(qj,Cj(yj))2d(q, y)^2 \approx d(q, C(y))^2 = \sum_{j=1}^{m} d(q_j, C_j(y_j))^2d(q,y)2≈d(q,C(y))2=∑j=1m​d(qj​,Cj​(yj​))2

Where qjq_jqj​ is the jjj-th sub-vector of the query, and Cj(yj)C_j(y_j)Cj​(yj​) is the centroid for the jjj-th sub-vector of yyy. This is the core of Asymmetric Distance Computation (ADC). Instead of reconstructing an approximate full vector and computing distance against it, we precompute the distance from each query sub-vector to all centroids in each codebook. The query-time distance for any stored vector then becomes a simple table lookup and sum. The implementation below demonstrates this. It takes a query vector and pre-computed PQ codebooks, creates the lookup tables, and returns the top k nearest neighbors by partial distance lookup alone.

asymmetric-distance-computation-adc.py
1import numpy as np 2 3def adc_pq_search( 4 query: np.ndarray, 5 pq_codes: np.ndarray, 6 codebooks: np.ndarray, 7 k: int = 10 8) -> np.ndarray: 9 """ 10 Approximate nearest neighbor search using Asymmetric Distance Computation (ADC) 11 with Product Quantization. 12 13 Args: 14 query: (d,) float vector. 15 pq_codes: (N, m) uint8 array of stored codes. 16 codebooks: (m, 256, d/m) float array of centroids. 17 """ 18 m = pq_codes.shape[1] 19 if query.shape[0] % m != 0: 20 raise ValueError("query dimension must be divisible by number of PQ chunks") 21 d_sub = query.shape[0] // m 22 23 # 1. Precompute distance tables. 24 # For each subspace, measure query sub-vector distance to every centroid. 25 dist_tables = np.zeros((m, codebooks.shape[1]), dtype=np.float32) 26 for i in range(m): 27 query_sub = query[i*d_sub : (i+1)*d_sub] 28 dist_tables[i, :] = np.sum((codebooks[i] - query_sub) ** 2, axis=1) 29 30 # 2. Aggregate distances with table lookups. 31 # Sum partial distances for each vector based on stored code IDs. 32 dists = np.zeros(len(pq_codes)) 33 for i in range(m): 34 # Look up precomputed distance for code stored at this position. 35 dists += dist_tables[i, pq_codes[:, i]] 36 37 return np.argsort(dists)[:k] 38 39query = np.array([0.0, 0.0, 0.0, 0.0], dtype=np.float32) 40codebooks = np.zeros((2, 256, 2), dtype=np.float32) 41codebooks[0, 1] = np.array([2.0, 0.0]) 42codebooks[1, 1] = np.array([0.0, 2.0]) 43pq_codes = np.array( 44 [ 45 [0, 0], 46 [1, 0], 47 [0, 1], 48 [1, 1], 49 ], 50 dtype=np.uint8, 51) 52 53nearest_ids = adc_pq_search(query, pq_codes, codebooks, k=2).tolist() 54print(f"nearest ids: {nearest_ids}")
Output
1nearest ids: [0, 1]

Filtered search

Real queries almost never search the whole corpus. A support tool asks for "tickets like this one, but only from the last 30 days and only for the EU region." That metadata predicate is a filter, and where you apply it changes both recall and latency. There are three strategies, and the choice is one of the most common production interview questions.

Filter then exact scan first finds rows matching the predicate, then scores only that subset. If the allowed set is small, this is exact and can be efficient. It gets costly as the allowed set grows.

Post-filtering (search, then filter) runs normal ANN search for some candidate count, then discards anything that fails the predicate. This keeps the graph intact, but it has an overfiltering failure mode: if the predicate is selective (say only 1% of vectors match), the initial ANN candidate buffer may contain zero matches, so you return fewer than k results or none at all. Some engines address this by expanding the scan iteratively until enough matching results appear or a budget is exhausted.[1]

Allow-list graph traversal is another useful option. The graph walk only counts allowed nodes toward the result set while it can still traverse disallowed nodes for connectivity. Weaviate documents this pre-filtered HNSW behavior, a sweeping strategy, and an ACORN-inspired option; its documentation recommends ACORN particularly when restrictive filters correlate poorly with the vector neighborhood.[6][7] This isn't a universal winner: a tiny allowed set can make a flat scan cheaper, while broader or poorly correlated filters can benefit from graph traversal.

show-post-filter-underfill.py
1candidate_ids = ["eu-1", "us-1", "us-2", "us-3", "eu-2"] 2region = {"eu-1": "eu", "us-1": "us", "us-2": "us", "us-3": "us", "eu-2": "eu"} 3 4def post_filter(candidates: list[str], required_region: str, k: int) -> list[str]: 5 return [item for item in candidates if region[item] == required_region][:k] 6 7initial_candidates = candidate_ids[:3] 8expanded_candidates = candidate_ids 9 10print(f"initial result: {post_filter(initial_candidates, 'eu', k=2)}") 11print(f"expanded result: {post_filter(expanded_candidates, 'eu', k=2)}")
Output
1initial result: ['eu-1'] 2expanded result: ['eu-1', 'eu-2']

How products package these ideas

Vendor defaults drift faster than the underlying concepts, so skip memorizing a leaderboard table. In interviews and production reviews, ask four concrete questions instead:

  1. Which ANN families are exposed: HNSW, IVF, PQ, disk-backed graphs, or a managed black box?
  2. How does filtered search work: before search, after search, or during traversal?
  3. Where does the index body live: RAM, SSD, or object storage?
  4. Which compression knobs are available, and what recall do they cost?

That framing travels better than product trivia:

  • Postgres-style extensions usually expose a small set of explicit index types such as HNSW or IVFFlat inside SQL workflows.
  • General-purpose vector databases often center HNSW for RAM-resident search, then layer in quantization and filter-aware traversal.
  • Faiss and Faiss-derived stacks expose broad ANN menus, which is powerful but shifts more tuning responsibility onto the engineer.
  • Storage-first systems move more bytes out of RAM and onto SSD or object storage, which improves cost efficiency but makes IO behavior part of your latency budget.

The pattern that stays stable is conceptual, not vendor-specific: HNSW is a candidate when you want strong in-memory recall without centroid training, while IVF and PQ are candidates when scan fraction or bytes per vector become central constraints. Choose from measured recall, tail latency, build/update cost, and memory on your workload.

Scaling to billions

When data exceeds standard RAM limits, you usually need one of two moves: compress the vectors more aggressively, or move most of the index body out of RAM. A global logistics platform tracking billions of shipment embeddings faces this exact choice: either compress each route vector with PQ or keep a sparse graph in memory and store the full index on fast SSD.

1. Hybrid approaches (IVF-PQ)

Combining IVF (to reduce search scope) and PQ (to compress vectors) is a common billion-scale recipe.

  • Index: IndexIVFPQ in Faiss.
  • Memory math: In Faiss, IndexIVFPQ stores about ceil(m_pq * nbits / 8) + 8 bytes per vector for the PQ code plus the vector id, where m_pq is the number of PQ sub-vectors/codebooks. A 32-byte code therefore lands near 40 bytes per vector, so 1B vectors is roughly 40 GB before centroids and other overhead.
  • Trade-off: Much lower memory than raw float32 vectors, but more recall loss and more tuning knobs than HNSW.

2. Disk-based indexes (DiskANN)

Microsoft's DiskANN[8] shows that you can serve a billion-point graph index from a single node by keeping a compact in-memory structure and placing the large index body on SSD.

  • Mechanism: The system is graph-based, but optimized around SSD access patterns rather than assuming every edge traversal hits RAM.
  • IO-aware design: Search quality now depends on storage layout and read amplification, not just graph degree and beam width.
  • Performance: The Microsoft Research page for the NeurIPS 2019 paper reports >5000 queries per second, <3 ms mean latency, and 95%+ 1-recall@1 on SIFT1B on a 16-core machine with 64 GB RAM plus SSD. Treat those numbers as benchmark-specific, not a universal promise for every embedding workload.

3. Distributed sharding

If single-node vertical scaling fails, we shard horizontally using a Scatter-Gather approach.

Distributed vector search scatter-gather flow from query fan-out to shard-local top-k and one merged final result. Distributed vector search scatter-gather flow from query fan-out to shard-local top-k and one merged final result.
Full fan-out sharding cuts local work, but a complete global top-k still waits for the slowest shard response before the broker can merge results.
  • Strategy: "Scatter-Gather" architecture. Split 1B vectors into SSS shards (e.g., 10 shards of 100M each).
  • Query: The gateway node broadcasts the query to all SSS nodes (Scatter).
  • Merge: Each node returns its top-kkk. The gateway merges these lists to find the global top-kkk (Gather).
  • Bottlenecks: Under full fan-out, throughput doesn't scale linearly due to coordination overhead, and tail latency (P99) is exposed to the slowest shard.

Mastery check

Common pitfalls

These are the symptoms, causes, and fixes we see most often in production vector search.

Symptom: recall drops after a data shift Cause: You trained an IVF index on English product descriptions, then started adding Spanish descriptions. The old centroids no longer represent the new data distribution, so queries land in the wrong Voronoi cells. Fix: Retrain centroids on a representative sample that includes the new data. Monitor recall@k on a held-out validation set and trigger retraining when it drifts below a threshold.

Symptom: HNSW index causes OOM crashes Cause: An uncompressed HNSW index stores graph adjacency data on top of raw vectors. Edge representation, layer distribution, ids, allocator overhead, and implementation choices all affect the real total. Fix: Measure bytes per indexed vector in the chosen engine at production-like M and dimension settings, then reserve headroom for builds, queries, operating-system buffers, and metadata. If RAM is tight, evaluate compression or an IVF-PQ design against recall requirements.

Symptom: search is fast but returns poor matches Cause: The ANN budget is too small for the index family you chose. For IVF that often means nprobe is too low, so you inspect too few lists. For HNSW it often means ef_search is barely above k, so the beam never explores enough candidates. Fix: Diagnose by index family, not with one generic knob story. Measure recall@k on a labeled set, then raise nprobe for IVF or ef_search for HNSW until recall meets the product target.

Symptom: switching from L2 to cosine breaks IVF results Cause: You changed the distance metric at search time but didn't retrain the coarse quantizer on normalized vectors. The centroids were learned in L2 geometry, so they misplace vectors after normalization. Fix: Normalize training vectors before calling index.train(), and normalize both database and query vectors before index.add() and index.search().

Follow-up questions

Your corpus updates constantly and you need online inserts. What HNSW maintenance caveat matters most?

HNSW handles inserts naturally because a new node can be attached incrementally. Deletes are harder because removing nodes can damage graph connectivity. Many systems therefore use tombstones and periodic rebuilds or compaction instead of pretending deletion is free.

You have 10M vectors and only 16 GB RAM. Why might IVF-PQ beat HNSW even if HNSW usually has stronger recall-latency behavior in RAM?

Because memory is the hard constraint. IVF-PQ compresses vectors aggressively enough to stay resident where HNSW may spend too much RAM on raw vectors plus graph edges. HNSW is often better when the working set already fits; IVF-PQ becomes the practical choice when bytes per vector decide feasibility.

A teammate wants to reconstruct every PQ vector before scoring it because it “sounds simpler.” Why is that the wrong instinct?

Because ADC is the whole trick that keeps PQ search efficient. Query-time work should be lookup-table additions, not full-vector reconstruction for every candidate. Reconstructing every vector throws away much of the compute benefit of compressed search.

Distributed search added more shards, but p99 got worse. What probably happened?

You shrank local indexes but increased fan-out and made the request wait on more shard responses. Scatter-gather throughput can rise while p99 still gets worse if one slow shard or merge overhead dominates.

Evaluation rubric

By the end, you should be able to defend these points without relying on metadata or hidden notes:

  1. HNSW search starts at sparse upper layers, uses greedy routing there, then uses a bounded best-first search at layer 0.
  2. M, ef_construction, and ef_search move recall, latency, build time, and memory in different directions.
  3. IVF reduces search work by partitioning vectors into nlist inverted lists and probing only nprobe of them.
  4. Product Quantization reduces memory by splitting vectors into subspaces and storing centroid IDs instead of float32 values.
  5. IVF-PQ trades memory for quantization error, while HNSW trades memory and build time for strong in-RAM recall and latency.

What you can do now

By this point in the article, you should be able to:

  1. Estimate flat-search work. Given NNN vectors and ddd dimensions, calculate raw memory and scored components per query, then state what must be benchmarked before making a latency claim.
  2. Trace HNSW search. Explain why the algorithm starts at the top layer, how greedy descent works, and why ef_search controls the recall-latency trade-off.
  3. Count IVF comparisons. Given nlist and nprobe, estimate the fraction of the database that gets scanned.
  4. Choose an index family. State whether HNSW or IVF-PQ is a better fit when the constraint is "16 GB RAM for 10M vectors" versus "sub-50 ms latency at 99% recall."
  5. Spot common misconfigurations. Recognize stale centroids, underestimating graph memory, and metric mismatches from their symptoms.

Key takeaways

Index familyCore ideaMain strengthMain costReach for it when
HNSWHierarchical proximity graphOften strong in-RAM recall/latency behaviorGraph memory and slower buildsLow latency matters, the working set fits in RAM, and evaluation supports it
IVF-FlatCoarse partitioning plus exact scan inside probed listsSimple and tunableNeeds training and still stores raw vectorsYou want controllable scan fraction without graph complexity
IVF-PQIVF plus compressed vector codesMuch lower bytes per vectorQuantization error and more tuningMemory cost dominates and some recall loss is acceptable
DiskANNSSD-backed graph indexSingle-node scale beyond RAMStorage-aware engineering complexityThe dataset is larger than RAM but one fast SSD-backed node is still attractive

Evaluate HNSW when you can afford RAM and need low-latency high-recall retrieval. Evaluate IVF-PQ when raw vector storage becomes the bottleneck. Consider DiskANN-style systems when a graph index remains attractive but the index body no longer fits in memory. In each case, release from a recall-and-latency curve measured on representative queries.

Next Step
Continue to Advanced RAG: HyDE & Self-RAG

There, you'll learn advanced RAG techniques including query rewriting, HyDE, Self-RAG, and Corrective RAG (CRAG) to build more reliable retrieval pipelines. The indexing trade-offs you learned here will matter directly: a retrieval pipeline is only as good as the index underneath it, and knowing when your latency spike is an `ef_search` problem versus an `nprobe` problem is what separates a working demo from a production system.

PreviousRecursive Language Models (RLM)
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

pgvector

pgvector contributors · 2026 · GitHub

Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.

Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence

Billion-scale similarity search with GPUs.

Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint

Product Quantization for Nearest Neighbor Search.

Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011

Optimized Product Quantization

Ge, T., He, K., Ke, Q., & Sun, J. · 2013

ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data

Patel, L., Kraft, P., Guestrin, C., & Zaharia, M. · 2024

Filtering

Weaviate · 2026

DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node.

Subramanya, S. J., et al. · 2019 · NeurIPS