Learn to count retrieval work, express growth with Big-O, avoid wasteful selection and pairwise loops, and enforce a latency budget with runnable Python.
This retrieval problem builds on permission-aware SQL: first keep only rows this project and principal may retrieve, then rank those rows. That still leaves an engineering question: will ranking finish quickly after the number of allowed chunks grows from four to four million?
An algorithm is a recipe for completing a task. Complexity describes how the work or memory required by that recipe grows as its input grows. You'll count real retrieval operations before reaching for optimization folklore.
A teammate asks an internal engineering assistant:
Why did the auth tests fail?
Assume SQL permission predicates, optionally reinforced by row-level security, have already returned only auth project chunks permitted for this request. Use four symbols:
| Symbol | Meaning here | Example |
|---|---|---|
n | authorized chunks eligible for ranking | 4, then 8, then 50,000 |
d | numbers in one embedding vector | 3 in the lab, 768 in a service |
k | citations returned to the answer generator | 2 or 5 |
t | tokens in a later prompt preview | 128 or 512 |
These variables matter because different code grows with different parts of the request. A permission filter might reduce n. A larger embedding increases d. Returning only two citations controls k, but it doesn't automatically avoid scoring every candidate.
Start with an intentionally small, inspectable retriever. It scores a chunk by counting question words that appear in the chunk text.
1chunks = [
2 {"id": "auth-fixture", "text": "auth tests fail when fixture expires"},
3 {"id": "ci-timeout", "text": "ci job timeout after slow integration test"},
4 {"id": "cli-flag", "text": "readme documents old command line flag"},
5 {"id": "deploy-runbook", "text": "rollback service after canary error"},
6]
7
8def lexical_scan(question: str, allowed_chunks: list[dict]) -> tuple[list[tuple[int, str]], int]:
9 terms = set(question.lower().split())
10 matches = []
11 checked = 0
12 for chunk in allowed_chunks:
13 checked += 1
14 overlap = len(terms & set(chunk["text"].split()))
15 if overlap:
16 matches.append((overlap, chunk["id"]))
17 return sorted(matches, reverse=True), checked
18
19matches, checked = lexical_scan("auth tests fixture", chunks)
20print("chunks_checked", checked)
21print("best_chunk", matches[0][1])1chunks_checked 4
2best_chunk auth-fixturechunks_checked isn't elapsed time. It's a count of the dominant repeated action: inspecting one candidate chunk. For four authorized chunks, this scan checks four. For forty thousand, it checks forty thousand. Real text work also depends on chunk length. When chunks have a configured size limit, counting candidate inspections is a useful first model. Without that bound, include words read as another growing input.
That's the core habit: identify an operation that repeats with scale, then count it.
Big-O gives an asymptotic upper bound: it describes how work grows once the input becomes large. Fixed multipliers and lower-order terms don't change the growth class. Under the bounded-chunk simplification above, scanning one extra authorized chunk adds one inspection, so candidate-scan work is O(n), read "order n" or "linear in n."
1base_chunks = [
2 {"id": "auth-fixture", "text": "auth fixture expired"},
3 {"id": "ci-timeout", "text": "ci timeout"},
4 {"id": "cli-flag", "text": "old cli flag"},
5 {"id": "deploy-runbook", "text": "canary rollback"},
6]
7
8def count_checks(question: str, allowed_chunks: list[dict]) -> int:
9 terms = set(question.split())
10 checks = 0
11 for chunk in allowed_chunks:
12 checks += 1
13 _ = terms & set(chunk["text"].split())
14 return checks
15
16for copies in (1, 2, 4, 8):
17 corpus = base_chunks * copies
18 print(f"n={len(corpus):>2} checks={count_checks('auth fixture', corpus):>2}")1n= 4 checks= 4
2n= 8 checks= 8
3n=16 checks=16
4n=32 checks=32Doubling n doubles the repeated checks. The code still has overhead, such as building a set of question terms, but that overhead doesn't grow once per candidate. Big-O points to the piece that becomes dangerous at larger scale.
Don't call the whole bot "O(n)." Attach the claim to an operation: ranking authorized chunks with this bounded-length scan is O(n).
Text overlap is only a teaching scorer. Semantic retrieval commonly represents the question and each chunk as embedding vectors, then computes a similarity score. The NumPy lesson already gave you the mechanics of vectors and dot products; here we count their cost.
Each exact dot product inspects d coordinates. Scoring all n allowed chunks therefore performs work proportional to n * d.
1import numpy as np
2
3chunk_ids = ["auth-fixture", "ci-timeout", "cli-flag", "deploy-runbook"]
4chunk_vectors = np.array([
5 [1.0, 0.2, 0.1],
6 [0.0, 1.0, 0.2],
7 [0.8, 0.1, 0.4],
8 [0.1, 0.2, 1.0],
9])
10question = np.array([1.0, 0.0, 0.0])
11
12scores = chunk_vectors @ question
13best = int(np.argmax(scores))
14n, d = chunk_vectors.shape
15print("best_chunk", chunk_ids[best])
16print("coordinate_multiplies", n * d)1best_chunk auth-fixture
2coordinate_multiplies 12With four three-dimensional vectors, the score pass uses 12 coordinate multiplications. With 50,000 chunks and 768 coordinates, it uses 38,400,000 multiplications before it chooses citations. The exact ranking step is O(n * d) in time and storing all vectors is O(n * d) in memory.
Suppose the assistant will cite only k = 2 chunks. A first version can score all candidates and fully sort every result. In product code, treat k <= 0 as an empty result so heap logic never reads from an empty list.
1scores = [
2 (0.92, "auth-fixture"),
3 (0.21, "ci-timeout"),
4 (0.81, "cli-flag"),
5 (0.08, "deploy-runbook"),
6 (0.32, "rate-limit-runbook"),
7]
8
9top_two = sorted(scores, reverse=True)[:2]
10print("top_two", top_two)
11print("fully_reordered_scores", len(scores))1top_two [(0.92, 'auth-fixture'), (0.81, 'cli-flag')]
2fully_reordered_scores 5This is correct. It also orders three losers that will never be shown. Sorting n scored candidates costs O(n log n).
A bounded min-heap stores only the top-k values seen so far. Its smallest retained score stays at the root, so a stronger candidate can replace the current weakest winner. Repairing the heap follows one path through a tree with k entries. That path has at most about log2(k) parent-child steps: three for k = 8, ten for k = 1,024. Python's heapq helpers implement that structure in the lab.
For k >= 2, maintaining the heap costs O(n log k) after scoring, with O(k) selection memory. For k = 1, one running maximum gives O(n) selection.[1]
1from heapq import heappush, heapreplace
2
3scores = [
4 (0.92, "auth-fixture"),
5 (0.21, "ci-timeout"),
6 (0.81, "cli-flag"),
7 (0.08, "deploy-runbook"),
8 (0.32, "rate-limit-runbook"),
9]
10
11def bounded_top_k(items: list[tuple[float, str]], k: int) -> list[tuple[float, str]]:
12 if k <= 0:
13 return []
14 heap: list[tuple[float, str]] = []
15 for item in items:
16 if len(heap) < k:
17 heappush(heap, item)
18 elif item > heap[0]:
19 heapreplace(heap, item)
20 return sorted(heap, reverse=True)
21
22top_two = bounded_top_k(scores, k=2)
23print("top_two", top_two)
24print("selection_values_stored", len(top_two))
25assert bounded_top_k(scores, k=0) == []1top_two [(0.92, 'auth-fixture'), (0.81, 'cli-flag')]
2selection_values_stored 2Important limit: the heap improves selection, not exact vector scoring. If scores came from embeddings, you still paid O(n * d) to produce them. Engineers get into trouble when they speed up a cheap tail step and claim they fixed an expensive upstream step.
Linear work doubles when the input doubles. Quadratic work grows far faster because each item is compared with many other items.
A tempting "cleanup" step might compare every retrieved chunk against every other retrieved chunk to remove near-duplicates. For n chunks it evaluates n * (n - 1) / 2 pairs, which is O(n^2).
1def pair_comparisons(n: int) -> int:
2 comparisons = 0
3 for left in range(n):
4 for right in range(left + 1, n):
5 comparisons += 1
6 return comparisons
7
8for n in (4, 8, 16, 1_000):
9 print(f"n={n:>4} pairs={pair_comparisons(n):>6}")1n= 4 pairs= 6
2n= 8 pairs= 28
3n= 16 pairs= 120
4n=1000 pairs=499500At n = 1,000, a request-time pairwise pass makes 499,500 comparisons. That's a concrete failure case: a harmless-looking loop can consume the latency budget before generation begins.
If the requirement is only exact duplicate removal, a set changes the operation. It remembers texts already seen instead of comparing every pair.
1candidate_texts = [
2 "auth fixture expiry",
3 "auth test failure",
4 "auth fixture expiry",
5 "ci timeout retry",
6 "auth test failure",
7]
8
9seen = set()
10unique = []
11membership_checks = 0
12for text in candidate_texts:
13 membership_checks += 1
14 if text not in seen:
15 seen.add(text)
16 unique.append(text)
17
18print("unique_chunks", len(unique))
19print("membership_checks", membership_checks)1unique_chunks 3
2membership_checks 5This set-based path uses expected O(n) membership checks under normal hash-table behavior. Hashing still reads each text, so unbounded string length remains part of the real cost. It doesn't solve semantic near-duplicate detection. Requirements determine the algorithm: don't silently replace "similar meaning" with "identical string" just to obtain a better complexity label.
This course will teach Transformer internals later. For now, carry one preview: scaled dot-product attention computes a matrix from queries and keys, commonly written as Q @ K.T, so a self-attention sequence with t token positions has t * t score cells.[2]
You don't need to understand the model yet to recognize the growth shape.
1def score_matrix_size(tokens: int) -> tuple[int, float]:
2 cells = tokens * tokens
3 mib_if_float32 = cells * 4 / (1024 * 1024)
4 return cells, mib_if_float32
5
6for tokens in (128, 256, 512, 1_024):
7 cells, mib = score_matrix_size(tokens)
8 print(f"tokens={tokens:>4} cells={cells:>7} raw_matrix_mib={mib:>4.2f}")1tokens= 128 cells= 16384 raw_matrix_mib=0.06
2tokens= 256 cells= 65536 raw_matrix_mib=0.25
3tokens= 512 cells= 262144 raw_matrix_mib=1.00
4tokens=1024 cells=1048576 raw_matrix_mib=4.00Doubling tokens quadruples cells. This doesn't yet model a complete language model, optimized kernels, or serving memory. It's a simple transfer of your new algorithm skill: when you see an all-pairs matrix, ask whether an input is being squared.
Big-O tells you the shape, not whether today's workload meets a service objective. A useful engineer does both:
This first guard uses a count rather than a made-up millisecond conversion.
1def exact_score_work(candidate_chunks: int, embedding_dim: int) -> int:
2 return candidate_chunks * embedding_dim
3
4def within_budget(candidate_chunks: int, embedding_dim: int, max_multiplies: int) -> bool:
5 return exact_score_work(candidate_chunks, embedding_dim) <= max_multiplies
6
7budget = 1_000_000
8for candidates in (500, 2_000, 50_000):
9 work = exact_score_work(candidates, embedding_dim=768)
10 print(candidates, work, within_budget(candidates, 768, budget))1500 384000 True
22000 1536000 False
350000 38400000 FalseThe rejected case doesn't justify weakening authorization or returning arbitrary evidence. It shows exact ranking over that many authorized candidates no longer fits this budget. Possible next experiments are a stricter legitimate filter, an offline-created index, or an approximate nearest-neighbor index evaluated for recall.
One later option is HNSW, a hierarchical proximity-graph approach to approximate nearest-neighbor search. It trades exact scoring of every vector for graph-guided candidate exploration, so any latency gain must be evaluated beside retrieval recall on real questions.[3] That's a Vector DB Internals topic, not a shortcut you need to implement in this foundation exercise.
Now assemble the lesson into a small baseline you could test before adding an index:
k citations.1from heapq import heappush, heapreplace
2import numpy as np
3
4rows = [
5 {"id": "auth-fixture", "project": "auth", "principals": {"engineer:u3"}, "vector": np.array([1.0, 0.0, 0.1])},
6 {"id": "auth-ci-timeout", "project": "auth", "principals": {"engineer:u3"}, "vector": np.array([0.2, 1.0, 0.1])},
7 {"id": "auth-cli-flag", "project": "auth", "principals": {"engineer:u3"}, "vector": np.array([0.8, 0.1, 0.4])},
8 {"id": "auth-secrets-private", "project": "auth", "principals": {"engineer:u1"}, "vector": np.array([0.0, 0.3, 1.0])},
9 {"id": "billing-auth-fixture", "project": "billing", "principals": {"engineer:u3"}, "vector": np.array([1.0, 0.0, 0.0])},
10]
11
12def authorized_exact_top_k(
13 question: np.ndarray,
14 all_rows: list[dict],
15 project: str,
16 principal: str,
17 k: int,
18) -> tuple[list[str], int]:
19 if k <= 0:
20 return [], 0
21 allowed = [
22 row
23 for row in all_rows
24 if row["project"] == project and principal in row["principals"]
25 ]
26 heap: list[tuple[float, str]] = []
27 coordinate_multiplies = 0
28 for row in allowed:
29 score = float(np.dot(question, row["vector"]))
30 coordinate_multiplies += question.size
31 item = (score, row["id"])
32 if len(heap) < k:
33 heappush(heap, item)
34 elif item > heap[0]:
35 heapreplace(heap, item)
36 ranked = sorted(heap, reverse=True)
37 return [chunk_id for _, chunk_id in ranked], coordinate_multiplies
38
39citations, work = authorized_exact_top_k(
40 np.array([1.0, 0.0, 0.0]),
41 rows,
42 "auth",
43 "engineer:u3",
44 k=2,
45)
46print("citations", citations)
47print("coordinate_multiplies", work)
48assert "auth-secrets-private" not in citations
49assert "billing-auth-fixture" not in citations
50assert work == 9
51assert authorized_exact_top_k(
52 np.array([1.0, 0.0, 0.0]),
53 rows,
54 "auth",
55 "engineer:u3",
56 k=0,
57) == ([], 0)1citations ['auth-fixture', 'auth-cli-flag']
2coordinate_multiplies 9This exact version is your correctness baseline. It has a clean contract: only authorized rows can appear, returned results are exact among those rows, and the score cost is visible. Once a benchmark says the baseline is too slow, you know which behavior an optimized version must preserve.
Work these without code first. Build intuition for growth.
n = 2,500 authorized chunks, each with d = 384 coordinates. How many coordinate multiplications does an exact dot-product scan perform?k = 5. Which part does a bounded heap reduce, and which part remains unchanged?n = 4,000, d = 256?2,500 * 384 = 960,000 coordinate multiplications.200 * 199 / 2 = 19,900 comparisons.1,024 / 256 = 4, so square score-cell count grows by 4 * 4 = 16.4,000 * 256 = 1,024,000, which exceeds the one-million budget.| Operation | Input that grows | Work shape | First engineering response |
|---|---|---|---|
| Scan authorized text chunks | n chunks | O(n) | Count candidates after permission filtering |
| Score exact embedding vectors | n chunks, d coordinates | O(n * d) | Measure score work before choosing an index |
Select best k scored chunks | n scores, k returned | O(n log k) with a heap | Don't fully sort discarded results |
| Compare every candidate pair | n chunks | O(n^2) | Challenge whether all-pairs work is required |
| Preview an attention score grid | t tokens | O(t^2) cells | Notice square growth before later model lessons |
Three rules carry into every later AI system chapter:
k, and measured time for each stage.n before shipping. Narrow the requirement or move expensive analysis out of the request path.Answer every question, then check your score. Score above 75% to mark this lesson complete.
6 questions remaining.
Introduction to Algorithms (3rd ed.)
Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. · 2009
Attention Is All You Need.
Vaswani, A., et al. · 2017
Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.
Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence