A beginner-first data-structures chapter that starts with a list scan, then teaches inverted indexes, heaps, queues, and caches through one support-search story.
Data structures decide what an AI system can find, update, rank, and reuse quickly. If the structure matches the operation, a support bot can jump to the right help article, a background worker can process jobs in order, and a retrieval cache can reuse ranked document IDs without serving stale results.
Imagine a customer opens a support chat and types:
refund email
Behind the scenes, the company has thousands of help articles. The system can't read each article from start to finish under a tight latency budget. It needs a way to jump straight to the handful of articles that mention those words.
That jump is the job of a data structure. The NumPy lesson taught you that arrays and tensors are good when position is the whole story: if you know row 2, column 1, you can land on the value instantly. The CUDA and MPS chapters then showed where those tensors execute. This chapter answers the next systems question: what if your lookup doesn't start with a position? What if it starts with a word like refund, a repeated query like refund email, or a stream of jobs waiting to be processed?
Production AI products don't use one data structure for every job. A lexical retriever can index terms, an ingestion worker can queue jobs, and a retrieval cache can reuse ranked IDs only while their source documents remain current. Each structure exists because a particular operation repeats.
Keep one question in mind: what operation must this structure serve?
That is the real skill behind data structures. You are not memorizing names. You are matching an operation to a shape.
The previous NumPy lesson focused on position-based access. Here the query starts with a term, a top-k request, or a work queue, so the shape has to change.
We will use one tiny support-search story from start to finish. The documents are small enough to trace by hand, but the ideas scale to the retrieval layer inside a retrieval-augmented generation (RAG) system, where a large language model (LLM) answers with retrieved evidence.
| doc id | text |
|---|---|
| 0 | refund policy details |
| 1 | account password reset |
| 2 | refund status email |
User query:
refund email
We will use this story five times:
A list is the right starting point because the data is tiny and the order of documents is stable.
If you scan each document by hand, the match counts look like this:
| doc id | contains refund | contains email | total matches |
|---|---|---|---|
| 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 |
| 2 | 1 | 1 | 2 |
Document 2 should rank above document 0.
This matters for beginners because a list scan is not "wrong." It is the smallest working version. The mistake is staying with a full scan after the question has changed.
When the corpus grows from 3 documents to 3 million, a query like refund email shouldn't force the system to reread each document. A full scan touches every document, so its cost grows in proportion to the corpus size. A dictionary lookup, by contrast, jumps to the value in roughly constant time on average regardless of how big the dictionary gets, because it hashes the key directly to its slot.[1] That doesn't make the whole search constant-time: the retriever still reads the posting lists for the requested terms and ranks the resulting candidates. The next chapter formalizes this gap with Big-O notation; for now the intuition is enough: the data structure must let the query enter the system in its natural form, by term, not by document position.
The query starts with a term such as refund. That means the system should jump from a term to the matching documents.
This is what an inverted index does.
An inverted index maps each term to its posting list. A posting records a matching document ID; larger indexes can store additional information such as term positions or term frequency. In this first version, a Python set[int] is enough because our score asks only whether each query term appears in each document.[2]
Split each document into terms and store the matching document IDs:
| term | document IDs |
|---|---|
refund | {0, 2} |
policy | {0} |
details | {0} |
account | {1} |
password | {1} |
reset | {1} |
status | {2} |
email | {2} |
Now the query can enter through the index:
| query term | posting list |
|---|---|
refund | {0, 2} |
email | {2} |
Turn those posting lists into match counts:
| doc id | matches from refund | matches from email | final score |
|---|---|---|---|
| 0 | 1 | 0 | 1 |
| 2 | 1 | 1 | 2 |
Document 2 wins because it matches both terms. Document 0 still matters because it matches one term.
The dictionary solves exact term lookup. The set solves uniqueness.
Suppose document 2 were refund refund email. The posting list for refund should still contain document 2 once, not twice. That is why the value is a set of document IDs instead of a list of document IDs.
This is a good beginner habit:
Production lexical indexes commonly keep ordered, compressed posting lists and richer statistics. The Python set is a teaching choice: it makes membership and duplicate prevention visible before you optimize storage or intersections.[2]
Here is the smallest proof. The third support article says refund twice, but membership in the refund posting set still contributes one document ID.
1from collections import defaultdict
2
3docs_with_repeat = [
4 "refund policy details",
5 "account password reset",
6 "refund refund email",
7]
8postings: dict[str, set[int]] = defaultdict(set)
9
10for doc_id, text in enumerate(docs_with_repeat):
11 for term in text.split():
12 postings[term].add(doc_id)
13
14print("refund postings", sorted(postings["refund"]))
15print("document 2 counted once", len(postings["refund"]) == 2)1refund postings [0, 2]
2document 2 counted once TrueOnce documents have scores, the system needs the best few results. It doesn't need a full sorted report about the corpus.
That is the moment for a heap.
For our tiny example, scores are:
| doc id | score |
|---|---|
| 2 | 2 |
| 0 | 1 |
A heap keeps the top k results easy to retrieve. In retrieval systems, that matters because you often want the best 5, 10, or 20 candidates, rather than a fully sorted candidate list.
For a stream of scored candidates, keep a min-heap capped at k items. The weakest retained score stays at the root. When a stronger candidate arrives, replace that weakest item instead of sorting every candidate seen so far.
Real rankers also need an explicit tie-break rule. The retriever lab below prefers the lower document ID when scores tie so its output stays deterministic.
1import heapq
2
3candidate_scores = [("doc-0", 1), ("doc-2", 2), ("doc-3", 3)]
4k = 2
5top_k: list[tuple[int, str]] = []
6
7for doc_id, score in candidate_scores:
8 candidate = (score, doc_id)
9 if len(top_k) < k:
10 heapq.heappush(top_k, candidate)
11 elif candidate > top_k[0]:
12 evicted = heapq.heapreplace(top_k, candidate)
13 print("evicted", evicted)
14
15print("top k", sorted(top_k, reverse=True))1evicted (1, 'doc-0')
2top k [(3, 'doc-3'), (2, 'doc-2')]Here is how the pieces fit together in a single flow:
The query enters as plain text. Terms split apart. The index turns each term into a posting list. Scores accumulate per document. The heap returns the best k document IDs. This same pattern sits inside RAG systems, where a language model reads only the best supporting chunks instead of the whole corpus.
Start with the slow version so you can see what the index is replacing. Then add the index and heap.
Put this in data_structures_demo.py. This first tokenizer lowercases and splits on whitespace. Production tokenizers also handle punctuation and language-specific rules, but keeping normalization tiny makes the structure visible. Returning a set means this example scores binary term membership: each distinct query term contributes at most once.
1from collections import defaultdict
2import heapq
3
4docs = [
5 "refund policy details",
6 "account password reset",
7 "refund status email",
8]
9
10def tokenize(text: str) -> set[str]:
11 return set(text.lower().split())
12
13def slow_search(query: str) -> list[tuple[int, int]]:
14 terms = tokenize(query)
15 matches: list[tuple[int, int]] = []
16 for doc_id, text in enumerate(docs):
17 words = tokenize(text)
18 score = sum(term in words for term in terms)
19 if score:
20 matches.append((doc_id, score))
21 return matches
22
23index: dict[str, set[int]] = defaultdict(set)
24for doc_id, text in enumerate(docs):
25 for term in tokenize(text):
26 index[term].add(doc_id)
27
28def search(query: str, k: int = 2) -> list[int]:
29 if not query.strip():
30 return []
31
32 scores: dict[int, int] = defaultdict(int)
33 for term in tokenize(query):
34 for doc_id in index.get(term, set()):
35 scores[doc_id] += 1
36
37 best = heapq.nlargest(
38 k,
39 scores.items(),
40 key=lambda item: (item[1], -item[0]),
41 )
42 return [doc_id for doc_id, _score in best]
43
44print("slow scan", slow_search("refund email"))
45print("posting list", sorted(index["refund"]))
46print("search", search("refund email"))1slow scan [(0, 1), (2, 2)]
2posting list [0, 2]
3search [2, 0]| Job | Structure | Why it fits |
|---|---|---|
| keep the original corpus order | list | the documents start as an ordered sequence |
| jump from term to matching docs | dictionary of sets | the query starts with a term |
| count how many terms matched | dictionary of integers | each document needs a score |
| keep the best few results | heap | the system wants top-k rather than a full sorted list |
The heap key uses score first and negative document ID second. Higher scores win; lower IDs win ties. That small rule makes repeated runs predictable.
That table is more important than the syntax.
Retrieval is one place where data structures show up in AI systems.
Imagine new documents arrive and need to be split into chunks, then encoded as embedding vectors before they can enter a semantic index. For one class of equal-priority jobs, arrival order is a clean baseline. Production queues may add priorities, retries, and dead-letter handling, but they still need an explicit ordering contract.
That is a queue.
1from collections import deque
2
3embedding_jobs = deque(["doc-3", "doc-4", "doc-5"])
4next_job = embedding_jobs.popleft()
5
6print("next job", next_job)
7print("remaining", list(embedding_jobs))1next job doc-3
2remaining ['doc-4', 'doc-5']The rule is simple:
That is first-in, first-out behavior. If you accidentally process newest work first, users may see random delays or starvation in old jobs.
This failure case makes the ordering bug visible. New jobs belong at the tail; putting them at the front lets recent uploads jump ahead of old work.
1from collections import deque
2
3fifo_jobs = deque(["oldest-ticket"])
4fifo_jobs.extend(["new-ticket-a", "new-ticket-b"])
5
6newest_first_jobs = deque(["oldest-ticket"])
7newest_first_jobs.appendleft("new-ticket-a")
8newest_first_jobs.appendleft("new-ticket-b")
9
10print("FIFO runs", fifo_jobs.popleft())
11print("wrong order runs", newest_first_jobs.popleft())1FIFO runs oldest-ticket
2wrong order runs new-ticket-bIf users ask the same question repeatedly, recomputing retrieval results per request wastes work.
A retrieval cache stores prior ranked document IDs so the system can reuse them.
1cache: dict[tuple[int, str], list[int]] = {}
2corpus_version = 1
3query = "refund email"
4
5cache_key = (corpus_version, query)
6if cache_key not in cache:
7 cache[cache_key] = search(query)
8
9print("cached", cache[cache_key])1cached [2, 0]| cache key part | why it exists |
|---|---|
query | separates one repeated request from another |
corpus_version | forces refresh after the document collection changes |
If document 2 stops mentioning refund, the cached ranking [2, 0] becomes fast and wrong. A version number, timestamp, or explicit invalidation rule tells the system when old results must be dropped.
Run the failure before relying on the fix. A text-only key finds yesterday's ranking after an update; a versioned key misses and forces recomputation.
1query = "refund email"
2old_version = 1
3new_version = 2
4
5text_only_cache = {query: [2, 0]}
6versioned_cache = {(old_version, query): [2, 0]}
7
8print("text-only stale hit", query in text_only_cache)
9print("versioned stale hit", (new_version, query) in versioned_cache)1text-only stale hit True
2versioned stale hit FalseBeginners often don't choose the wrong structure because they lack vocabulary. They choose it because the first working code feels good enough.
| Symptom | Likely cause | Fix |
|---|---|---|
| query gets slower as documents grow | the code still scans the whole list per query | build an index keyed by the query field |
| same document appears more than once | a list was used where uniqueness matters | store posting lists as sets |
| old jobs starve | work is being popped in the wrong order | use a FIFO queue and test the order explicitly |
| cached result looks right until documents change | cache has no version or invalidation rule | tie the cache key to corpus updates |
| ranking feels mysterious | scores are hidden inside loops | print one query's per-document score table |
The fastest way to debug is still the same: trace one tiny example by hand.
These tests protect the structures, rather than the final answer alone:
1from collections import deque
2
3def test_refund_lookup_has_two_docs():
4 assert index["refund"] == {0, 2}
5
6def test_empty_query_returns_empty_list():
7 assert search("") == []
8
9def test_refund_email_ranks_doc_2_first():
10 assert search("refund email")[0] == 2
11
12def test_search_normalizes_case():
13 assert search("Refund EMAIL") == [2, 0]
14
15def test_queue_is_fifo():
16 jobs = deque(["doc-3", "doc-4"])
17 assert jobs.popleft() == "doc-3"
18
19test_refund_lookup_has_two_docs()
20test_empty_query_returns_empty_list()
21test_refund_email_ranks_doc_2_first()
22test_search_normalizes_case()
23test_queue_is_fifo()
24print("five structural tests passed")1five structural tests passedEach test checks one promise:
Production AI infrastructure is full of small systems with state, events, and invariants. Request queues, cache entries, retry state, tool side effects, and rollout gates all need explicit state transitions. If you can build one of these cleanly at small scale, you can transfer the pattern to larger LLM systems.
Here is a tiny fulfillment simulator. It processes events in order and refuses transitions that would break the system's promises.
1from collections import defaultdict
2
3inventory = {"SKU-7": 3}
4reserved: dict[str, int] = defaultdict(int)
5shipped: dict[str, int] = defaultdict(int)
6errors: list[str] = []
7
8events = [
9 ("reserve", "order-1", "SKU-7", 2),
10 ("ship", "order-1", "SKU-7", 1),
11 ("ship", "order-1", "SKU-7", 2),
12 ("reserve", "order-2", "SKU-7", 1),
13]
14
15def apply_event(kind: str, order_id: str, sku: str, qty: int) -> None:
16 if kind == "reserve":
17 if inventory[sku] < qty:
18 errors.append(f"{order_id}: not enough inventory")
19 return
20 inventory[sku] -= qty
21 reserved[order_id] += qty
22 return
23
24 if kind == "ship":
25 if reserved[order_id] - shipped[order_id] < qty:
26 errors.append(f"{order_id}: cannot ship more than reserved")
27 return
28 shipped[order_id] += qty
29 return
30
31 errors.append(f"{order_id}: unknown event {kind}")
32
33for event in events:
34 apply_event(*event)
35
36print("inventory", dict(inventory))
37print("reserved", dict(reserved))
38print("shipped", dict(shipped))
39print("errors", errors)1inventory {'SKU-7': 0}
2reserved {'order-1': 2, 'order-2': 1}
3shipped {'order-1': 1}
4errors ['order-1: cannot ship more than reserved']1shipped[order] <= reserved[order]If you can name the invariant, write the state transition, and produce a failing case that proves the guard works, you can handle many "build a moderately complex system quickly" prompts. The same pattern applies to token schedulers, retry queues, tool-call ledgers, and evaluation runners.
Queue workers retry failed events. If a successful shipment event is delivered twice, applying it twice silently corrupts state. A completed-event ID set turns an accidental replay into a visible skip. Record the ID only after the side effect succeeds; a failed attempt must remain retryable.
1completed_event_ids: set[str] = set()
2shipped_units = 0
3delivery_attempts = [("ship-91", 1), ("ship-91", 1), ("ship-92", 2)]
4
5for event_id, quantity in delivery_attempts:
6 if event_id in completed_event_ids:
7 print("skipped retry", event_id)
8 continue
9 shipped_units += quantity
10 completed_event_ids.add(event_id)
11
12print("shipped units", shipped_units)1skipped retry ship-91
2shipped units 3This in-memory set makes sequential replay visible, but it doesn't make an external shipment atomic. If a worker ships the package and crashes before adding the event ID, a retry can ship it again. Production code needs a durable idempotency key enforced together with the side effect, or a downstream write API that enforces the same contract.
That crash window is one reason the next SQL chapter matters. Durable rows, uniqueness constraints, and transactions let a system protect state across process failure.
Try these before you look at the solution sketches.
email support refund.sorted(index["refund"]).search("refund email").search("invoice") and explain why it shouldn't crash.refund, then explain what must happen to the cache.Use these to check your reasoning, not to skip the exercise.
| Practice item | What a good answer should show |
|---|---|
Add email support refund | refund and email both gain the new document ID. |
Predict sorted(index["refund"]) | [0, 2, 3] after the new document is added. |
Predict search("refund email") | [2, 3]. Documents 2 and 3 tie with score 2, so the lower document ID wins the deterministic tie-break. |
Query invoice | index.get(term, set()) returns an empty set, so the function returns [] instead of crashing. |
| Change document 2 | Rebuild the affected posting lists and drop cache entries tied to the old corpus version. |
If your answer feels fuzzy, go back to the score table. Good data-structure reasoning should look inspectable on paper before it looks clever in code.
Most beginner confusion disappears once you ask one concrete question:
What operation must stay fast and predictable?
Use that question to pick the structure:
| If the job is... | Reach for... |
|---|---|
| keep things in their original order | list |
| jump from key to value | dictionary |
| guarantee uniqueness or fast membership | set |
| keep the best few results | heap |
| process work in arrival order | queue |
| reuse prior results safely | cache |
This chapter intentionally stayed small. It doesn't try to cover all classic structures in one pass. That is a feature, not a missing piece. A beginner needs a handful of structures they can trace end to end before moving to bigger systems.
Later retrieval systems may combine exact term postings with approximate nearest-neighbor (ANN) indexes over embedding vectors. Hierarchical navigable small world (HNSW) is a layered graph index for ANN search; the Faiss work describes optimized GPU similarity search for brute-force, approximate, and compressed-domain setups. The core lesson doesn't change: store data in a structure that matches the lookup you need.[3][4]
Three later patterns reuse the same ideas at larger scale:
Data structures decide which candidates enter the model's field of view.
That is why this chapter sits before databases. First you need to know which structure makes a lookup fast. Then you are ready to put those structures behind durable tables, joins, transactions, and vector search.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Designing Data-Intensive Applications.
Kleppmann, M. · 2017
Introduction to Information Retrieval.
Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press
Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.
Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence
Billion-scale similarity search with GPUs.
Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint
Efficiently Scaling Transformer Inference.
Pope, R., et al. · 2023 · arXiv preprint
Efficient Memory Management for Large Language Model Serving with PagedAttention
Kwon, W., et al. · 2023 · SOSP 2023