LearnML Algorithms & EvaluationCore Retrieval Algorithms

🔍MediumRAG & Retrieval

Core Retrieval Algorithms

Hands-on chapter for core retrieval algorithms, with first-principles mechanics, runnable code, failure modes, and production checks.

40 min readOpenAI, Anthropic, Google +17 key concepts

Retrieval is how systems find relevant information before generation. It turns a user query into a ranked list of candidate documents. This chapter starts from zero and builds toward the concrete job skill: Implement toy BM25 scoring, dense cosine similarity, and reciprocal rank fusion for a small document set. ^[1]^[2]^[3]

Retrieval algorithm diagram comparing BM25 sparse scores, dense cosine scores, and reciprocal rank fusion into final top-k documents — Visual anchor: BM25 rewards exact words, dense search rewards meaning, and fusion keeps documents that rank well in either lane.

Step map

Stage	Beginner action	Checkpoint
Concept	Score one query against a small document set.	Reader can say input, operation, and output without naming a library.
Build	Compare lexical, dense, and fused ranking signals.	Code prints or asserts one result the reader predicted first.
Failure	Irrelevant top results expose recall or scoring failure.	The common beginner mistake has a visible symptom and guard.
Ship	Evaluation queries and ranking metrics are committed.	Artifact is small enough for another engineer to rerun.

Start here

Start with similarity. If two vectors point in similar directions, cosine similarity is high. If a document shares rare query terms, lexical retrieval scores it high. Production RAG often combines both.

Read this chapter once for the idea, then run the demo and change one value. For Core Retrieval Algorithms, progress means you can name the input, explain the operation, and say what result would prove the idea worked.

By the end, you should be able to explain Core Retrieval Algorithms with a worked example, not a library name. Keep one runnable file and one short note with the result you expected before you ran it.

Why this chapter matters

Core Retrieval Algorithms matters because later LLM work assumes this habit already exists. You will use it when you inspect data, debug model behavior, compare evaluations, or explain why a result should be trusted.

The job skill here is: Implement toy BM25 scoring, dense cosine similarity, and reciprocal rank fusion for a small document set. Treat the snippet as lab equipment: run it, change one input, and write down what changed before you move on.

Beginner mental model

Imagine a query and three short documents. Before using a vector database, calculate one similarity score yourself so the ranking is not a black box.

A useful beginner checklist for Core Retrieval Algorithms:

What object enters the system?
What transformation happens to it?
What evidence says the result is correct?

Keep the answer concrete. If you can't point to the value, shape, row, metric, or test that proves the point, the Core Retrieval Algorithms concept is still fuzzy.

Vocabulary in plain English

BM25: lexical ranking method that rewards matching rare terms while controlling document length.
cosine similarity: angle-based similarity between vectors.
MIPS: maximum inner product search, used when dot product is the score.
HNSW: graph-based approximate nearest-neighbor index.
IVF: index that partitions vectors into coarse groups before search.
product quantization: compression method for large vector collections.
RRF: rank fusion rule that combines multiple ranked lists.

Use these definitions while reading the demo. Each term should map to a variable, an assertion, or a decision you could explain in review.

Build it

Start with the smallest version that can run from a terminal. The goal for this Core Retrieval Algorithms demo is visibility: one file, one output, and no hidden notebook state.


python
1import math
2
3def cosine(a, b):
4    dot = sum(x * y for x, y in zip(a, b))
5    na = math.sqrt(sum(x * x for x in a))
6    nb = math.sqrt(sum(y * y for y in b))
7    return dot / (na * nb)
8
9print(cosine([1, 0, 1], [0.5, 0.5, 1]))

Read the code in this order:

dot measures how much two vectors point together.
na and nb measure vector lengths.
dot / (na * nb) normalizes by length so scale doesn't dominate.
A real retriever repeats this score across many documents, then sorts by score.

After it runs, make three small edits. Add a normal-case test, add an edge-case test, then log the intermediate value a beginner would most likely misunderstand. That turns Core Retrieval Algorithms from a reading exercise into an engineering exercise.

For Core Retrieval Algorithms, a strong submission includes a runnable command, one test file, and notes for any assumptions. If data, randomness, training, or evaluation appears, save the split rule, seed, config, and metric definition.

Beginner failure case

A beginner may trust top-1 retrieval without checking whether relevant documents are missing from the candidate set.

For Core Retrieval Algorithms, make the failure visible before adding the fix. Write the symptom in plain English, then add the smallest guard that would catch it next time.

Good guards for Core Retrieval Algorithms are concrete: assertions, fixture rows, duplicate checks, seed control, metric intervals, or release checks. Pick the guard that makes the hidden assumption executable.

Practice ladder

Run the snippet exactly as written and save the output.
Change one input value and predict the output before running it again.
Add one assertion that would catch a beginner mistake.
Add three document vectors, rank them by cosine similarity, then test what happens when one vector has a much larger norm.
Write a two-line README: one command to run the demo, one command to run the test.

Keep this ladder small. Core Retrieval Algorithms should feel runnable before it feels impressive. The capstones later reuse the same habit at product scale.

Production check

Track recall@k, MRR, nDCG, latency, memory, index freshness, and per-query failure examples.

A production check for Core Retrieval Algorithms is proof another engineer can trust the result. At foundation level that means a reproducible command and tests. At capstone level it also means a design note, eval evidence, cost or latency notes, and rollback criteria.

Before moving on, answer four Core Retrieval Algorithms questions: What input does this accept? What output or metric proves it worked? What failure would fool you? What test catches that failure?

What to ship

Ship a small Core Retrieval Algorithms folder with code, tests, and notes. Make it boring to run: install dependencies, run tests, run the demo. That boring path is what makes the artifact useful in a portfolio.

Core Retrieval Algorithms feeds later LLM engineering work directly. Retrieval, fine-tuning, agents, evals, and serving all depend on small foundations like this being clear before systems get large.

Evaluation Rubric

1
Explains the core mental model behind Core Retrieval Algorithms without hiding behind library calls
2
Implements the central idea in runnable Python, NumPy, PyTorch, or scikit-learn code
3
Identifies realistic failure modes and adds tests or production checks that catch them

Common Pitfalls

Dense retrieval can miss exact terms, and sparse retrieval can miss paraphrases. A good RAG system measures both.
Skipping the from-scratch version and reaching for a library before the mechanics are clear.
Treating a clean demo as proof that the implementation will survive bad inputs, drift, or scale.

Follow-up Questions to Expect

Key Concepts Tested

BM25cosine similarityMIPSHNSWIVFproduct quantizationRRF

References

The Probabilistic Relevance Framework: BM25 and Beyond.

Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval

Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.

Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence

Billion-scale similarity search with GPUs.

Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

Your account is free and you can post anonymously if you choose.

Back to Topics

LearnML Algorithms & EvaluationCore Retrieval Algorithms

🔍MediumRAG & Retrieval

Core Retrieval Algorithms

Hands-on chapter for core retrieval algorithms, with first-principles mechanics, runnable code, failure modes, and production checks.

40 min readOpenAI, Anthropic, Google +17 key concepts

Step map

Stage	Beginner action	Checkpoint
Concept	Score one query against a small document set.	Reader can say input, operation, and output without naming a library.
Build	Compare lexical, dense, and fused ranking signals.	Code prints or asserts one result the reader predicted first.
Failure	Irrelevant top results expose recall or scoring failure.	The common beginner mistake has a visible symptom and guard.
Ship	Evaluation queries and ranking metrics are committed.	Artifact is small enough for another engineer to rerun.

Start here

Why this chapter matters

Beginner mental model

Imagine a query and three short documents. Before using a vector database, calculate one similarity score yourself so the ranking is not a black box.

A useful beginner checklist for Core Retrieval Algorithms:

What object enters the system?
What transformation happens to it?
What evidence says the result is correct?

Keep the answer concrete. If you can't point to the value, shape, row, metric, or test that proves the point, the Core Retrieval Algorithms concept is still fuzzy.

Vocabulary in plain English

BM25: lexical ranking method that rewards matching rare terms while controlling document length.
cosine similarity: angle-based similarity between vectors.
MIPS: maximum inner product search, used when dot product is the score.
HNSW: graph-based approximate nearest-neighbor index.
IVF: index that partitions vectors into coarse groups before search.
product quantization: compression method for large vector collections.
RRF: rank fusion rule that combines multiple ranked lists.

Use these definitions while reading the demo. Each term should map to a variable, an assertion, or a decision you could explain in review.

Build it

Start with the smallest version that can run from a terminal. The goal for this Core Retrieval Algorithms demo is visibility: one file, one output, and no hidden notebook state.


python
1import math
2
3def cosine(a, b):
4    dot = sum(x * y for x, y in zip(a, b))
5    na = math.sqrt(sum(x * x for x in a))
6    nb = math.sqrt(sum(y * y for y in b))
7    return dot / (na * nb)
8
9print(cosine([1, 0, 1], [0.5, 0.5, 1]))

Read the code in this order:

dot measures how much two vectors point together.
na and nb measure vector lengths.
dot / (na * nb) normalizes by length so scale doesn't dominate.
A real retriever repeats this score across many documents, then sorts by score.

Beginner failure case

A beginner may trust top-1 retrieval without checking whether relevant documents are missing from the candidate set.

For Core Retrieval Algorithms, make the failure visible before adding the fix. Write the symptom in plain English, then add the smallest guard that would catch it next time.

Practice ladder

Run the snippet exactly as written and save the output.
Change one input value and predict the output before running it again.
Add one assertion that would catch a beginner mistake.
Add three document vectors, rank them by cosine similarity, then test what happens when one vector has a much larger norm.
Write a two-line README: one command to run the demo, one command to run the test.

Keep this ladder small. Core Retrieval Algorithms should feel runnable before it feels impressive. The capstones later reuse the same habit at product scale.

Production check

Track recall@k, MRR, nDCG, latency, memory, index freshness, and per-query failure examples.

Before moving on, answer four Core Retrieval Algorithms questions: What input does this accept? What output or metric proves it worked? What failure would fool you? What test catches that failure?

What to ship

Core Retrieval Algorithms feeds later LLM engineering work directly. Retrieval, fine-tuning, agents, evals, and serving all depend on small foundations like this being clear before systems get large.

Evaluation Rubric

1
Explains the core mental model behind Core Retrieval Algorithms without hiding behind library calls
2
Implements the central idea in runnable Python, NumPy, PyTorch, or scikit-learn code
3
Identifies realistic failure modes and adds tests or production checks that catch them

Common Pitfalls

Dense retrieval can miss exact terms, and sparse retrieval can miss paraphrases. A good RAG system measures both.
Skipping the from-scratch version and reaching for a library before the mechanics are clear.
Treating a clean demo as proof that the implementation will survive bad inputs, drift, or scale.

Follow-up Questions to Expect

Key Concepts Tested

BM25cosine similarityMIPSHNSWIVFproduct quantizationRRF

References

The Probabilistic Relevance Framework: BM25 and Beyond.

Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval

Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.

Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence

Billion-scale similarity search with GPUs.

Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

Your account is free and you can post anonymously if you choose.

Core Retrieval Algorithms

Step map

Start here

Why this chapter matters

Beginner mental model

Vocabulary in plain English

Build it

Beginner failure case

Practice ladder

Production check

What to ship

1Why does Core Retrieval Algorithms belong before advanced LLM engineering?

2What should a student build after reading this chapter?

3What makes the chapter job-relevant instead of academic only?

Core Retrieval Algorithms

Step map

Start here

Why this chapter matters

Beginner mental model

Vocabulary in plain English

Build it

Beginner failure case

Practice ladder

Production check

What to ship

1Why does Core Retrieval Algorithms belong before advanced LLM engineering?

2What should a student build after reading this chapter?

3What makes the chapter job-relevant instead of academic only?