LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnComputing FoundationsAlgorithms for ML Engineers
⚙️EasyMLOps & Deployment

Algorithms for ML Engineers

Learn to count retrieval work, express growth with Big-O, avoid wasteful selection and pairwise loops, and enforce a latency budget with runnable Python.

12 min read
Learning path
Step 9 of 158 in the full curriculum
SQL and Data ModelingGradients and Backprop

This retrieval problem builds on permission-aware SQL: first keep only rows this project and principal may retrieve, then rank those rows. That still leaves an engineering question: will ranking finish quickly after the number of allowed chunks grows from four to four million?

An algorithm is a recipe for completing a task. Complexity describes how the work or memory required by that recipe grows as its input grows. You'll count real retrieval operations before reaching for optimization folklore.

Permission-aware engineering-doc retrieval flowing from authorized chunks through scoring and top-k selection to a measured budget decision. Permission-aware engineering-doc retrieval flowing from authorized chunks through scoring and top-k selection to a measured budget decision.
An authorized answer path becomes an algorithm problem only after you identify the growing input, count the work, and decide whether the budget still holds.

One request, one variable

A teammate asks an internal engineering assistant:

Why did the auth tests fail?

Assume SQL permission predicates, optionally reinforced by row-level security, have already returned only auth project chunks permitted for this request. Use four symbols:

SymbolMeaning hereExample
nauthorized chunks eligible for ranking4, then 8, then 50,000
dnumbers in one embedding vector3 in the lab, 768 in a service
kcitations returned to the answer generator2 or 5
ttokens in a later prompt preview128 or 512

These variables matter because different code grows with different parts of the request. A permission filter might reduce n. A larger embedding increases d. Returning only two citations controls k, but it doesn't automatically avoid scoring every candidate.

Count a complete scan

Start with an intentionally small, inspectable retriever. It scores a chunk by counting question words that appear in the chunk text.

count-one-scan.py
1chunks = [ 2 {"id": "auth-fixture", "text": "auth tests fail when fixture expires"}, 3 {"id": "ci-timeout", "text": "ci job timeout after slow integration test"}, 4 {"id": "cli-flag", "text": "readme documents old command line flag"}, 5 {"id": "deploy-runbook", "text": "rollback service after canary error"}, 6] 7 8def lexical_scan(question: str, allowed_chunks: list[dict]) -> tuple[list[tuple[int, str]], int]: 9 terms = set(question.lower().split()) 10 matches = [] 11 checked = 0 12 for chunk in allowed_chunks: 13 checked += 1 14 overlap = len(terms & set(chunk["text"].split())) 15 if overlap: 16 matches.append((overlap, chunk["id"])) 17 return sorted(matches, reverse=True), checked 18 19matches, checked = lexical_scan("auth tests fixture", chunks) 20print("chunks_checked", checked) 21print("best_chunk", matches[0][1])
Output
1chunks_checked 4 2best_chunk auth-fixture

chunks_checked isn't elapsed time. It's a count of the dominant repeated action: inspecting one candidate chunk. For four authorized chunks, this scan checks four. For forty thousand, it checks forty thousand. Real text work also depends on chunk length. When chunks have a configured size limit, counting candidate inspections is a useful first model. Without that bound, include words read as another growing input.

That's the core habit: identify an operation that repeats with scale, then count it.

Big-O names the growth shape

Big-O gives an asymptotic upper bound: it describes how work grows once the input becomes large. Fixed multipliers and lower-order terms don't change the growth class. Under the bounded-chunk simplification above, scanning one extra authorized chunk adds one inspection, so candidate-scan work is O(n), read "order n" or "linear in n."

observe-linear-growth.py
1base_chunks = [ 2 {"id": "auth-fixture", "text": "auth fixture expired"}, 3 {"id": "ci-timeout", "text": "ci timeout"}, 4 {"id": "cli-flag", "text": "old cli flag"}, 5 {"id": "deploy-runbook", "text": "canary rollback"}, 6] 7 8def count_checks(question: str, allowed_chunks: list[dict]) -> int: 9 terms = set(question.split()) 10 checks = 0 11 for chunk in allowed_chunks: 12 checks += 1 13 _ = terms & set(chunk["text"].split()) 14 return checks 15 16for copies in (1, 2, 4, 8): 17 corpus = base_chunks * copies 18 print(f"n={len(corpus):>2} checks={count_checks('auth fixture', corpus):>2}")
Output
1n= 4 checks= 4 2n= 8 checks= 8 3n=16 checks=16 4n=32 checks=32

Doubling n doubles the repeated checks. The code still has overhead, such as building a set of question terms, but that overhead doesn't grow once per candidate. Big-O points to the piece that becomes dangerous at larger scale.

Don't call the whole bot "O(n)." Attach the claim to an operation: ranking authorized chunks with this bounded-length scan is O(n).

Vector scoring adds another input

Text overlap is only a teaching scorer. Semantic retrieval commonly represents the question and each chunk as embedding vectors, then computes a similarity score. The NumPy lesson already gave you the mechanics of vectors and dot products; here we count their cost.

Each exact dot product inspects d coordinates. Scoring all n allowed chunks therefore performs work proportional to n * d.

count-vector-scoring-work.py
1import numpy as np 2 3chunk_ids = ["auth-fixture", "ci-timeout", "cli-flag", "deploy-runbook"] 4chunk_vectors = np.array([ 5 [1.0, 0.2, 0.1], 6 [0.0, 1.0, 0.2], 7 [0.8, 0.1, 0.4], 8 [0.1, 0.2, 1.0], 9]) 10question = np.array([1.0, 0.0, 0.0]) 11 12scores = chunk_vectors @ question 13best = int(np.argmax(scores)) 14n, d = chunk_vectors.shape 15print("best_chunk", chunk_ids[best]) 16print("coordinate_multiplies", n * d)
Output
1best_chunk auth-fixture 2coordinate_multiplies 12

With four three-dimensional vectors, the score pass uses 12 coordinate multiplications. With 50,000 chunks and 768 coordinates, it uses 38,400,000 multiplications before it chooses citations. The exact ranking step is O(n * d) in time and storing all vectors is O(n * d) in memory.

Return top-k without sorting every result

Suppose the assistant will cite only k = 2 chunks. A first version can score all candidates and fully sort every result. In product code, treat k <= 0 as an empty result so heap logic never reads from an empty list.

full-sort-is-correct-but-does-extra-work.py
1scores = [ 2 (0.92, "auth-fixture"), 3 (0.21, "ci-timeout"), 4 (0.81, "cli-flag"), 5 (0.08, "deploy-runbook"), 6 (0.32, "rate-limit-runbook"), 7] 8 9top_two = sorted(scores, reverse=True)[:2] 10print("top_two", top_two) 11print("fully_reordered_scores", len(scores))
Output
1top_two [(0.92, 'auth-fixture'), (0.81, 'cli-flag')] 2fully_reordered_scores 5

This is correct. It also orders three losers that will never be shown. Sorting n scored candidates costs O(n log n).

A bounded min-heap stores only the top-k values seen so far. Its smallest retained score stays at the root, so a stronger candidate can replace the current weakest winner. Repairing the heap follows one path through a tree with k entries. That path has at most about log2(k) parent-child steps: three for k = 8, ten for k = 1,024. Python's heapq helpers implement that structure in the lab.

For k >= 2, maintaining the heap costs O(n log k) after scoring, with O(k) selection memory. For k = 1, one running maximum gives O(n) selection.[1]Reference 1Introduction to Algorithms (3rd ed.)https://mitpress.mit.edu/9780262033848/introduction-to-algorithms/

keep-only-top-k-scores.py
1from heapq import heappush, heapreplace 2 3scores = [ 4 (0.92, "auth-fixture"), 5 (0.21, "ci-timeout"), 6 (0.81, "cli-flag"), 7 (0.08, "deploy-runbook"), 8 (0.32, "rate-limit-runbook"), 9] 10 11def bounded_top_k(items: list[tuple[float, str]], k: int) -> list[tuple[float, str]]: 12 if k <= 0: 13 return [] 14 heap: list[tuple[float, str]] = [] 15 for item in items: 16 if len(heap) < k: 17 heappush(heap, item) 18 elif item > heap[0]: 19 heapreplace(heap, item) 20 return sorted(heap, reverse=True) 21 22top_two = bounded_top_k(scores, k=2) 23print("top_two", top_two) 24print("selection_values_stored", len(top_two)) 25assert bounded_top_k(scores, k=0) == []
Output
1top_two [(0.92, 'auth-fixture'), (0.81, 'cli-flag')] 2selection_values_stored 2

Important limit: the heap improves selection, not exact vector scoring. If scores came from embeddings, you still paid O(n * d) to produce them. Engineers get into trouble when they speed up a cheap tail step and claim they fixed an expensive upstream step.

Three exact retrieval cost cards showing scan work, vector scoring work, and top-k selection work with their growth variables. Three exact retrieval cost cards showing scan work, vector scoring work, and top-k selection work with their growth variables.
Don't optimize a vague retriever. Split it into candidate scanning, vector scoring, and top-k selection, then name the variable that grows in each part.

Spot a quadratic trap

Linear work doubles when the input doubles. Quadratic work grows far faster because each item is compared with many other items.

A tempting "cleanup" step might compare every retrieved chunk against every other retrieved chunk to remove near-duplicates. For n chunks it evaluates n * (n - 1) / 2 pairs, which is O(n^2).

count-all-pairs.py
1def pair_comparisons(n: int) -> int: 2 comparisons = 0 3 for left in range(n): 4 for right in range(left + 1, n): 5 comparisons += 1 6 return comparisons 7 8for n in (4, 8, 16, 1_000): 9 print(f"n={n:>4} pairs={pair_comparisons(n):>6}")
Output
1n= 4 pairs= 6 2n= 8 pairs= 28 3n= 16 pairs= 120 4n=1000 pairs=499500

At n = 1,000, a request-time pairwise pass makes 499,500 comparisons. That's a concrete failure case: a harmless-looking loop can consume the latency budget before generation begins.

If the requirement is only exact duplicate removal, a set changes the operation. It remembers texts already seen instead of comparing every pair.

deduplicate-exact-text-with-a-set.py
1candidate_texts = [ 2 "auth fixture expiry", 3 "auth test failure", 4 "auth fixture expiry", 5 "ci timeout retry", 6 "auth test failure", 7] 8 9seen = set() 10unique = [] 11membership_checks = 0 12for text in candidate_texts: 13 membership_checks += 1 14 if text not in seen: 15 seen.add(text) 16 unique.append(text) 17 18print("unique_chunks", len(unique)) 19print("membership_checks", membership_checks)
Output
1unique_chunks 3 2membership_checks 5

This set-based path uses expected O(n) membership checks under normal hash-table behavior. Hashing still reads each text, so unbounded string length remains part of the real cost. It doesn't solve semantic near-duplicate detection. Requirements determine the algorithm: don't silently replace "similar meaning" with "identical string" just to obtain a better complexity label.

The same square shape appears later in attention

This course will teach Transformer internals later. For now, carry one preview: scaled dot-product attention computes a matrix from queries and keys, commonly written as Q @ K.T, so a self-attention sequence with t token positions has t * t score cells.[2]Reference 2Attention Is All You Need.https://arxiv.org/abs/1706.03762

You don't need to understand the model yet to recognize the growth shape.

preview-square-token-work.py
1def score_matrix_size(tokens: int) -> tuple[int, float]: 2 cells = tokens * tokens 3 mib_if_float32 = cells * 4 / (1024 * 1024) 4 return cells, mib_if_float32 5 6for tokens in (128, 256, 512, 1_024): 7 cells, mib = score_matrix_size(tokens) 8 print(f"tokens={tokens:>4} cells={cells:>7} raw_matrix_mib={mib:>4.2f}")
Output
1tokens= 128 cells= 16384 raw_matrix_mib=0.06 2tokens= 256 cells= 65536 raw_matrix_mib=0.25 3tokens= 512 cells= 262144 raw_matrix_mib=1.00 4tokens=1024 cells=1048576 raw_matrix_mib=4.00

Doubling tokens quadruples cells. This doesn't yet model a complete language model, optimized kernels, or serving memory. It's a simple transfer of your new algorithm skill: when you see an all-pairs matrix, ask whether an input is being squared.

Quadratic growth preview comparing all-pairs chunk cleanup with a later token attention score grid as input sizes double. Quadratic growth preview comparing all-pairs chunk cleanup with a later token attention score grid as input sizes double.
Pairwise chunk cleanup and a later attention score grid share the same warning sign: doubling the input makes roughly four times as many pairs or cells.

Measure scale and install a budget guard

Big-O tells you the shape, not whether today's workload meets a service objective. A useful engineer does both:

  1. Count expected work to spot dangerous growth.
  2. Benchmark the real implementation on representative data and hardware.
  3. Reject or reroute requests whose predicted work exceeds the supported budget.

This first guard uses a count rather than a made-up millisecond conversion.

guard-exact-retrieval-work.py
1def exact_score_work(candidate_chunks: int, embedding_dim: int) -> int: 2 return candidate_chunks * embedding_dim 3 4def within_budget(candidate_chunks: int, embedding_dim: int, max_multiplies: int) -> bool: 5 return exact_score_work(candidate_chunks, embedding_dim) <= max_multiplies 6 7budget = 1_000_000 8for candidates in (500, 2_000, 50_000): 9 work = exact_score_work(candidates, embedding_dim=768) 10 print(candidates, work, within_budget(candidates, 768, budget))
Output
1500 384000 True 22000 1536000 False 350000 38400000 False

The rejected case doesn't justify weakening authorization or returning arbitrary evidence. It shows exact ranking over that many authorized candidates no longer fits this budget. Possible next experiments are a stricter legitimate filter, an offline-created index, or an approximate nearest-neighbor index evaluated for recall.

One later option is HNSW, a hierarchical proximity-graph approach to approximate nearest-neighbor search. It trades exact scoring of every vector for graph-guided candidate exploration, so any latency gain must be evaluated beside retrieval recall on real questions.[3]Reference 3Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.https://arxiv.org/abs/1603.09320 That's a Vector DB Internals topic, not a shortcut you need to implement in this foundation exercise.

Build it: an authorized exact top-k retriever

Now assemble the lesson into a small baseline you could test before adding an index:

  • Authorization happens first.
  • Every authorized vector is scored exactly.
  • A bounded heap retains only k citations.
  • The function reports score work so a caller can enforce a budget.
authorized-exact-top-k-retriever.py
1from heapq import heappush, heapreplace 2import numpy as np 3 4rows = [ 5 {"id": "auth-fixture", "project": "auth", "principals": {"engineer:u3"}, "vector": np.array([1.0, 0.0, 0.1])}, 6 {"id": "auth-ci-timeout", "project": "auth", "principals": {"engineer:u3"}, "vector": np.array([0.2, 1.0, 0.1])}, 7 {"id": "auth-cli-flag", "project": "auth", "principals": {"engineer:u3"}, "vector": np.array([0.8, 0.1, 0.4])}, 8 {"id": "auth-secrets-private", "project": "auth", "principals": {"engineer:u1"}, "vector": np.array([0.0, 0.3, 1.0])}, 9 {"id": "billing-auth-fixture", "project": "billing", "principals": {"engineer:u3"}, "vector": np.array([1.0, 0.0, 0.0])}, 10] 11 12def authorized_exact_top_k( 13 question: np.ndarray, 14 all_rows: list[dict], 15 project: str, 16 principal: str, 17 k: int, 18) -> tuple[list[str], int]: 19 if k <= 0: 20 return [], 0 21 allowed = [ 22 row 23 for row in all_rows 24 if row["project"] == project and principal in row["principals"] 25 ] 26 heap: list[tuple[float, str]] = [] 27 coordinate_multiplies = 0 28 for row in allowed: 29 score = float(np.dot(question, row["vector"])) 30 coordinate_multiplies += question.size 31 item = (score, row["id"]) 32 if len(heap) < k: 33 heappush(heap, item) 34 elif item > heap[0]: 35 heapreplace(heap, item) 36 ranked = sorted(heap, reverse=True) 37 return [chunk_id for _, chunk_id in ranked], coordinate_multiplies 38 39citations, work = authorized_exact_top_k( 40 np.array([1.0, 0.0, 0.0]), 41 rows, 42 "auth", 43 "engineer:u3", 44 k=2, 45) 46print("citations", citations) 47print("coordinate_multiplies", work) 48assert "auth-secrets-private" not in citations 49assert "billing-auth-fixture" not in citations 50assert work == 9 51assert authorized_exact_top_k( 52 np.array([1.0, 0.0, 0.0]), 53 rows, 54 "auth", 55 "engineer:u3", 56 k=0, 57) == ([], 0)
Output
1citations ['auth-fixture', 'auth-cli-flag'] 2coordinate_multiplies 9

This exact version is your correctness baseline. It has a clean contract: only authorized rows can appear, returned results are exact among those rows, and the score cost is visible. Once a benchmark says the baseline is too slow, you know which behavior an optimized version must preserve.

Practice with numbers

Work these without code first. Build intuition for growth.

  1. A project has n = 2,500 authorized chunks, each with d = 384 coordinates. How many coordinate multiplications does an exact dot-product scan perform?
  2. A full sort orders 10,000 scored results even though the assistant returns k = 5. Which part does a bounded heap reduce, and which part remains unchanged?
  3. A near-duplicate cleanup compares each of 200 chunks with every later chunk. How many comparisons occur?
  4. An attention preview changes prompt length from 256 tokens to 1,024 tokens. By what factor does the number of score cells grow?
  5. Your guard allows one million coordinate multiplications. Does exact scoring fit for n = 4,000, d = 256?

Solution sketches

  1. 2,500 * 384 = 960,000 coordinate multiplications.
  2. The heap reduces selecting the best five values from full-sort work to bounded top-k work. It doesn't remove the exact score pass that produced all 10,000 values.
  3. 200 * 199 / 2 = 19,900 comparisons.
  4. Token length grows by 1,024 / 256 = 4, so square score-cell count grows by 4 * 4 = 16.
  5. No. 4,000 * 256 = 1,024,000, which exceeds the one-million budget.

What to remember

OperationInput that growsWork shapeFirst engineering response
Scan authorized text chunksn chunksO(n)Count candidates after permission filtering
Score exact embedding vectorsn chunks, d coordinatesO(n * d)Measure score work before choosing an index
Select best k scored chunksn scores, k returnedO(n log k) with a heapDon't fully sort discarded results
Compare every candidate pairn chunksO(n^2)Challenge whether all-pairs work is required
Preview an attention score gridt tokensO(t^2) cellsNotice square growth before later model lessons

Three rules carry into every later AI system chapter:

  1. Name the operation and its growing inputs before stating a complexity.
  2. Keep a smallest-correct baseline so optimization can be checked for correctness.
  3. Couple latency work with product requirements: authorization, citation quality, and recall don't disappear when a faster algorithm arrives.

Mastery check

Key concepts

  • Complexity describes growth of one named operation, not an entire product.
  • Exact vector retrieval over authorized candidates performs O(n * d) scoring work.
  • Bounded top-k selection reduces unnecessary ordering but doesn't remove exact scoring.
  • All-pairs loops are a quadratic danger sign in retrieval cleanup and later token processing.
  • A budget guard converts complexity reasoning into an observable system decision.

Evaluation rubric

  • Foundational: Counts scan checks and explains why doubling authorized chunks doubles exact scan work.
  • Intermediate: Implements exact vector scoring plus bounded top-k selection and distinguishes O(n * d) scoring from O(n log k) selection.
  • Advanced: Identifies a quadratic request-path trap, replaces it only when requirements allow, and defines a work budget before proposing approximate retrieval.

Follow-up questions

Common pitfalls

Optimizing before defining the operation

  • Symptom: Team says "use a faster retriever" without knowing whether filtering, scoring, or selection consumes time.
  • Cause: No input variables or repeated operations were named.
  • Fix: Report authorized candidate count, embedding dimension, returned k, and measured time for each stage.

Returning unauthorized evidence faster

  • Symptom: Global vector search is fast, but its candidates include chunks outside the requesting project's permissions.
  • Cause: Optimization bypassed the authorization boundary introduced in the SQL lesson.
  • Fix: Treat authorization as part of correctness. Any indexed design must preserve the same allowed-result contract.

Claiming top-k solved full retrieval

  • Symptom: Full sort was replaced by a heap, but latency barely changes.
  • Cause: Selection was cheap compared with producing every embedding similarity score.
  • Fix: Track score work and selection work separately. Optimize the dominant measured step.

Hiding quadratic work in cleanup

  • Symptom: Request time spikes after adding deduplication or reranking among many candidates.
  • Cause: A nested loop silently added all-pairs comparisons.
  • Fix: Count pairs for realistic n before shipping. Narrow the requirement or move expensive analysis out of the request path.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A lexical retriever's candidate-scan step inspects each authorized chunk once, and chunk length is bounded. If allowed chunks grow from 4 to 32 for the same question, what claim is justified?
2.A permission-aware query returns 600 chunks for auth and 12,000 chunks for a large project. Both use 256-coordinate embeddings, and the exact-scoring budget is 1,000,000 coordinate multiplications per request. Which decision follows?
3.A retriever computes an exact dot product for every authorized vector and then returns sorted(scores, reverse=True)[:3]. If the full sort is replaced with a bounded min-heap, what changes?
4.An exact top-k retriever filters by project and principal before scoring. Five rows have 3-coordinate vectors: three auth rows allow engineer:u3, one auth row allows engineer:u1, and one billing row allows engineer:u3. For auth, engineer:u3, and k=2, what follows?
5.A teammate proposes a request-time cleanup that compares each unordered pair among 2,000 retrieved snippets once to remove semantically similar text. What is the risk, and when is a set a valid replacement?
6.An attention preview forms a self-attention score matrix with t * t cells. If a prompt grows from 256 tokens to 1,024 tokens, how does the cell count change?

6 questions remaining.

Next Step
Continue to Calculus for Gradients, Backpropagation, and Optimization

You can now reason about <span data-glossary="tensor-shape">tensor shapes</span>, data access paths, and how retrieval cost grows as evidence scales. The next chapter turns that cost reasoning into <span data-glossary="gradient">gradient</span> intuition: how parameter changes reduce error, the first mathematical step toward understanding how neural networks learn from large collections of examples.

PreviousSQL and Data Modeling
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Introduction to Algorithms (3rd ed.)

Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. · 2009

Attention Is All You Need.

Vaswani, A., et al. · 2017

Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.

Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence