Master the concepts that power modern AI systems. From foundational transformer architecture to production system design โ structured to take you from basics to expert-level.
Follow these modules in order. Each step builds directly on the previous one.
NumPy shapes, accelerator basics, data structures, SQL, and algorithmic cost for practical ML systems
Probability, statistics, distributions, uncertainty, hypothesis testing, bootstrap, and pass@k
Background knowledge for readers new to ML. Skip ahead if you already know neural networks and how models train.
Regression, validation, PCA, retrieval, decoding, experiments, PyTorch loops, and dataset quality
Feature pipelines, tabular prediction, ranking, forecasting, monitoring, and continuous training for production ML
First working mental models: tokenization, embeddings, evaluation basics, file ingestion, chunking, and instruction-tuned chat
Practical medium-depth patterns for reasoning, tool use, agents, RAG, evaluation, observability, data, caching, cost, and first product design
Shippable predictive ML and LLM products: ETA, ranking, forecasting, vision, pipelines, QA, evaluation, classifiers, and agents
Harder internals: sentence embeddings, vector scoring, attention, positions, normalization, and decoding
Scaling laws, distributed training, fine-tuning, alignment, rewards, distillation, merging, and prompt optimization
Production-grade retrieval and agent systems: vector indexes, GraphRAG, security, orchestration, memory, HITL, and failure recovery
Serving architecture, KV cache mechanics, batching, quantization, long context, advanced architectures, deployment, and experiments
End-to-end hard system design breakdowns for real AI products
Final interview practice for frontier AI labs: Python systems, design, behavioral evidence, and technical presentation
NumPy shapes, accelerator basics, data structures, SQL, and algorithmic cost for practical ML systems
A beginner-focused NumPy chapter that teaches axis naming, indexing, broadcasting, reductions, reshape vs transpose, attention score shapes, keepdims safety, and shape assertions.
Build beginner-first CUDA intuition for model training: CPU vs GPU roles, host-device copies, asynchronous execution, PyTorch device placement, and first-line debugging of OOM and performance issues.
Build beginner-first intuition for training on Apple silicon: what Metal and MPS are, why unified memory changes the CUDA mental model, how PyTorch exposes the `mps` device, how to check availability, where CPU fallback appears, and how synchronization and memory pressure still shape performance.
A beginner-first data-structures chapter that starts with a list scan, then teaches inverted indexes, heaps, queues, and caches through one support-search story.
Turn an in-memory support retriever into durable SQL tables. Create rows and keys, query with parameters and joins, enforce permissions, use transactions, inspect indexes, and see where pgvector fits.
Learn to count retrieval work, express growth with Big-O, avoid wasteful selection and pairwise loops, and enforce a latency budget with runnable Python.
Probability, statistics, distributions, uncertainty, hypothesis testing, bootstrap, and pass@k
Learn why training works by nudging one delivery-time weight, tracing and summing chain-rule paths, checking gradients, and confirming them with PyTorch.
Turn one gradient vector into batches of model inputs while learning dot products, matrix transforms, tensor axes, and shape debugging.
Find hidden directions in a support-ticket matrix with SVD, then use rank, PCA, truncation, and condition numbers without losing sight of what the numbers mean.
Trace SGD, momentum, Adam, AdamW, schedules, and gradient clipping on one uneven loss surface. Learn what each optimizer buffer measures and how to validate a training choice.
A beginner-first probability article that teaches events, priors, conditional probability, independence, Bayes rule, and base-rate mistakes through one e-commerce order-risk detector story.
Estimate fraud risk in a flagged review queue from finite labels, using bootstrap intuition, score intervals, sampling bias checks, and calibrated reporting.
Model an e-commerce support agent with binary outcomes, request intents, tool-call counts, and tail latency, then challenge each simulation before trusting it.
Compare an order-operations coding assistant with paired evidence, uncertainty for lift, and pass@k under a fixed sampling budget.
Background knowledge for readers new to ML. Skip ahead if you already know neural networks and how models train.
Trace a delivery-risk network from one neuron to a batched NumPy forward pass, then diagnose activation, shape, scale, and numerical-stability failures.
Trace a CNN over a damaged-package photo patch: shared kernels, feature-map shapes, pooling, padding failures, and a NumPy-to-PyTorch forward pass.
Follow a shipment-delay model through prediction, loss, gradients, parameter updates, scalar autograd, mini-batches, validation checks, and PyTorch.
Turn raw action scores into stable probabilities and a useful learning signal, then apply the same loss to next-token predictions.
Trace an RNN over ordered events, see why gradients fade or grow, and use LSTM and GRU gates to control memory.
Compress a claim-photo patch, turn its latent code into a sampleable distribution, and implement VAE loss and training.
Trace a support reply through masked attention, a decoder block, and next-token logits with readable NumPy and PyTorch code.
Learn how next-token prediction becomes a trainable language model, from bigram counts and neural n-grams to causal Transformer generation and KV-cache serving.
Trace how decoder-only models grew into modern LLMs, then inspect scaling, instruction tuning, open weights, MoE, and serving tradeoffs with runnable examples.
Build and test grounded prompts with clear roles, few-shot examples, structured outputs, evidence checks, and failure-focused evaluation.
Turn a grounded prompt into a reliable API boundary with server-side secrets, typed results, bounded retries, safe actions, and useful telemetry.
Ship one traceable return-decision workflow: validated input, model boundary, stored status, clear UI states, failure tests, and deploy checks.
Follow one return-decision assistant from base-model training to post-training, retrieval, serving, evaluation, and the fix chosen after a real failure.
Regression, validation, PCA, retrieval, decoding, experiments, PyTorch loops, and dataset quality
Fit return-assistant latency by hand, implement least squares and gradient descent in NumPy, then test failure cases and held-out behavior.
Route damaged-return requests with logistic regression from scratch: derive sigmoid and log loss, fit NumPy weights, select a cost-aware threshold on validation data, audit ranking and calibration, then compare with scikit-learn.
Model damaged-return review with decision trees from scratch: compute impurity, test a non-perfect stump on held-out cases, compare forests and boosting, and audit feature explanations.
Learn reinforcement learning through the damaged-return workflow from earlier lessons. Define an MDP, compute discounted returns and Bellman backups, implement value iteration and Q-learning, model abandonment risk, and connect policy gradients to LLM post-training.
Make model and policy claims honestly: define the decision moment, split return episodes by time and customer, expose feature and preprocessing leakage, and audit LLM evaluation contamination.
Inspect unlabeled support-message embeddings with k-means and PCA, then stress-test whether apparent neighborhoods survive scale, metric, and compression choices.
Build and evaluate the evidence-selection stage of a support assistant with BM25, dense similarity, rank fusion, reranking, and approximate search audits.
Turn retrieved evidence into controlled text by implementing stable softmax, sampling filters, constrained decoding, beam search, and reproducible generation audits.
Design a trustworthy online experiment for an AI support change: randomize customers, measure useful outcomes, quantify uncertainty, and reject false wins.
Build a PyTorch classifier from raw logits through autograd, validation, and reloadable checkpoints.
Build versioned AI datasets with schema gates, grouped splits, contamination checks, and auditable receipts.
Feature pipelines, tabular prediction, ranking, forecasting, monitoring, and continuous training for production ML
Turn delivery events into stable prediction inputs while preventing leakage and training-serving mismatch.
Build point-in-time delivery features from events and preserve the same meaning in online serving.
Train a boosted ETA-risk baseline from tabular features, evaluate slices, and package deployment evidence.
Rank products for a shopper using candidate retrieval, relevance metrics, and feedback-loop safeguards.
Forecast parcel demand with time-aware evaluation and turn large forecast errors into reviewable operational alerts.
Monitor predictive models from feature freshness through delayed labels, then gate retraining, promotion, and rollback.
First working mental models: tokenization, embeddings, evaluation basics, file ingestion, chunking, and instruction-tuned chat
Use Sutton's Bitter Lesson to compare rules, learning, and search through a measured support-ticket routing lab.
Build a small subword tokenizer, compare BPE, WordPiece, and SentencePiece, then audit token cost and Unicode behavior.
Turn token IDs into vectors, learn what nearby usage captures, and see why a word such as charge needs sentence-dependent representations.
Compute perplexity from held-out token probabilities, compare models under a fixed protocol, normalize across tokenizers, and decide what PPL can't tell you.
Turn PDFs, scans, HTML, and Markdown into faithful evidence records with provenance and quality gates before retrieval.
Turn clean documents into retrieval units that preserve answers, citations, and measurable search quality.
Build an evaluation suite for a policy-answering LLM: score evidence use, understand public benchmark contracts, control judge bias, and make release decisions from private tests.
Teach a base language model to answer as an assistant: curate grounded SFT rows, serialize chat turns exactly, choose loss targets, pack safely, and detect serving-time template drift.
Practical medium-depth patterns for reasoning, tool use, agents, RAG, evaluation, observability, data, caching, cost, and first product design
Shrink and inspect embedding indexes without guessing: measure recall while testing PCA, projections, native shortening, and quantization.
Build and evaluate reasoning controllers: single traces, answer voting, and bounded tree search for multi-step LLM decisions.
Build a safe tool-calling runtime that validates model requests, executes controlled actions, feeds observations back, and evaluates complete workflows.
Move from local function calls to reusable MCP capability servers by tracing one real session, building a working stdio integration, and enforcing trust boundaries.
Build a prompt-injection-resistant agent boundary: quarantine untrusted tool content, validate typed action proposals, require approval, and measure unsafe side effects.
Turn a tool-bearing LLM workflow into auditable evidence: classify its use, own risks, version controls, preserve traces, and gate releases.
Build a trustworthy human-feedback data flywheel: redact traces, write rubrics, measure agreement, select useful examples, prevent leakage, and promote versioned datasets.
Evaluate ShopFlow refund-agent runs by final state, observable trace, safety gates, cost, and repeatability, then map private tests to public benchmarks.
Design a secure, traceable RAG service around versioned policy evidence, grounded answers, abstention, release gates, and latency budgets.
Upgrade a permission-safe RAG retriever with BM25, semantic scores, rank fusion, and recall gates for exact codes and paraphrased policy questions.
Turn a permission-safe hybrid candidate list into precise context using cross-encoder reasoning, ordering metrics, latency gates, and traceable evidence selection.
Evaluate a permission-safe RAG answer trace with context, claim, citation, failure-attribution, and release gates before automating softer judgments.
Add calibrated soft judgments to a RAG evaluation trace without letting an LLM override deterministic evidence gates.
Build a matched-pair fairness audit for an LLM judge, measure routing gaps, and block release when evidence is too weak.
Build a claim-level grounding gate for delivery updates that verifies evidence, catches confident fabrication, abstains safely, and records release traces.
Turn claim-level answer traces into production metrics, actionable alerts, privacy-safe debugging records, and reproducible incident evidence.
Turn a live LLM regression into a reproducible candidate decision by logging inputs, metrics, artifacts, and promotion evidence.
Measure how FP16 and BF16 affect training range, update precision, memory, and release evidence before enabling faster low-precision compute.
Turn an evaluated LLM change into an immutable release bundle, promote it through measured traffic, and roll back without losing lineage.
Reuse stable policy answers across paraphrased questions without crossing release, access, or freshness boundaries; then prove the cache is both safe and worth serving.
Build an auditable LLM cost ledger from usage traces, cache decisions, output contracts, offline batch work, and release budget gates.
Turn an audited cost contract into a model gateway that preserves data, schema, review, and budget requirements across routing and fallback.
Assemble a stateful support agent that grounds replies, gates refund actions, preserves gateway policy, and hands difficult cases to humans.
Shippable predictive ML and LLM products: ETA, ranking, forecasting, vision, pipelines, QA, evaluation, classifiers, and agents
Ship a delivery-delay warning service with as-of features, versioned policy gates, baseline evidence, and monitored fallback.
Ship a marketplace ranking candidate with eligible retrieval, separate recall and NDCG gates, replayable exposure rows, and an A/B-ready rollback receipt.
Ship a demand forecast and capacity-alert artifact with rolling backtests, alert review, and retraining policy.
Ship a damaged-package photo triage service with quality gates, slice evaluation, serving bundles, and review monitoring.
Assemble predictive ML artifacts into validated training, registry promotion, canary monitoring, and rollback.
Ship the policy-evidence service required by a support agent: approved ingestion, cited answers, abstention, and dashboard-ready eval rows.
Build a release dashboard for document QA that turns grounded-answer rows into slice gates, uncertainty checks, and inspectable decisions.
Train and gate a support-escalation encoder that hands safe intake decisions to a production agent.
Assemble classifier intake, cited policy evidence, approval-gated actions, and episode release tests into a production agent.
Harder internals: sentence embeddings, vector scoring, attention, positions, normalization, and decoding
Learn how contrastive losses train sentence embeddings, why hard negatives matter, and how retrieval systems combine bi-encoders, rerankers, and dimension tradeoffs.
Learn vector scoring contracts, evaluate Matryoshka widths, and measure scalar, product, and binary quantization before shipping compressed retrieval.
Learn scaled dot-product attention from first principles, including Q/K/V routing, variance scaling, masks, multi-head shapes, KV-cache costs, and FlashAttention.
Understand how Vision Transformers split images into patches, build visual tokens, train encoders, and connect to CLIP and multimodal LLMs.
Understand why transformers need position information, how sinusoidal encodings work, how RoPE and ALiBi encode relative position, and why long-context extrapolation needs careful evaluation.
Understand LayerNorm mechanics, Pre-LN versus Post-LN placement, RMSNorm simplification, gradient stability, and hybrid normalization layouts for deep transformers.
Learn how sparse autoencoders decompose transformer activations into candidate interpretable features, support circuit tracing, and enable controlled activation-steering experiments.
Compare decoding strategies for text generation: greedy, beam search, top-k, nucleus (top-p), temperature, repetition controls, and newer variants like min-p.
Scaling laws, distributed training, fine-tuning, alignment, rewards, distillation, merging, and prompt optimization
Learn the empirical power laws governing LLM performance, from Kaplan's parameter-heavy frontier through Chinchilla-optimal ratios to modern inference-aware training strategies.
Understand how web-scale pre-training data is extracted, filtered, deduplicated, mixed, tokenized, and packed into training-ready shards, including decontamination, late-stage annealing, and synthetic-data tradeoffs.
Build and train a tiny GPT end to end on Shakespeare: tokenize with GPT-style subwords, remap active token IDs, run causal self-attention, track validation loss, save a checkpoint, and sample text.
Learn when to keep the causal language-modeling objective and continue pretraining on domain text instead of jumping straight to SFT, and how to evaluate the trade-off against forgetting, cost, and downstream gain.
Build synthetic post-training data pipelines with Self-Instruct, Evol-Instruct, calibrated judge signals, verifiers, preference pairs, diversity checks, and decontamination.
Run supervised fine-tuning as a real training system: choose the learning objective before the update surface, verify response-token loss and packing, track the real batch budget, save resumable checkpoints, and export on held-out behavior.
Understand ZeRO stages, current FSDP1 vs FSDP2 guidance, and when native PyTorch or DeepSpeed is the right choice for large-model training.
Understand the mathematics of Low-Rank Adaptation (LoRA), modern adapter targeting strategies, and the real memory tradeoffs compared to full fine-tuning and QLoRA.
Train reward models as a first-class post-training stage: validate chosen/rejected pairs and splits, fit a scalar reward head with Bradley-Terry loss, audit generalization, and decide when explicit rewards are worth the extra complexity.
Understand the RLHF pipeline and DPO, including reward modeling, PPO mechanics, and the trade-offs between iterative reinforcement learning and direct preference optimization.
Understand how Constitutional AI reduces reliance on repeated human preference labeling through AI critique and ranking, and how automated red teaming stress-tests those safeguards.
Understand RLVR, a post-training approach that uses programmatic verification instead of learned human-preference rewards to improve checked outcomes in math, code, and other contract-driven tasks.
Understand the main forms of knowledge distillation for LLMs, from logit matching and response-based supervision to on-policy KD. Learn when distillation helps, where student capacity becomes the bottleneck, and how to implement a correct teacher-student training loop.
Learn model merging techniques, from simple weight averaging and task arithmetic to TIES-Merging and DARE, including practical guidance on tokenizer compatibility, mergekit workflows, and evaluation.
Move beyond manual prompt editing. Use DSPy to search prompt and few-shot candidates from data, then release only after held-out evaluation.
Learn Recursive Language Models (RLMs): keep long context in a programmable environment, delegate targeted sub-calls, and release the design only after measured quality, cost, and safety checks.
Production-grade retrieval and agent systems: vector indexes, GraphRAG, security, orchestration, memory, HITL, and failure recovery
Learn how approximate nearest neighbor indexes use HNSW, IVF, and Product Quantization to balance speed, recall, and memory in production vector databases.
Learn how query rewriting, HyDE, Self-RAG, and Corrective RAG change retrieval control, and how to evaluate their cost and evidence quality.
Learn how GraphRAG uses entity graphs, hierarchical community reports, and embeddings to retrieve evidence for relationship-heavy and corpus-level questions.
Learn how document ACLs, tenant isolation, retrieval-time authorization, output checks, and audit logs reduce private-data leakage risk in enterprise RAG.
Build reliable LLM interfaces with JSON mode, structured outputs, schema validation, and grammar-guided decoding.
Compare ReAct for tightly coupled tool use with Plan-and-Execute for longer workflows with explicit planning and replanning.
Build layered guardrails for prompt injection defense, sensitive-data controls, structured outputs, policy enforcement, and safe tool use.
Build code agents that test candidate patches inside bounded sandboxes with runtime evidence and defense-in-depth controls.
Build browser and desktop agents whose proposed clicks and keystrokes remain behind host policy, approval, verification, and sandbox controls.
Build approval gates, durable checkpoints, and guarded resumes for agent actions that change real-world state.
Scope coding-agent tasks, isolate execution, keep patches on branches, verify behavior, and preserve human merge ownership.
Design agent memory systems with scoped storage, sourced recall, tenant isolation, and durable checkpoints without letting recalled context authorize side effects.
Learn how to implement validation gates, retries, checkpointed recovery, state reconciliation, loop breakers, and graceful degradation when LLM agents hallucinate, stall, or drift from their tools.
Master multi-agent orchestration with LangGraph, AutoGen teams, and OpenAI handoffs. Learn DAG-style routing, typed shared state, protocol boundaries, and human-in-the-loop controls for reliable AI systems.
Serving architecture, KV cache mechanics, batching, quantization, long context, advanced architectures, deployment, and experiments
Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and prefill/decode disaggregation.
Compare MHA, MQA, and GQA architectures, calculate their KV cache footprint, and reason about memory-limited serving tradeoffs.
Calculate KV cache capacity, trace paged block allocation, and separate memory packing from prefix reuse and scheduling tradeoffs.
Structure exact reusable prefixes, validate cache hits from usage fields, and enforce invalidation and tenant-isolation boundaries.
Understand how FlashAttention cuts auxiliary attention memory from O(nยฒ) to O(n) with tiling and online softmax, and analyze its IO complexity.
Understand how LLM schedulers use continuous batching, chunked prefill, and prefill-decode disaggregation to improve throughput without violating TTFT or inter-token latency targets.
Explains why decode-heavy LLM serving is often memory-bound and how KV-cache design, batching, PagedAttention, and speculative decoding improve scale.
Learn tensor parallelism, pipeline parallelism, sequence parallelism, and how multi-GPU serving trades memory capacity for communication overhead.
Understand how GPTQ, AWQ, and GGUF trade off accuracy, memory footprint, and portability when serving LLMs on GPUs or local hardware.
Plan local LLM deployment with model size, quantization, pruning and sparsity trade-offs, Docker packaging, runtime choice, and hardware budgets.
Distill large teachers into compact SLMs using MobileLLM architectures and Phi-style data recipes. Compile and run them on-device with MLC LLM, ONNX Runtime, Core ML, and ExecuTorch while respecting power, thermal, and strict privacy constraints.
Reduce LLM inter-token latency by pairing cheap drafting with target-model verification. Learn the rejection-sampling proof, speedup model, method choices, and production rollout gates.
Master long-context LLM engineering: KV-cache math, prefill-vs-decode bottlenecks, RoPE scaling, lost-in-the-middle behavior, and long-context vs. RAG trade-offs.
Move past fitting tokens into the window and learn the discipline of context engineering: curating the smallest high-signal token set, fighting context rot, and applying write, select, compress, and isolate strategies plus tool-result pruning and sub-agent isolation.
Master MoE routing, load balancing, and understand why models like Mixtral and DeepSeek deliver strong capacity-per-compute tradeoffs compared with dense architectures.
Master linear-time sequence modeling: from S4 and HiPPO to Mamba's selective recurrence, Mamba-2's SSD framework, Mamba-3's inference-first refinements, and modern hybrid Transformer-SSM designs.
Understand how reasoning models trade extra inference compute for better answers, and what that means for search, verifiers, KV cache pressure, and routing.
Master advanced MLOps and DevOps patterns for LLM systems: GitOps for prompts and models, feature stores for embedding features, automated rollback on eval regression, shadow traffic, and production-grade model registries.
Master the design of GPU serving infrastructure for LLMs with autoscaling, continuous batching, and cost optimization.
Master the design of an A/B testing framework for LLM-powered features, including traffic routing, metric selection, sample sizing, and automated guardrails.
End-to-end hard system design breakdowns for real AI products
Master the architecture of a real-time content moderation system using LLMs and specialized classifiers.
Design a real-time code completion path with context construction, measured serving latency, privacy controls, and stale-result suppression.
Design a shared LLM platform with tenant-scoped state, quota enforcement, adapter routing, KV accounting, and measured GPU utilization.
Master the architecture of an end-to-end AI search engine, covering freshness routing, hybrid retrieval, evidence packing, citation verification, and streaming synthesis.
Master CLIP's contrastive pre-training, zero-shot classification, visual token budgets, and the architecture of modern published VLMs like LLaVA, BLIP-2, and Qwen-VL.
Deep dive into multimodal LLM architecture covering encoders, projector designs, fusion patterns, training recipes, visual token budgets, and serving constraints like KV cache growth.
Master diffusion models from the forward noising process and DDPM training to Classifier-Free Guidance, latent diffusion, DiT backbones, and fast sampling trade-offs.
Master real-time voice AI architecture: turn detection, streaming STT/LLM/TTS, native audio trade-offs, WebRTC transport, and barge-in state.
Design a production reasoning agent that routes by difficulty, evaluates candidate work, requires evidence before release, and survives serving bottlenecks like key-value (KV) cache growth.
Final interview practice for frontier AI labs: Python systems, design, behavioral evidence, and technical presentation
Practice production-shaped Python coding prompts: crawlers, in-memory stores, ledgers, schedulers, parsers, rate limiters, caches, and concurrency follow-ups.
Design AI lab systems with clear goals, scale math, APIs, data models, overload behavior, permissions, eval gates, and operational debugging paths.
Prepare behavioral answers for AI labs around judgment, humility, incident leadership, disagreement, safety mechanisms, ambiguity, and evidence of ownership.
Prepare a technical project presentation that proves ownership, architecture taste, tradeoff judgment, rollout discipline, metrics, and depth under questioning.