Master the concepts that power modern AI systems. From foundational transformer architecture to production system design โ structured to take you from basics to expert-level.
Follow these modules in order. Each step builds directly on the previous one.
Tokenization, embeddings, attention, and the core mental models behind modern LLMs
Serving architecture, KV cache mechanics, batching strategies, and latency/cost trade-offs
Chunking, indexing, hybrid retrieval, GraphRAG, and enterprise data access controls
Prompting, tool calling, memory, orchestration, and guardrails for robust agents
Benchmarking, LLM-as-judge, online experiments, and reliability diagnostics
Caching, deployment, versioning, and the operational discipline for production LLM systems
Pre-training to post-training: data pipelines, alignment methods, and reasoning performance
End-to-end system design breakdowns for real-world AI applications
Tokenization, embeddings, attention, and the core mental models behind modern LLMs
Understand Sutton's Bitter Lesson, why general methods that use computation consistently outperform human-engineered heuristics, and how this principle shapes every modern AI architecture decision.
Compare tokenization algorithms, understand vocabulary size tradeoffs, analyze the multilingual tokenization tax, and handle Unicode edge cases in production.
Trace the full evolution from count-based methods through Word2Vec/GloVe to contextual BERT/GPT representations. Understand the distributional hypothesis, embedding geometry, and when to use static vs contextual embeddings in production.
Master sentence embedding training with contrastive learning (InfoNCE), optimize retrieval with bi-encoder vs. cross-encoder architectures, and use modern advances like Matryoshka representations.
Compare PCA, t-SNE, and UMAP for visualizing and compressing embeddings, and learn when MRL and product quantization replace post-hoc reduction.
Master vector similarity (cosine vs dot product), optimize dimensions with Matryoshka learning, and implement scalar (INT8), product (PQ), and binary (BQ) quantization for billion-scale retrieval systems.
Derive the attention formula, prove the scaling factor, and implement multi-head attention. Analyze O(nยฒ) complexity and understand the three attention variants (self, causal, cross) in modern architectures.
Understand why transformers need position info, derive sinusoidal encodings, explore how RoPE encodes relative position through rotation, compare ALiBi's linear bias approach, and analyze long-context extrapolation methods.
A deep dive into Layer Normalization mechanics: Pre-LN vs Post-LN gradient flow, representation collapse trade-offs, RMSNorm simplification, and modern innovations like QK-Norm and Peri-LN.
Compare decoding strategies for text generation: greedy, beam search, top-k, nucleus (top-p), and min-p sampling, with temperature scaling and repetition penalty.
Derive perplexity from cross-entropy loss, understand bits-per-byte normalization, and navigate the modern LLM evaluation landscape including LLM-as-Judge and Arena Elo.
Serving architecture, KV cache mechanics, batching strategies, and latency/cost trade-offs
Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and disaggregation.
Master the inference optimizations that make serving large models possible. Compare MHA, MQA, and GQA architectures and their impact on KV cache memory.
Understand KV cache storage strategies for multi-tenant LLM inference, including PagedAttention, memory fragmentation mitigation, and vLLM architecture.
Understand how FlashAttention achieves O(n) memory by tiling and online softmax, and analyze its IO complexity.
Understand high-throughput request schedulers for LLM serving, focusing on continuous batching, prefill-decode disaggregation, and latency-aware scheduling.
Deep dive into LLM inference optimization: KV-cache management, continuous batching, PagedAttention, and speculative decoding.
Accelerate LLM inference 2-3x by decoupling drafting from verification. Learn the probability theory behind exact distribution matching and how to deploy speculative decoding in production.
Master long-context LLM engineering: from RoPE scaling and attention patterns to practical context management strategies, lost-in-the-middle effects, and chunking approaches for production systems.
Understand post-training quantization methods GPTQ, AWQ, and GGUF. Learn how to deploy 70B models on consumer GPUs with minimal quality loss.
Master MoE routing, load balancing, and understand why modern MoE models like DeepSeek-V3 achieve better compute-quality tradeoffs.
Master the linear-time alternative to transformers: from structured state spaces (S4) through Mamba's selective mechanism to hybrid architectures like Jamba that combine the best of both worlds.
Understand the shift from train-time to test-time compute scaling. Explore how reasoning models trade inference FLOPs for better logical deduction.
Chunking, indexing, hybrid retrieval, GraphRAG, and enterprise data access controls
Compare document chunking approaches for RAG: fixed-size, semantic, recursive, and their impact on retrieval quality.
Master the internals of approximate nearest neighbor algorithms: HNSW, IVF, and Product Quantization. Understand the speed-recall-memory tradeoffs in production vector databases.
Understand how to build a hybrid retrieval system combining BM25 sparse search with dense vector embeddings for optimal recall.
Understand the architecture of end-to-end RAG systems: retriever design, vector indices, chunking strategies, and hallucination mitigation.
Master advanced RAG techniques including query decomposition, HyDE, Self-RAG, and Corrective RAG (CRAG) to build robust retrieval pipelines.
How Microsoft's GraphRAG architecture uses community detection and graph structure to answer questions that pure vector search cannot.
Understand row-level security, document ACLs, and per-user filtering in vector stores to prevent RAG systems from leaking confidential data.
Prompting, tool calling, memory, orchestration, and guardrails for robust agents
Master Chain-of-Thought prompting, Self-Consistency, and Tree-of-Thought strategies. Learn when to trade inference compute for reasoning accuracy.
Master the techniques for guaranteeing structurally valid LLM outputs, from JSON mode and function calling schemas to grammar-guided decoding with finite state machines.
Understand how LLMs learn to call functions, parse structured output, and handle multi-step tool use chains.
Understand the Model Context Protocol (MCP) and emerging standards for agent-tool interaction, from protocol architecture and transport layers to security considerations and ecosystem integration.
Master the core patterns for autonomous agents: ReAct loops, Plan-and-Execute architectures, and multi-agent orchestration.
Master memory systems for LLM agents, from short-term working memory and conversation buffers to long-term semantic stores, episodic recall, and MemGPT's hierarchical memory management.
Designing systems that pause agent execution for human approval. From bank transfers to code deployment, building trust into autonomous AI.
Design input/output safety filters for a production LLM application with configurable policy enforcement.
Master prompt injection attacks, understand why they bypass safety filters, and design multi-layer defense strategies for production LLM systems.
Master the architecture of code generation agents, from the generate-execute-debug loop to secure sandboxing with gVisor and WebAssembly.
Implementing deterministic fallbacks, infinite-loop breakers, and graceful degradation when LLM agents hallucinate or get stuck.
Master multi-agent DAGs using LangGraph and AutoGen. Learn to implement shared state, message passing, conditional routing, and human-in-the-loop workflows for robust AI systems.
Design evaluation frameworks for AI agents, from task-completion benchmarks like SWE-bench and OSWorld to custom metrics for tool use accuracy, multi-step reasoning, and safety in agentic workflows.
Benchmarking, LLM-as-judge, online experiments, and reliability diagnostics
Understand major LLM benchmarks (MMLU, HumanEval, GPQA), measurement protocols (pass@k, Elo), and the impact of data contamination.
Master the LLM-as-a-Judge approach, from designing effective rubrics to handling biases like position and verbosity.
Master the design of an A/B testing framework for LLM-powered features, including metric selection, sample size calculations, and automated guardrails.
Design an observability stack for LLM applications covering logging, metrics, tracing, and drift detection.
Master the taxonomy, detection methods, and mitigation strategies for LLM hallucinations, from statistical self-consistency checks to retrieval-grounded generation and chain-of-verification.
Master the taxonomy of LLM biases, implementation of fairness metrics, and end-to-end mitigation strategies from data curation to RLHF.
Caching, deployment, versioning, and the operational discipline for production LLM systems
Implementing semantic caches, request deduplication, and cost-aware routing to cut LLM API costs by 40-70% without quality loss.
Master the economics of LLM deployment. Learn token-level cost modeling, prompt optimization, caching strategies, model routing, and build-vs-buy decisions at scale.
Master the architecture of CI/CD pipelines for LLM deployments, covering model versioning, automated evaluation gates, and rollback strategies.
Master the design of GPU serving infrastructure for LLMs with autoscaling, continuous batching, and cost optimization.
Pre-training to post-training: data pipelines, alignment methods, and reasoning performance
Master the empirical power laws governing LLM performance, from Kaplan's original scaling results through Chinchilla-optimal ratios to modern inference-aware training strategies.
Understand the end-to-end data pipeline for pre-training a foundation model, including crawling, deduplication, quality filtering, and data mixing.
Master instruction tuning (SFT) and chat templates. Learn how raw base models are transformed into helpful assistants using structured data, loss masking, and sequence packing.
Understand FP16/BF16 training formats, the necessity of master weights, and how dynamic loss scaling prevents gradient underflow.
Master FSDP and DeepSpeed ZeRO strategies for training LLMs. Compare memory efficiency, communication overhead, and 3D parallelism techniques.
Move beyond manual prompt engineering. Learn to use DSPy's compiler to automatically optimize prompts, select few-shot examples, and improve LLM pipeline performance from data.
Master Recursive Language Models (RLMs), an inference-time approach that moves long context into a programmable environment so models can recurse over 10M+ token workloads with competitive quality and cost.
Master the mathematics of Low-Rank Adaptation (LoRA), adapter injection strategies, and memory/compute tradeoffs compared to full fine-tuning.
Understand the core mechanisms of knowledge distillation for LLMs. Master the techniques for compressing massive teacher models into efficient student models while preserving complex reasoning capabilities.
Master model merging techniques, from simple weight averaging and Task Arithmetic to TIES-Merging and DARE, including practical guidance on using mergekit for combining specialized models.
Master Constitutional AI's self-improvement loop and automated red teaming strategies for scalable model alignment.
Master the RLHF pipeline and DPO. Understand reward modeling, PPO mechanics, and the trade-offs between iterative reinforcement learning and direct preference optimization.
Understand RLVR (the training approach that produced DeepSeek-R1's reasoning capabilities) using binary correctness signals instead of human preferences or reward model approximations.
End-to-end system design breakdowns for real-world AI applications
Architect a production-grade customer support agent with RAG, tool use, and human escalation capabilities.
Master the architecture of a real-time content moderation system using LLMs and specialized classifiers.
Master the architecture of an end-to-end AI search engine, covering multi-stage retrieval, hallucination verification, and streaming synthesis.
Master the design of a real-time code completion system like Copilot, including context construction, model serving, and low-latency UX.
Master the design of a multi-tenant platform that serves large language models with strict SLA guarantees, token-aware rate limiting, and accurate cost tracking.
Master how to design a production reasoning agent (like o1/DeepSeek-R1) that uses chain-of-thought, tree search, and test-time compute scaling for complex problem solving.
Architect a real-time voice AI agent with sub-500ms latency. Covers VAD, streaming STT/LLM/TTS pipelines, WebRTC transport, and handling interruptions.
Master CLIP's contrastive pre-training, zero-shot classification, and the architecture of modern VLMs like LLaVA and GPT-4V.
Learn how to design a multimodal LLM that processes text, images, and audio, covering projection strategies and cross-modal attention.
Master the mathematics and architecture of Diffusion Models, from the forward noising process to U-Net denoising, Classifier-Free Guidance, and Latent Diffusion scaling.