Master the concepts that power modern AI systems. From foundational transformer architecture to production system design โ structured to take you from basics to expert-level.
Follow these modules in order. Each step builds directly on the previous one.
Tokenization, embeddings, attention, and the core mental models behind modern LLMs
Serving architecture, KV cache mechanics, batching strategies, and latency/cost trade-offs
Chunking, indexing, hybrid retrieval, GraphRAG, and enterprise data access controls
Prompting, tool calling, memory, orchestration, and guardrails for robust agents
Benchmarking, LLM-as-judge, online experiments, and reliability diagnostics
Caching, deployment, versioning, and the operational discipline for production LLM systems
Pre-training to post-training: data pipelines, alignment methods, and reasoning performance
End-to-end system design breakdowns for real-world AI applications
Tokenization, embeddings, attention, and the core mental models behind modern LLMs
Understand Sutton's Bitter Lesson, why general methods that use computation consistently outperform human-engineered heuristics, and how this principle shapes every modern AI architecture decision.
Master tokenization algorithms (BPE, WordPiece, SentencePiece), understand vocabulary size tradeoffs, and analyze the multilingual tokenization tax.
Trace the full evolution from count-based methods through Word2Vec/GloVe to contextual BERT/GPT representations. Understand the distributional hypothesis, embedding geometry, and when to use static vs contextual embeddings in production.
Master sentence embedding training with contrastive learning (InfoNCE), optimize retrieval with bi-encoder vs. cross-encoder architectures, and use modern advances like Matryoshka representations.
Compare PCA, t-SNE, and UMAP for visualizing and compressing embeddings, and learn when MRL and product quantization replace post-hoc reduction.
Master vector similarity (cosine vs dot product), optimize dimensions with Matryoshka learning, and implement scalar, product, and binary quantization for retrieval systems.
Master the scaled dot-product attention formula from first principles. Deep dive into the variance proof, multi-head parallelization, O(nยฒ) memory complexity, and the three core attention variants.
Understand why transformers need position info, derive sinusoidal encodings, explore how RoPE encodes relative position through rotation, compare ALiBi's linear bias approach, and analyze long-context extrapolation methods.
Master Layer Normalization mechanics: Pre-LN vs Post-LN gradient flow, representation collapse trade-offs, RMSNorm simplification, and modern innovations like QK-Norm and Peri-LN.
Master decoding strategies for text generation: compare greedy, beam search, top-k, nucleus (top-p), and min-p sampling, with temperature scaling and repetition penalty.
Derive perplexity from cross-entropy loss, understand bits-per-byte normalization, and explore modern LLM evaluation methods including LLM-as-Judge and Arena Elo.
Serving architecture, KV cache mechanics, batching strategies, and latency/cost trade-offs
Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and disaggregation.
Master the inference optimizations that make serving large models possible. Compare MHA, MQA, and GQA architectures and their impact on KV cache memory.
Understand KV cache storage strategies for multi-tenant LLM inference, including PagedAttention, memory fragmentation, and vLLM architecture.
Understand how FlashAttention achieves O(n) memory by tiling and online softmax, and analyze its IO complexity.
Understand high-throughput request schedulers for LLM serving, focusing on continuous batching, prefill-decode disaggregation, and latency-aware scheduling.
Explores LLM inference optimization: KV cache management, continuous batching, PagedAttention, and speculative decoding.
Accelerate LLM inference 2-3x by decoupling drafting from verification. Learn the probability theory behind exact distribution matching and how to deploy speculative decoding in production.
Master long-context LLM engineering: from RoPE scaling and attention patterns to practical context management strategies, lost-in-the-middle effects, and chunking approaches for production systems.
Understand post-training quantization methods GPTQ, AWQ, and GGUF. Learn how to deploy 72B models on consumer GPUs with minimal quality loss.
Master MoE routing, load balancing, and understand why modern MoE models like DeepSeek-V2 achieve better compute-quality tradeoffs.
Master the linear-time alternative to transformers: from structured state spaces (S4) and Mamba's selective mechanism to hybrid architectures like Jamba.
Understand the shift from train-time to test-time compute scaling. Explore how reasoning models trade inference FLOPs for better logical deduction.
Chunking, indexing, hybrid retrieval, GraphRAG, and enterprise data access controls
Deep dive into document chunking approaches for RAG: fixed-size, semantic, recursive, and their impact on retrieval quality.
Master the internals of approximate nearest neighbor algorithms: HNSW, IVF, and Product Quantization. Understand the speed-recall-memory tradeoffs in production vector databases.
Understand how to build a hybrid retrieval system combining BM25 sparse search with dense vector embeddings for optimal recall.
Understand the architecture of end-to-end RAG systems: retriever design, vector indices, chunking strategies, and hallucination mitigation.
Master advanced RAG techniques including query decomposition, HyDE, Self-RAG, and Corrective RAG (CRAG) to build robust retrieval pipelines.
Understand how Microsoft's GraphRAG architecture uses community detection and graph structure to answer questions that pure vector search can't.
Understand row-level security, document ACLs, and per-user filtering in vector stores to prevent RAG systems from leaking confidential data.
Prompting, tool calling, memory, orchestration, and guardrails for robust agents
Master Chain-of-Thought prompting, Self-Consistency, and Tree-of-Thought strategies. Learn when to trade inference compute for reasoning accuracy.
Master the techniques for guaranteeing structurally valid LLM outputs, from JSON mode and function calling schemas to grammar-guided decoding with finite state machines.
Understand how LLMs learn to call functions, parse structured output, and handle multi-step tool use chains.
Understand the Model Context Protocol (MCP) and emerging standards for agent-tool interaction, from protocol architecture and transport layers to security considerations and ecosystem integration.
Master the core patterns for autonomous agents: ReAct loops, Plan-and-Execute architectures, and multi-agent orchestration.
Master memory systems for LLM agents, from short-term working memory and conversation buffers to long-term semantic stores, episodic recall, and MemGPT's hierarchical memory management.
Designing systems that pause agent execution for human approval. From bank transfers to code deployment, building trust into autonomous AI.
Master the design of input and output safety filters for production LLM applications with configurable policy enforcement.
Master prompt injection attacks, understand why they bypass safety filters, and design multi-layer defense strategies for production LLM systems.
Master the architecture of code generation agents, from the generate-execute-debug loop to secure sandboxing with gVisor and WebAssembly.
Master implementing deterministic fallbacks, infinite-loop breakers, and graceful degradation for when LLM agents hallucinate or get stuck.
Master multi-agent DAGs using LangGraph and AutoGen. Learn to implement shared state, message passing, conditional routing, and human-in-the-loop workflows for robust AI systems.
Master the design of evaluation frameworks for AI agents, from task-completion benchmarks like SWE-bench and OSWorld to custom metrics for tool use accuracy, multi-step reasoning, and safety.
Benchmarking, LLM-as-judge, online experiments, and reliability diagnostics
Master major LLM benchmarks (MMLU, HumanEval, GPQA, SWE-bench), measurement protocols (pass@k, Elo), analyze data contamination, and learn 2026 selection strategies including agentic benchmarks and cost-to-quality ratios.
Master the LLM-as-a-Judge approach, from designing effective rubrics to handling biases like position and verbosity.
Master the design of an A/B testing framework for LLM-powered features, including metric selection, sample size calculations, and automated guardrails.
Master the design of an observability stack for LLM applications, covering logging, metrics, tracing, and drift detection.
Master the taxonomy, detection methods, and mitigation strategies for LLM hallucinations. Covers everything from SelfCheckGPT and semantic entropy to specialized detectors like Lynx, token-level probing, and cutting-edge prevention techniques including contrastive decoding.
Master the taxonomy of LLM biases, implementation of fairness metrics, and end-to-end mitigation strategies from data curation to RLHF.
Caching, deployment, versioning, and the operational discipline for production LLM systems
Master semantic caching, request deduplication, and cost-aware routing to cut LLM API costs by 40-70% without quality loss.
Master the economics of LLM deployment. Learn token-level cost modeling, prompt optimization, caching strategies, model routing, and build-vs-buy decisions at scale.
Master the architecture of CI/CD pipelines for LLM deployments, covering model versioning, automated evaluation gates, and rollback strategies.
Master the design of GPU serving infrastructure for LLMs with autoscaling, continuous batching, and cost optimization.
Pre-training to post-training: data pipelines, alignment methods, and reasoning performance
Master the empirical power laws governing LLM performance, from Kaplan's original scaling results through Chinchilla-optimal ratios to modern inference-aware training strategies.
Understand the complete data pipeline for pre-training a foundation model, including crawling, deduplication, quality filtering, data mixing, sequence packing, data annealing, and synthetic data generation.
Master instruction tuning (SFT) and chat templates. Learn how raw base models are transformed into helpful assistants using structured data, loss masking, and sequence packing.
Understand FP16/BF16 training formats, the necessity of master weights, and how dynamic loss scaling prevents gradient underflow.
Master FSDP and DeepSpeed ZeRO strategies for training LLMs. Compare memory efficiency, communication overhead, and 3D parallelism techniques.
Move beyond manual prompt engineering. Master DSPy's compiler to automatically optimize prompts, select few-shot examples, and improve LLM pipeline performance from data.
Master Recursive Language Models (RLMs), an inference-time approach that moves long context into a programmable environment so models can recurse over 10M+ token workloads with competitive quality and cost.
Master the mathematics of Low-Rank Adaptation (LoRA), adapter injection strategies, and memory/compute tradeoffs compared to full fine-tuning.
Understand the core mechanisms of knowledge distillation for LLMs. Master the techniques for compressing massive teacher models into efficient student models while preserving complex reasoning capabilities.
Master model merging techniques, from simple weight averaging and Task Arithmetic to TIES-Merging and DARE, including practical guidance on using mergekit for combining specialized models.
Understand how Constitutional AI replaces human feedback with AI self-supervision, and explore automated red teaming strategies for scalable model alignment.
Master the RLHF pipeline and DPO. Understand reward modeling, PPO mechanics, and the trade-offs between iterative reinforcement learning and direct preference optimization.
Understand RLVR (the training approach that produced DeepSeek-R1's reasoning capabilities) using binary correctness signals instead of human preferences or reward model approximations.
End-to-end system design breakdowns for real-world AI applications
Master the architecture of a production-grade customer support agent, including RAG, tool use, and stateful human escalation.
Master the architecture of a real-time content moderation system using LLMs and specialized classifiers.
Master the architecture of an end-to-end AI search engine, covering multi-stage retrieval, hallucination verification, and streaming synthesis.
Master the design of a real-time code completion system like Copilot, including context construction, model serving, and low-latency UX.
Master the design of a multi-tenant platform that serves large language models with strict SLA guarantees, token-aware rate limiting, and accurate cost tracking.
Master how to design a production reasoning agent (like o1/DeepSeek-R1) that uses chain-of-thought, tree search, and test-time compute scaling for complex problem solving.
Master the architecture of a real-time voice AI agent with sub-500ms latency. Covers VAD, streaming STT/LLM/TTS pipelines, WebRTC transport, and handling interruptions.
Master CLIP's contrastive pre-training, zero-shot classification, and the architecture of modern published VLMs like LLaVA, BLIP-2, and Qwen-VL.
Deep dive into multimodal LLM architecture covering encoders, projection strategies, fusion techniques, three-stage training with DPO, MoE for efficient inference, and adaptive thinking modes.
Master the mathematics and architecture of Diffusion Models, from the forward noising process to U-Net denoising, Classifier-Free Guidance, and Latent Diffusion scaling.