Advanced14 lessons

Inference and Serving

Understand serving bottlenecks from TTFT and KV cache through batching, quantization, model parallelism, and autoscaling.

Engineers responsible for latency, cost, local deployment, model gateways, and GPU serving reliability.

You can reason about model fit, slow responses, and which serving technique fixes each bottleneck.

  1. 1Inference: TTFT, TPS & KV CacheUnderstand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and prefill/decode disaggregation.Inference & Production ScaleHard31 min
  2. 2Multi-Query & Grouped-Query AttentionCompare MHA, MQA, and GQA architectures, calculate their KV cache footprint, and reason about memory-limited serving tradeoffs.Inference & Production ScaleHard40 min
  3. 3KV Cache & PagedAttentionCalculate KV cache capacity, trace paged block allocation, and separate memory packing from prefix reuse and scheduling tradeoffs.Inference & Production ScaleHard37 min
  4. 4Prefix Caching and Prompt CachingStructure exact reusable prefixes, validate cache hits from usage fields, and enforce invalidation and tenant-isolation boundaries.Inference & Production ScaleHard19 min
  5. 5FlashAttention & Memory EfficiencyUnderstand how FlashAttention cuts auxiliary attention memory from O(n²) to O(n) with tiling and online softmax, and analyze its IO complexity.Inference & Production ScaleHard34 min
  6. 6Continuous Batching & SchedulingUnderstand how LLM schedulers use continuous batching, chunked prefill, and prefill-decode disaggregation to improve throughput without violating TTFT, TPOT, or inter-token latency targets.Inference & Production ScaleHard34 min
  7. 7Scaling LLM InferenceExplains why decode-heavy LLM serving is often memory-bound and how KV-cache design, batching, PagedAttention, and speculative decoding improve scale.Inference & Production ScaleHard42 min
  8. 8Model Parallelism for LLM InferenceLearn tensor parallelism, pipeline parallelism, context parallelism, and how multi-GPU serving trades memory capacity for communication overhead.Inference & Production ScaleHard21 min
  9. 9Model Quantization: GPTQ, AWQ & GGUFUnderstand how GPTQ, AWQ, and GGUF trade off accuracy, memory footprint, and portability when serving LLMs on GPUs or local hardware.Inference & Production ScaleHard35 min
  10. 10Local LLM DeploymentPlan local LLM deployment with model size, quantization, pruning and sparsity trade-offs, Docker packaging, runtime choice, and hardware budgets.Inference & Production ScaleHard18 min
  11. 11Speculative DecodingReduce LLM inter-token latency by pairing cheap drafting with target-model verification. Learn the rejection-sampling proof, speedup model, method choices, and production rollout gates.Inference & Production ScaleHard35 min
  12. 12Long Context Window ManagementMaster long-context LLM engineering: KV-cache math, prefill-vs-decode bottlenecks, RoPE scaling, lost-in-the-middle behavior, and long-context vs. RAG trade-offs.Inference & Production ScaleHard36 min
  13. 13Mixture of Experts ArchitectureMaster MoE routing, load balancing, and the dense-vs-sparse serving tradeoffs behind Mixtral, DeepSeek, and Qwen3.6-style expert models.Inference & Production ScaleHard38 min
  14. 14GPU Serving & AutoscalingMaster the design of GPU serving infrastructure for LLMs with autoscaling, continuous batching, and cost optimization.Inference & Production ScaleHard50 min