LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

1๐ŸงชAI Engineering Foundations
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
2โšกInference Systems & Optimization
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
3๐Ÿ”Advanced Retrieval & Enterprise Memory
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
4๐Ÿค–Agentic Architecture & Orchestration
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
5๐Ÿ“ŠEvaluation & Reliability
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
6๐Ÿ› ๏ธLLMOps & Production Engineering
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
7๐ŸงฌTraining, Alignment & Reasoning
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
8๐Ÿ—๏ธSystem Design Case Studies
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

12 free articles ยท No login required

Learn LLM Engineering

Master the concepts that power modern AI systems. From foundational transformer architecture to production system design โ€” structured to take you from basics to expert-level.

76 topics8 modules38h total content12 free

Step-by-step roadmap

Follow these modules in order. Each step builds directly on the previous one.

  1. 1
    Step 1 of 8

    ๐ŸงชAI Engineering Foundations

    Current step

    Tokenization, embeddings, attention, and the core mental models behind modern LLMs

    11 topics~6hNext: Step 2
  2. 2
    Step 2 of 8

    โšกInference Systems & Optimization

    Upcoming

    Serving architecture, KV cache mechanics, batching strategies, and latency/cost trade-offs

    12 topics~6hNext: Step 3
  3. 3
    Step 3 of 8

    ๐Ÿ”Advanced Retrieval & Enterprise Memory

    Upcoming

    Chunking, indexing, hybrid retrieval, GraphRAG, and enterprise data access controls

    7 topics~4hNext: Step 4
  4. 4
    Step 4 of 8

    ๐Ÿค–Agentic Architecture & Orchestration

    Upcoming

    Prompting, tool calling, memory, orchestration, and guardrails for robust agents

    13 topics~6hNext: Step 5
  5. 5
    Step 5 of 8

    ๐Ÿ“ŠEvaluation & Reliability

    Upcoming

    Benchmarking, LLM-as-judge, online experiments, and reliability diagnostics

    6 topics~3hNext: Step 6
  6. 6
    Step 6 of 8

    ๐Ÿ› ๏ธLLMOps & Production Engineering

    Upcoming

    Caching, deployment, versioning, and the operational discipline for production LLM systems

    4 topics~2hNext: Step 7
  7. 7
    Step 7 of 8

    ๐ŸงฌTraining, Alignment & Reasoning

    Upcoming

    Pre-training to post-training: data pipelines, alignment methods, and reasoning performance

    13 topics~6hNext: Step 8
  8. 8
    Step 8 of 8

    ๐Ÿ—๏ธSystem Design Case Studies

    Upcoming

    End-to-end system design breakdowns for real-world AI applications

    10 topics~6hFinal step
๐Ÿงช

AI Engineering Foundations

Tokenization, embeddings, attention, and the core mental models behind modern LLMs

1

The Bitter Lesson & Compute

Understand Sutton's Bitter Lesson, why general methods that use computation consistently outperform human-engineered heuristics, and how this principle shapes every modern AI architecture decision.

Easy20m
2

Tokenization: BPE & SentencePiece

Compare tokenization algorithms, understand vocabulary size tradeoffs, analyze the multilingual tokenization tax, and handle Unicode edge cases in production.

Medium30m
3

Word to Contextual Embeddings

Trace the full evolution from count-based methods through Word2Vec/GloVe to contextual BERT/GPT representations. Understand the distributional hypothesis, embedding geometry, and when to use static vs contextual embeddings in production.

Medium30m

Sentence Embeddings & Contrastive Loss

PRO

Master sentence embedding training with contrastive learning (InfoNCE), optimize retrieval with bi-encoder vs. cross-encoder architectures, and use modern advances like Matryoshka representations.

Medium30m

Dimensionality Reduction for Embeddings

PRO

Compare PCA, t-SNE, and UMAP for visualizing and compressing embeddings, and learn when MRL and product quantization replace post-hoc reduction.

Medium30m

Embedding Similarity & Quantization

PRO

Master vector similarity (cosine vs dot product), optimize dimensions with Matryoshka learning, and implement scalar (INT8), product (PQ), and binary (BQ) quantization for billion-scale retrieval systems.

Medium25m
7

Scaled Dot-Product Attention

Derive the attention formula, prove the scaling factor, and implement multi-head attention. Analyze O(nยฒ) complexity and understand the three attention variants (self, causal, cross) in modern architectures.

Medium50m
8

Positional Encoding: RoPE & ALiBi

Understand why transformers need position info, derive sinusoidal encodings, explore how RoPE encodes relative position through rotation, compare ALiBi's linear bias approach, and analyze long-context extrapolation methods.

Hard45m

Layer Normalization: Pre-LN vs Post-LN

PRO

A deep dive into Layer Normalization mechanics: Pre-LN vs Post-LN gradient flow, representation collapse trade-offs, RMSNorm simplification, and modern innovations like QK-Norm and Peri-LN.

Hard30m
10

Decoding Strategies: Greedy to Nucleus

Compare decoding strategies for text generation: greedy, beam search, top-k, nucleus (top-p), and min-p sampling, with temperature scaling and repetition penalty.

Hard45m

Perplexity & model evaluation

PRO

Derive perplexity from cross-entropy loss, understand bits-per-byte normalization, and navigate the modern LLM evaluation landscape including LLM-as-Judge and Arena Elo.

Medium45m
โšก

Inference Systems & Optimization

Serving architecture, KV cache mechanics, batching strategies, and latency/cost trade-offs

1

Inference: TTFT, TPS & KV Cache

Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and disaggregation.

Medium30m

Multi-Query & Grouped-Query Attention

PRO

Master the inference optimizations that make serving large models possible. Compare MHA, MQA, and GQA architectures and their impact on KV cache memory.

Medium30m

KV Cache & PagedAttention

PRO

Understand KV cache storage strategies for multi-tenant LLM inference, including PagedAttention, memory fragmentation mitigation, and vLLM architecture.

Hard30m

FlashAttention & Memory Efficiency

PRO

Understand how FlashAttention achieves O(n) memory by tiling and online softmax, and analyze its IO complexity.

Hard35m

Continuous Batching & Scheduling

PRO

Understand high-throughput request schedulers for LLM serving, focusing on continuous batching, prefill-decode disaggregation, and latency-aware scheduling.

Hard25m

Scaling LLM Inference

PRO

Deep dive into LLM inference optimization: KV-cache management, continuous batching, PagedAttention, and speculative decoding.

Hard35m

Speculative Decoding

PRO

Accelerate LLM inference 2-3x by decoupling drafting from verification. Learn the probability theory behind exact distribution matching and how to deploy speculative decoding in production.

Medium35m

Long Context Window Management

PRO

Master long-context LLM engineering: from RoPE scaling and attention patterns to practical context management strategies, lost-in-the-middle effects, and chunking approaches for production systems.

Hard25m

Model Quantization: GPTQ, AWQ & GGUF

PRO

Understand post-training quantization methods GPTQ, AWQ, and GGUF. Learn how to deploy 70B models on consumer GPUs with minimal quality loss.

Medium35m

Mixture of Experts (MoE)

PRO

Master MoE routing, load balancing, and understand why modern MoE models like DeepSeek-V3 achieve better compute-quality tradeoffs.

Hard25m

Mamba & State Space Models

PRO

Master the linear-time alternative to transformers: from structured state spaces (S4) through Mamba's selective mechanism to hybrid architectures like Jamba that combine the best of both worlds.

Hard45m

Reasoning & Test-Time Compute

PRO

Understand the shift from train-time to test-time compute scaling. Explore how reasoning models trade inference FLOPs for better logical deduction.

Hard35m

Unlock 64 Premium Articles

Deep dives, model answers, scoring rubrics & more

๐Ÿ”

Advanced Retrieval & Enterprise Memory

Chunking, indexing, hybrid retrieval, GraphRAG, and enterprise data access controls

1

Chunking Strategies

Compare document chunking approaches for RAG: fixed-size, semantic, recursive, and their impact on retrieval quality.

Medium30m

Vector DB Internals: HNSW & IVF

PRO

Master the internals of approximate nearest neighbor algorithms: HNSW, IVF, and Product Quantization. Understand the speed-recall-memory tradeoffs in production vector databases.

Hard45m

Hybrid Search: Dense + Sparse

PRO

Understand how to build a hybrid retrieval system combining BM25 sparse search with dense vector embeddings for optimal recall.

Medium30m

Production RAG Pipelines

PRO

Understand the architecture of end-to-end RAG systems: retriever design, vector indices, chunking strategies, and hallucination mitigation.

Medium20m

Advanced RAG: HyDE & Self-RAG

PRO

Master advanced RAG techniques including query decomposition, HyDE, Self-RAG, and Corrective RAG (CRAG) to build robust retrieval pipelines.

Hard30m

GraphRAG & Knowledge Graphs

PRO

How Microsoft's GraphRAG architecture uses community detection and graph structure to answer questions that pure vector search cannot.

Hard45m

RAG Security & Access Control

PRO

Understand row-level security, document ACLs, and per-user filtering in vector stores to prevent RAG systems from leaking confidential data.

Hard25m
๐Ÿค–

Agentic Architecture & Orchestration

Prompting, tool calling, memory, orchestration, and guardrails for robust agents

1

CoT, ToT & Self-Consistency Prompting

Master Chain-of-Thought prompting, Self-Consistency, and Tree-of-Thought strategies. Learn when to trade inference compute for reasoning accuracy.

Medium25m

Structured Output Generation

PRO

Master the techniques for guaranteeing structurally valid LLM outputs, from JSON mode and function calling schemas to grammar-guided decoding with finite state machines.

Medium25m

Function Calling & Tool Use

PRO

Understand how LLMs learn to call functions, parse structured output, and handle multi-step tool use chains.

Medium30m

MCP & Tool Protocol Standards

PRO

Understand the Model Context Protocol (MCP) and emerging standards for agent-tool interaction, from protocol architecture and transport layers to security considerations and ecosystem integration.

Medium15m

ReAct & Plan-and-Execute

PRO

Master the core patterns for autonomous agents: ReAct loops, Plan-and-Execute architectures, and multi-agent orchestration.

Hard25m

Agent Memory & Persistence

PRO

Master memory systems for LLM agents, from short-term working memory and conversation buffers to long-term semantic stores, episodic recall, and MemGPT's hierarchical memory management.

Medium25m

Human-in-the-Loop Agents

PRO

Designing systems that pause agent execution for human approval. From bank transfers to code deployment, building trust into autonomous AI.

Medium30m

Guardrails & Safety Filters

PRO

Design input/output safety filters for a production LLM application with configurable policy enforcement.

Hard30m

Prompt Injection Defense

PRO

Master prompt injection attacks, understand why they bypass safety filters, and design multi-layer defense strategies for production LLM systems.

Medium30m

Code Generation & Sandboxing

PRO

Master the architecture of code generation agents, from the generate-execute-debug loop to secure sandboxing with gVisor and WebAssembly.

Hard35m

Agent Failure & Recovery

PRO

Implementing deterministic fallbacks, infinite-loop breakers, and graceful degradation when LLM agents hallucinate or get stuck.

Medium30m

Multi-Agent Orchestration

PRO

Master multi-agent DAGs using LangGraph and AutoGen. Learn to implement shared state, message passing, conditional routing, and human-in-the-loop workflows for robust AI systems.

Hard30m

AI Agent Evaluation and Benchmarking

PRO

Design evaluation frameworks for AI agents, from task-completion benchmarks like SWE-bench and OSWorld to custom metrics for tool use accuracy, multi-step reasoning, and safety in agentic workflows.

Medium20m
๐Ÿ“Š

Evaluation & Reliability

Benchmarking, LLM-as-judge, online experiments, and reliability diagnostics

1

LLM Benchmarks & Limitations

Understand major LLM benchmarks (MMLU, HumanEval, GPQA), measurement protocols (pass@k, Elo), and the impact of data contamination.

Medium20m

LLM-as-a-Judge Evaluation

PRO

Master the LLM-as-a-Judge approach, from designing effective rubrics to handling biases like position and verbosity.

Medium15m

A/B Testing for LLMs

PRO

Master the design of an A/B testing framework for LLM-powered features, including metric selection, sample size calculations, and automated guardrails.

Medium25m

LLM Observability & Monitoring

PRO

Design an observability stack for LLM applications covering logging, metrics, tracing, and drift detection.

Medium30m

Hallucination Detection & Mitigation

PRO

Master the taxonomy, detection methods, and mitigation strategies for LLM hallucinations, from statistical self-consistency checks to retrieval-grounded generation and chain-of-verification.

Medium30m

Bias & Fairness in LLMs

PRO

Master the taxonomy of LLM biases, implementation of fairness metrics, and end-to-end mitigation strategies from data curation to RLHF.

Medium35m
๐Ÿ› ๏ธ

LLMOps & Production Engineering

Caching, deployment, versioning, and the operational discipline for production LLM systems

Semantic Caching & Cost Optimization

PRO

Implementing semantic caches, request deduplication, and cost-aware routing to cut LLM API costs by 40-70% without quality loss.

Medium25m

LLM Cost Engineering and Token Economics

PRO

Master the economics of LLM deployment. Learn token-level cost modeling, prompt optimization, caching strategies, model routing, and build-vs-buy decisions at scale.

Medium25m

Model Versioning & Deployment

PRO

Master the architecture of CI/CD pipelines for LLM deployments, covering model versioning, automated evaluation gates, and rollback strategies.

Medium25m

GPU Serving & Autoscaling

PRO

Master the design of GPU serving infrastructure for LLMs with autoscaling, continuous batching, and cost optimization.

Hard30m
๐Ÿงฌ

Training, Alignment & Reasoning

Pre-training to post-training: data pipelines, alignment methods, and reasoning performance

Scaling Laws & Compute Training

PRO

Master the empirical power laws governing LLM performance, from Kaplan's original scaling results through Chinchilla-optimal ratios to modern inference-aware training strategies.

Hard30m

Pre-training Data at Scale

PRO

Understand the end-to-end data pipeline for pre-training a foundation model, including crawling, deduplication, quality filtering, and data mixing.

Medium25m

Instruction Tuning & Chat Templates

PRO

Master instruction tuning (SFT) and chat templates. Learn how raw base models are transformed into helpful assistants using structured data, loss masking, and sequence packing.

Medium25m

Mixed Precision Training

PRO

Understand FP16/BF16 training formats, the necessity of master weights, and how dynamic loss scaling prevents gradient underflow.

Medium30m

Distributed Training: FSDP & ZeRO

PRO

Master FSDP and DeepSpeed ZeRO strategies for training LLMs. Compare memory efficiency, communication overhead, and 3D parallelism techniques.

Hard30m

Prompt Optimization with DSPy

PRO

Move beyond manual prompt engineering. Learn to use DSPy's compiler to automatically optimize prompts, select few-shot examples, and improve LLM pipeline performance from data.

Hard25m

Recursive Language Models (RLM)

PRO

Master Recursive Language Models (RLMs), an inference-time approach that moves long context into a programmable environment so models can recurse over 10M+ token workloads with competitive quality and cost.

Hard25m
8

LoRA & Parameter-Efficient Tuning

Master the mathematics of Low-Rank Adaptation (LoRA), adapter injection strategies, and memory/compute tradeoffs compared to full fine-tuning.

Hard35m

Knowledge Distillation

PRO

Understand the core mechanisms of knowledge distillation for LLMs. Master the techniques for compressing massive teacher models into efficient student models while preserving complex reasoning capabilities.

Medium25m

Model Merging and Weight Interpolation

PRO

Master model merging techniques, from simple weight averaging and Task Arithmetic to TIES-Merging and DARE, including practical guidance on using mergekit for combining specialized models.

Hard18m

Constitutional AI & Red Teaming

PRO

Master Constitutional AI's self-improvement loop and automated red teaming strategies for scalable model alignment.

Medium30m

RLHF & DPO Alignment

PRO

Master the RLHF pipeline and DPO. Understand reward modeling, PPO mechanics, and the trade-offs between iterative reinforcement learning and direct preference optimization.

Hard35m

RLVR & Verifiable Rewards

PRO

Understand RLVR (the training approach that produced DeepSeek-R1's reasoning capabilities) using binary correctness signals instead of human preferences or reward model approximations.

Hard25m
๐Ÿ—๏ธ

System Design Case Studies

End-to-end system design breakdowns for real-world AI applications

Automated Support Agent

PRO

Architect a production-grade customer support agent with RAG, tool use, and human escalation capabilities.

Medium20m
2

Content Moderation System

Master the architecture of a real-time content moderation system using LLMs and specialized classifiers.

Medium35m

LLM-Powered Search Engine

PRO

Master the architecture of an end-to-end AI search engine, covering multi-stage retrieval, hallucination verification, and streaming synthesis.

Hard25m

Code Completion System

PRO

Master the design of a real-time code completion system like Copilot, including context construction, model serving, and low-latency UX.

Hard20m

Multi-Tenant LLM Platform

PRO

Master the design of a multi-tenant platform that serves large language models with strict SLA guarantees, token-aware rate limiting, and accurate cost tracking.

Hard30m

Reasoning & Test-Time Compute

PRO

Master how to design a production reasoning agent (like o1/DeepSeek-R1) that uses chain-of-thought, tree search, and test-time compute scaling for complex problem solving.

Hard45m

Real-Time Voice AI Agent

PRO

Architect a real-time voice AI agent with sub-500ms latency. Covers VAD, streaming STT/LLM/TTS pipelines, WebRTC transport, and handling interruptions.

Hard45m

Vision-Language Models & CLIP

PRO

Master CLIP's contrastive pre-training, zero-shot classification, and the architecture of modern VLMs like LLaVA and GPT-4V.

Hard40m

Multimodal LLM Architecture

PRO

Learn how to design a multimodal LLM that processes text, images, and audio, covering projection strategies and cross-modal attention.

Hard30m

Diffusion Models & Image Generation

PRO

Master the mathematics and architecture of Diffusion Models, from the forward noising process to U-Net denoising, Classifier-Free Guidance, and Latent Diffusion scaling.

Hard45m

Ready to unlock everything?

Get full access to all 76 articles with detailed breakdowns, architecture diagrams, model answers, and scoring rubrics.