Fifty LLM engineering concepts, organized by system layer. Each answer focuses on mechanism, trade-off, failure mode, and production intuition.
This is a systems map, not a flashcard pile. Read one concept, cover the answer, then explain the mechanism, the trade-off, and the production failure it helps you debug.
Read in this order: model mechanics, runtime constraints, systems and retrieval, then evaluation and safety. If a concept feels thin, follow the linked LeetLLM lesson for the full derivation, code, and design walk-through.
Before the deeper concepts, be able to explain next-token prediction, context windows, logits, sampling controls, prompt hierarchy, and the difference between stored weights and runtime activations. These primitives explain KV-cache growth, token budgets, decoding behavior, prompt-injection risk, and memory pressure.
Self-attention turns each token into Query, Key, and Value vectors. Queries score keys, softmax turns scores into weights, and those weights mix values: .
If , the scale is . The scale keeps scores from becoming too sharp as head dimension grows. Cost is the trade-off: every token can attend to every other token, which gives rich context but creates sequence-length scaling.[1] See Scaled Dot-Product Attention.
One attention head can learn one similarity pattern. Multi-head attention gives separate subspaces for syntax, position, entity links, or semantic similarity. Heads aren't magic interpretable modules, but splitting attention lets the model represent several relationships in parallel.
Standard multi-head attention stores separate keys and values for every head. Multi-Query Attention (MQA) shares one key/value set across all query heads, cutting KV-cache memory but reducing representational freedom.[2] Grouped-Query Attention (GQA) is the middle path: groups of query heads share each key/value set, so memory drops while quality stays closer to multi-head attention.[3]
Transformers need position information because token order isn't inherent in parallel attention. RoPE rotates Query and Key vectors by position, so their dot product carries relative-offset information.[4] That fits decoder-only language modeling well and gives long-context extensions a useful base.
RoPE doesn't make long context free. Past training length, teams still need scaling tricks, retuning, or careful evals. More: RoPE and ALiBi.
Post-LN normalizes after the residual branch. Pre-LN normalizes before the sublayer, leaving the residual path cleaner for gradients. That's why many deep decoder-only LLMs use Pre-LN or norm-first variants.
RMSNorm is common too. It drops mean centering, normalizes by root mean square, and is cheaper to compute. More: Layer Normalization.
Word-level tokenization breaks on new names and misspellings. Character-level tokenization handles any string but makes sequences too long. Subword tokenizers such as BPE, WordPiece, and SentencePiece split rare words into reusable pieces while keeping common words compact.
Production trap: words and tokens aren't interchangeable. Measure token counts on your actual language mix, code, logs, and documents. More: BPE and SentencePiece.
Static embeddings assign one vector per word. Contextual embeddings change after attention layers process surrounding text, so "charge" in a payment dispute differs from "charge" in a battery instruction.
The embedding table starts as a lookup matrix. Context comes from the Transformer stack that transforms those vectors. More: Contextual Embeddings.
Cosine similarity compares direction after length normalization. Dot product compares direction and magnitude. If embedding norms vary, dot product can favor high-magnitude vectors even when they're less semantically relevant.
Some retrieval models are trained for dot product or Maximum Inner Product Search, so don't switch metrics blindly. Match metric to training objective and index type. More: Embedding Similarity.
During generation, the model appends one token at a time. The KV cache stores key/value states from previous tokens so the server doesn't replay the full prompt for every next token.
Memory scales as 2 * layers * kv_heads * head_dim * tokens * bytes_per_value * batch_size. A 100K-token request can consume tens of GB of KV cache even when weights already fit. Most serving wins start here: GQA, PagedAttention, cache quantization, shorter active context, and better scheduling. More: KV Cache.
PagedAttention stores KV cache in fixed-size blocks instead of one contiguous allocation. That cuts fragmentation and lets the scheduler share or allocate cache blocks more flexibly.[5]
The vLLM paper reported 2-4x higher throughput than FasterTransformer and Orca at similar latency on its evaluated workloads.[5] PagedAttention isn't an attention approximation. It's memory management for serving.
TTFT (time to first token) is dominated by prefill: reading the prompt and building KV cache. TPS (tokens per second) is dominated by decode: generating one token at a time while reading cached states.
They often conflict. Low TTFT likes short prompts and small batches. High TPS likes batching and high GPU utilization. Continuous batching and chunked prefill help balance those goals. More: TTFT and TPS.
Static batching waits for a whole batch to finish. Continuous batching admits new requests as old ones finish, keeping decode slots full even when output lengths vary.
Real traffic mixes short classifications, medium chat replies, and long generations. Serving engines such as vLLM, TensorRT-LLM, and SGLang use iteration-level scheduling for this reason.
Quantization stores weights in fewer bits, often 8-bit or 4-bit instead of FP16/BF16. GPTQ and AWQ are post-training quantization methods. GGUF is a file format common in llama.cpp style local inference.
Trade-off: memory and speed vs quality. Benchmark on your task. A 4-bit model that passes chat demos can still fail code, math, or long-context retrieval. More: Model Quantization.
Speculative decoding uses a small draft model to propose several tokens, then a larger target model verifies them in parallel.[6] Accepted tokens keep the target distribution intact under the algorithm's assumptions; rejected tokens fall back to target sampling.
Use it when the draft model is much faster and predicts the target well. Code and structured text often work better than open-ended creative text. More: Speculative Decoding.
Retrieval-Augmented Generation (RAG) gives the model external evidence before answering.[7] Production path: ingestion, parsing, chunking, embedding, indexing, retrieval, reranking, context assembly, generation, and validation.
Dense vector search handles semantic matches. Sparse search such as BM25 handles exact terms, codes, product names, and rare entities. Hybrid search combines both, often with Reciprocal Rank Fusion.[8]
Use hybrid retrieval for enterprise docs unless evals prove dense-only is enough. Exact terms matter more often than teams expect. More: Hybrid Search.
Separate retrieval quality from generation quality. Retrieval metrics include recall@k, MRR, and nDCG. Generation metrics include faithfulness, relevance, and completeness.
RAGAS-style metrics and LLM judges can help, but they need calibration against real labels.[9][10] If you only score final answers, retrieval bugs stay hidden. More: RAG Evaluation.
Chunking splits documents into retrievable units. Fixed-size chunks are predictable but can cut meaning in half. Structural chunks preserve headings and sections. Semantic chunks split where topic shifts.
There is no universal chunk size. Start with a simple baseline, then tune from retrieval evals and failure examples. More: Chunking Strategies.
GraphRAG builds entity and relationship structure from documents, then retrieves connected subgraphs instead of isolated chunks.[11] It helps when questions require multi-hop relationships, such as product dependencies, org structures, or legal references.
Don't add a graph if vector or hybrid retrieval solves the task. Graphs add extraction, update, and correctness burdens. More: GraphRAG.
Pre-training teaches broad next-token prediction. Supervised fine-tuning (SFT) teaches instruction following and formats. Alignment uses preferences, critiques, or rewards to steer behavior.
LoRA freezes base weights and trains small low-rank update matrices instead of every parameter.[12] It makes adaptation cheaper and produces small adapters that can be swapped or served alongside one base model.
QLoRA combines LoRA with 4-bit base-model quantization, making large-model tuning possible on much smaller hardware.[13] More: LoRA.
RLHF trains a reward model from preference data, then optimizes the language model against that reward using reinforcement learning.[14] DPO skips the explicit reward model and directly optimizes on preference pairs.[15]
DPO is simpler operationally. RLHF still matters when you want an explicit reward model for auditing, reuse, or online optimization. More: RLHF and DPO.
Reinforcement Learning with Verifiable Rewards (RLVR) uses programmatic verifiers when success is objective: unit tests for code, exact answers for math, schema validators for structured output.[16]
DeepSeek-R1 made this direction visible at scale, with separate pipelines for R1-Zero and R1.[17] Production lesson: if you can write a verifier, you can create cheaper, more consistent reward signal than pure preference labels.
Use prompting for one-off behavior changes. Use RAG when the model needs fresh, private, or cited knowledge. Use fine-tuning when you need durable behavior, format adherence, or domain style.
Scaling laws relate loss to compute, data, and parameters.[18] Chinchilla shifted the practical lesson: under dense-training assumptions, many earlier models were too parameter-heavy for their training-token count.[19]
Don't compare parameter count alone. Training tokens, data quality, architecture, and post-training can matter as much as raw size. More: Scaling Laws.
ReAct alternates reasoning, tool action, observation, and answer.[20] The key is that each observation changes the next step, which keeps the agent grounded in tool output instead of memory.
Model Context Protocol (MCP) exposes tools, resources, and prompts through typed schemas.[21] It gives clients a consistent discovery and invocation layer instead of one-off tool glue for every integration.
MCP is less about model quality and more about tool interoperability, schema clarity, and security boundaries. More: MCP Standards.
Multi-agent systems split work across specialized roles: planner, worker, critic, retriever, or tool executor. Use them when separation improves quality, safety, or parallelism.
Don't default to many agents. Extra LLM calls add latency, cost, and debugging surface. Start with one capable agent and add roles only when failure analysis points there. More: Multi-Agent Orchestration.
Common failures: loops, hallucinated tools, context overflow, cascading errors, and goal drift. The controls are max iterations, schema validation, tool allowlists, checkpoints, summaries, and human escalation.
Strong answer: name a failure symptom and the control that catches it. More: Agent Failure.
Prompt injection is untrusted text trying to override higher-priority instructions or control tools. The most important boundary is trusted vs untrusted context.
Layer defenses: tool-side permission checks, schema validation, output checks, least privilege, retrieval isolation, and confirmation gates for side effects. Filters help but don't solve it alone. Full playbook: Prompt Injection Defense.
Perplexity is the exponentiated average negative log-likelihood. Lower means the model assigned higher probability to the observed text.
Use it for same-tokenizer checkpoint comparisons. Don't use it to rank instruction-following systems across model families, because tokenization and usefulness differ. More: Perplexity.
LLM-as-judge uses another model to score outputs against a rubric. It can evaluate faithfulness, relevance, style, or completeness at scale.
Pitfalls include position bias, verbosity bias, self-preference, and rubric sensitivity.[22] Mitigate with randomized order, pairwise comparisons, multiple judges, and calibration against human labels. More: LLM-as-a-Judge.
A hallucination is plausible unsupported output. You can't eliminate it, but you can reduce and detect it.
Practical controls: retrieve trusted evidence, force citations, constrain structured output, run post-generation verification, and fail closed when evidence is missing. More: Hallucination Mitigation.
LLM A/B tests need product metrics and quality metrics. Track latency, cost, task completion, user feedback, safety rate, and sampled judge or human scores.
Use guardrails for rare bad behavior. A variant that improves average helpfulness but doubles hallucination rate may be unacceptable. More: A/B Testing.
A support RAG system needs query routing, hybrid retrieval, reranking, context assembly, grounded generation, validation, and eval feedback. The best answer names both data path and feedback loop.
The hardest constraint is latency. Inline completions must feel instant, so systems gather context from current file, neighboring tabs, language-server data, recent edits, and cursor prefix/suffix, then route to a small fast model.
Larger models fit explicit generation, refactoring, and chat flows. Completion quality is measured by acceptance rate, edit survival, latency, and downstream task success. More: Code Completion Design.
Use tiers. Fast rules and classifiers catch obvious cases. LLM review handles ambiguous content. Human review handles appeals and high-risk uncertainty.
Design depends on false-positive vs false-negative cost. High-harm categories bias toward recall and escalation. Low-risk categories can bias toward precision. More: Content Moderation Design.
Cost is mostly input tokens, output tokens, model price, retries, and cache hit rate. The controls are model routing, prompt trimming, exact or semantic caching, batching, and self-hosting at high utilization.
Don't optimize only unit price. A cheap model that needs longer prompts, retries, or manual review can cost more end-to-end. More: LLM Cost Engineering.
Track request latency, TTFT, total tokens, model, cost, status, retrieval hits, judge scores, user feedback, tool calls, and safety events. For self-hosting, add GPU utilization, queue depth, batch size, and KV-cache occupancy.
Observability should answer: "Was this bad answer caused by retrieval, prompt assembly, model behavior, tool error, or latency timeout?" More: LLM Observability.
Exact caching keys on identical input. Semantic caching embeds the query and reuses a nearby cached answer when similarity is high enough.
Risk: stale or wrong answers. Use conservative thresholds, freshness policies, and bypass rules for personalized or high-stakes queries. More: Semantic Caching.
MoE models keep many expert feed-forward networks in memory but activate only a subset per token.[23] This raises total capacity without dense compute for every token.
DeepSeek-V3 reports 671B total parameters with about 37B active per token, using 256 routed experts plus shared experts and activating 8 routed experts per token.[24] The serving catch: all experts still affect memory and routing balance.
State Space Models process long sequences with linear-time recurrence-like structure rather than full quadratic attention.[25] They can be more efficient for long contexts but often struggle with precise long-range recall compared with attention.
Hybrid models such as Jamba mix Transformer and Mamba-style layers, keeping attention where exact recall matters while using SSM blocks for efficient sequence handling.[26]
Reasoning models spend more inference compute on hard tasks. Test-time compute scaling means letting the model search, check, or reason longer instead of only scaling training size.[27]
Use it for math, code, multi-step logic, and verifiable tasks. Avoid it for simple factual queries and latency-sensitive flows. More: Test-Time Compute.
FlashAttention computes exact attention in tiles that fit fast on-chip memory instead of materializing the full attention matrix in high-bandwidth memory.[28] This reduces memory traffic and makes long sequences much more practical.
FlashAttention-3 targets Hopper GPUs with asynchronous data movement and low-precision support.[29] It changes memory movement, not attention math. More: FlashAttention.
Distillation trains a smaller student model to mimic a larger teacher. Instead of hard labels only, the student can learn from teacher distributions, rationales, or generated examples.
Use it when latency, cost, or edge deployment matters and your task distribution is narrower than general chat. Quality depends heavily on representative distillation data. More: Knowledge Distillation.
Constitutional AI uses written principles to critique and revise model outputs, then uses those revised outputs and preference comparisons for training.[30] It reduces dependence on direct human labels for every safety case.
The constitution still needs human authorship, review, and failure analysis. Principles are policy artifacts, not magic safety guarantees. More: Constitutional AI.
Bias shows up as unequal performance, stereotypes, or different outcomes across groups. Detection uses benchmarks, counterfactual prompt swaps, and production outcome audits.[31][32]
Mitigation can involve data balancing, targeted fine-tuning, output review, and red teaming. The hard part is defining unacceptable behavior for the product context. More: Bias and Fairness.
Guardrails are checks around model calls: input filters, tool permission gates, output validators, policy checks, rate limits, cost caps, and circuit breakers.
Good guardrails are enforceable outside the model. If a tool call can spend money, delete data, or contact users, policy should live in code too. More: Guardrails.
Prompt engineering tunes the instruction. Context engineering manages the full information environment: system message, examples, retrieved chunks, tool specs, conversation memory, ordering, compression, and update policy.
This matters because many failures come from irrelevant, stale, or badly ordered context rather than weak base-model capability. Full article: Context Engineering: Beyond Prompting.
Open-weight models give more control, privacy options, tuning, and self-hosting economics. Closed APIs offer fast adoption, strong frontier capability, managed uptime, and simpler operations.
Many teams use both: prototype on APIs, route routine private work to open-weight systems when utilization justifies it, and keep frontier APIs for hard cases. Full comparison: Open-Source vs Closed-Source LLMs in 2026.
Use this guide as a checklist after you've studied the deeper lessons. You should be able to diagnose decode OOMs from KV-cache shape, choose RAG for frequently changing docs, bound agent loops with iteration and repeated-call controls, and justify hybrid open-weight plus closed-API routing when privacy, volume, and hard reasoning pull in different directions.
If you can explain each concept with a mechanism, a trade-off, and a failure mode, move into full coding, system design, behavioral, and technical presentation loops through AI Lab Coding Interview: Python Systems.
Attention Is All You Need.
Vaswani, A., et al. · 2017
Fast Transformer Decoding: One Write-Head is All You Need.
Shazeer, N. · 2019 · arXiv preprint
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
Ainslie, J., et al. · 2023 · EMNLP 2023
RoFormer: Enhanced Transformer with Rotary Position Embedding.
Su, J., et al. · 2021
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Fast Inference from Transformers via Speculative Decoding.
Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Lewis, P., et al. · 2020 · NeurIPS 2020
Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09
RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Es, S., et al. · 2023 · arXiv preprint
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Zheng, L., et al. · 2023 · NeurIPS 2023
From Local to Global: A Graph RAG Approach to Query-Focused Summarization.
Edge, D., et al. · 2024 · arXiv preprint
LoRA: Low-Rank Adaptation of Large Language Models.
Hu, E. J., et al. · 2021 · ICLR
QLoRA: Efficient Finetuning of Quantized Language Models.
Dettmers, T., et al. · 2023 · NeurIPS
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov, R., et al. · 2023
Tülu 3: Pushing Frontiers in Open Language Model Post-Training
Lambert, N., et al. · 2024 · arXiv preprint
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI · 2025
Scaling Laws for Neural Language Models
Kaplan et al. · 2020
Training Compute-Optimal Large Language Models.
Hoffmann, J., et al. · 2022 · NeurIPS 2022
ReAct: Synergizing Reasoning and Acting in Language Models.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Huang, E., & Cao, Y. · 2023 · ICLR 2023
Introducing the Model Context Protocol
Anthropic · 2024
Large Language Models are not Fair Evaluators.
Wang, P., et al. · 2023
Mixtral of Experts.
Jiang, A. Q., et al. · 2024
DeepSeek-V3 Technical Report.
DeepSeek-AI · 2024 · arXiv preprint
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu & Dao · 2023
Jamba: A Hybrid Transformer-Mamba Language Model
AI21 Labs · 2024
Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.
Snell, C., et al. · 2024 · arXiv preprint
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision.
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. · 2024
Constitutional AI: Harmlessness from AI Feedback.
Bai, Y., et al. · 2022 · arXiv preprint
BBQ: A Hand-Built Bias Benchmark for Question Answering.
Parrish, A., et al. · 2022 · ACL 2022
Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods.
Zhao, J., et al. · 2018 · NAACL 2018