LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

Blog
AI EngineeringDeep DiveArchitectureSystem Design

50 Essential LLM Engineering Concepts for 2026

Fifty LLM engineering concepts, organized by system layer. Each answer focuses on mechanism, trade-off, failure mode, and production intuition.

LeetLLM TeamMarch 21, 2026Updated June 12, 202617 min read

This is a systems map, not a flashcard pile. Read one concept, cover the answer, then explain the mechanism, the trade-off, and the production failure it helps you debug.

Read in this order: model mechanics, runtime constraints, systems and retrieval, then evaluation and safety. If a concept feels thin, follow the linked LeetLLM lesson for the full derivation, code, and design walk-through.

Four-layer LLM engineering study map moving from mechanism to runtime, systems, and evaluation judgment. Four-layer LLM engineering study map moving from mechanism to runtime, systems, and evaluation judgment.
Strong answers connect mechanism, runtime constraint, system design choice, and evaluation signal.

Preflight primitives

Before the deeper concepts, be able to explain next-token prediction, context windows, logits, sampling controls, prompt hierarchy, and the difference between stored weights and runtime activations. These primitives explain KV-cache growth, token budgets, decoding behavior, prompt-injection risk, and memory pressure.

Transformer architecture and attention

Concept 1: how does self-attention work?

Self-attention turns each token into Query, Key, and Value vectors. Queries score keys, softmax turns scores into weights, and those weights mix values: Attention(Q,K,V)=softmax(QKT/dk)V\text{Attention}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d_k})VAttention(Q,K,V)=softmax(QKT/dk​​)V.

If dk=64d_k = 64dk​=64, the scale is 64=8\sqrt{64} = 864​=8. The scale keeps scores from becoming too sharp as head dimension grows. Cost is the trade-off: every token can attend to every other token, which gives rich context but creates O(n2)O(n^2)O(n2) sequence-length scaling.[1] See Scaled Dot-Product Attention.

Concept 2: why use multi-head attention?

One attention head can learn one similarity pattern. Multi-head attention gives separate subspaces for syntax, position, entity links, or semantic similarity. Heads aren't magic interpretable modules, but splitting attention lets the model represent several relationships in parallel.

Concept 3: what are MQA and GQA?

Standard multi-head attention stores separate keys and values for every head. Multi-Query Attention (MQA) shares one key/value set across all query heads, cutting KV-cache memory but reducing representational freedom.[2] Grouped-Query Attention (GQA) is the middle path: groups of query heads share each key/value set, so memory drops while quality stays closer to multi-head attention.[3]

Attention KV sharing comparison where MHA stores one key value cache per query head, GQA shares a few caches across groups, and MQA shares one cache across all heads. Attention KV sharing comparison where MHA stores one key value cache per query head, GQA shares a few caches across groups, and MQA shares one cache across all heads.
MQA and GQA matter because decode memory scales with stored key/value heads, not only total query heads.

Concept 4: why did RoPE become common?

Transformers need position information because token order isn't inherent in parallel attention. RoPE rotates Query and Key vectors by position, so their dot product carries relative-offset information.[4] That fits decoder-only language modeling well and gives long-context extensions a useful base.

RoPE doesn't make long context free. Past training length, teams still need scaling tricks, retuning, or careful evals. More: RoPE and ALiBi.

Concept 5: Pre-LN vs Post-LN?

Post-LN normalizes after the residual branch. Pre-LN normalizes before the sublayer, leaving the residual path cleaner for gradients. That's why many deep decoder-only LLMs use Pre-LN or norm-first variants.

RMSNorm is common too. It drops mean centering, normalizes by root mean square, and is cheaper to compute. More: Layer Normalization.

Tokenization and embeddings

Concept 6: why subword tokenization?

Word-level tokenization breaks on new names and misspellings. Character-level tokenization handles any string but makes sequences too long. Subword tokenizers such as BPE, WordPiece, and SentencePiece split rare words into reusable pieces while keeping common words compact.

Production trap: words and tokens aren't interchangeable. Measure token counts on your actual language mix, code, logs, and documents. More: BPE and SentencePiece.

Concept 7: static vs contextual embeddings?

Static embeddings assign one vector per word. Contextual embeddings change after attention layers process surrounding text, so "charge" in a payment dispute differs from "charge" in a battery instruction.

The embedding table starts as a lookup matrix. Context comes from the Transformer stack that transforms those vectors. More: Contextual Embeddings.

Concept 8: cosine similarity vs dot product?

Cosine similarity compares direction after length normalization. Dot product compares direction and magnitude. If embedding norms vary, dot product can favor high-magnitude vectors even when they're less semantically relevant.

Some retrieval models are trained for dot product or Maximum Inner Product Search, so don't switch metrics blindly. Match metric to training objective and index type. More: Embedding Similarity.

Inference and serving

Concept 9: what is the KV cache?

During generation, the model appends one token at a time. The KV cache stores key/value states from previous tokens so the server doesn't replay the full prompt for every next token.

Memory scales as 2 * layers * kv_heads * head_dim * tokens * bytes_per_value * batch_size. A 100K-token request can consume tens of GB of KV cache even when weights already fit. Most serving wins start here: GQA, PagedAttention, cache quantization, shorter active context, and better scheduling. More: KV Cache.

Concept 10: how does PagedAttention help?

PagedAttention stores KV cache in fixed-size blocks instead of one contiguous allocation. That cuts fragmentation and lets the scheduler share or allocate cache blocks more flexibly.[5]

The vLLM paper reported 2-4x higher throughput than FasterTransformer and Orca at similar latency on its evaluated workloads.[5] PagedAttention isn't an attention approximation. It's memory management for serving.

Concept 11: TTFT vs TPS?

TTFT (time to first token) is dominated by prefill: reading the prompt and building KV cache. TPS (tokens per second) is dominated by decode: generating one token at a time while reading cached states.

They often conflict. Low TTFT likes short prompts and small batches. High TPS likes batching and high GPU utilization. Continuous batching and chunked prefill help balance those goals. More: TTFT and TPS.

Concept 12: continuous batching?

Static batching waits for a whole batch to finish. Continuous batching admits new requests as old ones finish, keeping decode slots full even when output lengths vary.

Real traffic mixes short classifications, medium chat replies, and long generations. Serving engines such as vLLM, TensorRT-LLM, and SGLang use iteration-level scheduling for this reason.

Concept 13: quantization, GPTQ, AWQ, GGUF?

Quantization stores weights in fewer bits, often 8-bit or 4-bit instead of FP16/BF16. GPTQ and AWQ are post-training quantization methods. GGUF is a file format common in llama.cpp style local inference.

Trade-off: memory and speed vs quality. Benchmark on your task. A 4-bit model that passes chat demos can still fail code, math, or long-context retrieval. More: Model Quantization.

Concept 14: speculative decoding?

Speculative decoding uses a small draft model to propose several tokens, then a larger target model verifies them in parallel.[6] Accepted tokens keep the target distribution intact under the algorithm's assumptions; rejected tokens fall back to target sampling.

Use it when the draft model is much faster and predicts the target well. Code and structured text often work better than open-ended creative text. More: Speculative Decoding.

Inference optimization map showing request routing, scheduler, model bytes, kernel speed, and hardware bandwidth as separate levers. Inference optimization map showing request routing, scheduler, model bytes, kernel speed, and hardware bandwidth as separate levers.
Inference optimization is a stack. Pick the layer that matches the bottleneck: routing, scheduling, model compression, kernel efficiency, or hardware.

RAG and retrieval

Concept 15: production RAG pipeline?

Retrieval-Augmented Generation (RAG) gives the model external evidence before answering.[7] Production path: ingestion, parsing, chunking, embedding, indexing, retrieval, reranking, context assembly, generation, and validation.

Production RAG pipeline with offline document indexing, online retrieval, an evidence gate, and grounded answer generation. Production RAG pipeline with offline document indexing, online retrieval, an evidence gate, and grounded answer generation.
Most RAG failures happen before the final model call: bad parsing, bad chunks, bad embeddings, weak retrieval, or noisy context assembly.

Concept 16: hybrid search?

Dense vector search handles semantic matches. Sparse search such as BM25 handles exact terms, codes, product names, and rare entities. Hybrid search combines both, often with Reciprocal Rank Fusion.[8]

Use hybrid retrieval for enterprise docs unless evals prove dense-only is enough. Exact terms matter more often than teams expect. More: Hybrid Search.

Concept 17: how do you evaluate RAG?

Separate retrieval quality from generation quality. Retrieval metrics include recall@k, MRR, and nDCG. Generation metrics include faithfulness, relevance, and completeness.

RAGAS-style metrics and LLM judges can help, but they need calibration against real labels.[9][10] If you only score final answers, retrieval bugs stay hidden. More: RAG Evaluation.

Concept 18: what is chunking?

Chunking splits documents into retrievable units. Fixed-size chunks are predictable but can cut meaning in half. Structural chunks preserve headings and sections. Semantic chunks split where topic shifts.

There is no universal chunk size. Start with a simple baseline, then tune from retrieval evals and failure examples. More: Chunking Strategies.

Concept 19: GraphRAG?

GraphRAG builds entity and relationship structure from documents, then retrieves connected subgraphs instead of isolated chunks.[11] It helps when questions require multi-hop relationships, such as product dependencies, org structures, or legal references.

Don't add a graph if vector or hybrid retrieval solves the task. Graphs add extraction, update, and correctness burdens. More: GraphRAG.

Training, fine-tuning, and alignment

Concept 20: Pre-training vs SFT vs alignment?

Pre-training teaches broad next-token prediction. Supervised fine-tuning (SFT) teaches instruction following and formats. Alignment uses preferences, critiques, or rewards to steer behavior.

LLM training stages showing pre-training for broad next-token knowledge, supervised fine-tuning for instruction behavior, and alignment for preference shaping. LLM training stages showing pre-training for broad next-token knowledge, supervised fine-tuning for instruction behavior, and alignment for preference shaping.
Don't collapse all post-training into "fine-tuning." Each stage changes a different part of behavior.

Concept 21: LoRA?

LoRA freezes base weights and trains small low-rank update matrices instead of every parameter.[12] It makes adaptation cheaper and produces small adapters that can be swapped or served alongside one base model.

QLoRA combines LoRA with 4-bit base-model quantization, making large-model tuning possible on much smaller hardware.[13] More: LoRA.

Concept 22: RLHF vs DPO?

RLHF trains a reward model from preference data, then optimizes the language model against that reward using reinforcement learning.[14] DPO skips the explicit reward model and directly optimizes on preference pairs.[15]

DPO is simpler operationally. RLHF still matters when you want an explicit reward model for auditing, reuse, or online optimization. More: RLHF and DPO.

Concept 23: RLVR?

Reinforcement Learning with Verifiable Rewards (RLVR) uses programmatic verifiers when success is objective: unit tests for code, exact answers for math, schema validators for structured output.[16]

DeepSeek-R1 made this direction visible at scale, with separate pipelines for R1-Zero and R1.[17] Production lesson: if you can write a verifier, you can create cheaper, more consistent reward signal than pure preference labels.

Concept 24: RAG vs fine-tuning vs prompting?

Use prompting for one-off behavior changes. Use RAG when the model needs fresh, private, or cited knowledge. Use fine-tuning when you need durable behavior, format adherence, or domain style.

Decision path from prompt changes to RAG, fine-tuning, and combined production systems. Decision path from prompt changes to RAG, fine-tuning, and combined production systems.
Escalate only when a simpler method fails. Fresh knowledge usually points to RAG, not fine-tuning.

Concept 25: scaling laws and Chinchilla?

Scaling laws relate loss to compute, data, and parameters.[18] Chinchilla shifted the practical lesson: under dense-training assumptions, many earlier models were too parameter-heavy for their training-token count.[19]

Don't compare parameter count alone. Training tokens, data quality, architecture, and post-training can matter as much as raw size. More: Scaling Laws.

Agents and tool use

Concept 26: ReAct?

ReAct alternates reasoning, tool action, observation, and answer.[20] The key is that each observation changes the next step, which keeps the agent grounded in tool output instead of memory.

ReAct agent loop where thought chooses a tool, action runs it, observation updates state, and answer exits only after evidence. ReAct agent loop where thought chooses a tool, action runs it, observation updates state, and answer exits only after evidence.
ReAct works when observations update state. If the model ignores observations, the pattern is cosmetic.

Concept 27: MCP?

Model Context Protocol (MCP) exposes tools, resources, and prompts through typed schemas.[21] It gives clients a consistent discovery and invocation layer instead of one-off tool glue for every integration.

MCP is less about model quality and more about tool interoperability, schema clarity, and security boundaries. More: MCP Standards.

Concept 28: Multi-agent systems?

Multi-agent systems split work across specialized roles: planner, worker, critic, retriever, or tool executor. Use them when separation improves quality, safety, or parallelism.

Don't default to many agents. Extra LLM calls add latency, cost, and debugging surface. Start with one capable agent and add roles only when failure analysis points there. More: Multi-Agent Orchestration.

Concept 29: agent failure modes?

Common failures: loops, hallucinated tools, context overflow, cascading errors, and goal drift. The controls are max iterations, schema validation, tool allowlists, checkpoints, summaries, and human escalation.

Strong answer: name a failure symptom and the control that catches it. More: Agent Failure.

Concept 30: prompt injection?

Prompt injection is untrusted text trying to override higher-priority instructions or control tools. The most important boundary is trusted vs untrusted context.

Layer defenses: tool-side permission checks, schema validation, output checks, least privilege, retrieval isolation, and confirmation gates for side effects. Filters help but don't solve it alone. Full playbook: Prompt Injection Defense.

Evaluation and reliability

Concept 31: perplexity?

Perplexity is the exponentiated average negative log-likelihood. Lower means the model assigned higher probability to the observed text.

Use it for same-tokenizer checkpoint comparisons. Don't use it to rank instruction-following systems across model families, because tokenization and usefulness differ. More: Perplexity.

Concept 32: LLM-as-judge?

LLM-as-judge uses another model to score outputs against a rubric. It can evaluate faithfulness, relevance, style, or completeness at scale.

Pitfalls include position bias, verbosity bias, self-preference, and rubric sensitivity.[22] Mitigate with randomized order, pairwise comparisons, multiple judges, and calibration against human labels. More: LLM-as-a-Judge.

Concept 33: hallucination mitigation?

A hallucination is plausible unsupported output. You can't eliminate it, but you can reduce and detect it.

Practical controls: retrieve trusted evidence, force citations, constrain structured output, run post-generation verification, and fail closed when evidence is missing. More: Hallucination Mitigation.

Concept 34: LLM A/B testing?

LLM A/B tests need product metrics and quality metrics. Track latency, cost, task completion, user feedback, safety rate, and sampled judge or human scores.

Use guardrails for rare bad behavior. A variant that improves average helpfulness but doubles hallucination rate may be unacceptable. More: A/B Testing.

System design

Concept 35: production support RAG?

A support RAG system needs query routing, hybrid retrieval, reranking, context assembly, grounded generation, validation, and eval feedback. The best answer names both data path and feedback loop.

Support RAG feedback loop where query retrieval, evidence checking, answer generation, validation, and evaluation metrics form a production improvement cycle. Support RAG feedback loop where query retrieval, evidence checking, answer generation, validation, and evaluation metrics form a production improvement cycle.
Production RAG isn't only retrieval. It needs validation and measurement so failures become fixes.

Concept 36: code completion system?

The hardest constraint is latency. Inline completions must feel instant, so systems gather context from current file, neighboring tabs, language-server data, recent edits, and cursor prefix/suffix, then route to a small fast model.

Larger models fit explicit generation, refactoring, and chat flows. Completion quality is measured by acceptance rate, edit survival, latency, and downstream task success. More: Code Completion Design.

Concept 37: LLM moderation system?

Use tiers. Fast rules and classifiers catch obvious cases. LLM review handles ambiguous content. Human review handles appeals and high-risk uncertainty.

Design depends on false-positive vs false-negative cost. High-harm categories bias toward recall and escalation. Low-risk categories can bias toward precision. More: Content Moderation Design.

Production engineering and LLMOps

Concept 38: inference cost control?

Cost is mostly input tokens, output tokens, model price, retries, and cache hit rate. The controls are model routing, prompt trimming, exact or semantic caching, batching, and self-hosting at high utilization.

Don't optimize only unit price. A cheap model that needs longer prompts, retries, or manual review can cost more end-to-end. More: LLM Cost Engineering.

Concept 39: observability stack?

Track request latency, TTFT, total tokens, model, cost, status, retrieval hits, judge scores, user feedback, tool calls, and safety events. For self-hosting, add GPU utilization, queue depth, batch size, and KV-cache occupancy.

Observability should answer: "Was this bad answer caused by retrieval, prompt assembly, model behavior, tool error, or latency timeout?" More: LLM Observability.

Concept 40: semantic caching?

Exact caching keys on identical input. Semantic caching embeds the query and reuses a nearby cached answer when similarity is high enough.

Risk: stale or wrong answers. Use conservative thresholds, freshness policies, and bypass rules for personalized or high-stakes queries. More: Semantic Caching.

Advanced architecture

Concept 41: Mixture of Experts?

MoE models keep many expert feed-forward networks in memory but activate only a subset per token.[23] This raises total capacity without dense compute for every token.

DeepSeek-V3 reports 671B total parameters with about 37B active per token, using 256 routed experts plus shared experts and activating 8 routed experts per token.[24] The serving catch: all experts still affect memory and routing balance.

Mixture of Experts routing where one token is routed to two active experts while other experts stay idle in memory. Mixture of Experts routing where one token is routed to two active experts while other experts stay idle in memory.
MoE saves compute per token, but it doesn't make full model memory disappear.

Concept 42: state space models and Mamba?

State Space Models process long sequences with linear-time recurrence-like structure rather than full quadratic attention.[25] They can be more efficient for long contexts but often struggle with precise long-range recall compared with attention.

Hybrid models such as Jamba mix Transformer and Mamba-style layers, keeping attention where exact recall matters while using SSM blocks for efficient sequence handling.[26]

Concept 43: reasoning models and test-time compute?

Reasoning models spend more inference compute on hard tasks. Test-time compute scaling means letting the model search, check, or reason longer instead of only scaling training size.[27]

Use it for math, code, multi-step logic, and verifiable tasks. Avoid it for simple factual queries and latency-sensitive flows. More: Test-Time Compute.

Concept 44: FlashAttention?

FlashAttention computes exact attention in tiles that fit fast on-chip memory instead of materializing the full attention matrix in high-bandwidth memory.[28] This reduces memory traffic and makes long sequences much more practical.

FlashAttention-3 targets Hopper GPUs with asynchronous data movement and low-precision support.[29] It changes memory movement, not attention math. More: FlashAttention.

Concept 45: knowledge distillation?

Distillation trains a smaller student model to mimic a larger teacher. Instead of hard labels only, the student can learn from teacher distributions, rationales, or generated examples.

Use it when latency, cost, or edge deployment matters and your task distribution is narrower than general chat. Quality depends heavily on representative distillation data. More: Knowledge Distillation.

Safety, governance, and emerging practice

Concept 46: Constitutional AI?

Constitutional AI uses written principles to critique and revise model outputs, then uses those revised outputs and preference comparisons for training.[30] It reduces dependence on direct human labels for every safety case.

The constitution still needs human authorship, review, and failure analysis. Principles are policy artifacts, not magic safety guarantees. More: Constitutional AI.

Concept 47: bias detection?

Bias shows up as unequal performance, stereotypes, or different outcomes across groups. Detection uses benchmarks, counterfactual prompt swaps, and production outcome audits.[31][32]

Mitigation can involve data balancing, targeted fine-tuning, output review, and red teaming. The hard part is defining unacceptable behavior for the product context. More: Bias and Fairness.

Concept 48: guardrails?

Guardrails are checks around model calls: input filters, tool permission gates, output validators, policy checks, rate limits, cost caps, and circuit breakers.

Good guardrails are enforceable outside the model. If a tool call can spend money, delete data, or contact users, policy should live in code too. More: Guardrails.

Concept 49: context engineering?

Prompt engineering tunes the instruction. Context engineering manages the full information environment: system message, examples, retrieved chunks, tool specs, conversation memory, ordering, compression, and update policy.

This matters because many failures come from irrelevant, stale, or badly ordered context rather than weak base-model capability. Full article: Context Engineering: Beyond Prompting.

Concept 50: Open-weight vs closed-source models?

Open-weight models give more control, privacy options, tuning, and self-hosting economics. Closed APIs offer fast adoption, strong frontier capability, managed uptime, and simpler operations.

Many teams use both: prototype on APIs, route routine private work to open-weight systems when utilization justifies it, and keep frontier APIs for hard cases. Full comparison: Open-Source vs Closed-Source LLMs in 2026.

Self-check

Use this guide as a checklist after you've studied the deeper lessons. You should be able to diagnose decode OOMs from KV-cache shape, choose RAG for frequently changing docs, bound agent loops with iteration and repeated-call controls, and justify hybrid open-weight plus closed-API routing when privacy, volume, and hard reasoning pull in different directions.

If you can explain each concept with a mechanism, a trade-off, and a failure mode, move into full coding, system design, behavioral, and technical presentation loops through AI Lab Coding Interview: Python Systems.

PreviousvLLM vs SGLang vs TensorRT-LLM vs Ollama: Choosing an Inference Engine in 2026NextAI Engineer Salary Guide 2026
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Attention Is All You Need.

Vaswani, A., et al. · 2017

Fast Transformer Decoding: One Write-Head is All You Need.

Shazeer, N. · 2019 · arXiv preprint

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

RoFormer: Enhanced Transformer with Rotary Position Embedding.

Su, J., et al. · 2021

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.

Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

From Local to Global: A Graph RAG Approach to Query-Focused Summarization.

Edge, D., et al. · 2024 · arXiv preprint

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

Training Language Models to Follow Instructions with Human Feedback (InstructGPT).

Ouyang, L., et al. · 2022 · NeurIPS 2022

Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Rafailov, R., et al. · 2023

Tülu 3: Pushing Frontiers in Open Language Model Post-Training

Lambert, N., et al. · 2024 · arXiv preprint

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI · 2025

Scaling Laws for Neural Language Models

Kaplan et al. · 2020

Training Compute-Optimal Large Language Models.

Hoffmann, J., et al. · 2022 · NeurIPS 2022

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Huang, E., & Cao, Y. · 2023 · ICLR 2023

Introducing the Model Context Protocol

Anthropic · 2024

Large Language Models are not Fair Evaluators.

Wang, P., et al. · 2023

Mixtral of Experts.

Jiang, A. Q., et al. · 2024

DeepSeek-V3 Technical Report.

DeepSeek-AI · 2024 · arXiv preprint

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu & Dao · 2023

Jamba: A Hybrid Transformer-Mamba Language Model

AI21 Labs · 2024

Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

Snell, C., et al. · 2024 · arXiv preprint

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision.

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. · 2024

Constitutional AI: Harmlessness from AI Feedback.

Bai, Y., et al. · 2022 · arXiv preprint

BBQ: A Hand-Built Bias Benchmark for Question Answering.

Parrish, A., et al. · 2022 · ACL 2022

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods.

Zhao, J., et al. · 2018 · NAACL 2018