LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
⚡Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
🔍Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
🤖Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
📊Evaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
🛠️LLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
🏗️System Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnTraining, Alignment & ReasoningLoRA & Parameter-Efficient Tuning
⚡HardFine-Tuning & Training

LoRA & Parameter-Efficient Tuning

Master the mathematics of Low-Rank Adaptation (LoRA), adapter injection strategies, and memory/compute tradeoffs compared to full fine-tuning.

35 min readGoogle, Meta, Microsoft +16 key concepts

Imagine you want to teach a master chef how to cook a specific new dish. You wouldn't send them back to culinary school for four years to relearn everything from chopping onions to molecular gastronomy. You'd just give them a small index card with the new recipe.

Full fine-tuning is like sending the chef back to school: it updates every single connection in the model's "brain," which is slow, expensive, and requires massive computing power. LoRA (Low-Rank Adaptation) is the index card. It freezes the model's vast existing knowledge and injects a tiny, separate set of instructions that modify its behavior. This reduces the work by 10,000×\times× while achieving nearly the same results.


Why LoRA exists

Training a 70-billion parameter model is staggeringly expensive. If you want to update all parameters (full fine-tuning), you need to store not just the model weights, but also the optimizer states and gradients for every single parameter.

Here's the memory breakdown for a 70B model using the standard Adam optimizer (a popular gradient descent algorithm) with FP16 (half-precision) weights:

ComponentMemoryFormula
Model weights (FP16)140 GB70B×270\text{B} \times 270B×2 bytes
Gradients (FP16)140 GB70B×270\text{B} \times 270B×2 bytes
Adam optimizer mmm (FP32)280 GB70B×470\text{B} \times 470B×4 bytes
Adam optimizer vvv (FP32)280 GB70B×470\text{B} \times 470B×4 bytes
Total~840 GBNeeds multiple A100-80GB GPUs

For most companies and researchers, this is impossible. LoRA's key insight comes from a paper by Aghajanyan et al.:[1] models are over-parameterized. When they learn a new task, the actual change in their weights is mathematically simple. You don't need to change every parameter independently; you can describe the change using a much smaller set of variables.

The intuition: "low rank"

Think of a high-resolution photograph. It might have millions of pixels. But if the photo is just a simple gradient from blue to white, you don't need to list every pixel's color. You can just say "start blue at the top, fade to white at the bottom." That's a "low-rank" description: it captures the same information with a tiny fraction of the data.

LoRA assumes the update to the model weights (ΔW\Delta WΔW) is like that simple gradient. Even though the weight matrix is huge, the change needed to learn a new task has "low intrinsic rank", and it can be represented by multiplying two very small matrices together.


The math

Low-rank decomposition

Instead of learning a full update matrix ΔW\Delta WΔW that has the same massive dimensions as the original weights (dout×dind_{\text{out}} \times d_{\text{in}}dout​×din​), LoRA factorizes it into two smaller matrices:

ΔW=B⋅A\Delta W = B \cdot AΔW=B⋅A

where:

  • •A∈Rr×dinA \in \mathbb{R}^{r \times d_{\text{in}}}A∈Rr×din​: the down-projection (compresses the input to a small rank rrr)
  • •B∈Rdout×rB \in \mathbb{R}^{d_{\text{out}} \times r}B∈Rdout​×r: the up-projection (expands it back to the output dimension)
  • •rrr is the rank, a tiny number (typically 8, 16, or 64) compared to the model's dimensions (often 4096 or more)[2]

The modified forward pass becomes:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} B A xh=W0​x+rα​BAx

where W0W_0W0​ is the frozen pretrained weight matrix, and only AAA and BBB are trained.

Diagram Diagram
LoRA adapter architecture: the original weight matrix W is frozen, and two small low-rank matrices A and B are trained such that the effective weight becomes W + BA, with rank r << d. LoRA adapter architecture: the original weight matrix W is frozen, and two small low-rank matrices A and B are trained such that the effective weight becomes W + BA, with rank r << d.

Parameter savings: concrete calculation

Let's verify the savings. Consider a weight matrix WWW with dimensions 4096×40964096 \times 40964096×4096.

LoRA params=r⋅din+dout⋅r=2r⋅d\text{LoRA params} = r \cdot d_{\text{in}} + d_{\text{out}} \cdot r = 2r \cdot dLoRA params=r⋅din​+dout​⋅r=2r⋅d

Where rrr is LoRA rank, and din,doutd_{\text{in}}, d_{\text{out}}din​,dout​ are matrix dimensions (equal to ddd for square projection layers).

ApproachParametersRatioMemory (FP16)
Full fine-tuning40962=4096^2 = 40962= 16.7M100%33.5 MB
LoRA (r=8r=8r=8)2×8×4096=2 \times 8 \times 4096 = 2×8×4096= 65.5K0.39%131 KB
LoRA (r=16r=16r=16)2×16×4096=2 \times 16 \times 4096 = 2×16×4096= 131K0.78%262 KB
LoRA (r=64r=64r=64)2×64×4096=2 \times 64 \times 4096 = 2×64×4096= 524K3.1%1 MB

💡 Key insight: You get a 99% reduction in trainable parameters with r=16r=16r=16, while maintaining competitive performance on most downstream tasks.

Initialization strategy

This is a subtle but critical detail. If we initialized AAA and BBB randomly, the model would start with random noise added to its predictions, breaking its pretrained knowledge.

To fix this:

  • •AAA is initialized with random Gaussian values (Kaiming uniform).
  • •BBB is initialized to zero.

This ensures that at the very start of training, B⋅A=0B \cdot A = 0B⋅A=0, so ΔW=0\Delta W = 0ΔW=0. The model starts exactly as the pretrained model and gradually learns the adaptation.

Understanding Alpha (α\alphaα)

The scaling factor αr\frac{\alpha}{r}rα​ is crucial. It acts as a multiplier for the adapter's contribution:

scaling=αr\text{scaling} = \frac{\alpha}{r}scaling=rα​

  • •rrr (rank): Determines the capacity of the adapter.
  • •α\alphaα (alpha): Determines the strength of the adapter's signal.

By convention, α\alphaα is often set to 2r2r2r. The ratio αr\frac{\alpha}{r}rα​ allows you to change the rank rrr without re-tuning the learning rate. If you double rrr, the magnitude of the output of BABABA likely increases; dividing by the larger rrr keeps the scale roughly constant.

🎯 Production tip: A common heuristic is to set α=2×r\alpha = 2 \times rα=2×r (e.g., if r=16r=16r=16, set α=32\alpha=32α=32). If you find the model isn't learning enough, you can increase α\alphaα instead of increasing the learning rate.


Where to inject LoRA

Diagram Diagram

Transformers have many linear layers. The original LoRA paper[2] focused on the Query (WQW_QWQ​) and Value (WVW_VWV​) projections in the attention mechanism. However, subsequent research suggests that adapting all linear layers yields better results.

Common injection strategies:

StrategyTarget modulesWhen to use
Original LoRAWQ,WVW_Q, W_VWQ​,WV​ onlyLight adaptation, small datasets, maximum speed.
ExtendedWQ,WK,WV,WOW_Q, W_K, W_V, W_OWQ​,WK​,WV​,WO​Most practical use cases.
AggressiveAll linear layers (attention + FFN)Domain shift, reasoning tasks, larger datasets.

Targeting the MLP layers (gate, up, and down projections) effectively doubles the trainable parameters but often closes the performance gap with full fine-tuning on complex tasks.

Why adapt MLP layers? Attention layers (WQ,WK,WV,WOW_Q, W_K, W_V, W_OWQ​,WK​,WV​,WO​) primarily handle context mixing, routing information between tokens. MLP layers handle knowledge processing and fact retrieval. For tasks that require learning new facts or complex reasoning (like coding or medical diagnosis), adapting the MLP layers provides the model with more capacity to store new knowledge patterns.


Complete implementation with HuggingFace PEFT

The peft library makes this easy. You don't need to write the matrix multiplication manually. The following code demonstrates how to apply LoRA to a pretrained Llama-3-8B model. It takes the base model, applies a configuration targeting all linear layers with rank 16, and outputs a wrapped model ready for memory-efficient training.

python
1from peft import LoraConfig, get_peft_model, TaskType 2from transformers import AutoModelForCausalLM, AutoTokenizer 3import torch 4 5# Load base model 6model = AutoModelForCausalLM.from_pretrained( 7 "meta-llama/Llama-3-8B", 8 torch_dtype=torch.bfloat16, 9 device_map="auto", 10) 11 12# Configure LoRA 13lora_config = LoraConfig( 14 r=16, # Rank 15 lora_alpha=32, # Scaling factor: α/r = 32/16 = 2.0 16 target_modules=[ 17 "q_proj", "k_proj", "v_proj", "o_proj", 18 "gate_proj", "up_proj", "down_proj", 19 ], 20 lora_dropout=0.05, # Regularization 21 bias="none", # Don't train biases 22 task_type=TaskType.CAUSAL_LM, 23) 24 25# Apply LoRA - freezes base weights, adds adapter params 26model = get_peft_model(model, lora_config) 27model.print_trainable_parameters() 28# Output: "trainable params: 83,886,080 || all params: 8,114,212,864 || trainable%: 1.03%"

Under the hood: the forward pass

To understand what peft is doing, here is a simplified implementation of a LoRA layer in PyTorch. This custom linear layer replaces a standard nn.Linear module. It takes an input tensor x, computes both the frozen base model projection and the trainable low-rank adaptation, and returns the combined output scaled by alpha.

python
1import torch 2import torch.nn as nn 3 4class LoRALinear(nn.Module): 5 """LoRA-augmented linear layer.""" 6 def __init__(self, base_layer: nn.Linear, r: int, alpha: float): 7 super().__init__() 8 self.base = base_layer 9 self.base.weight.requires_grad_(False) # Freeze the base model weights! 10 11 d_out, d_in = base_layer.weight.shape 12 # Initialize A with random small values 13 self.A = nn.Parameter(torch.randn(r, d_in) * 0.01) 14 # Initialize B with zeros 15 self.B = nn.Parameter(torch.zeros(d_out, r)) 16 self.scaling = alpha / r 17 18 def forward(self, x: torch.Tensor) -> torch.Tensor: 19 # 1. Compute the original frozen output 20 base_out = self.base(x) 21 22 # 2. Compute the adapter output: x -> A -> B 23 # Shape: (batch, seq, d_in) -> (batch, seq, r) -> (batch, seq, d_out) 24 lora_out = (x @ self.A.T) @ self.B.T 25 26 # 3. Sum them up 27 return base_out + self.scaling * lora_out

Rank selection: empirical analysis

The rank rrr is the most important hyperparameter. How small is too small?

Rank (rrr)Trainable % (Llama 8B)QualityUse case
40.08%Good enoughSimple sentiment analysis, classification
80.16%StandardInstruction following, chat
160.32%High fidelityGeneral-purpose fine-tuning
641.3%Diminishing returnsComplex domain adaptation (e.g. math/code)
2565.2%WastefulRarely outperforms r=64

Research insight:[2] Performance often plateaus quickly. Going from r=8r=8r=8 to r=64r=64r=64 gives only marginal improvements on many benchmarks. This confirms the "low intrinsic dimensionality" hypothesis: the knowledge needed for most tasks is compact.


LoRA vs full fine-tuning: decision framework

Diagram Diagram

Choosing between LoRA and full fine-tuning (FFT) is a tradeoff between resource efficiency and model capacity.

  • •LoRA acts as a regularizer. Because it limits the number of trainable parameters, it prevents the model from forgetting its pre-trained knowledge ("catastrophic forgetting") when training on small datasets.
  • •Full Fine-Tuning has higher capacity. If you have a massive dataset (>100k samples) and the target domain is very different from the pre-training data (e.g., medical records vs. internet text), full fine-tuning allows the model to adjust all weights to capture new patterns.

Practical Implementation Guide

When training LoRA models in production (using libraries like peft or trl), keep these defaults in mind:

HyperparameterRecommended ValueNote
Learning Rate2e-4 to 1e-3Much higher than full fine-tuning (usually 1e-5).
Batch Size8 - 32Use gradient accumulation to simulate larger batches if VRAM is tight.
Rank (rrr)8 or 16Start here. Only go to 64 for difficult, domain-heavy tasks.
Alpha (α\alphaα)2×r2 \times r2×rE.g., α=32\alpha=32α=32 for r=16r=16r=16.
Dropout0.05 or 0.1Prevents overfitting on small datasets.
Target Modulesall-linearAdapting all linear layers usually outperforms just q_proj, v_proj.

QLoRA: going further

Standard LoRA freezes the base model in 16-bit precision (BF16 or FP16). QLoRA (Quantized LoRA)[3] takes this a step further by quantizing the base model to 4-bit.

MethodMemory (70B model)GPU requirement
Full FT (FP16)~840 GB10+ A100-80GB
LoRA (FP16 base)~140 GB2× A100-80GB
QLoRA (NF4 base)~35 GB1× A100-48GB ✅

QLoRA introduces three key innovations to make 4-bit training possible without losing accuracy:

  1. •NF4 (Normal Float 4-bit): Standard 4-bit float types are evenly spaced. However, neural network weights usually follow a bell curve (normal distribution). NF4 is information-theoretically optimal for this distribution, placing more representational bits where the weights cluster (near zero) and fewer at the outliers.
  2. •Double Quantization: Quantization requires "quantization constants" to scale the 4-bit integers back to FP16. QLoRA quantizes these constants as well, shaving off an extra ~0.37 bits per parameter. On a 65B model, this saves ~3GB of VRAM.
  3. •Paged Optimizers: This feature uses Unified Memory to offload optimizer states to CPU RAM when the GPU runs out of memory, preventing OOM (Out of Memory) crashes during training spikes.

🎯 Production tip: QLoRA is the standard for fine-tuning large models (70B+) on consumer hardware or single server GPUs.

Multi-adapter serving

A major production advantage of LoRA is hot-swapping.

In a traditional setup, if you have 10 custom models for 10 different clients, you need to load 10 full copies of the model (10 ×\times× 140GB). With LoRA, you load the frozen base model once (140GB). You then load 10 tiny adapters (10 ×\times× 100MB).

When a request comes in for Client A, the system adds Adapter A's weights to the computation. When a request comes for Client B, it swaps to Adapter B. This takes milliseconds.

Systems like S-LoRA[4] optimize this further by batching requests from different adapters together, enabling a single GPU to serve thousands of fine-tuned variants simultaneously with near-zero overhead.


Advanced: DoRA (weight-decomposed LoRA)

A recent improvement called DoRA[5] decomposes the weight update into two components: magnitude and direction.

W′=m⋅W0+BA∥W0+BA∥cW' = m \cdot \frac{W_0 + BA}{\|W_0 + BA\|_c}W′=m⋅∥W0​+BA∥c​W0​+BA​

where mmm is a learnable magnitude vector and ∥⋅∥c\| \cdot \|_c∥⋅∥c​ is the column-wise norm.

Why this works: LoRA forces the magnitude and direction of the weight update to change together. In full fine-tuning, the model can rotate weight vectors (change direction) without changing their length (magnitude), or vice versa. Standard LoRA couples these updates. DoRA restores this flexibility by separating the magnitude vector mmm from the directional update matrix.

Empirically, DoRA achieves higher accuracy than LoRA at the same rank, especially for tasks requiring subtle reasoning. Since the decomposition can be merged back into standard linear weights after training, DoRA has zero inference overhead compared to base models.


Key takeaways

  • •Low Intrinsic Rank: Fine-tuning mainly updates a small subspace of the model's weights. LoRA exploits this by training low-rank matrices AAA and BBB.
  • •Massive Efficiency: Reduces trainable parameters by ~10,000×\times× and GPU memory by ~3×\times× (or more with QLoRA).
  • •Injection Strategy: Target all linear layers (attention + FFN) for best performance, not just Q/V.
  • •Zero Overhead Inference: Adapters can be mathematically merged into the base weights (W′=W0+BAW' = W_0 + BAW′=W0​+BA) for deployment, so inference speed is identical to the base model.
  • •Multi-Tenancy: Keep one frozen base model in VRAM and hot-swap lightweight adapters per request to serve many users.
Evaluation Rubric
  • 1
    Mathematically describes ΔW = BA with correct dimensions
  • 2
    Explains freezing base weights and training only A, B
  • 3
    Quantifies parameter reduction (e.g., rank 16 on 4096×4096 = 131K vs 16.7M params)
  • 4
    Discusses injection into attention projections (Q, K, V, O)
  • 5
    Mentions QLoRA for further memory reduction via quantized base weights
Common Pitfalls
  • Treating LoRA as always equivalent to full fine-tuning
  • Ignoring quantization interactions with adapters
  • Not knowing where adapters are typically injected
  • Confusion about A vs B matrix dimensions and initialization
Follow-up Questions to Expect

Key Concepts Tested
Low-rank decomposition ΔW = BAAdapter injection points in transformersMemory and compute savingsRank selection tradeoffsQLoRA and quantization interactionsMulti-adapter serving (S-LoRA, LoRAX)
References

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.

Aghajanyan, A., Gupta, S., & Zettlemoyer, L. · 2020 · ACL 2021

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2022 · ICLR

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

DoRA: Weight-Decomposed Low-Rank Adaptation.

Liu, S., et al. · 2024 · ICML 2024

S-LoRA: Serving Thousands of Concurrent LoRA Adapters.

Sheng, Y., et al. · 2023

Your account is free and you can post anonymously if you choose.