LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationLoRA & Parameter-Efficient Tuning
⚡HardFine-Tuning & Training

LoRA & Parameter-Efficient Tuning

Understand the mathematics of Low-Rank Adaptation (LoRA), modern adapter targeting strategies, and the real memory tradeoffs compared to full fine-tuning and QLoRA.

36 min read
Learning path
Step 100 of 155 in the full curriculum
Distributed Training: FSDP & ZeROReward Modeling from Preference Data

LoRA & Parameter-Efficient Tuning

Distributed training showed how to make full-model updates fit by spreading weights, gradients, and optimizer state across GPUs. LoRA asks a different question: when the behavior change is narrow, can you avoid updating the full model at all?

Suppose a warehouse support model already handles returns, but a new carrier policy changes how damaged items should be escalated. The model doesn't need to relearn language, invoices, or general support behavior. It needs a targeted update for one slice of behavior.

That is the problem LoRA solves. It adapts a large model without updating every weight. This lesson explains low-rank adapters, where they attach, how they train, and why they are useful for domain-specific behavior.

Full fine-tuning updates every connection in the model, which is slow, expensive, and memory-heavy. LoRA (Low-Rank Adaptation) freezes the pretrained model and adds a small set of trainable adapter weights that modify its behavior. In practice, this often cuts the trainable parameter count by orders of magnitude. In the original GPT-3 175B experiments, LoRA reduced trainable parameters by 10,000x and GPU memory by about 3x relative to full fine-tuning.[1]

Why LoRA exists

Full fine-tuning a 70-billion parameter model is expensive because training stores more than model weights. It also needs gradients and optimizer state for every trainable parameter. Start by naming the accounting recipe: this table assumes low-precision parameters and gradients with two FP32 Adam moment tensors, but no separate FP32 master parameter copy.

Here's the memory breakdown for a 70B model using Adam with FP16 (half-precision) weights:

ComponentMemoryFormula
Model weights (FP16)140 GB70B×270\text{B} \times 270B×2 bytes
Gradients (FP16)140 GB70B×270\text{B} \times 270B×2 bytes
Adam optimizer mmm (FP32, single-precision)280 GB70B×470\text{B} \times 470B×4 bytes
Adam optimizer vvv (FP32, single-precision)280 GB70B×470\text{B} \times 470B×4 bytes
Total under this 12-byte recipe840 GBBefore activations and temporary buffers

This is only the model-state part of training memory. Real runs also pay for activations. A training stack that keeps a separate FP32 master copy adds another 280 GB, producing the 1.12 TB, 16-byte recipe from the distributed-training lesson.

full-finetuning-memory-recipe.py
1params_billion = 70 2gb_per_byte_per_billion = 1 3 4recipe = { 5 "fp16_parameters": 2, 6 "fp16_gradients": 2, 7 "fp32_adam_moments": 8, 8} 9without_master = params_billion * sum(recipe.values()) * gb_per_byte_per_billion 10with_master = without_master + params_billion * 4 11 12print("bytes_per_parameter_without_master=", sum(recipe.values())) 13print("full_finetuning_states_without_master_GB=", without_master) 14print("full_finetuning_states_with_master_GB=", with_master) 15print("frozen_fp16_base_floor_for_lora_GB=", params_billion * recipe["fp16_parameters"])
Fine-tuning recipe output
1bytes_per_parameter_without_master= 12 2full_finetuning_states_without_master_GB= 840 3full_finetuning_states_with_master_GB= 1120 4frozen_fp16_base_floor_for_lora_GB= 140

For most teams, this makes full fine-tuning impractical. LoRA builds on the empirical observation from Aghajanyan et al. that the update directions needed during fine-tuning often live in a surprisingly low-dimensional subspace of the full parameter space.[2] In other words, you don't need to adjust every single weight independently. Most of the task-specific signal can be captured by changing the weights along a much smaller number of "important directions." LoRA exploits exactly that structure by learning the update as a product of two thin matrices instead of one giant matrix.

What LoRA saves, and what it doesn't

LoRA saves memory mostly by eliminating gradients and optimizer state for the frozen base model. That is the big win. The frozen weights still need to stay in memory, and the adapters carry their own small gradient and optimizer state.

That means activation memory is not eliminated. Activations (the intermediate tensors computed during the forward pass) still scale with sequence length, batch size, and model depth, so you can still run out of memory on long contexts even with tiny adapters. Freezing weights can change exactly which tensors autograd retains, so do not claim activation bytes are identical without measuring your implementation. For long contexts, evaluate gradient checkpointing, smaller micro-batches, and memory-efficient attention kernels alongside LoRA.

Key Insight: LoRA removes base-weight gradient and optimizer-state storage, not the need to execute the full base network. A batch of 2,048-token warehouse support tickets can still be dominated by activations. Measure peak memory under the intended context length and micro-batch size.

The visual below switches from the 70B planning example to a 65B reference point so its QLoRA column matches the demonstrated 65B run from the QLoRA paper.[3]

65B memory evidence comparison: full fine-tuning uses 780 GB of model states under a 12-byte recipe, LoRA has a 130 GB FP16 frozen-base floor, and QLoRA demonstrated a 65B run on one 48 GB GPU. 65B memory evidence comparison: full fine-tuning uses 780 GB of model states under a 12-byte recipe, LoRA has a 130 GB FP16 frozen-base floor, and QLoRA demonstrated a 65B run on one 48 GB GPU.
Under the stated 12-byte full-tuning recipe, LoRA drops base gradient and Adam-state storage while keeping the frozen base weights. Activations and adapter state remain workload dependent.

The overlay intuition

Think of full fine-tuning as rebuilding an entire fulfillment routing table because you need one new carrier rule. It's expensive, slow, and you might accidentally change routes that already worked.

LoRA is like keeping the original routing table frozen but adding a small overlay for the rows that need adaptation. Requests still flow through the original table, with the overlay adding a compact correction.

  • The base table: The frozen pre-trained weights (W0W_0W0​).
  • The overlay: The low-rank matrices (AAA and BBB).
  • The final route: The sum of both (W0+αrBAW_0 + \frac{\alpha}{r}BAW0​+rα​BA).

The mathematical term for this is low-rank decomposition. Even though a weight matrix is large, the change needed to learn a new task can have compact structure. LoRA represents that update by multiplying two narrow matrices together instead of learning one full-size update matrix.

The math: from intuition to formula

A worked example with small numbers

Before we write the general formula, work through a concrete example. Suppose a weight matrix WWW has dimensions 4096×40964096 \times 40964096×4096 (a typical Transformer projection layer).

Full fine-tuning would train 4096×4096=16,777,2164096 \times 4096 = 16,777,2164096×4096=16,777,216 parameters.

With LoRA rank r=16r = 16r=16 (the low-rank dimension of the adapter update):

  • Matrix AAA has shape 16×409616 \times 409616×4096 and has 65,53665,53665,536 parameters.
  • Matrix BBB has shape 4096×164096 \times 164096×16 and has 65,53665,53665,536 parameters.
  • Total LoRA parameters: 131,072131,072131,072

That's 0.78% of the full matrix size, a 99.2% reduction.

Here's how the forward pass changes. For an input vector xxx, the original output is W0xW_0 xW0​x. With LoRA, the output becomes:

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} B A xh=W0​x+rα​BAx

The first term is the frozen base matrix. The second term is the low-rank overlay, scaled by α/r\alpha / rα/r, where α\alphaα is a scaling hyperparameter that controls the adapter's strength.

Low-rank decomposition

Instead of learning a full update matrix ΔW\Delta WΔW with the same dimensions as the original weights (dout×dind_{\text{out}} \times d_{\text{in}}dout​×din​), LoRA factorizes it into two smaller matrices:

ΔW=B⋅A\Delta W = B \cdot AΔW=B⋅A

where:

  • A∈Rr×dinA \in \mathbb{R}^{r \times d_{\text{in}}}A∈Rr×din​: the first matrix (maps from input to rank rrr)
  • B∈Rdout×rB \in \mathbb{R}^{d_{\text{out}} \times r}B∈Rdout​×r: the second matrix (maps from rank to output)
  • rrr is the rank, a tiny number (typically 8, 16, or 64) compared to the model's dimensions (often 4096 or more)[1]

The figure below shows the split between the frozen base path and the trainable adapter path.

LoRA adapter architecture showing frozen base weight W0 running alongside a low-rank path where A projects from d_in to rank r and B projects from rank r back to d_out. LoRA adapter architecture showing frozen base weight W0 running alongside a low-rank path where A projects from d_in to rank r and B projects from rank r back to d_out.
Trace the two paths. The frozen base W0 runs unchanged. The LoRA path projects input down to rank r, then back up to d_out. The results are added and scaled by α/r.

Parameter savings: concrete calculation

Let's verify the savings for a realistic projection layer.

LoRA params=r⋅din+dout⋅r=2r⋅d\text{LoRA params} = r \cdot d_{\text{in}} + d_{\text{out}} \cdot r = 2r \cdot dLoRA params=r⋅din​+dout​⋅r=2r⋅d

Where rrr is LoRA rank, and din,doutd_{\text{in}}, d_{\text{out}}din​,dout​ are matrix dimensions (equal to ddd for square projection layers).

ApproachParametersRatioMemory (FP16)
Full fine-tuning40962=4096^2 = 40962= 16.7M100%33.5 MB
LoRA (r=8r=8r=8)2×8×4096=2 \times 8 \times 4096 = 2×8×4096= 65.5K0.39%131 KB
LoRA (r=16r=16r=16)2×16×4096=2 \times 16 \times 4096 = 2×16×4096= 131K0.78%262 KB
LoRA (r=64r=64r=64)2×64×4096=2 \times 64 \times 4096 = 2×64×4096= 524K3.1%1 MB

For r=16r=16r=16, the adapter trains less than 1% as many parameters as the full matrix. That is the core tradeoff: you give the task a constrained update space, and in return you remove most gradient and optimizer-state memory for that layer.

parameter-savings-concrete-calculation.py
1def lora_parameter_count(d_in: int, d_out: int, rank: int) -> int: 2 return rank * d_in + d_out * rank 3 4d_in = d_out = 4096 5rank = 16 6full_params = d_in * d_out 7lora_params = lora_parameter_count(d_in, d_out, rank) 8ratio = lora_params / full_params 9 10print(f"full_params={full_params:,}") 11print(f"lora_params={lora_params:,}") 12print(f"LoRA r={rank} trains {lora_params:,} params, {ratio:.2%} of full fine-tuning.")
Parameter savings output
1full_params=16,777,216 2lora_params=131,072 3LoRA r=16 trains 131,072 params, 0.78% of full fine-tuning.

Initialization strategy

This is a subtle but important detail. The adapter path should start as a no-op. If both AAA and BBB started with arbitrary non-zero values, the model would begin with random noise added to its predictions and drift away from the pretrained checkpoint.

In the original LoRA setup, one factor is initialized randomly and the other is initialized to zero.[1] A common pattern is:

  • AAA starts with small random values.
  • BBB starts at zero.

This ensures that at the very start of training, B⋅A=0B \cdot A = 0B⋅A=0, so ΔW=0\Delta W = 0ΔW=0. The model starts with the same behavior as the pretrained model and gradually learns the adaptation.

Common Mistake: Initializing both AAA and BBB with small random values because "random init worked for the base model." If both matrices start non-zero, the first forward pass outputs pretrained predictions plus random noise, and the model immediately drifts from its checkpoint. Always zero-initialize one factor.

Understanding alpha (α\alphaα)

The scaling factor αr\frac{\alpha}{r}rα​ matters. It acts as a multiplier for the adapter's contribution:

scaling=αr\text{scaling} = \frac{\alpha}{r}scaling=rα​

  • rrr (rank): Determines the capacity of the adapter.
  • α\alphaα (alpha): Determines the strength of the adapter's signal.

For ordinary LoRA scaling, α=r\alpha=rα=r and α=2r\alpha=2rα=2r produce multipliers of 1 and 2. Holding α/r\alpha/rα/r constant is useful when you want a rank comparison without also changing this explicit multiplier, but it does not guarantee identical learned update magnitude. Treat rank, scaling, learning rate, and target-module coverage as coupled sweep axes rather than universal defaults.

lora-scaling-factor.py
1configs = [ 2 {"rank": 8, "alpha": 16}, 3 {"rank": 32, "alpha": 16}, 4 {"rank": 32, "alpha": 64}, 5] 6 7for config in configs: 8 scale = config["alpha"] / config["rank"] 9 print(f"r={config['rank']:>2} alpha={config['alpha']:>2} scale={scale:.1f}")
LoRA scaling output
1r= 8 alpha=16 scale=2.0 2r=32 alpha=16 scale=0.5 3r=32 alpha=64 scale=2.0

Where to inject LoRA

When adapting a Transformer with LoRA, you must decide which weight matrices will receive the adapters. The diagram below illustrates the typical structure of a Transformer block, highlighting where adapters are commonly injected during the attention and feed-forward phases.

Diagram showing Original LoRA, All attention projections, QLoRA-style: all linear layers, and q_proj, v_proj. Diagram showing Original LoRA, All attention projections, QLoRA-style: all linear layers, and q_proj, v_proj.
Original LoRA, All attention projections, QLoRA-style: all linear layers, and q_proj, v_proj.

The :::primary class marks the LoRA-adapted weights and the :::muted class marks the frozen weights. The original LoRA paper focused on the Query (WQW_QWQ​) and Value (WVW_VWV​) projections in attention.[1] QLoRA found that adapting every linear layer in the Transformer block was needed to match full-fine-tuning results in its experiments.[3]

Common injection strategies

Depending on the complexity of your task and your compute budget, you can choose to inject LoRA adapters into a subset of the available linear layers. Here are the most common strategies used in practice:

StrategyTarget modulesWhen to use
Original LoRAWQ,WVW_Q, W_VWQ​,WV​ onlyCheapest baseline matching the original paper's attention setup.
All attentionWQ,WK,WV,WOW_Q, W_K, W_V, W_OWQ​,WK​,WV​,WO​Attention-only comparison with more trainable capacity.
All-linearAttention projections + MLP gate/up/down projectionsQLoRA-style starting point to compare when quality matters.

Module names vary by architecture. Current Hugging Face PEFT exposes target_modules="all-linear" as the QLoRA-style shortcut, but you should still inspect which modules were matched on your model.[4]

Why adapt MLP layers?

Attention layers (WQ,WK,WV,WOW_Q, W_K, W_V, W_OWQ​,WK​,WV​,WO​) control how tokens route information to one another. MLP (Multi-Layer Perceptron) layers, also called FFNs (Feed-Forward Networks), perform per-token feature transformations inside the block. All-linear targeting gives the adapter access to both paths, at the price of more trainable parameters; compare it against narrower targeting on held-out examples.

This toy transformer block makes the budget change visible:

target-module-budget.py
1dimensions = { 2 "q_proj": (4096, 4096), 3 "k_proj": (4096, 4096), 4 "v_proj": (4096, 4096), 5 "o_proj": (4096, 4096), 6 "gate_proj": (4096, 11008), 7 "up_proj": (4096, 11008), 8 "down_proj": (11008, 4096), 9} 10rank = 16 11 12def adapter_params(names): 13 return sum(rank * (dimensions[name][0] + dimensions[name][1]) for name in names) 14 15qv = adapter_params(["q_proj", "v_proj"]) 16all_linear = adapter_params(list(dimensions)) 17print("qv_adapter_params=", qv) 18print("all_linear_adapter_params=", all_linear) 19print("all_linear_vs_qv_ratio=", round(all_linear / qv, 2))
Target-module budget output
1qv_adapter_params= 262144 2all_linear_adapter_params= 1249280 3all_linear_vs_qv_ratio= 4.77

Soft prompts and prefix tuning

LoRA is not the only parameter-efficient fine-tuning method. Two older but still useful PEFT families learn prompt-like state instead of adding low-rank updates inside weight matrices.

Prompt tuning trains a small set of continuous virtual tokens prepended to the input.[5] These aren't human-readable words. They're learned embedding vectors that steer the frozen model toward a task, like adding a small learned routing slip before every warehouse ticket. Prompt tuning is simple and cheap, but it has limited capacity because the adaptation only enters through the context window.

Prefix tuning trains continuous key/value prefixes for transformer layers.[6] Instead of changing model weights, it gives each layer learned prefix state that attention can attend to. That makes prefix tuning more expressive than plain soft prompts, because the learned control signal reaches multiple layers directly.

Use this decision rule:

MethodWhat trainsBest fit
Prompt tuning / soft promptsLearned input embeddingsVery cheap task steering, classification-like tasks, large base models
Prefix tuningLearned per-layer prefix statesGeneration tasks where a stronger steering signal helps
LoRA / QLoRALow-rank weight adaptersDomain adaptation, instruction tuning, tool behavior, stronger behavior changes

Don't call every PEFT method "LoRA." LoRA modifies internal projections through low-rank adapters. Soft prompt tuning and prefix tuning leave those projections frozen and learn continuous prompt-like state instead.

Implementing LoRA with Hugging Face PEFT

The peft (Parameter-Efficient Fine-Tuning) library makes this easy. You don't need to write the matrix multiplication manually. The following code demonstrates how to apply LoRA to an official Qwen2.5 base model. It uses the common all-linear starting point for dense decoder models.

This is a real training setup fragment, not a tiny local smoke test. To run it, install torch, transformers, peft, and accelerate, and use a machine that can load the target model. Let your trainer or distributed launcher place trainable models; Accelerate documents device_map="auto" as a big-model inference dispatch path rather than a distributed-training strategy.[7]

implementing-lora-with-hugging-face-peft.py
1import torch 2from peft import LoraConfig, get_peft_model, TaskType 3from transformers import AutoModelForCausalLM 4 5model = AutoModelForCausalLM.from_pretrained( 6 "Qwen/Qwen2.5-7B", 7 dtype=torch.bfloat16, 8) 9 10# Configure LoRA 11lora_config = LoraConfig( 12 r=16, # rank 13 lora_alpha=32, # scaling factor alpha/r = 32/16 = 2.0 14 target_modules="all-linear", # QLoRA-style targeting 15 lora_dropout=0.05, # regularization 16 bias="none", # don't train biases 17 task_type=TaskType.CAUSAL_LM, 18) 19 20model = get_peft_model(model, lora_config) 21model.print_trainable_parameters() 22# Inspect this output rather than assuming how many modules matched.

When you run print_trainable_parameters(), you see trainable params, all params, and percentage for the exact model and module selection. Record it with each run; a shortcut is only useful if it matched the intended projection layers.

Merging the adapter to remove the adapter path

For a single-adapter deployment, you often merge the adapter back into the base weights. This removes the small runtime cost of the adapter path and produces a single, ordinary model checkpoint you can serve with any inference engine.

merging-the-adapter-path.py
1# After training completes 2merged_model = model.merge_and_unload() 3# merged_model is now a standard transformers model with no PEFT adapters 4merged_model.save_pretrained("./my-adapted-model")

The merge_and_unload() call computes Wmerged=W0+αrBAW_{\text{merged}} = W_0 + \frac{\alpha}{r}BAWmerged​=W0​+rα​BA for every adapted linear layer and returns a base-model artifact with the adapter update folded into its weights.[4] The resulting artifact uses a standard dense layer path: no separate adapter matrix multiply remains, and the adapted behavior is baked into the weights. Keep adapters separate instead when one resident base must switch among many tasks or tenants.

merge-identity.py
1base = [[1.0, 2.0], [3.0, 4.0]] 2A = [[1.0, -1.0]] 3B = [[0.5], [1.0]] 4scale = 2.0 5x = [2.0, 1.0] 6 7def matvec(matrix, vector): 8 return [sum(a * b for a, b in zip(row, vector)) for row in matrix] 9 10adapter_matrix = [ 11 [scale * B[row][0] * A[0][col] for col in range(2)] 12 for row in range(2) 13] 14merged = [ 15 [base[row][col] + adapter_matrix[row][col] for col in range(2)] 16 for row in range(2) 17] 18unmerged_output = [ 19 value + delta 20 for value, delta in zip(matvec(base, x), matvec(adapter_matrix, x)) 21] 22print("merged_weights=", merged) 23print("outputs_match=", matvec(merged, x) == unmerged_output)
Adapter merge output
1merged_weights= [[2.0, 1.0], [5.0, 2.0]] 2outputs_match= True

Under the hood: the forward pass

To understand what peft is doing, here is a simplified implementation of a LoRA layer in PyTorch. This custom linear layer replaces a standard nn.Linear module. It takes an input tensor x, computes both the frozen base model projection and the trainable low-rank adaptation, and returns the combined output scaled by alpha.

under-the-hood-the-forward-pass.py
1import torch 2import torch.nn as nn 3 4class LoRALinear(nn.Module): 5 """LoRA-augmented linear layer.""" 6 def __init__(self, base_layer: nn.Linear, r: int, alpha: float): 7 super().__init__() 8 self.base = base_layer 9 for param in self.base.parameters(): 10 param.requires_grad_(False) # Freeze the pretrained layer 11 12 d_out, d_in = base_layer.weight.shape 13 self.A = nn.Parameter(torch.randn(r, d_in) * 0.02) 14 self.B = nn.Parameter(torch.zeros(d_out, r)) 15 self.scaling = alpha / r 16 17 def forward(self, x: torch.Tensor) -> torch.Tensor: 18 base_out = self.base(x) 19 lora_out = (x @ self.A.T) @ self.B.T 20 return base_out + self.scaling * lora_out 21 22torch.manual_seed(0) 23base = nn.Linear(4, 3, bias=False) 24layer = LoRALinear(base, r=2, alpha=4) 25x = torch.randn(2, 5, 4) 26 27starts_as_noop = bool(torch.allclose(layer(x), base(x), atol=1e-6)) 28base_trainable = any(param.requires_grad for param in base.parameters()) 29adapter_trainable = any(param.requires_grad for param in (layer.A, layer.B)) 30 31loss = layer(x).sum() 32loss.backward() 33 34adapter_B_grad_nonzero = layer.B.grad is not None and layer.B.grad.abs().sum().item() > 0 35base_has_grad = any(param.grad is not None for param in base.parameters()) 36 37print("LoRA starts as a no-op and trains adapter weights.") 38print(f"starts_as_noop={starts_as_noop}") 39print(f"base_trainable={base_trainable}") 40print(f"adapter_trainable={adapter_trainable}") 41print(f"adapter_B_grad_nonzero={adapter_B_grad_nonzero}") 42print(f"base_has_grad={base_has_grad}")
LoRA layer output
1LoRA starts as a no-op and trains adapter weights. 2starts_as_noop=True 3base_trainable=False 4adapter_trainable=True 5adapter_B_grad_nonzero=True 6base_has_grad=False

Choosing the right rank

Rank rrr controls adapter capacity, but module coverage, data quality, and learning rate can matter as much as rank. Holding the target modules fixed, adapter parameters scale linearly with rrr:

Rank (rrr)Relative adapter parametersWhat to compare
40.25x of rank 16Cheap capacity floor
80.50x of rank 16Low-cost candidate
161.00xReference candidate
644.00x of rank 16Higher-capacity candidate only if evaluation warrants it
25616.00x of rank 16Expensive diagnostic, not an assumed improvement

Use a held-out task set and a regression set for capabilities you need to retain. A rank that lowers training loss while degrading general behavior is not a better adapter.

rank-budget-scaling.py
1reference_rank = 16 2reference_adapter_parameters = 1_249_280 # toy all-linear block from the earlier example 3 4for rank in [4, 8, 16, 64, 256]: 5 params = reference_adapter_parameters * rank // reference_rank 6 multiplier = rank / reference_rank 7 print(f"r={rank:>3} params={params:>8,} relative_to_r16={multiplier:.2f}x")
Rank budget output
1r= 4 params= 312,320 relative_to_r16=0.25x 2r= 8 params= 624,640 relative_to_r16=0.50x 3r= 16 params=1,249,280 relative_to_r16=1.00x 4r= 64 params=4,997,120 relative_to_r16=4.00x 5r=256 params=19,988,480 relative_to_r16=16.00x

The LoRA and QLoRA papers report strong results at low ranks in their studied tasks, but that does not make one rank universal.[1][3] Establish module coverage first, then sweep rank only as evaluation evidence demands.

LoRA vs full fine-tuning: when to choose which

Deciding between full fine-tuning and parameter-efficient methods depends less on raw sample count and more on domain shift, GPU budget, and whether you need one model or many variants.

Diagram showing Need many variants or cheap iteration?, Yes, Evaluate LoRA / QLoRA first Cheap variant iteration, and No. Diagram showing Need many variants or cheap iteration?, Yes, Evaluate LoRA / QLoRA first Cheap variant iteration, and No.
Need many variants or cheap iteration?, Yes, Evaluate LoRA / QLoRA first Cheap variant iteration, and No.

Choosing between LoRA and full fine-tuning (FFT) is a tradeoff between resource efficiency and model capacity.

Raw sample count is a weak proxy. Five thousand long conversations can contain more total tokens than fifty thousand short prompts, and serving requirements often matter more than dataset size anyway.

If you need many task variants, fast iteration, or cheap deployment, LoRA often still wins even with a fairly large dataset. If the target domain is radically different from pretraining and you have enough memory to update everything, full fine-tuning can warrant its extra cost.

  • LoRA keeps the original base checkpoint recoverable, but an active adapter can still regress behavior. Because fewer parameters train, LoRA constrains update capacity and makes rollback or adapter disablement straightforward. It does not guarantee that responses with the adapter enabled preserve general capabilities. Evaluate both target-task quality and regression behavior.
  • Full fine-tuning has higher capacity. If you have a very large corpus, enough memory, and a target domain that's radically different from pre-training (for example, moving from web text to highly specialized logistics documents with custom entity schemas and multi-step fulfillment rules), full fine-tuning lets the model reshape every weight to capture those new patterns. In such cases the extra cost can be justified.

Common mistakes and how to fix them

Symptom: Validation loss is flat and the model isn't learning

Cause: You might be adapting only WQW_QWQ​ and WVW_VWV​ when the task needs deeper behavioral changes. Fix: Compare all-linear targeting. QLoRA reported its full-fine-tuning match when applying adapters to all linear layers in each Transformer block.[3] Inspect the matched module list so you know what the shortcut actually selected.

Symptom: Adapter cost rises with rank but held-out quality does not

Cause: Rank is not the limiting axis, or ordinary α/r\alpha/rα/r scaling has made the comparison harder to interpret. Fix: Hold target coverage and the scaling policy explicit while comparing ranks. Check data and module coverage before increasing rank. For high-rank experiments, rsLoRA (rank-stabilized LoRA) changes scaling from α/r\alpha / rα/r to α/r\alpha / \sqrt{r}α/r​ to improve high-rank optimization behavior.[8][4]

Symptom: You changed rank and now the model overfits or underfits

Cause: Mismatched alpha. If you change rrr without changing α\alphaα, the effective scaling changes. Fix: For an isolated ordinary-LoRA rank comparison, hold α/r\alpha/rα/r constant; for an optimization search, track rank and alpha as separate parameters. Do not present one alpha rule as universal.

Symptom: Inference latency is higher than expected in production

Cause: You are applying an adapter path at runtime in a deployment that only needs one adapter. Fix: For a single-adapter deployment, evaluate merging after training. The merged weights are Wmerged=W0+αrBAW_{\text{merged}} = W_0 + \frac{\alpha}{r}BAWmerged​=W0​+rα​BA, a standard linear layer with no adapter-path overhead. Keep adapters separate intentionally for multi-adapter serving.

Symptom: A multi-GPU training job uses device_map="auto" as its sharding plan

Cause: Inference dispatch and distributed training were confused. device_map="auto" is documented in Accelerate's Big Model Inference path. Fix: Let the trainer or launcher select supported training placement, and use FSDP, DeepSpeed, or a deliberate QLoRA setup when the model does not fit on one training device.[7]

Symptom: Your PEFT notes say "LoRA" but the method trains prompt vectors

Cause: You're mixing separate PEFT families. Soft prompt tuning learns input embeddings. Prefix tuning learns per-layer prefix state. LoRA learns low-rank updates inside weight matrices. Fix: Name the method by what trains. If the learned parameters do not form AAA and BBB adapter matrices inside a projection layer, it is not LoRA.

Symptom: Your implementation shape-checks only after a late runtime error

Cause: AAA and BBB were transposed or initialized inconsistently. For a base matrix W0∈Rdout×dinW_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}W0​∈Rdout​×din​, LoRA uses A∈Rr×dinA \in \mathbb{R}^{r \times d_{\text{in}}}A∈Rr×din​ and B∈Rdout×rB \in \mathbb{R}^{d_{\text{out}} \times r}B∈Rdout​×r, so BABABA matches W0W_0W0​. Fix: Assert dimensions when wrapping modules, and zero-initialize one factor so the adapter path starts as a no-op.

Production sweep starting points

When training LoRA models with libraries such as peft or trl, define a small first sweep and log each axis. These are candidates to test, not production defaults:

HyperparameterInitial candidatesNote
Learning Rate1e-4, 2e-4Adapter SFT commonly starts above full-weight SFT rates; select on evaluation.
Effective token batchMeasure and hold stableCompare token throughput and quality, not only example count.
Rank (rrr)8, 16Add a higher-rank candidate only when evaluation shows capacity pressure.
Alpha (α\alphaα)Choose an explicit α/r\alpha/rα/r comparisonRecord scaling with rank; do not hide it inside a default.
Dropout0.0, 0.05Decide from validation behavior and dataset size.
Target ModulesQ/V baseline versus all-linearQLoRA-style coverage costs more adapter state but may improve quality.
Gradient CheckpointingOff/on comparison if memory is tightIt reduces activation storage by adding recomputation cost.

When monitoring LoRA training, track task evaluation and regression evaluation alongside validation loss. Fewer trainable parameters do not make overfitting impossible. If validation quality falls while training loss decreases, compare dropout, rank, and data cleanup. If the model does not learn the task, inspect formatting and target modules before assuming more rank is the fix.

QLoRA: going further

Standard LoRA usually keeps the frozen base model in BF16 or FP16. QLoRA (Quantized LoRA)[3] takes this a step further by storing the frozen base model in 4-bit and dequantizing on the fly for compute. The trainable LoRA adapters (AAA and BBB) stay in a higher precision such as BF16. Autograd must still propagate signals through base-layer computation to reach earlier adapters, but it does not allocate base-weight gradient or optimizer-state tensors. The 4-bit storage is for weights you are not training.

Use consistent model sizes when comparing numbers. For a 65B model, the raw storage floor and the paper's measured feasibility statement are different kinds of evidence:

MethodNumber you can defendWhat it means
Full FT under the 12-byte recipe above780 GB model states65B×1265\text{B} \times 1265B×12 bytes, before activations
LoRA with a BF16/FP16 frozen base130 GB base-weight floor65B×265\text{B} \times 265B×2 bytes, plus adapters and runtime memory
QLoRA (NF4 base)Fine-tuned 65B on one 48 GB GPU in the paperWhole demonstrated setup, not a universal memory formula[3]

The raw 4-bit payload is smaller than the actual QLoRA training footprint because quantization metadata, adapters, optimizer state, activations, and temporary buffers remain:

qlora-storage-floor.py
1parameters_billion = 65 2bf16_base_gb = parameters_billion * 2 3nf4_raw_payload_gb = parameters_billion * 4 / 8 4double_quant_overhead_bits_per_param = 0.127 # QLoRA paper's post-double-quant estimate 5nf4_plus_constants_gb = parameters_billion * (4 + double_quant_overhead_bits_per_param) / 8 6 7print(f"bf16_frozen_base_floor_GB={bf16_base_gb:.1f}") 8print(f"nf4_raw_payload_GB={nf4_raw_payload_gb:.1f}") 9print(f"nf4_plus_quant_constants_GB={nf4_plus_constants_gb:.1f}") 10print("paper_device_capacity_GB=48")
QLoRA storage floor output
1bf16_frozen_base_floor_GB=130.0 2nf4_raw_payload_GB=32.5 3nf4_plus_quant_constants_GB=33.5 4paper_device_capacity_GB=48

Activations still scale with sequence length and micro-batch size, so a long-context QLoRA run can still OOM even though the quantized base fits.

QLoRA introduces three key innovations to make 4-bit training possible without losing accuracy:

  1. NF4 (Normal Float 4-bit): NF4 chooses quantile-based values for normally distributed weights. Under the QLoRA paper's assumptions, it is information-theoretically optimal for that distribution.
  2. Double quantization: Quantization requires constants that scale blocks of 4-bit values back into compute precision. QLoRA quantizes these constants as well, reducing the average overhead by about 0.37 bits per parameter, from roughly 0.5 to 0.127 bits per parameter.[3]
  3. Paged optimizers: This feature uses Unified Memory to smooth the memory spikes that show up during long-sequence minibatches, instead of crashing as soon as optimizer state briefly exceeds GPU memory.

QLoRA is a strong candidate when the frozen BF16 or FP16 base model does not fit on available training hardware. It trades memory for quantization complexity, so compare task and regression quality against an affordable higher-precision baseline where possible.

Current PEFT's documented QLoRA-style setup quantizes the base at load time, prepares it for k-bit training, and then adds adapters:[9]

qlora-peft-setup.py
1import torch 2from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training 3from transformers import AutoModelForCausalLM, BitsAndBytesConfig 4 5quantization_config = BitsAndBytesConfig( 6 load_in_4bit=True, 7 bnb_4bit_quant_type="nf4", 8 bnb_4bit_use_double_quant=True, 9 bnb_4bit_compute_dtype=torch.bfloat16, 10) 11model = AutoModelForCausalLM.from_pretrained( 12 "Qwen/Qwen2.5-7B", 13 quantization_config=quantization_config, 14 dtype=torch.bfloat16, 15) 16model = prepare_model_for_kbit_training(model) 17model = get_peft_model( 18 model, 19 LoraConfig(r=16, lora_alpha=32, target_modules="all-linear", task_type="CAUSAL_LM"), 20)

If you want a merged deployment artifact after QLoRA training, first load the base model in BF16 or FP16, merge the adapter, and then optionally re-quantize for inference.

Multi-adapter serving

A major production advantage of LoRA is that one frozen base model can serve many task-specific adapters.

LoRA deployment modes comparing merged single-tenant serving with multi-adapter serving from a shared frozen base model and tenant-specific adapter pool. LoRA deployment modes comparing merged single-tenant serving with multi-adapter serving from a shared frozen base model and tenant-specific adapter pool.
Merge when one adapter owns the deployment. Keep adapters separate when one base model must serve many tenants or tasks without loading many full checkpoints.

In a traditional setup, if you have 10 custom models for 10 different clients, you need to load 10 full copies of the model. With LoRA, you load the frozen base model once and keep per-task adapters separately. Adapter size depends on rank, target modules, dtype, and model width, so calculate it instead of promising a fixed number. A shared base model can serve one adapter for fashion returns, another for electronics warranties, and a third for bulk-shipping refunds while the base weights stay resident.

Two common deployment patterns dominate:

  • Single-tenant: merge once (Wmerged=W0+αrBAW_{\text{merged}} = W_0 + \frac{\alpha}{r}BAWmerged​=W0​+rα​BA) and serve a standalone model with no separate adapter-path overhead.
  • Multi-tenant: keep W0W_0W0​ resident, load adapters on demand, and apply them during inference.

Systems like S-LoRA[10] push this further by keeping many adapters in CPU memory, paging the active ones to GPU memory, and batching requests that target different adapters in the same serving step. That lets one base model serve thousands of tenant-specific adapters without loading thousands of full checkpoints.

This simplified storage calculation shows why separate adapters matter for multi-tenant deployments:

multi-adapter-storage.py
1base_checkpoint_gb = 14 2adapter_mb = 128 3tenant_count = 50 4 5merged_full_models_gb = base_checkpoint_gb * tenant_count 6shared_base_plus_adapters_gb = base_checkpoint_gb + adapter_mb * tenant_count / 1024 7 8print(f"merged_full_models_GB={merged_full_models_gb:.1f}") 9print(f"shared_base_plus_adapters_GB={shared_base_plus_adapters_gb:.2f}") 10print(f"storage_reduction_x={merged_full_models_gb / shared_base_plus_adapters_gb:.1f}")
Adapter serving storage output
1merged_full_models_GB=700.0 2shared_base_plus_adapters_GB=20.25 3storage_reduction_x=34.6

Advanced: DoRA (weight-decomposed LoRA)

A LoRA variant called DoRA (Weight-Decomposed LoRA)[11] separates each pretrained weight vector into two pieces:

  • a learnable magnitude term (one scalar per output channel)
  • a direction term that's adapted with standard LoRA

Why this works

Standard LoRA learns one additive low-rank update BABABA. That update can change both magnitude and direction, but the same two factors must represent both effects.

DoRA explicitly factors the pretrained weight vector into its magnitude (a learnable scalar per output dimension) and its direction (a unit vector). The direction is then adapted with the usual low-rank LoRA matrices, while the magnitude scalars are trained directly. This separation gives the adapter an extra knob: it can boost or attenuate specific channels without needing to encode that scaling inside the low-rank matrices.

The DoRA paper reports improvements over plain LoRA on its evaluated tasks, especially at low ranks, while aiming to reduce the quality gap to full fine-tuning.[11] The magnitude parameter adds one value per output channel for each adapted matrix. Because magnitude and direction can be recombined into ordinary linear weights after training, a merged single-adapter deployment can still use a dense linear path.

Current PEFT exposes DoRA through use_dora=True on LoraConfig, but its documentation warns of greater runtime overhead than LoRA and recommends evaluating merge or ephemeral-offload options for inference.[4] Treat DoRA as a measured quality/cost experiment, not a free switch.

dora-extra-parameters.py
1d_in = d_out = 4096 2rank = 16 3lora_params = rank * (d_in + d_out) 4dora_magnitude_params = d_out 5 6print("lora_adapter_params=", lora_params) 7print("dora_extra_magnitude_params=", dora_magnitude_params) 8print(f"dora_extra_vs_lora={dora_magnitude_params / lora_params:.2%}")
DoRA parameter output
1lora_adapter_params= 131072 2dora_extra_magnitude_params= 4096 3dora_extra_vs_lora=3.12%

Check your understanding

Before moving on, make sure you can answer these questions without looking back:

  1. Parameter count: For a 4096×40964096 \times 40964096×4096 weight matrix, how many trainable parameters does LoRA with r=16r=16r=16 introduce? What percentage is that of the full matrix?

  2. Initialization: Why must matrix BBB start at zero? What would happen if both AAA and BBB started with random values?

  3. Rank vs alpha: If you increase rank from r=8r=8r=8 to r=32r=32r=32 but keep α=16\alpha=16α=16 fixed, what happens to the explicit scaling factor αr\frac{\alpha}{r}rα​? Can that factor alone tell you the trained adapter's final update magnitude?

  4. Serving: Your production setup loads one base model and keeps 50 tenant adapters in CPU RAM, paging them to GPU on demand. A colleague suggests merging all adapters into separate full models to simplify the stack. What do you lose?

Solution sketches

  1. 2×16×4096=131,0722 \times 16 \times 4096 = 131,0722×16×4096=131,072 parameters. That's 131,072/16,777,216≈0.78%131,072 / 16,777,216 \approx 0.78\%131,072/16,777,216≈0.78% of the full matrix.

  2. BBB starts at zero so that B⋅A=0B \cdot A = 0B⋅A=0 at initialization. If both started random, the model would output pretrained predictions plus random noise from the very first forward pass, causing immediate drift.

  3. The explicit scaling factor drops from 16/8=2.016/8 = 2.016/8=2.0 to 16/32=0.516/32 = 0.516/32=0.5. Raising α\alphaα proportionally (for example, to α=64\alpha=64α=64) keeps that explicit multiplier fixed, but it doesn't guarantee the same learned update magnitude or quality. Compare rank and alpha with evaluation.

  4. You'd lose the memory savings. Merging 50 adapters into 50 full models means loading 50 complete weight sets instead of one base plus 50 small adapter files. Adapter paging becomes impossible.

Practice checkpoints

What you should be able to defend

By the end of this lesson, you should be able to:

  • State the LoRA update as ΔW=BA\Delta W = BAΔW=BA and give the dimensions of AAA, BBB, and W0W_0W0​.
  • Explain why the base weights freeze while only AAA and BBB receive gradients.
  • Calculate the parameter savings for rank 16 on a 4096×40964096 \times 40964096×4096 projection: 131,072 adapter parameters instead of 16,777,216 full-matrix parameters.
  • Decide when to adapt only attention projections and when to use all-linear targeting across attention and MLP projections.
  • Explain why QLoRA reduces base-weight memory with 4-bit quantization but still leaves activation memory as a training bottleneck.
  • Distinguish LoRA weight adapters from soft prompt tuning and prefix tuning.

Key takeaways

  • Low intrinsic rank: Fine-tuning often updates a small subspace of the model's weights. LoRA exploits this by training low-rank matrices AAA and BBB.
  • Training efficiency: LoRA often cuts trainable parameters by 100x to 10,000x, depending on rank and target modules.
  • Memory caveat: Biggest savings come from optimizer states and gradients. Activation memory still matters.
  • Injection strategy: Compare all-linear against a narrower baseline; it exposes more block projections at a higher adapter cost.
  • Serving flexibility: Merge for single-tenant serving without a separate adapter path, or keep adapters separate for multi-tenant serving.
  • Modern variants: QLoRA reduces frozen-base memory through quantization. DoRA adds a magnitude/direction parameterization whose quality and overhead must be measured.
Next Step
Continue to Reward Modeling from Preference Data

LoRA and QLoRA showed how to adapt a large model cheaply. Reward modeling is next because preference tuning still needs a signal to optimize against, and one common path is to train a separate scorer that ranks chosen responses above rejected ones before RLHF or other alignment steps.

PreviousDistributed Training: FSDP & ZeRO
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.

Aghajanyan, A., Gupta, S., & Zettlemoyer, L. · 2020 · ACL 2021

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

PEFT Documentation: LoRA Developer Guide.

Hugging Face · 2026

The Power of Scale for Parameter-Efficient Prompt Tuning.

Lester, Al-Rfou & Constant · 2021

Prefix-Tuning: Optimizing Continuous Prompts for Generation.

Li, X. L. & Liang, P. · 2021 · ACL 2021

Accelerate Documentation: Big Model Inference.

Hugging Face · 2026

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA.

Kalajdzievski, D. · 2023 · arXiv preprint

PEFT Documentation: Quantization.

Hugging Face · 2026

S-LoRA: Serving Thousands of Concurrent LoRA Adapters.

Sheng, Y., et al. · 2023 · arXiv preprint

DoRA: Weight-Decomposed Low-Rank Adaptation.

Liu, S., et al. · 2024 · ICML 2024