Understand the mathematics of Low-Rank Adaptation (LoRA), modern adapter targeting strategies, and the real memory tradeoffs compared to full fine-tuning and QLoRA.
Distributed training showed how to make full-model updates fit by spreading weights, gradients, and optimizer state across GPUs. LoRA asks a different question: when the behavior change is narrow, can you avoid updating the full model at all?
Suppose a warehouse support model already handles returns, but a new carrier policy changes how damaged items should be escalated. The model doesn't need to relearn language, invoices, or general support behavior. It needs a targeted update for one slice of behavior.
That is the problem LoRA solves. It adapts a large model without updating every weight. This lesson explains low-rank adapters, where they attach, how they train, and why they are useful for domain-specific behavior.
Full fine-tuning updates every connection in the model, which is slow, expensive, and memory-heavy. LoRA (Low-Rank Adaptation) freezes the pretrained model and adds a small set of trainable adapter weights that modify its behavior. In practice, this often cuts the trainable parameter count by orders of magnitude. In the original GPT-3 175B experiments, LoRA reduced trainable parameters by 10,000x and GPU memory by about 3x relative to full fine-tuning.[1]
Full fine-tuning a 70-billion parameter model is expensive because training stores more than model weights. It also needs gradients and optimizer state for every trainable parameter. Start by naming the accounting recipe: this table assumes low-precision parameters and gradients with two FP32 Adam moment tensors, but no separate FP32 master parameter copy.
Here's the memory breakdown for a 70B model using Adam with FP16 (half-precision) weights:
| Component | Memory | Formula |
|---|---|---|
| Model weights (FP16) | 140 GB | bytes |
| Gradients (FP16) | 140 GB | bytes |
| Adam optimizer (FP32, single-precision) | 280 GB | bytes |
| Adam optimizer (FP32, single-precision) | 280 GB | bytes |
| Total under this 12-byte recipe | 840 GB | Before activations and temporary buffers |
This is only the model-state part of training memory. Real runs also pay for activations. A training stack that keeps a separate FP32 master copy adds another 280 GB, producing the 1.12 TB, 16-byte recipe from the distributed-training lesson.
1params_billion = 70
2gb_per_byte_per_billion = 1
3
4recipe = {
5 "fp16_parameters": 2,
6 "fp16_gradients": 2,
7 "fp32_adam_moments": 8,
8}
9without_master = params_billion * sum(recipe.values()) * gb_per_byte_per_billion
10with_master = without_master + params_billion * 4
11
12print("bytes_per_parameter_without_master=", sum(recipe.values()))
13print("full_finetuning_states_without_master_GB=", without_master)
14print("full_finetuning_states_with_master_GB=", with_master)
15print("frozen_fp16_base_floor_for_lora_GB=", params_billion * recipe["fp16_parameters"])1bytes_per_parameter_without_master= 12
2full_finetuning_states_without_master_GB= 840
3full_finetuning_states_with_master_GB= 1120
4frozen_fp16_base_floor_for_lora_GB= 140For most teams, this makes full fine-tuning impractical. LoRA builds on the empirical observation from Aghajanyan et al. that the update directions needed during fine-tuning often live in a surprisingly low-dimensional subspace of the full parameter space.[2] In other words, you don't need to adjust every single weight independently. Most of the task-specific signal can be captured by changing the weights along a much smaller number of "important directions." LoRA exploits exactly that structure by learning the update as a product of two thin matrices instead of one giant matrix.
LoRA saves memory mostly by eliminating gradients and optimizer state for the frozen base model. That is the big win. The frozen weights still need to stay in memory, and the adapters carry their own small gradient and optimizer state.
That means activation memory is not eliminated. Activations (the intermediate tensors computed during the forward pass) still scale with sequence length, batch size, and model depth, so you can still run out of memory on long contexts even with tiny adapters. Freezing weights can change exactly which tensors autograd retains, so do not claim activation bytes are identical without measuring your implementation. For long contexts, evaluate gradient checkpointing, smaller micro-batches, and memory-efficient attention kernels alongside LoRA.
Key Insight: LoRA removes base-weight gradient and optimizer-state storage, not the need to execute the full base network. A batch of 2,048-token warehouse support tickets can still be dominated by activations. Measure peak memory under the intended context length and micro-batch size.
The visual below switches from the 70B planning example to a 65B reference point so its QLoRA column matches the demonstrated 65B run from the QLoRA paper.[3]
Think of full fine-tuning as rebuilding an entire fulfillment routing table because you need one new carrier rule. It's expensive, slow, and you might accidentally change routes that already worked.
LoRA is like keeping the original routing table frozen but adding a small overlay for the rows that need adaptation. Requests still flow through the original table, with the overlay adding a compact correction.
The mathematical term for this is low-rank decomposition. Even though a weight matrix is large, the change needed to learn a new task can have compact structure. LoRA represents that update by multiplying two narrow matrices together instead of learning one full-size update matrix.
Before we write the general formula, work through a concrete example. Suppose a weight matrix has dimensions (a typical Transformer projection layer).
Full fine-tuning would train parameters.
With LoRA rank (the low-rank dimension of the adapter update):
That's 0.78% of the full matrix size, a 99.2% reduction.
Here's how the forward pass changes. For an input vector , the original output is . With LoRA, the output becomes:
The first term is the frozen base matrix. The second term is the low-rank overlay, scaled by , where is a scaling hyperparameter that controls the adapter's strength.
Instead of learning a full update matrix with the same dimensions as the original weights (), LoRA factorizes it into two smaller matrices:
where:
The figure below shows the split between the frozen base path and the trainable adapter path.
Let's verify the savings for a realistic projection layer.
Where is LoRA rank, and are matrix dimensions (equal to for square projection layers).
| Approach | Parameters | Ratio | Memory (FP16) |
|---|---|---|---|
| Full fine-tuning | 16.7M | 100% | 33.5 MB |
| LoRA () | 65.5K | 0.39% | 131 KB |
| LoRA () | 131K | 0.78% | 262 KB |
| LoRA () | 524K | 3.1% | 1 MB |
For , the adapter trains less than 1% as many parameters as the full matrix. That is the core tradeoff: you give the task a constrained update space, and in return you remove most gradient and optimizer-state memory for that layer.
1def lora_parameter_count(d_in: int, d_out: int, rank: int) -> int:
2 return rank * d_in + d_out * rank
3
4d_in = d_out = 4096
5rank = 16
6full_params = d_in * d_out
7lora_params = lora_parameter_count(d_in, d_out, rank)
8ratio = lora_params / full_params
9
10print(f"full_params={full_params:,}")
11print(f"lora_params={lora_params:,}")
12print(f"LoRA r={rank} trains {lora_params:,} params, {ratio:.2%} of full fine-tuning.")1full_params=16,777,216
2lora_params=131,072
3LoRA r=16 trains 131,072 params, 0.78% of full fine-tuning.This is a subtle but important detail. The adapter path should start as a no-op. If both and started with arbitrary non-zero values, the model would begin with random noise added to its predictions and drift away from the pretrained checkpoint.
In the original LoRA setup, one factor is initialized randomly and the other is initialized to zero.[1] A common pattern is:
This ensures that at the very start of training, , so . The model starts with the same behavior as the pretrained model and gradually learns the adaptation.
Common Mistake: Initializing both and with small random values because "random init worked for the base model." If both matrices start non-zero, the first forward pass outputs pretrained predictions plus random noise, and the model immediately drifts from its checkpoint. Always zero-initialize one factor.
The scaling factor matters. It acts as a multiplier for the adapter's contribution:
For ordinary LoRA scaling, and produce multipliers of 1 and 2. Holding constant is useful when you want a rank comparison without also changing this explicit multiplier, but it does not guarantee identical learned update magnitude. Treat rank, scaling, learning rate, and target-module coverage as coupled sweep axes rather than universal defaults.
1configs = [
2 {"rank": 8, "alpha": 16},
3 {"rank": 32, "alpha": 16},
4 {"rank": 32, "alpha": 64},
5]
6
7for config in configs:
8 scale = config["alpha"] / config["rank"]
9 print(f"r={config['rank']:>2} alpha={config['alpha']:>2} scale={scale:.1f}")1r= 8 alpha=16 scale=2.0
2r=32 alpha=16 scale=0.5
3r=32 alpha=64 scale=2.0When adapting a Transformer with LoRA, you must decide which weight matrices will receive the adapters. The diagram below illustrates the typical structure of a Transformer block, highlighting where adapters are commonly injected during the attention and feed-forward phases.
The :::primary class marks the LoRA-adapted weights and the :::muted class marks the frozen weights. The original LoRA paper focused on the Query () and Value () projections in attention.[1] QLoRA found that adapting every linear layer in the Transformer block was needed to match full-fine-tuning results in its experiments.[3]
Depending on the complexity of your task and your compute budget, you can choose to inject LoRA adapters into a subset of the available linear layers. Here are the most common strategies used in practice:
| Strategy | Target modules | When to use |
|---|---|---|
| Original LoRA | only | Cheapest baseline matching the original paper's attention setup. |
| All attention | Attention-only comparison with more trainable capacity. | |
| All-linear | Attention projections + MLP gate/up/down projections | QLoRA-style starting point to compare when quality matters. |
Module names vary by architecture. Current Hugging Face PEFT exposes target_modules="all-linear" as the QLoRA-style shortcut, but you should still inspect which modules were matched on your model.[4]
Attention layers () control how tokens route information to one another. MLP (Multi-Layer Perceptron) layers, also called FFNs (Feed-Forward Networks), perform per-token feature transformations inside the block. All-linear targeting gives the adapter access to both paths, at the price of more trainable parameters; compare it against narrower targeting on held-out examples.
This toy transformer block makes the budget change visible:
1dimensions = {
2 "q_proj": (4096, 4096),
3 "k_proj": (4096, 4096),
4 "v_proj": (4096, 4096),
5 "o_proj": (4096, 4096),
6 "gate_proj": (4096, 11008),
7 "up_proj": (4096, 11008),
8 "down_proj": (11008, 4096),
9}
10rank = 16
11
12def adapter_params(names):
13 return sum(rank * (dimensions[name][0] + dimensions[name][1]) for name in names)
14
15qv = adapter_params(["q_proj", "v_proj"])
16all_linear = adapter_params(list(dimensions))
17print("qv_adapter_params=", qv)
18print("all_linear_adapter_params=", all_linear)
19print("all_linear_vs_qv_ratio=", round(all_linear / qv, 2))1qv_adapter_params= 262144
2all_linear_adapter_params= 1249280
3all_linear_vs_qv_ratio= 4.77LoRA is not the only parameter-efficient fine-tuning method. Two older but still useful PEFT families learn prompt-like state instead of adding low-rank updates inside weight matrices.
Prompt tuning trains a small set of continuous virtual tokens prepended to the input.[5] These aren't human-readable words. They're learned embedding vectors that steer the frozen model toward a task, like adding a small learned routing slip before every warehouse ticket. Prompt tuning is simple and cheap, but it has limited capacity because the adaptation only enters through the context window.
Prefix tuning trains continuous key/value prefixes for transformer layers.[6] Instead of changing model weights, it gives each layer learned prefix state that attention can attend to. That makes prefix tuning more expressive than plain soft prompts, because the learned control signal reaches multiple layers directly.
Use this decision rule:
| Method | What trains | Best fit |
|---|---|---|
| Prompt tuning / soft prompts | Learned input embeddings | Very cheap task steering, classification-like tasks, large base models |
| Prefix tuning | Learned per-layer prefix states | Generation tasks where a stronger steering signal helps |
| LoRA / QLoRA | Low-rank weight adapters | Domain adaptation, instruction tuning, tool behavior, stronger behavior changes |
Don't call every PEFT method "LoRA." LoRA modifies internal projections through low-rank adapters. Soft prompt tuning and prefix tuning leave those projections frozen and learn continuous prompt-like state instead.
The peft (Parameter-Efficient Fine-Tuning) library makes this easy. You don't need to write the matrix multiplication manually. The following code demonstrates how to apply LoRA to an official Qwen2.5 base model. It uses the common all-linear starting point for dense decoder models.
This is a real training setup fragment, not a tiny local smoke test. To run it, install torch, transformers, peft, and accelerate, and use a machine that can load the target model. Let your trainer or distributed launcher place trainable models; Accelerate documents device_map="auto" as a big-model inference dispatch path rather than a distributed-training strategy.[7]
1import torch
2from peft import LoraConfig, get_peft_model, TaskType
3from transformers import AutoModelForCausalLM
4
5model = AutoModelForCausalLM.from_pretrained(
6 "Qwen/Qwen2.5-7B",
7 dtype=torch.bfloat16,
8)
9
10# Configure LoRA
11lora_config = LoraConfig(
12 r=16, # rank
13 lora_alpha=32, # scaling factor alpha/r = 32/16 = 2.0
14 target_modules="all-linear", # QLoRA-style targeting
15 lora_dropout=0.05, # regularization
16 bias="none", # don't train biases
17 task_type=TaskType.CAUSAL_LM,
18)
19
20model = get_peft_model(model, lora_config)
21model.print_trainable_parameters()
22# Inspect this output rather than assuming how many modules matched.When you run print_trainable_parameters(), you see trainable params, all params, and percentage for the exact model and module selection. Record it with each run; a shortcut is only useful if it matched the intended projection layers.
For a single-adapter deployment, you often merge the adapter back into the base weights. This removes the small runtime cost of the adapter path and produces a single, ordinary model checkpoint you can serve with any inference engine.
1# After training completes
2merged_model = model.merge_and_unload()
3# merged_model is now a standard transformers model with no PEFT adapters
4merged_model.save_pretrained("./my-adapted-model")The merge_and_unload() call computes for every adapted linear layer and returns a base-model artifact with the adapter update folded into its weights.[4] The resulting artifact uses a standard dense layer path: no separate adapter matrix multiply remains, and the adapted behavior is baked into the weights. Keep adapters separate instead when one resident base must switch among many tasks or tenants.
1base = [[1.0, 2.0], [3.0, 4.0]]
2A = [[1.0, -1.0]]
3B = [[0.5], [1.0]]
4scale = 2.0
5x = [2.0, 1.0]
6
7def matvec(matrix, vector):
8 return [sum(a * b for a, b in zip(row, vector)) for row in matrix]
9
10adapter_matrix = [
11 [scale * B[row][0] * A[0][col] for col in range(2)]
12 for row in range(2)
13]
14merged = [
15 [base[row][col] + adapter_matrix[row][col] for col in range(2)]
16 for row in range(2)
17]
18unmerged_output = [
19 value + delta
20 for value, delta in zip(matvec(base, x), matvec(adapter_matrix, x))
21]
22print("merged_weights=", merged)
23print("outputs_match=", matvec(merged, x) == unmerged_output)1merged_weights= [[2.0, 1.0], [5.0, 2.0]]
2outputs_match= TrueTo understand what peft is doing, here is a simplified implementation of a LoRA layer in PyTorch. This custom linear layer replaces a standard nn.Linear module. It takes an input tensor x, computes both the frozen base model projection and the trainable low-rank adaptation, and returns the combined output scaled by alpha.
1import torch
2import torch.nn as nn
3
4class LoRALinear(nn.Module):
5 """LoRA-augmented linear layer."""
6 def __init__(self, base_layer: nn.Linear, r: int, alpha: float):
7 super().__init__()
8 self.base = base_layer
9 for param in self.base.parameters():
10 param.requires_grad_(False) # Freeze the pretrained layer
11
12 d_out, d_in = base_layer.weight.shape
13 self.A = nn.Parameter(torch.randn(r, d_in) * 0.02)
14 self.B = nn.Parameter(torch.zeros(d_out, r))
15 self.scaling = alpha / r
16
17 def forward(self, x: torch.Tensor) -> torch.Tensor:
18 base_out = self.base(x)
19 lora_out = (x @ self.A.T) @ self.B.T
20 return base_out + self.scaling * lora_out
21
22torch.manual_seed(0)
23base = nn.Linear(4, 3, bias=False)
24layer = LoRALinear(base, r=2, alpha=4)
25x = torch.randn(2, 5, 4)
26
27starts_as_noop = bool(torch.allclose(layer(x), base(x), atol=1e-6))
28base_trainable = any(param.requires_grad for param in base.parameters())
29adapter_trainable = any(param.requires_grad for param in (layer.A, layer.B))
30
31loss = layer(x).sum()
32loss.backward()
33
34adapter_B_grad_nonzero = layer.B.grad is not None and layer.B.grad.abs().sum().item() > 0
35base_has_grad = any(param.grad is not None for param in base.parameters())
36
37print("LoRA starts as a no-op and trains adapter weights.")
38print(f"starts_as_noop={starts_as_noop}")
39print(f"base_trainable={base_trainable}")
40print(f"adapter_trainable={adapter_trainable}")
41print(f"adapter_B_grad_nonzero={adapter_B_grad_nonzero}")
42print(f"base_has_grad={base_has_grad}")1LoRA starts as a no-op and trains adapter weights.
2starts_as_noop=True
3base_trainable=False
4adapter_trainable=True
5adapter_B_grad_nonzero=True
6base_has_grad=FalseRank controls adapter capacity, but module coverage, data quality, and learning rate can matter as much as rank. Holding the target modules fixed, adapter parameters scale linearly with :
| Rank () | Relative adapter parameters | What to compare |
|---|---|---|
| 4 | 0.25x of rank 16 | Cheap capacity floor |
| 8 | 0.50x of rank 16 | Low-cost candidate |
| 16 | 1.00x | Reference candidate |
| 64 | 4.00x of rank 16 | Higher-capacity candidate only if evaluation warrants it |
| 256 | 16.00x of rank 16 | Expensive diagnostic, not an assumed improvement |
Use a held-out task set and a regression set for capabilities you need to retain. A rank that lowers training loss while degrading general behavior is not a better adapter.
1reference_rank = 16
2reference_adapter_parameters = 1_249_280 # toy all-linear block from the earlier example
3
4for rank in [4, 8, 16, 64, 256]:
5 params = reference_adapter_parameters * rank // reference_rank
6 multiplier = rank / reference_rank
7 print(f"r={rank:>3} params={params:>8,} relative_to_r16={multiplier:.2f}x")1r= 4 params= 312,320 relative_to_r16=0.25x
2r= 8 params= 624,640 relative_to_r16=0.50x
3r= 16 params=1,249,280 relative_to_r16=1.00x
4r= 64 params=4,997,120 relative_to_r16=4.00x
5r=256 params=19,988,480 relative_to_r16=16.00xThe LoRA and QLoRA papers report strong results at low ranks in their studied tasks, but that does not make one rank universal.[1][3] Establish module coverage first, then sweep rank only as evaluation evidence demands.
Deciding between full fine-tuning and parameter-efficient methods depends less on raw sample count and more on domain shift, GPU budget, and whether you need one model or many variants.
Choosing between LoRA and full fine-tuning (FFT) is a tradeoff between resource efficiency and model capacity.
Raw sample count is a weak proxy. Five thousand long conversations can contain more total tokens than fifty thousand short prompts, and serving requirements often matter more than dataset size anyway.
If you need many task variants, fast iteration, or cheap deployment, LoRA often still wins even with a fairly large dataset. If the target domain is radically different from pretraining and you have enough memory to update everything, full fine-tuning can warrant its extra cost.
Cause: You might be adapting only and when the task needs deeper behavioral changes.
Fix: Compare all-linear targeting. QLoRA reported its full-fine-tuning match when applying adapters to all linear layers in each Transformer block.[3] Inspect the matched module list so you know what the shortcut actually selected.
Cause: Rank is not the limiting axis, or ordinary scaling has made the comparison harder to interpret. Fix: Hold target coverage and the scaling policy explicit while comparing ranks. Check data and module coverage before increasing rank. For high-rank experiments, rsLoRA (rank-stabilized LoRA) changes scaling from to to improve high-rank optimization behavior.[8][4]
Cause: Mismatched alpha. If you change without changing , the effective scaling changes. Fix: For an isolated ordinary-LoRA rank comparison, hold constant; for an optimization search, track rank and alpha as separate parameters. Do not present one alpha rule as universal.
Cause: You are applying an adapter path at runtime in a deployment that only needs one adapter. Fix: For a single-adapter deployment, evaluate merging after training. The merged weights are , a standard linear layer with no adapter-path overhead. Keep adapters separate intentionally for multi-adapter serving.
device_map="auto" as its sharding planCause: Inference dispatch and distributed training were confused. device_map="auto" is documented in Accelerate's Big Model Inference path.
Fix: Let the trainer or launcher select supported training placement, and use FSDP, DeepSpeed, or a deliberate QLoRA setup when the model does not fit on one training device.[7]
Cause: You're mixing separate PEFT families. Soft prompt tuning learns input embeddings. Prefix tuning learns per-layer prefix state. LoRA learns low-rank updates inside weight matrices. Fix: Name the method by what trains. If the learned parameters do not form and adapter matrices inside a projection layer, it is not LoRA.
Cause: and were transposed or initialized inconsistently. For a base matrix , LoRA uses and , so matches . Fix: Assert dimensions when wrapping modules, and zero-initialize one factor so the adapter path starts as a no-op.
When training LoRA models with libraries such as peft or trl, define a small first sweep and log each axis. These are candidates to test, not production defaults:
| Hyperparameter | Initial candidates | Note |
|---|---|---|
| Learning Rate | 1e-4, 2e-4 | Adapter SFT commonly starts above full-weight SFT rates; select on evaluation. |
| Effective token batch | Measure and hold stable | Compare token throughput and quality, not only example count. |
| Rank () | 8, 16 | Add a higher-rank candidate only when evaluation shows capacity pressure. |
| Alpha () | Choose an explicit comparison | Record scaling with rank; do not hide it inside a default. |
| Dropout | 0.0, 0.05 | Decide from validation behavior and dataset size. |
| Target Modules | Q/V baseline versus all-linear | QLoRA-style coverage costs more adapter state but may improve quality. |
| Gradient Checkpointing | Off/on comparison if memory is tight | It reduces activation storage by adding recomputation cost. |
When monitoring LoRA training, track task evaluation and regression evaluation alongside validation loss. Fewer trainable parameters do not make overfitting impossible. If validation quality falls while training loss decreases, compare dropout, rank, and data cleanup. If the model does not learn the task, inspect formatting and target modules before assuming more rank is the fix.
Standard LoRA usually keeps the frozen base model in BF16 or FP16. QLoRA (Quantized LoRA)[3] takes this a step further by storing the frozen base model in 4-bit and dequantizing on the fly for compute. The trainable LoRA adapters ( and ) stay in a higher precision such as BF16. Autograd must still propagate signals through base-layer computation to reach earlier adapters, but it does not allocate base-weight gradient or optimizer-state tensors. The 4-bit storage is for weights you are not training.
Use consistent model sizes when comparing numbers. For a 65B model, the raw storage floor and the paper's measured feasibility statement are different kinds of evidence:
| Method | Number you can defend | What it means |
|---|---|---|
| Full FT under the 12-byte recipe above | 780 GB model states | bytes, before activations |
| LoRA with a BF16/FP16 frozen base | 130 GB base-weight floor | bytes, plus adapters and runtime memory |
| QLoRA (NF4 base) | Fine-tuned 65B on one 48 GB GPU in the paper | Whole demonstrated setup, not a universal memory formula[3] |
The raw 4-bit payload is smaller than the actual QLoRA training footprint because quantization metadata, adapters, optimizer state, activations, and temporary buffers remain:
1parameters_billion = 65
2bf16_base_gb = parameters_billion * 2
3nf4_raw_payload_gb = parameters_billion * 4 / 8
4double_quant_overhead_bits_per_param = 0.127 # QLoRA paper's post-double-quant estimate
5nf4_plus_constants_gb = parameters_billion * (4 + double_quant_overhead_bits_per_param) / 8
6
7print(f"bf16_frozen_base_floor_GB={bf16_base_gb:.1f}")
8print(f"nf4_raw_payload_GB={nf4_raw_payload_gb:.1f}")
9print(f"nf4_plus_quant_constants_GB={nf4_plus_constants_gb:.1f}")
10print("paper_device_capacity_GB=48")1bf16_frozen_base_floor_GB=130.0
2nf4_raw_payload_GB=32.5
3nf4_plus_quant_constants_GB=33.5
4paper_device_capacity_GB=48Activations still scale with sequence length and micro-batch size, so a long-context QLoRA run can still OOM even though the quantized base fits.
QLoRA introduces three key innovations to make 4-bit training possible without losing accuracy:
QLoRA is a strong candidate when the frozen BF16 or FP16 base model does not fit on available training hardware. It trades memory for quantization complexity, so compare task and regression quality against an affordable higher-precision baseline where possible.
Current PEFT's documented QLoRA-style setup quantizes the base at load time, prepares it for k-bit training, and then adds adapters:[9]
1import torch
2from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
3from transformers import AutoModelForCausalLM, BitsAndBytesConfig
4
5quantization_config = BitsAndBytesConfig(
6 load_in_4bit=True,
7 bnb_4bit_quant_type="nf4",
8 bnb_4bit_use_double_quant=True,
9 bnb_4bit_compute_dtype=torch.bfloat16,
10)
11model = AutoModelForCausalLM.from_pretrained(
12 "Qwen/Qwen2.5-7B",
13 quantization_config=quantization_config,
14 dtype=torch.bfloat16,
15)
16model = prepare_model_for_kbit_training(model)
17model = get_peft_model(
18 model,
19 LoraConfig(r=16, lora_alpha=32, target_modules="all-linear", task_type="CAUSAL_LM"),
20)If you want a merged deployment artifact after QLoRA training, first load the base model in BF16 or FP16, merge the adapter, and then optionally re-quantize for inference.
A major production advantage of LoRA is that one frozen base model can serve many task-specific adapters.
In a traditional setup, if you have 10 custom models for 10 different clients, you need to load 10 full copies of the model. With LoRA, you load the frozen base model once and keep per-task adapters separately. Adapter size depends on rank, target modules, dtype, and model width, so calculate it instead of promising a fixed number. A shared base model can serve one adapter for fashion returns, another for electronics warranties, and a third for bulk-shipping refunds while the base weights stay resident.
Two common deployment patterns dominate:
Systems like S-LoRA[10] push this further by keeping many adapters in CPU memory, paging the active ones to GPU memory, and batching requests that target different adapters in the same serving step. That lets one base model serve thousands of tenant-specific adapters without loading thousands of full checkpoints.
This simplified storage calculation shows why separate adapters matter for multi-tenant deployments:
1base_checkpoint_gb = 14
2adapter_mb = 128
3tenant_count = 50
4
5merged_full_models_gb = base_checkpoint_gb * tenant_count
6shared_base_plus_adapters_gb = base_checkpoint_gb + adapter_mb * tenant_count / 1024
7
8print(f"merged_full_models_GB={merged_full_models_gb:.1f}")
9print(f"shared_base_plus_adapters_GB={shared_base_plus_adapters_gb:.2f}")
10print(f"storage_reduction_x={merged_full_models_gb / shared_base_plus_adapters_gb:.1f}")1merged_full_models_GB=700.0
2shared_base_plus_adapters_GB=20.25
3storage_reduction_x=34.6A LoRA variant called DoRA (Weight-Decomposed LoRA)[11] separates each pretrained weight vector into two pieces:
Standard LoRA learns one additive low-rank update . That update can change both magnitude and direction, but the same two factors must represent both effects.
DoRA explicitly factors the pretrained weight vector into its magnitude (a learnable scalar per output dimension) and its direction (a unit vector). The direction is then adapted with the usual low-rank LoRA matrices, while the magnitude scalars are trained directly. This separation gives the adapter an extra knob: it can boost or attenuate specific channels without needing to encode that scaling inside the low-rank matrices.
The DoRA paper reports improvements over plain LoRA on its evaluated tasks, especially at low ranks, while aiming to reduce the quality gap to full fine-tuning.[11] The magnitude parameter adds one value per output channel for each adapted matrix. Because magnitude and direction can be recombined into ordinary linear weights after training, a merged single-adapter deployment can still use a dense linear path.
Current PEFT exposes DoRA through use_dora=True on LoraConfig, but its documentation warns of greater runtime overhead than LoRA and recommends evaluating merge or ephemeral-offload options for inference.[4] Treat DoRA as a measured quality/cost experiment, not a free switch.
1d_in = d_out = 4096
2rank = 16
3lora_params = rank * (d_in + d_out)
4dora_magnitude_params = d_out
5
6print("lora_adapter_params=", lora_params)
7print("dora_extra_magnitude_params=", dora_magnitude_params)
8print(f"dora_extra_vs_lora={dora_magnitude_params / lora_params:.2%}")1lora_adapter_params= 131072
2dora_extra_magnitude_params= 4096
3dora_extra_vs_lora=3.12%Before moving on, make sure you can answer these questions without looking back:
Parameter count: For a weight matrix, how many trainable parameters does LoRA with introduce? What percentage is that of the full matrix?
Initialization: Why must matrix start at zero? What would happen if both and started with random values?
Rank vs alpha: If you increase rank from to but keep fixed, what happens to the explicit scaling factor ? Can that factor alone tell you the trained adapter's final update magnitude?
Serving: Your production setup loads one base model and keeps 50 tenant adapters in CPU RAM, paging them to GPU on demand. A colleague suggests merging all adapters into separate full models to simplify the stack. What do you lose?
parameters. That's of the full matrix.
starts at zero so that at initialization. If both started random, the model would output pretrained predictions plus random noise from the very first forward pass, causing immediate drift.
The explicit scaling factor drops from to . Raising proportionally (for example, to ) keeps that explicit multiplier fixed, but it doesn't guarantee the same learned update magnitude or quality. Compare rank and alpha with evaluation.
You'd lose the memory savings. Merging 50 adapters into 50 full models means loading 50 complete weight sets instead of one base plus 50 small adapter files. Adapter paging becomes impossible.
By the end of this lesson, you should be able to:
all-linear targeting across attention and MLP projections.all-linear against a narrower baseline; it exposes more block projections at a higher adapter cost.LoRA: Low-Rank Adaptation of Large Language Models.
Hu, E. J., et al. · 2021 · ICLR
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.
Aghajanyan, A., Gupta, S., & Zettlemoyer, L. · 2020 · ACL 2021
QLoRA: Efficient Finetuning of Quantized Language Models.
Dettmers, T., et al. · 2023 · NeurIPS
PEFT Documentation: LoRA Developer Guide.
Hugging Face · 2026
The Power of Scale for Parameter-Efficient Prompt Tuning.
Lester, Al-Rfou & Constant · 2021
Prefix-Tuning: Optimizing Continuous Prompts for Generation.
Li, X. L. & Liang, P. · 2021 · ACL 2021
Accelerate Documentation: Big Model Inference.
Hugging Face · 2026
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA.
Kalajdzievski, D. · 2023 · arXiv preprint
PEFT Documentation: Quantization.
Hugging Face · 2026
S-LoRA: Serving Thousands of Concurrent LoRA Adapters.
Sheng, Y., et al. · 2023 · arXiv preprint
DoRA: Weight-Decomposed Low-Rank Adaptation.
Liu, S., et al. · 2024 · ICML 2024