LearnAdvanced Training & AdaptationLoRA & Parameter-Efficient Tuning

⚡HardFine-Tuning & Training

LoRA & Parameter-Efficient Tuning

Understand the mathematics of Low-Rank Adaptation (LoRA), modern adapter targeting strategies, and the real memory tradeoffs compared to full fine-tuning and QLoRA.

36 min read

Learning path

Step 104 of 158 in the full curriculum

Distributed Training: FSDP & ZeRO Reward Modeling from Preference Data

Distributed training makes full-model updates fit by spreading weights, gradients, and optimizer state across GPUs. LoRA starts from a narrower bet: when the behavior change is small, can you avoid updating the full model at all?

Suppose an access-policy assistant already handles routine key reviews, but a new rotation policy changes how stale keys should be escalated. The model doesn't need to relearn language or general support behavior. It needs a targeted update for one slice of behavior. That's the problem LoRA solves.

Full fine-tuning can update every eligible trainable weight, which is slow, expensive, and memory-heavy. LoRA (Low-Rank Adaptation) freezes the pretrained model and adds a small set of trainable adapter weights that modify its behavior. In practice, this often cuts the trainable parameter count by orders of magnitude. In the original GPT-3 175B experiments, LoRA reduced trainable parameters by 10,000x and GPU memory by about 3x relative to full fine-tuning.^{[1]Reference 1LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685}

Why LoRA exists

Full fine-tuning a 70-billion parameter model is expensive because training stores more than model weights. It also needs gradients and optimizer state for every trainable parameter. Start by naming the accounting recipe: this table assumes low-precision parameters and gradients with two FP32 Adam moment tensors, but no separate FP32 master parameter copy.

Consider the memory breakdown for a 70B model using Adam with FP16 (half-precision) weights:

Component	Memory	Formula
Model weights (FP16)	140 GB	$70\text{B} \times 2$ bytes
Gradients (FP16)	140 GB	$70\text{B} \times 2$ bytes
Adam optimizer $m$ (FP32, single-precision)	280 GB	$70\text{B} \times 4$ bytes
Adam optimizer $v$ (FP32, single-precision)	280 GB	$70\text{B} \times 4$ bytes
Total under this 12-byte recipe	840 GB	Before activations and temporary buffers

This is only the model-state part of training memory. Real runs also pay for activations. A training stack that keeps a separate FP32 master copy adds another 280 GB, producing the 1.12 TB, 16-byte recipe from the distributed-training lesson.

full-finetuning-memory-recipe.py

params_billion = 70
gb_per_byte_per_billion = 1

recipe = {
    "fp16_parameters": 2,
    "fp16_gradients": 2,
    "fp32_adam_moments": 8,
}
without_master = params_billion * sum(recipe.values()) * gb_per_byte_per_billion
with_master = without_master + params_billion * 4

print("bytes_per_parameter_without_master=", sum(recipe.values()))
print("full_finetuning_states_without_master_GB=", without_master)
print("full_finetuning_states_with_master_GB=", with_master)
print("frozen_fp16_base_floor_for_lora_GB=", params_billion * recipe["fp16_parameters"])

Fine-tuning recipe output

bytes_per_parameter_without_master= 12
full_finetuning_states_without_master_GB= 840
full_finetuning_states_with_master_GB= 1120
frozen_fp16_base_floor_for_lora_GB= 140

For most teams, this makes full fine-tuning impractical. LoRA builds on the empirical observation from Aghajanyan et al. that the update directions needed during fine-tuning often live in a surprisingly low-dimensional subspace of the full parameter space.^{[2]Reference 2Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.https://arxiv.org/abs/2012.13255} Most task-specific signal can be captured by changing the weights along a much smaller number of "important directions." LoRA exploits that structure by learning the update as a product of two thin matrices instead of one giant matrix.

What LoRA saves, and what it doesn't

LoRA saves memory mostly by eliminating gradients and optimizer state for the frozen base model. The frozen weights still need to stay in memory, and the adapters carry their own small gradient and optimizer state.

That means activation memory isn't eliminated. Activations (the intermediate tensors computed during the forward pass) still scale with sequence length, batch size, and model depth, so you can still run out of memory on long contexts even with tiny adapters. Freezing weights can change exactly which tensors autograd retains, so don't claim activation bytes are identical without measuring your implementation. For long contexts, evaluate gradient checkpointing, smaller micro-batches, and memory-efficient attention kernels alongside LoRA.

Memory boundary: LoRA removes base-weight gradient and optimizer-state storage, not the need to execute the full base network. A batch of 2,048-token access-review tickets can still be dominated by activations. Measure peak memory under the intended context length and micro-batch size.

The visual below switches from the 70B planning example to a 65B reference point so its QLoRA column matches the demonstrated 65B run from the QLoRA paper.^{[3]Reference 3QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314}

65B memory evidence comparison: full fine-tuning uses 780 GB of model states under a 12-byte recipe, LoRA has a 130 GB FP16 frozen-base floor, and QLoRA demonstrated a 65B run on one 48 GB GPU. — Under the stated 12-byte full-tuning recipe, LoRA drops base gradient and Adam-state storage while keeping the frozen base weights. Activations and adapter state remain workload dependent.

The overlay intuition

Full fine-tuning rewrites the whole weight matrix. LoRA keeps the base weights frozen and adds a small learned overlay:

base path: frozen pretrained weights ( $W_0$ )
adapter path: low-rank matrices ( $A$ and $B$ )
effective weight: $W_0 + \frac{\alpha}{r}BA$

The useful update can have compact structure even when the base matrix is large.

The math: from intuition to formula

A worked example with small numbers

Start with a concrete example before the general formula. Suppose a weight matrix $W$ has dimensions $4096 \times 4096$ (a typical Transformer projection layer).

Full fine-tuning would train $4096 \times 4096 = 16,777,216$ parameters.

With LoRA rank $r = 16$ (the low-rank dimension of the adapter update):

Matrix $A$ has shape $16 \times 4096$ and has $65,536$ parameters.
Matrix $B$ has shape $4096 \times 16$ and has $65,536$ parameters.
Total LoRA parameters: $131,072$

That's 0.78% of the full matrix size, a 99.2% reduction.

The forward pass changes by adding a low-rank update. For an input vector $x$ , the original output is $W_0 x$ . With LoRA, the output becomes:

$h = W_0 x + \frac{\alpha}{r} B A x$

The first term is the frozen base matrix. The second term is the low-rank overlay, scaled by $\alpha / r$ , where $\alpha$ is a scaling hyperparameter that controls the adapter's strength.

Low-rank decomposition

Instead of learning a full update matrix $\Delta W$ with the same dimensions as the original weights ( $d_{\text{out}} \times d_{\text{in}}$ ), LoRA factorizes it into two smaller matrices:

$\Delta W = B \cdot A$

where:

$A \in \mathbb{R}^{r \times d_{\text{in}}}$ : the first matrix (maps from input to rank $r$ )
$B \in \mathbb{R}^{d_{\text{out}} \times r}$ : the second matrix (maps from rank to output)
$r$ is the rank, a tiny number such as 8, 16, or 64 compared to the model's dimensions (often 4096 or more)^{[1]Reference 1LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685}

This visual shows the split between the frozen base path and the trainable adapter path.

LoRA adapter architecture showing frozen base weight W0 running alongside a low-rank path where A projects from d_in to rank r and B projects from rank r back to d_out. — Trace the two paths. The frozen base W0 runs unchanged. The LoRA path projects input down to rank r, then back up to d_out. Its correction is scaled by α/r before the two paths are added.

Parameter savings: concrete calculation

Verify the savings for a realistic projection layer.

$\text{LoRA params} = r \cdot d_{\text{in}} + d_{\text{out}} \cdot r = 2r \cdot d$

Where $r$ is LoRA rank, and $d_{\text{in}}, d_{\text{out}}$ are matrix dimensions (equal to $d$ for square projection layers).

Approach	Parameters	Ratio	Memory (FP16)
Full fine-tuning	$4096^2 =$ 16.7M	100%	33.5 MB
LoRA ( $r=8$ )	$2 \times 8 \times 4096 =$ 65.5K	0.39%	131 KB
LoRA ( $r=16$ )	$2 \times 16 \times 4096 =$ 131K	0.78%	262 KB
LoRA ( $r=64$ )	$2 \times 64 \times 4096 =$ 524K	3.1%	1 MB

For $r=16$ , the adapter trains less than 1% as many parameters as the full matrix. That's the core tradeoff: you give the task a constrained update space, and in return you remove most gradient and optimizer-state memory for that layer.

parameter-savings-concrete-calculation.py

def lora_parameter_count(d_in: int, d_out: int, rank: int) -> int:
    return rank * d_in + d_out * rank

d_in = d_out = 4096
rank = 16
full_params = d_in * d_out
lora_params = lora_parameter_count(d_in, d_out, rank)
ratio = lora_params / full_params

print(f"full_params={full_params:,}")
print(f"lora_params={lora_params:,}")
print(f"LoRA r={rank} trains {lora_params:,} params, {ratio:.2%} of full fine-tuning.")

Parameter savings output

full_params=16,777,216
lora_params=131,072
LoRA r=16 trains 131,072 params, 0.78% of full fine-tuning.

Initialization strategy

One initialization detail matters: the adapter path should start as a no-op. If both $A$ and $B$ start non-zero, the model adds random noise before training learns anything.

In the original LoRA setup, one factor is initialized randomly and the other is initialized to zero.^{[1]Reference 1LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685} A common pattern is:

$A$ starts with small random values.
$B$ starts at zero.

This makes $B \cdot A = 0$ at the start of training, so $\Delta W = 0$ . The model begins with pretrained behavior and learns the adaptation gradually.

Common Mistake: Initializing both $A$ and $B$ with small random values because "random init worked for the base model." If both matrices start non-zero, the first forward pass outputs pretrained predictions plus random noise, and the model immediately drifts from its checkpoint. Always zero-initialize one factor.

Understanding alpha ( $\alpha$ )

The scaling factor $\frac{\alpha}{r}$ matters. It acts as a multiplier for the adapter's contribution:

$\text{scaling} = \frac{\alpha}{r}$

$r$ (rank): Determines the capacity of the adapter.
$\alpha$ (alpha): Determines the strength of the adapter's signal.

For ordinary LoRA scaling, $\alpha=r$ and $\alpha=2r$ produce multipliers of 1 and 2. Holding $\alpha/r$ constant is useful when you want a rank comparison without also changing this explicit multiplier, but it doesn't guarantee identical learned update magnitude. Treat rank, scaling, learning rate, and target-module coverage as coupled sweep axes rather than universal defaults.

lora-scaling-factor.py

configs = [
    {"rank": 8, "alpha": 16},
    {"rank": 32, "alpha": 16},
    {"rank": 32, "alpha": 64},
]

for config in configs:
    scale = config["alpha"] / config["rank"]
    print(f"r={config['rank']:>2} alpha={config['alpha']:>2} scale={scale:.1f}")

LoRA scaling output

r= 8 alpha=16 scale=2.0
r=32 alpha=16 scale=0.5
r=32 alpha=64 scale=2.0

Where to inject LoRA

When adapting a Transformer with LoRA, you must decide which weight matrices receive adapters. The diagram compares three common coverage choices across attention and feed-forward projections.

Diagram showing Original LoRA, All attention projections, QLoRA-style: all linear layers, and q_proj, v_proj. — Original LoRA, All attention projections, QLoRA-style: all linear layers, and q_proj, v_proj.

The :::primary class marks the LoRA-adapted weights and the :::muted class marks the frozen weights. The original LoRA paper focused on the Query ( $W_Q$ ) and Value ( $W_V$ ) projections in attention.^{[1]Reference 1LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685} QLoRA found that adapting every linear layer in the Transformer block was needed to match full-fine-tuning results in its experiments.^{[3]Reference 3QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314}

Common injection strategies

Target coverage trades adapter cost for capacity. Compare these common strategies:

Strategy	Target modules	When to use
Original LoRA	$W_Q, W_V$ only	Cheapest baseline matching the original paper's attention setup.
All attention	$W_Q, W_K, W_V, W_O$	Attention-only comparison with more trainable capacity.
All-linear	Attention projections + MLP gate/up/down projections	QLoRA-style starting point to compare when quality matters.

Module names vary by architecture. Current Hugging Face PEFT exposes target_modules="all-linear" as the QLoRA-style shortcut, but you should still inspect which modules were matched on your model.^{[4]Reference 4PEFT Documentation: LoRA Developer Guide.https://huggingface.co/docs/peft/main/developer_guides/lora}

Why adapt MLP layers?

Attention layers ( $W_Q, W_K, W_V, W_O$ ) control how tokens route information to one another. MLP (Multi-Layer Perceptron) layers, also called FFNs (Feed-Forward Networks), perform per-token feature transformations inside the block. All-linear targeting gives the adapter access to both paths, at the price of more trainable parameters; compare it against narrower targeting on held-out examples.

This toy transformer block makes the budget change visible:

target-module-budget.py

dimensions = {
    "q_proj": (4096, 4096),
    "k_proj": (4096, 4096),
    "v_proj": (4096, 4096),
    "o_proj": (4096, 4096),
    "gate_proj": (4096, 11008),
    "up_proj": (4096, 11008),
    "down_proj": (11008, 4096),
}
rank = 16

def adapter_params(names):
    return sum(rank * (dimensions[name][0] + dimensions[name][1]) for name in names)

qv = adapter_params(["q_proj", "v_proj"])
all_linear = adapter_params(list(dimensions))
print("qv_adapter_params=", qv)
print("all_linear_adapter_params=", all_linear)
print("all_linear_vs_qv_ratio=", round(all_linear / qv, 2))

Target-module budget output

qv_adapter_params= 262144
all_linear_adapter_params= 1249280
all_linear_vs_qv_ratio= 4.77

Soft prompts and prefix tuning

LoRA isn't the only parameter-efficient fine-tuning method. Two older but still useful PEFT families learn prompt-like state instead of adding low-rank updates inside weight matrices.

Prompt tuning trains a small set of continuous virtual tokens prepended to the input.^{[5]Reference 5The Power of Scale for Parameter-Efficient Prompt Tuning.https://arxiv.org/abs/2104.08691} These aren't human-readable words. They're learned embedding vectors that steer the frozen model toward a task, like adding a small learned control prefix before every access ticket. Prompt tuning is simple and cheap, but it has limited capacity because the adaptation only enters through the context window.

Prefix tuning trains continuous key/value prefixes for transformer layers.^{[6]Reference 6Prefix-Tuning: Optimizing Continuous Prompts for Generation.https://arxiv.org/abs/2101.00190} Instead of changing model weights, it gives each layer learned prefix state that attention can attend to. That makes prefix tuning more expressive than plain soft prompts, because the learned control signal reaches multiple layers directly.

Use this decision rule:

Method	What trains	Best fit
Prompt tuning / soft prompts	Learned input embeddings	Very cheap task steering, classification-like tasks, large base models
Prefix tuning	Learned per-layer prefix states	Generation tasks where a stronger steering signal helps
LoRA / QLoRA	Low-rank weight adapters	Domain adaptation, instruction tuning, tool behavior, stronger behavior changes

Don't call every PEFT method "LoRA." LoRA modifies internal projections through low-rank adapters. Soft prompt tuning and prefix tuning leave those projections frozen and learn continuous prompt-like state instead.

Implementing LoRA with Hugging Face PEFT

The peft (Parameter-Efficient Fine-Tuning) library handles adapter wrapping for you. You don't need to write the matrix multiplication manually. This code applies LoRA to an official Qwen2.5 base model using the common all-linear starting point for dense decoder models.

This is a real training setup fragment, not a tiny local smoke test. To run it, install torch, transformers, peft, and accelerate, and use a machine that can load the target model. Let your trainer or distributed launcher handle device placement; Accelerate documents device_map="auto" as a big-model inference dispatch path rather than a distributed-training strategy.^{[7]Reference 7Accelerate Documentation: Big Model Inference.https://huggingface.co/docs/accelerate/usage_guides/big_modeling}

implementing-lora-with-hugging-face-peft.py

import torch
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B",
    dtype=torch.bfloat16,
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,                          # rank
    lora_alpha=32,                 # scaling factor alpha/r = 32/16 = 2.0
    target_modules="all-linear",   # QLoRA-style targeting
    lora_dropout=0.05,             # regularization
    bias="none",                   # don't train biases
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Inspect this output rather than assuming how many modules matched.

When you run print_trainable_parameters(), you see trainable params, all params, and percentage for the exact model and module selection. Record it with each run; a shortcut is only useful if it matched the intended projection layers.

Merging the adapter to remove the adapter path

For a single-adapter deployment, you often merge the adapter back into the base weights. This removes the small runtime cost of the adapter path and produces a single, ordinary model checkpoint for inference engines that support the base architecture.

merging-the-adapter-path.py

# After training completes
merged_model = model.merge_and_unload()
# merged_model is now a standard transformers model with no PEFT adapters
merged_model.save_pretrained("./my-adapted-model")

The merge_and_unload() call computes $W_{\text{merged}} = W_0 + \frac{\alpha}{r}BA$ for every adapted linear layer and returns a base-model artifact with the adapter update folded into its weights.^{[4]Reference 4PEFT Documentation: LoRA Developer Guide.https://huggingface.co/docs/peft/main/developer_guides/lora} The resulting artifact uses a standard dense layer path: no separate adapter matrix multiply remains, and the adapted behavior is baked into the weights. Keep adapters separate instead when one resident base must switch among many tasks or tenants.

merge-identity.py

base = [[1.0, 2.0], [3.0, 4.0]]
A = [[1.0, -1.0]]
B = [[0.5], [1.0]]
scale = 2.0
x = [2.0, 1.0]

def matvec(matrix, vector):
    return [sum(a * b for a, b in zip(row, vector)) for row in matrix]

adapter_matrix = [
    [scale * B[row][0] * A[0][col] for col in range(2)]
    for row in range(2)
]
merged = [
    [base[row][col] + adapter_matrix[row][col] for col in range(2)]
    for row in range(2)
]
unmerged_output = [
    value + delta
    for value, delta in zip(matvec(base, x), matvec(adapter_matrix, x))
]
print("merged_weights=", merged)
print("outputs_match=", matvec(merged, x) == unmerged_output)

Adapter merge output

merged_weights= [[2.0, 1.0], [5.0, 2.0]]
outputs_match= True

Under the hood: the forward pass

To understand what peft is doing, build a simplified LoRA layer in PyTorch. This custom linear layer replaces a standard nn.Linear module. It takes an input tensor x, computes both the frozen base model projection and the trainable low-rank adaptation, scales the adapter path by $\alpha/r$ , and adds the two outputs.

under-the-hood-the-forward-pass.py

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """LoRA-augmented linear layer."""
    def __init__(self, base_layer: nn.Linear, r: int, alpha: float):
        super().__init__()
        self.base = base_layer
        for param in self.base.parameters():
            param.requires_grad_(False)  # Freeze the pretrained layer

        d_out, d_in = base_layer.weight.shape
        self.A = nn.Parameter(torch.randn(r, d_in) * 0.02)
        self.B = nn.Parameter(torch.zeros(d_out, r))
        self.scaling = alpha / r

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        base_out = self.base(x)
        lora_out = (x @ self.A.T) @ self.B.T
        return base_out + self.scaling * lora_out

torch.manual_seed(0)
base = nn.Linear(4, 3, bias=False)
layer = LoRALinear(base, r=2, alpha=4)
x = torch.randn(2, 5, 4)

starts_as_noop = bool(torch.allclose(layer(x), base(x), atol=1e-6))
base_trainable = any(param.requires_grad for param in base.parameters())
adapter_trainable = any(param.requires_grad for param in (layer.A, layer.B))

loss = layer(x).sum()
loss.backward()

adapter_B_grad_nonzero = layer.B.grad is not None and layer.B.grad.abs().sum().item() > 0
base_has_grad = any(param.grad is not None for param in base.parameters())

print("LoRA starts as a no-op and trains adapter weights.")
print(f"starts_as_noop={starts_as_noop}")
print(f"base_trainable={base_trainable}")
print(f"adapter_trainable={adapter_trainable}")
print(f"adapter_B_grad_nonzero={adapter_B_grad_nonzero}")
print(f"base_has_grad={base_has_grad}")

LoRA layer output

LoRA starts as a no-op and trains adapter weights.
starts_as_noop=True
base_trainable=False
adapter_trainable=True
adapter_B_grad_nonzero=True
base_has_grad=False

Choosing the right rank

Rank $r$ controls adapter capacity, but module coverage, data quality, and learning rate can matter as much as rank. Holding the target modules fixed, adapter parameters scale linearly with $r$ :

Rank ( $r$ )	Relative adapter parameters	What to compare
4	0.25x of rank 16	Cheap capacity floor
8	0.50x of rank 16	Low-cost candidate
16	1.00x	Reference candidate
64	4.00x of rank 16	Higher-capacity candidate only if evaluation warrants it
256	16.00x of rank 16	Expensive diagnostic, not an assumed improvement

Use a held-out task set and a regression set for capabilities you need to retain. A rank that lowers training loss while degrading general behavior isn't a better adapter.

rank-budget-scaling.py

reference_rank = 16
reference_adapter_parameters = 1_249_280  # toy all-linear block from the earlier example

for rank in [4, 8, 16, 64, 256]:
    params = reference_adapter_parameters * rank // reference_rank
    multiplier = rank / reference_rank
    print(f"r={rank:>3} params={params:>8,} relative_to_r16={multiplier:.2f}x")

Rank budget output

r=  4 params= 312,320 relative_to_r16=0.25x
r=  8 params= 624,640 relative_to_r16=0.50x
r= 16 params=1,249,280 relative_to_r16=1.00x
r= 64 params=4,997,120 relative_to_r16=4.00x
r=256 params=19,988,480 relative_to_r16=16.00x

The LoRA and QLoRA papers report strong results at low ranks in their studied tasks, but that doesn't make one rank universal.^{[1]Reference 1LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685}^{[3]Reference 3QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314} Establish module coverage first, then sweep rank only as evaluation evidence demands.

LoRA vs full fine-tuning: when to choose which

Deciding between full fine-tuning and parameter-efficient methods depends less on raw sample count and more on domain shift, GPU budget, and whether you need one model or many variants.

Diagram showing Need many variants or cheap iteration?, Yes, Evaluate LoRA / QLoRA first Cheap variant iteration, and No. — Need many variants or cheap iteration?, Yes, Evaluate LoRA / QLoRA first Cheap variant iteration, and No.

Choosing between LoRA and full fine-tuning (FFT) is a tradeoff between resource efficiency and model capacity.

Raw sample count is a weak proxy. Five thousand long conversations can contain more total tokens than fifty thousand short prompts, and serving requirements often matter more than dataset size anyway.

If you need many task variants, fast iteration, or cheap deployment, LoRA often still wins even with a fairly large dataset. If the target domain is radically different from pretraining and you have enough memory to update everything, full fine-tuning can warrant its extra cost.

LoRA keeps the original base checkpoint recoverable, but an active adapter can still regress behavior. Because fewer parameters train, LoRA constrains update capacity and makes rollback or adapter disablement straightforward. It doesn't guarantee that responses with the adapter enabled preserve general capabilities. Evaluate both target-task quality and regression behavior.
Full fine-tuning has higher capacity. If you have a very large corpus, enough memory, and a target domain that's radically different from pre-training (for example, moving from web text to highly specialized security-policy documents with custom entity schemas and multi-step approval rules), full fine-tuning lets the model reshape every weight to capture those new patterns. In such cases the extra cost can be justified.

Common mistakes and how to fix them

Symptom: Validation loss is flat and the model isn't learning

Cause: You might be adapting only $W_Q$ and $W_V$ when the task needs deeper behavioral changes.
Fix: Compare all-linear targeting. QLoRA reported its full-fine-tuning match when applying adapters to all linear layers in each Transformer block.^{[3]Reference 3QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314} Inspect the matched module list so you know what the shortcut selected.

Symptom: Adapter cost rises with rank but held-out quality doesn't

Cause: Rank isn't the limiting axis, or ordinary $\alpha/r$ scaling has made the comparison harder to interpret.
Fix: Hold target coverage and the scaling policy explicit while comparing ranks. Check data and module coverage before increasing rank. For high-rank experiments, rsLoRA (rank-stabilized LoRA) changes scaling from $\alpha / r$ to $\alpha / \sqrt{r}$ to improve high-rank optimization behavior.^{[8]Reference 8A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA.https://arxiv.org/abs/2312.03732}^{[4]Reference 4PEFT Documentation: LoRA Developer Guide.https://huggingface.co/docs/peft/main/developer_guides/lora}

Symptom: You changed rank and now the model overfits or underfits

Cause: Mismatched alpha. If you change $r$ without changing $\alpha$ , the effective scaling changes.
Fix: For an isolated ordinary-LoRA rank comparison, hold $\alpha/r$ constant; for an optimization search, track rank and alpha as separate parameters. Don't present one alpha rule as universal.

Symptom: Inference latency is higher than expected in production

Cause: You're applying an adapter path at runtime in a deployment that only needs one adapter.
Fix: For a single-adapter deployment, evaluate merging after training. The merged weights are $W_{\text{merged}} = W_0 + \frac{\alpha}{r}BA$ , a standard linear layer with no adapter-path overhead. Keep adapters separate intentionally for multi-adapter serving.

Symptom: A multi-GPU training job uses `device_map="auto"` as its sharding plan

Cause: Inference dispatch and distributed training were confused. device_map="auto" is documented in Accelerate's Big Model Inference path.
Fix: Let the trainer or launcher select supported training placement, and use FSDP, DeepSpeed, or a deliberate QLoRA setup when the model doesn't fit on one training device.^{[7]Reference 7Accelerate Documentation: Big Model Inference.https://huggingface.co/docs/accelerate/usage_guides/big_modeling}

Symptom: Your PEFT notes say "LoRA" but the method trains prompt vectors

Cause: You're mixing separate PEFT families: soft prompt tuning learns input embeddings, prefix tuning learns per-layer prefix state, and LoRA learns low-rank updates inside weight matrices.
Fix: Name the method by what trains; if the learned parameters don't form $A$ and $B$ adapter matrices inside a projection layer, it isn't LoRA.

Symptom: New special tokens never appear in generated output

Cause: You resized the tokenizer but left lm_head frozen under LoRA-only targeting. The adapter can shift hidden states, but a frozen output projection still maps them into the old vocabulary basis.
Fix: Add lm_head (and usually embed_tokens) to modules_to_save, target them explicitly, or use PEFT trainable_token_indices for the new token IDs.^{[4]Reference 4PEFT Documentation: LoRA Developer Guide.https://huggingface.co/docs/peft/main/developer_guides/lora}

Symptom: Your implementation shape-checks only after a late runtime error

Cause: $A$ and $B$ were transposed or initialized inconsistently. For a base matrix $W_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ , LoRA uses $A \in \mathbb{R}^{r \times d_{\text{in}}}$ and $B \in \mathbb{R}^{d_{\text{out}} \times r}$ , so $BA$ matches $W_0$ .
Fix: Assert dimensions when wrapping modules, and zero-initialize one factor so the adapter path starts as a no-op.

Production sweep starting points

When training LoRA models with libraries such as peft or trl, define a small first sweep and log each axis. These are candidates to test, not production defaults:

Hyperparameter	Initial candidates	Note
Learning Rate	`1e-4`, `2e-4`	Adapter SFT commonly starts above full-weight SFT rates; select on evaluation.
Effective token batch	Measure and hold stable	Compare token throughput and quality, not example count alone.
Rank ( $r$ )	8, 16	Add a higher-rank candidate only when evaluation shows capacity pressure.
Alpha ( $\alpha$ )	Choose an explicit $\alpha/r$ comparison	Record scaling with rank; don't hide it inside a default.
Dropout	0.0, 0.05	Decide from validation behavior and dataset size.
Target Modules	Q/V baseline versus `all-linear`	QLoRA-style coverage costs more adapter state but may improve quality.
Gradient Checkpointing	Off/on comparison if memory is tight	It reduces activation storage by adding recomputation cost.

When monitoring LoRA training, track task evaluation and regression evaluation alongside validation loss. Fewer trainable parameters don't make overfitting impossible. If validation quality falls while training loss decreases, compare dropout, rank, and data cleanup. If the model doesn't learn the task, inspect formatting and target modules before assuming more rank is the fix.

QLoRA: going further

Standard LoRA usually keeps the frozen base model in BF16 or FP16. QLoRA (Quantized LoRA)^{[3]Reference 3QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314} takes this a step further by storing the frozen base model in 4-bit and dequantizing on the fly for compute. The trainable LoRA adapters ( $A$ and $B$ ) stay in a higher precision such as BF16. Autograd must still propagate signals through base-layer computation to reach earlier adapters, but it doesn't allocate base-weight gradient or optimizer-state tensors. The 4-bit storage is for weights you aren't training.

Use consistent model sizes when comparing numbers. For a 65B model, the raw storage floor and the paper's measured feasibility statement are different kinds of evidence:

Method	Number you can defend	What it means
Full FT under the 12-byte recipe above	780 GB model states	$65\text{B} \times 12$ bytes, before activations
LoRA with a BF16/FP16 frozen base	130 GB base-weight floor	$65\text{B} \times 2$ bytes, plus adapters and runtime memory
QLoRA (NF4 base)	Fine-tuned 65B on one 48 GB GPU in the paper	Whole demonstrated setup, not a universal memory formula^{[3]Reference 3QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314}

The raw 4-bit payload is smaller than the actual QLoRA training footprint because quantization metadata, adapters, optimizer state, activations, and temporary buffers remain:

qlora-storage-floor.py

parameters_billion = 65
bf16_base_gb = parameters_billion * 2
nf4_raw_payload_gb = parameters_billion * 4 / 8
double_quant_overhead_bits_per_param = 0.127  # QLoRA paper's post-double-quant estimate
nf4_plus_constants_gb = parameters_billion * (4 + double_quant_overhead_bits_per_param) / 8

print(f"bf16_frozen_base_floor_GB={bf16_base_gb:.1f}")
print(f"nf4_raw_payload_GB={nf4_raw_payload_gb:.1f}")
print(f"nf4_plus_quant_constants_GB={nf4_plus_constants_gb:.1f}")
print("paper_device_capacity_GB=48")

QLoRA storage floor output

bf16_frozen_base_floor_GB=130.0
nf4_raw_payload_GB=32.5
nf4_plus_quant_constants_GB=33.5
paper_device_capacity_GB=48

Activations still scale with sequence length and micro-batch size, so a long-context QLoRA run can still OOM even though the quantized base fits.

QLoRA introduced three techniques that preserved the paper's evaluated 16-bit fine-tuning performance while making 4-bit adapter training practical:

NF4 (Normal Float 4-bit): NF4 chooses quantile-based values for normally distributed weights. Under the QLoRA paper's assumptions, it's information-theoretically optimal for that distribution.
Double quantization: Quantization requires constants that scale blocks of 4-bit values back into compute precision. QLoRA quantizes these constants as well, reducing the average overhead by about 0.37 bits per parameter, from roughly 0.5 to 0.127 bits per parameter.^{[3]Reference 3QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314}
Paged optimizers: This feature uses Unified Memory to smooth the memory spikes that show up during long-sequence minibatches, instead of crashing as soon as optimizer state briefly exceeds GPU memory.

QLoRA is a strong candidate when the frozen BF16 or FP16 base model doesn't fit on available training hardware. It trades memory for quantization complexity, so compare task and regression quality against an affordable higher-precision baseline where possible.

Current PEFT's documented QLoRA-style setup quantizes the base at load time, prepares it for k-bit training, and then adds adapters:^{[9]Reference 9PEFT Documentation: Quantization.https://huggingface.co/docs/peft/main/developer_guides/quantization}

qlora-peft-setup.py

import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B",
    quantization_config=quantization_config,
    dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(
    model,
    LoraConfig(r=16, lora_alpha=32, target_modules="all-linear", task_type="CAUSAL_LM"),
)

If you want a merged deployment artifact after QLoRA training, first load the base model in BF16 or FP16, merge the adapter, and then optionally re-quantize for inference.

Multi-adapter serving

A major production advantage of LoRA is that one frozen base model can serve many task-specific adapters.

In a traditional setup, if you have 10 custom models for 10 different clients, you need to load 10 full copies of the model. With LoRA, you load the frozen base model once and keep per-task adapters separately. Adapter size depends on rank, target modules, dtype, and model width, so calculate it instead of promising a fixed number. A shared base model can serve one adapter for access reviews, another for key-rotation exceptions, and a third for quota-escalation cases while the base weights stay resident.

Two common deployment patterns dominate:

Single-tenant: merge once ( $W_{\text{merged}} = W_0 + \frac{\alpha}{r}BA$ ) and serve a standalone model with no separate adapter-path overhead.
Multi-tenant: keep $W_0$ resident, load adapters on demand, and apply them during inference.

Systems like S-LoRA^{[10]Reference 10S-LoRA: Serving Thousands of Concurrent LoRA Adapters.https://arxiv.org/abs/2311.03285} push this further by keeping many adapters in CPU memory, paging the active ones to GPU memory, and batching requests that target different adapters in the same serving step. That lets one base model serve thousands of tenant-specific adapters without loading thousands of full checkpoints.

This simplified storage calculation shows why separate adapters matter for multi-tenant deployments:

multi-adapter-storage.py

base_checkpoint_gb = 14
adapter_mb = 128
tenant_count = 50

merged_full_models_gb = base_checkpoint_gb * tenant_count
shared_base_plus_adapters_gb = base_checkpoint_gb + adapter_mb * tenant_count / 1024

print(f"merged_full_models_GB={merged_full_models_gb:.1f}")
print(f"shared_base_plus_adapters_GB={shared_base_plus_adapters_gb:.2f}")
print(f"storage_reduction_x={merged_full_models_gb / shared_base_plus_adapters_gb:.1f}")

Adapter serving storage output

merged_full_models_GB=700.0
shared_base_plus_adapters_GB=20.25
storage_reduction_x=34.6

Advanced: DoRA (weight-decomposed LoRA)

A LoRA variant called DoRA (Weight-Decomposed LoRA)^{[11]Reference 11DoRA: Weight-Decomposed Low-Rank Adaptation.https://arxiv.org/abs/2402.09353} separates each pretrained weight vector into two pieces:

a learnable magnitude term (one scalar per output channel)
a direction term that's adapted with standard LoRA

Why this works

Standard LoRA learns one additive low-rank update $BA$ . That update can change both magnitude and direction, but the same two factors must represent both effects.

DoRA explicitly factors the pretrained weight vector into its magnitude (a learnable scalar per output dimension) and its direction (a unit vector). The direction is then adapted with the usual low-rank LoRA matrices, while the magnitude scalars are trained directly. This separation gives the adapter an extra knob: it can boost or attenuate specific channels without needing to encode that scaling inside the low-rank matrices.

The DoRA paper reports improvements over plain LoRA on its evaluated tasks, especially at low ranks, while aiming to reduce the quality gap to full fine-tuning.^{[11]Reference 11DoRA: Weight-Decomposed Low-Rank Adaptation.https://arxiv.org/abs/2402.09353} The magnitude parameter adds one value per output channel for each adapted matrix. Because magnitude and direction can be recombined into ordinary linear weights after training, a merged single-adapter deployment can still use a dense linear path.

Current PEFT exposes DoRA through use_dora=True on LoraConfig, but its documentation warns of greater runtime overhead than LoRA and recommends evaluating merge or ephemeral-offload options for inference.^{[4]Reference 4PEFT Documentation: LoRA Developer Guide.https://huggingface.co/docs/peft/main/developer_guides/lora} Treat DoRA as a measured quality/cost experiment, not a free switch.

dora-extra-parameters.py

d_in = d_out = 4096
rank = 16
lora_params = rank * (d_in + d_out)
dora_magnitude_params = d_out

print("lora_adapter_params=", lora_params)
print("dora_extra_magnitude_params=", dora_magnitude_params)
print(f"dora_extra_vs_lora={dora_magnitude_params / lora_params:.2%}")

DoRA parameter output

lora_adapter_params= 131072
dora_extra_magnitude_params= 4096
dora_extra_vs_lora=3.12%

Check your understanding

Answer these from memory before you continue:

Parameter count: For a $4096 \times 4096$ weight matrix, how many trainable parameters does LoRA with $r=16$ introduce? What percentage is that of the full matrix?
Initialization: Why does standard LoRA initialize $B$ to zero while $A$ is random? What would happen if both factors started with random values?
Rank vs alpha: If you increase rank from $r=8$ to $r=32$ but keep $\alpha=16$ fixed, what happens to the explicit scaling factor $\frac{\alpha}{r}$ ? Can that factor alone tell you the trained adapter's final update magnitude?
Serving: Your production setup loads one base model and keeps 50 tenant adapters in CPU RAM, paging them to GPU on demand. A colleague suggests merging all adapters into separate full models to simplify the stack. What do you lose?

Solution sketches

$2 \times 16 \times 4096 = 131,072$ parameters. That's $131,072 / 16,777,216 \approx 0.78\%$ of the full matrix.
Standard LoRA uses random $A$ and zero $B$ , so $B \cdot A = 0$ and the adapter starts as a no-op while $B$ can receive a useful first-step gradient. Zero $B$ is a convention, not the only mathematical option: zero $A$ with random $B$ also makes $B \cdot A = 0$ , but changes the optimization dynamics. If both factors started random, the adapter would perturb the pretrained output from the first forward pass.
The explicit scaling factor drops from $16/8 = 2.0$ to $16/32 = 0.5$ . Raising $\alpha$ proportionally (for example, to $\alpha=64$ ) keeps that explicit multiplier fixed, but it doesn't guarantee the same learned update magnitude or quality. Compare rank and alpha with evaluation.
You'd lose the memory savings. Merging 50 adapters into 50 full models means loading 50 complete weight sets instead of one base plus 50 small adapter files. Adapter paging becomes impossible.

Practice checkpoints

What to check before moving on

Check that you can:

State the LoRA update as $\Delta W = BA$ and give the dimensions of $A$ , $B$ , and $W_0$ .
Explain why the base weights freeze while only $A$ and $B$ receive gradients.
Calculate the parameter savings for rank 16 on a $4096 \times 4096$ projection: 131,072 adapter parameters instead of 16,777,216 full-matrix parameters.
Decide when to adapt only attention projections and when to use all-linear targeting across attention and MLP projections.
Explain why QLoRA reduces base-weight memory with 4-bit quantization but still leaves activation memory as a training bottleneck.
Distinguish LoRA weight adapters from soft prompt tuning and prefix tuning.

Adapter decisions to defend

Fine-tuning often updates a small subspace of the model's weights. LoRA exploits this by training low-rank matrices $A$ and $B$ .
LoRA often cuts trainable parameters by 100x to 10,000x, depending on rank and target modules.
The biggest savings come from optimizer states and gradients. Activation memory still matters.
Compare all-linear against a narrower baseline; it exposes more block projections at a higher adapter cost.
Merge for single-tenant serving without a separate adapter path, or keep adapters separate for multi-tenant serving.
QLoRA reduces frozen-base memory through quantization. DoRA adds a magnitude/direction parameterization whose quality and overhead must be measured.

Practice drill

Create an adapter selection scorecard for one domain adaptation:

Compare full fine-tuning, LoRA, QLoRA, prompt tuning, and prefix tuning on trainable parameters, base memory, activation memory, serving path, and rollback cost.
Calculate adapter parameters for two ranks and two target-module sets.
Log matched modules so q_proj/v_proj or all-linear choices are explicit rather than assumed.
Define held-out task, regression, and latency gates before choosing rank or merge strategy.

The scorecard should prevent a lower training loss from hiding worse regression behavior or serving complexity.

Next Step

Continue to Reward Modeling from Preference Data

LoRA and QLoRA showed how to adapt a large model cheaply. <span data-glossary="reward-model">Reward modeling</span> is next because preference tuning still needs a signal to optimize against, and one common path is to train a separate scorer that ranks chosen responses above rejected ones before <span data-glossary="rlhf">RLHF</span> or other alignment steps.

PreviousDistributed Training: FSDP & ZeRO

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.

Aghajanyan, A., Gupta, S., & Zettlemoyer, L. · 2020 · ACL 2021

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

PEFT Documentation: LoRA Developer Guide.

Hugging Face · 2026

The Power of Scale for Parameter-Efficient Prompt Tuning.

Lester, Al-Rfou & Constant · 2021

Prefix-Tuning: Optimizing Continuous Prompts for Generation.

Li, X. L. & Liang, P. · 2021 · ACL 2021

Accelerate Documentation: Big Model Inference.

Hugging Face · 2026

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA.

Kalajdzievski, D. · 2023 · arXiv preprint

PEFT Documentation: Quantization.

Hugging Face · 2026

S-LoRA: Serving Thousands of Concurrent LoRA Adapters.

Sheng, Y., et al. · 2023 · arXiv preprint

DoRA: Weight-Decomposed Low-Rank Adaptation.

Liu, S., et al. · 2024 · ICML 2024

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

LoRA & Parameter-Efficient Tuning

Why LoRA exists

What LoRA saves, and what it doesn't

The overlay intuition

The math: from intuition to formula

A worked example with small numbers

Low-rank decomposition

Why does LoRA save parameters?

Parameter savings: concrete calculation

Initialization strategy

Understanding alpha (α\alphaα)

If rank increases but α\alphaα stays fixed, what happens to α/r\alpha/rα/r?

Where to inject LoRA

Common injection strategies

Why adapt MLP layers?

Soft prompts and prefix tuning

Implementing LoRA with Hugging Face PEFT

Merging the adapter to remove the adapter path

Under the hood: the forward pass

Choosing the right rank

Your platform team wants to adapt a 7B model to handle three new incident-runbook policies. You have 5,000 labeled access tickets, each averaging 512 tokens. Would you start with r=8r=8r=8 or r=64r=64r=64?

LoRA vs full fine-tuning: when to choose which

Common mistakes and how to fix them

Symptom: Validation loss is flat and the model isn't learning

Symptom: Adapter cost rises with rank but held-out quality doesn't

Symptom: You changed rank and now the model overfits or underfits

Symptom: Inference latency is higher than expected in production

Symptom: A multi-GPU training job uses device_map="auto" as its sharding plan

Symptom: Your PEFT notes say "LoRA" but the method trains prompt vectors

Symptom: New special tokens never appear in generated output

Symptom: Your implementation shape-checks only after a late runtime error

Production sweep starting points

QLoRA: going further

Multi-adapter serving

Advanced: DoRA (weight-decomposed LoRA)

Why this works

Check your understanding

Solution sketches

Practice checkpoints

How do you choose rank rrr in a real LoRA run?

Can LoRA match full fine-tuning? When might it fail?

How would you serve different LoRA adapters for different users?

What is the relationship between LoRA and model merging?

What to check before moving on

Adapter decisions to defend

Practice drill

Mastery Check

Discussion

Understanding alpha ( $\alpha$ )

Symptom: A multi-GPU training job uses `device_map="auto"` as its sharding plan