LearnAdvanced Training & AdaptationModel Merging and Weight Interpolation

⚡HardFine-Tuning & Training

Model Merging and Weight Interpolation

Learn model merging techniques, from simple weight averaging and task arithmetic to TIES-Merging and DARE, including practical guidance on tokenizer compatibility, mergekit workflows, and evaluation.

35 min read

Learning path

Step 110 of 158 in the full curriculum

Knowledge Distillation for LLMs Vector DB Internals: HNSW & IVF

Knowledge distillation compressed teacher behavior into a smaller student. Model merging asks a different deployment question: if you already have several useful fine-tuned checkpoints from the same base model, can you combine their weights into one checkpoint without launching another training run?

Your model platform team runs separate assistant checkpoints for code generation, math reasoning, and general chat. All three came from the same base model, but serving them behind a router raises memory and operational costs and complicates rollback. A merge creates a candidate checkpoint without another gradient-training run, but retained behavior remains an evaluation question.

A hard prerequisite for direct tensor interpolation is compatible parameter structure: corresponding tensors must have compatible shapes and meanings. Embedding matrices and language-model heads also need an explicit tokenizer policy. You can't interpolate a 7B checkpoint with a 70B checkpoint, or blindly combine token rows from different vocabularies. Mergekit can construct a union tokenizer and assign fallback embeddings for missing tokens, but that's an explicit output-space choice that still needs evaluation.^{[1]Reference 1Mergekit: Tools for merging pre-trained large language modelshttps://github.com/arcee-ai/mergekit} Same-base checkpoints are the conservative starting point for task-vector merging because their parameter coordinates share lineage; compatible shapes alone don't establish merge quality.

Four recipes compare how same-base specialist checkpoints combine, from simple averaging to sparse conflict-aware task-vector merging.

Four model merging recipes compared before deployment evaluation. — Model soups average checkpoints, task arithmetic adds scaled task vectors, TIES trims and sign-aligns conflicting updates, and DARE sparsifies task vectors before a downstream merge.

When averaging weights is worth testing

If someone told you to average the internal numbers of two trained neural networks, skepticism is correct. Weight averaging is plausible in some fine-tuning settings because nearby checkpoints can occupy a connected low-loss region in parameter space. It isn't safe merely because both endpoints are good models.

Fine-tuning moves a checkpoint from a shared starting point through weight space. Linear mode connectivity studies whether the straight interpolation path between endpoints crosses a high-loss barrier.^{[2]Reference 2Linear Mode Connectivity and the Lottery Ticket Hypothesis.https://arxiv.org/abs/1912.05671} The Model Soups paper motivates averaging in a specific setting: models fine-tuned from a shared pretrained initialization under different hyperparameters often admit useful averages in its evaluated vision and text-classification experiments.^{[3]Reference 3Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Timehttps://arxiv.org/abs/2203.05482} This is evidence for testing nearby same-lineage candidates, not a proof that differently specialized LLM checkpoints share one basin.

In concrete terms, evaluate points on $\theta(\alpha) = (1-\alpha)\theta_A + \alpha\theta_B$ rather than assuming the midpoint is usable. Two checkpoints can each be strong on their own evaluation and still interfere when combined, especially when they represent different tasks. Models with unrelated pretraining lineages are even poorer candidates for direct interpolation because their parameter coordinate systems weren't preserved by a common base.

The screening logic below treats interpolation measurements as evidence, rather than granting the midpoint a pass because the sources share lineage. In a real run, losses and task_scores come from the candidate checkpoints and held-out evaluations:

screen-measured-interpolation-path.py

alphas = [0.00, 0.25, 0.50, 0.75, 1.00]
losses = [0.18, 0.20, 0.61, 0.23, 0.19]
task_scores = [0.88, 0.86, 0.70, 0.85, 0.89]

max_accepted_loss = 0.30
min_accepted_score = 0.84

accepted = [
    alpha
    for alpha, loss, score in zip(alphas, losses, task_scores)
    if loss <= max_accepted_loss and score >= min_accepted_score
]

print("accepted_alphas:", accepted)
print("midpoint_passes:", 0.50 in accepted)

Output

accepted_alphas: [0.0, 0.25, 0.75, 1.0]
midpoint_passes: False

Interpolation diagnostic comparing a measured low-loss path between nearby same-lineage checkpoints with a measured high-loss ridge between incompatible candidates. — Check interpolation loss or downstream quality rather than inferring it from lineage. Shared lineage is a useful precondition, not a pass result.

Same-base lineage: Same-base lineage makes a merge experiment defensible. It doesn't establish that a code-generation and math-reasoning merge retained either behavior. Only per-task evaluation does that.

The permutation invariance problem

Neural networks can exhibit permutation invariance: under appropriate corresponding reordering of incoming and outgoing weights, hidden-unit permutations can preserve the function a network computes.^{[4]Reference 4The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks.https://arxiv.org/abs/2110.06296} This helps explain why independently trained networks may be poorly aligned for naive averaging even if they solve similar tasks.

Git Re-Basin^{[5]Reference 5Git Re-Basin: Merging Models modulo Permutation Symmetries.https://arxiv.org/abs/2209.04836} addresses this by finding the optimal permutation $\pi$ that aligns the neurons of one model to match the other before merging:

$\theta_{\text{merged}} = \frac{1}{2}(\theta_A + \pi(\theta_B))$

Permutation invariance makes the failure mode concrete in the next figure. Two checkpoints can learn the same three hidden features but store them in different slot orders, so Git Re-Basin permutes one model before averaging.

Permutation-alignment diagram showing equivalent hidden features before and after slot reordering. — Permutation invariance means slot numbers aren't semantic IDs. Git Re-Basin searches for a permutation that aligns equivalent hidden units in the network architectures studied by its paper.

Git Re-Basin uses permutation matching and reports merged independently trained MLP, CNN, and ResNet models in its studied settings, including a zero-barrier ResNet result on CIFAR-10.^{[5]Reference 5Git Re-Basin: Merging Models modulo Permutation Symmetries.https://arxiv.org/abs/2209.04836} That paper is a useful alignment concept, but it isn't evidence that an arbitrary pair of large language models can be repaired and merged. For LLM work, same-base candidates plus downstream evaluation remain the practical default here.

The merging toolkit: from simple averages to conflict-aware methods

Several techniques exist for merging models, ranging from mathematical averages to geometric interpolation rules. Choice depends on source lineage, observed delta conflict, and the evaluations the output must pass.

Method	Mechanism	Pros	Cons	Best For
Model Soups / Linear	Uniform or weighted averaging	Simple baseline	Interference between conflicting parameters	Nearby checkpoints with representative evaluation
Task Arithmetic	Weighted task vectors	Separate coefficients per delta	Coefficients don't guarantee separate capabilities survive	Same-base task-vector experiments
TIES-Merging (Trim, Elect Sign, Merge)	Trim and aggregate-sign filtering	Explicitly handles conflicting delta signs	Density and scale need tuning	Conflicting same-base task vectors
DARE (sparsify first)	Random dropping and rescaling of task vectors	Sparse preprocessing evaluated by its paper	Not a complete merge rule on its own	Testing DARE plus a downstream merge
SLERP (Spherical Linear Interpolation)	Spherical interpolation in direction space	Has a norm-preserving geometric interpretation	Geometry alone doesn't establish task quality	Pairwise interpolation experiment

Model soups (uniform averaging)

The simplest approach to merging is called Model Soups^{[3]Reference 3Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Timehttps://arxiv.org/abs/2203.05482}. It averages the weights of multiple fine-tuned models to create a single averaged model:

$\theta_{\text{merged}} = \frac{1}{N} \sum_{i=1}^{N} \theta_i$

To see the operation, imagine two scalars instead of billion-parameter tensors. If checkpoint A has a weight 2.0 and checkpoint B has 4.0, their equal average is 3.0. Whether that compromise retains behavior can't be determined from this parameter alone. Model Soups evaluates averaging models fine-tuned from a shared initialization over hyperparameter configurations and reports improved accuracy and robustness in its studied settings without the inference cost of an ensemble.^{[3]Reference 3Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Timehttps://arxiv.org/abs/2203.05482}

Uniform vs. greedy soups

The original Model Soups paper distinguishes between two selection strategies:

Uniform Soups: Average all fine-tuned checkpoints with equal weight. Simple but risky, as a single poorly-performing checkpoint can drag down the overall quality.
Greedy Soups: Iteratively add checkpoints to the soup only if they improve performance on a held-out validation set. Start with the best individual model, then test each remaining model: if adding it to the average improves the validation metric, keep it; otherwise, discard it.

In the original paper, greedy soups outperform uniform averaging in the reported experiments because the selection rule skips candidates that hurt its held-out validation metric.^{[3]Reference 3Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Timehttps://arxiv.org/abs/2203.05482} A similar selection rule is reasonable to test only when your validation slices represent the behavior you need to keep.

uniform_merge performs uniform averaging. It takes a list of model state dictionaries and an optional list of weights, and returns a single state dictionary containing the weighted average of their parameters:

uniform-vs-greedy-soups.py

import torch

def uniform_merge(
    models: list[dict[str, torch.Tensor]],
    weights: list[float] | None = None,
) -> dict[str, torch.Tensor]:
    """Merge models by weighted averaging of parameters."""
    if weights is None:
        weights = [1.0 / len(models)] * len(models)

    if abs(sum(weights) - 1.0) >= 1e-6:
        raise ValueError("Weights must sum to 1")

    merged = {}
    for key in models[0].keys():
        merged[key] = sum(w * m[key] for w, m in zip(weights, models))

    return merged

code_model = {"w": torch.tensor([2.0, 4.0]), "bias": torch.tensor([1.0])}
math_model = {"w": torch.tensor([4.0, 2.0]), "bias": torch.tensor([3.0])}
merged = uniform_merge([code_model, math_model])

print("weights_ok:", bool(torch.allclose(merged["w"], torch.tensor([3.0, 3.0]))))
print("bias_ok:", bool(torch.allclose(merged["bias"], torch.tensor([2.0]))))
print("merged w:", merged["w"].tolist())
print("merged bias:", merged["bias"].tolist())

Output

weights_ok: True
bias_ok: True
merged w: [3.0, 3.0]
merged bias: [2.0]

Pros

Exceptionally simple to implement and requires no hyperparameters beyond the optional weighting scheme.

Cons

If the models are fine-tuned on wildly different tasks, direct averaging can cause destructive interference between conflicting parameters, degrading overall performance.

Task arithmetic

Instead of averaging absolute weights, Task Arithmetic^{[6]Reference 6Editing Models with Task Arithmetichttps://arxiv.org/abs/2212.04089} operates on task vectors (the difference between fine-tuned weights and the base model). Task vectors are specific directions from the base checkpoint (e.g., "walk 10 steps north"). Instead of averaging the final destinations of different hikers, you take the specific path each hiker took from base camp and combine them. By scaling these paths (e.g., "take half the steps north"), you can carefully mix different skills.

$\tau_i = \theta_i - \theta_0 \quad \text{(task vector for model } i \text{)}$

$\theta_{\text{merged}} = \theta_0 + \sum_{i=1}^{N} \lambda_i \cdot \tau_i$

Where $\theta_0$ is the base model, $\theta_i$ are fine-tuned models, $\tau_i$ are per-task deltas, and $\lambda_i$ are merge weights.

A concrete scalar walkthrough

Before you run this on billion-parameter tensors, try it on a single parameter. Imagine a base model where one weight is 2.0. A code-generation fine-tune pushes that weight to 2.5, so its task vector is +0.5. A math-reasoning fine-tune pushes it to 1.5, so its task vector is -0.5. If you want to favor the code delta without discarding the math delta, you might set $\lambda_{\text{code}} = 0.6$ and $\lambda_{\text{math}} = 0.4$ :

$\text{merged weight} = 2.0 + 0.6(0.5) + 0.4(-0.5) = 2.0 + 0.3 - 0.2 = 2.1$

The result, 2.1, is a parameter nudge in the code-generation direction. If you had instead averaged the absolute fine-tuned weights (2.5 and 1.5), you'd get 2.0, cancelling both deltas at this coordinate. A scalar calculation can't establish how either capability performs; it only shows how task-vector coefficients act on weights.

A different coefficient pair shows the same tuning choice. Suppose a code-generation fine-tune nudges a weight to 2.4 (task vector +0.4) and a math-reasoning fine-tune nudges it to 1.8 (task vector -0.2). Setting $\lambda_{\text{code}} = 0.7$ and $\lambda_{\text{math}} = 0.5$ gives 2.0 + 0.28 - 0.10 = 2.18. Whether this coefficient pair keeps either capability is measured after the merge.

task_arithmetic_merge implements the recipe in code. It takes a shared base model, a list of fine-tuned models, and scaling coefficients. It subtracts the base to get task vectors, applies a scaling coefficient, and adds them back to return the merged model:

a-concrete-scalar-walkthrough.py

import torch

def task_arithmetic_merge(
    base_model: dict[str, torch.Tensor],
    fine_tuned_models: list[dict[str, torch.Tensor]],
    scaling_coefficients: list[float]
) -> dict[str, torch.Tensor]:
    """Merge via task vectors (differences from base model).

    Args:
        base_model: The shared base model weights
        fine_tuned_models: List of fine-tuned model weights
        scaling_coefficients: Per-task scaling factors (lambda_i)
    """
    merged = {k: v.clone() for k, v in base_model.items()}

    for model, coeff in zip(fine_tuned_models, scaling_coefficients):
        for key in merged:
            task_vector = model[key] - base_model[key]
            merged[key] += coeff * task_vector

    return merged

shared_base = {"w": torch.tensor([2.0])}
code_ft = {"w": torch.tensor([2.5])}
math_ft = {"w": torch.tensor([1.5])}

# Example: 0.6x code delta + 0.4x math delta
merged = task_arithmetic_merge(
    base_model=shared_base,
    fine_tuned_models=[code_ft, math_ft],
    scaling_coefficients=[0.6, 0.4],
)

print("base:", float(shared_base["w"]))
print("matches_expected:", bool(torch.allclose(merged["w"], torch.tensor([2.1]))))
print("merged:", round(float(merged["w"]), 2))

Output

base: 2.0
matches_expected: True
merged: 2.1

Advantage

Scaling coefficients control how much of each task vector enters the candidate checkpoint; evaluation determines retained behavior.

TIES-Merging (trim, elect sign, merge)

A major issue with simple averaging is that task vectors can directly contradict one another (e.g., one model increases a weight by 0.5, while another decreases it by 0.5). TIES-Merging^{[7]Reference 7TIES-Merging: Resolving Interference When Merging Modelshttps://arxiv.org/abs/2306.01708} addresses two forms of interference in its recipe: it trims low-magnitude deltas, elects a sign for each coordinate using total signed movement, and merges only values aligned with that elected sign.

TIES resolves a coordinate as if three specialists proposed parameter deltas for it. First, discard small proposals (Trim). Next, sum the size of increases against the size of decreases, rather than counting voters (Elect sign). Finally, average only proposals in the winning direction (Disjoint merge).

TIES-Merging flow from raw task vectors to trimmed updates, aggregate-magnitude sign election, and a merged aligned delta. — TIES handles interference coordinate by coordinate: trim low-magnitude updates, elect the direction with greater aggregate movement, then average aligned deltas.

Step 1: Trim

First, it drops low-magnitude changes treated as redundant by the recipe. The trim function below takes a task vector and a density threshold, then keeps high-magnitude updates:

step-1-trim.py

import torch

def trim(task_vector: torch.Tensor, density: float = 0.2) -> torch.Tensor:
    """Keep only the top-k% of parameter changes by magnitude."""
    threshold = torch.quantile(task_vector.abs(), 1 - density)
    mask = task_vector.abs() >= threshold
    return task_vector * mask

task_vector = torch.tensor([0.1, -0.5, 0.02, 0.8])
trimmed = trim(task_vector, density=0.5)

print("matches_expected:", bool(torch.allclose(trimmed, torch.tensor([0.0, -0.5, 0.0, 0.8]))))
print("trimmed:", [round(float(x), 2) for x in trimmed])

Output

matches_expected: True
trimmed: [0.0, -0.5, 0.0, 0.8]

Step 2: Elect sign

When multiple task vectors update the same parameter, TIES-Merging resolves conflicting directions with $\gamma_m^p = \mathrm{sgn}(\sum_t \hat{\tau}_t^p)$ .^{[7]Reference 7TIES-Merging: Resolving Interference When Merging Modelshttps://arxiv.org/abs/2306.01708} The elect_sign function sums the trimmed signed deltas and takes the resulting sign. This differs from a majority vote: one large update can outweigh two smaller opposing updates.

step-2-elect-sign.py

import torch

def elect_sign(trimmed_vectors: list[torch.Tensor]) -> torch.Tensor:
    """Elect sign with greatest aggregate signed movement per parameter."""
    aggregate_delta = sum(trimmed_vectors)
    return torch.sign(aggregate_delta)

trimmed_vectors = [
    torch.tensor([0.0, -0.5, 0.0, 0.8]),
    torch.tensor([0.0, 1.1, 0.0, -0.6]),
    torch.tensor([0.0, -0.4, 0.0, 0.7]),
]
elected = elect_sign(trimmed_vectors)
print("two_negative_votes_at_p2:", True)
print("positive_mass_wins_at_p2:", bool(elected[1] == 1))
print("matches_expected:", bool(torch.allclose(elected, torch.tensor([0.0, 1.0, 0.0, 1.0]))))
print("elected signs:", elected.tolist())

Output

two_negative_votes_at_p2: True
positive_mass_wins_at_p2: True
matches_expected: True
elected signs: [0.0, 1.0, 0.0, 1.0]

Step 3: Disjoint merge

Finally, it computes the average of only those task vectors whose updates match the consensus direction. The disjoint_merge function takes trimmed task vectors and their elected signs. It averages only the parameter updates that agree with the elected direction, zeroing out any dissenting values to produce the final merged task vector:

step-3-disjoint-merge.py

import torch

def disjoint_merge(
    trimmed_vectors: list[torch.Tensor],
    elected_signs: torch.Tensor
) -> torch.Tensor:
    """Average only non-zero values that agree with the elected sign."""
    merged = torch.zeros_like(trimmed_vectors[0])
    counts = torch.zeros_like(trimmed_vectors[0])

    for tv in trimmed_vectors:
        agree = (tv != 0) & (elected_signs != 0) & (torch.sign(tv) == elected_signs)
        merged += torch.where(agree, tv, 0)
        counts += agree.float()

    return torch.where(
        counts > 0,
        merged / counts.clamp(min=1),
        torch.zeros_like(merged)
    )

trimmed_vectors = [
    torch.tensor([0.0, -0.5, 0.0, 0.8]),
    torch.tensor([0.0, 1.1, 0.0, -0.6]),
    torch.tensor([0.0, -0.4, 0.0, 0.7]),
]
elected_signs = torch.tensor([0.0, 1.0, 0.0, 1.0])
merged = disjoint_merge(trimmed_vectors, elected_signs)

print("matches_expected:", bool(torch.allclose(merged, torch.tensor([0.0, 1.1, 0.0, 0.75]))))
print("merged task vector:", [round(float(x), 2) for x in merged])

Output

matches_expected: True
merged task vector: [0.0, 1.1, 0.0, 0.75]

TIES-Merging outperforms compared baselines in the paper's evaluated vision and T5 task-vector settings, and its analysis identifies sign interference.^{[7]Reference 7TIES-Merging: Resolving Interference When Merging Modelshttps://arxiv.org/abs/2306.01708} For an LLM merge, use it as a candidate recipe when task vectors conflict, then compare per-task evaluation against simpler baselines.

DARE (drop and rescale)^{[8]Reference 8Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunchhttps://arxiv.org/abs/2311.03099}

DARE randomly drops entries from each task vector and rescales the survivors before a downstream merge. The rescaling preserves an entry's expected delta under the random mask; it doesn't by itself prove that the resulting model preserves a capability.

The paper studies redundancy in supervised fine-tuning (SFT) deltas and reports that its evaluated models can often tolerate dropping 90% of delta entries, and in some cases 99%, before merging. Its size ablation reports that WizardMath-70B remains effective at a 99% drop rate while the evaluated 7B and 13B variants fail there.^{[8]Reference 8Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunchhttps://arxiv.org/abs/2311.03099} Treat that as experimental evidence for the paper's SFT models, not a default density setting for a new merge. DARE is a sparsification step, not a complete merge recipe: sparsified task vectors still need a merger such as averaging or TIES.

DARE's analysis also separates SFT deltas, which it observes are typically within roughly 0.002, from continued-pretraining deltas that approach 0.03; its drop-and-rescale approach becomes ineffective on the latter.^{[8]Reference 8Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunchhttps://arxiv.org/abs/2311.03099} One candidate pipeline is DARE-TIES: DARE sparsifies each delta, then TIES resolves directional conflicts among surviving values. Mergekit exposes that composition as dare_ties.^{[1]Reference 1Mergekit: Tools for merging pre-trained large language modelshttps://github.com/arcee-ai/mergekit}

$\tau_i^{\text{DARE}} = \frac{\tau_i \odot m}{1 - p}$

where $m$ is a random binary mask with drop rate $p$ and the $1/(1-p)$ factor preserves each entry's expectation under masking.^{[8]Reference 8Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunchhttps://arxiv.org/abs/2311.03099}

dare_sparsify implements the DARE step itself. It sparsifies one task vector, after which you can pass the result to a downstream merge rule:

dare-drop-and-rescale-yu2023.py

import torch

def dare_sparsify(
    task_vector: torch.Tensor,
    drop_rate: float = 0.9
) -> torch.Tensor:
    """DARE preprocessing for one task vector."""
    keep_prob = 1.0 - drop_rate
    if not 0.0 < keep_prob <= 1.0:
        raise ValueError("drop_rate must be in [0, 1)")

    mask = torch.bernoulli(torch.full_like(task_vector, keep_prob))
    return (task_vector * mask) / keep_prob

torch.manual_seed(4)
task_vector = torch.tensor([0.2, -0.4, 0.1, 0.6])
sparsified = dare_sparsify(task_vector, drop_rate=0.5)

shape_ok = sparsified.shape == task_vector.shape
finite_ok = bool(torch.isfinite(sparsified).all())
binary_mask_ok = set(torch.unique((sparsified != 0).int()).tolist()).issubset({0, 1})
makes_invalid_drop_rate_fail = False

try:
    dare_sparsify(task_vector, drop_rate=1.0)
except ValueError as exc:
    makes_invalid_drop_rate_fail = "drop_rate" in str(exc)

print("shape_ok:", shape_ok)
print("finite_ok:", finite_ok)
print("binary_mask_ok:", binary_mask_ok)
print("invalid_drop_rate_rejected:", makes_invalid_drop_rate_fail)
print("original nonzero:", int((task_vector != 0).sum()))
print("sparsified nonzero:", int((sparsified != 0).sum()))
print("sparsified:", [0.0 if abs(float(x)) < 1e-8 else round(float(x), 2) for x in sparsified])

Output

shape_ok: True
finite_ok: True
binary_mask_ok: True
invalid_drop_rate_rejected: True
original nonzero: 4
sparsified nonzero: 2
sparsified: [0.0, 0.0, 0.2, 1.2]

What the paper establishes

On the SFT models and tasks it evaluates, DARE finds substantial redundancy in task-vector entries and improves several downstream merge methods after drop-and-rescale preprocessing.^{[8]Reference 8Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunchhttps://arxiv.org/abs/2311.03099} For a new checkpoint family, density remains a tuned parameter: compare unsparsified and DARE-preprocessed merges on every required task slice.

SLERP (spherical linear interpolation)

SLERP is a geometric interpolation rule. It follows an angular arc between normalized vector directions rather than the chord used by linear interpolation. It isn't an algorithm for discovering a low-loss path around an incompatible-model ridge.

Rather than interpolating directions along a straight line, SLERP^{[9]Reference 9Animating Rotation with Quaternion Curveshttps://dl.acm.org/doi/10.1145/325334.325242} interpolates along a sphere. Write the geometry in terms of normalized directions:

\begin{aligned} \text{SLERP}(\hat{\theta}_A, \hat{\theta}_B, t) &= \frac{\sin((1-t)\Omega)}{\sin \Omega} \hat{\theta}_A \\ &\quad + \frac{\sin(t\Omega)}{\sin \Omega} \hat{\theta}_B \end{aligned}

where $t$ is the interpolation factor (from 0 to 1) and $\Omega = \arccos(\hat{\theta}_A \cdot \hat{\theta}_B)$ is the angle between normalized weight vectors. Practical merge implementations may handle magnitude separately after interpolating direction, as the code does here.

SLERP was introduced for computer graphics to interpolate rotations represented as quaternions.^{[9]Reference 9Animating Rotation with Quaternion Curveshttps://dl.acm.org/doi/10.1145/325334.325242} That source establishes its geometry, not downstream quality for neural-network weight merges. Use a SLERP checkpoint as another candidate and measure loss and required tasks just as you would for a linear merge.

The slerp function performs this geometric interpolation. It takes two unnormalized vectors and an interpolation factor t, projects them onto a unit sphere, computes the interpolation, and returns the combined vector:

slerp-spherical-linear-interpolation.py

import torch

def slerp(
    v0: torch.Tensor,
    v1: torch.Tensor,
    t: float
) -> torch.Tensor:
    """Practical SLERP for a single weight tensor."""
    flat_v0 = v0.flatten()
    flat_v1 = v1.flatten()

    v0_mag = flat_v0.norm()
    v1_mag = flat_v1.norm()
    if v0_mag.item() == 0 or v1_mag.item() == 0:
        return (1 - t) * v0 + t * v1

    # Separate direction from magnitude
    v0_dir = flat_v0 / v0_mag
    v1_dir = flat_v1 / v1_mag

    dot = torch.dot(v0_dir, v1_dir)
    omega = torch.acos(torch.clamp(dot, -1.0, 1.0))

    # Parallel and antipodal directions make the spherical path degenerate.
    sin_omega = torch.sin(omega)
    if sin_omega.abs().item() < 1e-6:
        return (1 - t) * v0 + t * v1

    direction = (
        torch.sin((1 - t) * omega) / sin_omega * v0_dir +
        torch.sin(t * omega) / sin_omega * v1_dir
    )
    magnitude = (1 - t) * v0_mag + t * v1_mag
    return direction.view_as(v0) * magnitude

v0 = torch.tensor([1.0, 0.0])
v1 = torch.tensor([0.0, 1.0])
mid = slerp(v0, v1, t=0.5)
antipodal_mid = slerp(v0, -v0, t=0.5)

print("unit_norm:", bool(torch.allclose(mid.norm(), torch.tensor(1.0), atol=1e-6)))
print("matches_45_degree:", bool(torch.allclose(mid, torch.tensor([2**-0.5, 2**-0.5]), atol=1e-6)))
print("antipodal_fallback_finite:", bool(torch.isfinite(antipodal_mid).all()))
print("midpoint:", [round(float(x), 4) for x in mid])
print("norm:", round(float(mid.norm()), 4))

Output

unit_norm: True
matches_45_degree: True
antipodal_fallback_finite: True
midpoint: [0.7071, 0.7071]
norm: 1.0

What it guarantees geometrically

In the equal-norm orthogonal-vector example above, the SLERP midpoint remains on the unit circle while a linear midpoint would have smaller norm. That's a geometric property of this example, not evidence that the midpoint preserves either model's behavior. For a pair of checkpoints, evaluate linear and spherical candidates against the same release gates.

Running a merge with mergekit

mergekit^{[1]Reference 1Mergekit: Tools for merging pre-trained large language modelshttps://github.com/arcee-ai/mergekit} is an open-source toolkit for merging language-model checkpoints. Its documented CLI supports YAML-defined merges with CPU or limited-VRAM execution, and mergekit-multi can run multi-stage recipes where later merges consume earlier outputs.^{[1]Reference 1Mergekit: Tools for merging pre-trained large language modelshttps://github.com/arcee-ai/mergekit}

To use mergekit, you typically define a YAML configuration file that specifies the base model, the fine-tuned source models, their respective merging coefficients, and the desired algorithm. This configuration acts as the input to the CLI tool to generate the merged model:

running-a-merge-with-mergekit.yaml

# mergekit config: merge_config.yml
models:
  - model: your-org/base-8b-code
    parameters:
      weight: 0.35
      density: 0.5       # retained fraction for this task vector
  - model: your-org/base-8b-math
    parameters:
      weight: 0.35
      density: 0.5
  - model: your-org/base-8b-chat
    parameters:
      weight: 0.30
      density: 0.5

merge_method: dare_ties
base_model: your-org/base-8b
tokenizer:
  source: base           # switch to union if you must preserve extra tokens
chat_template: auto      # or pin a specific template when model families differ
dtype: float16

For dare_ties, each source model's density controls the retained fraction of that source's task vector. Merge methods have different schemas, so don't treat merge_method as a drop-in switch: for example, mergekit's slerp method takes exactly two source models.^{[1]Reference 1Mergekit: Tools for merging pre-trained large language modelshttps://github.com/arcee-ai/mergekit} If all required tokens are already in the base tokenizer, tokenizer.source: base pins the output vocabulary to that base. Modern mergekit configuration defaults to a union tokenizer, which adds tokens present in source vocabularies and assigns fallback embeddings where an input model lacks them.^{[1]Reference 1Mergekit: Tools for merging pre-trained large language modelshttps://github.com/arcee-ai/mergekit} Either policy is an output-space decision that needs targeted evaluation.

Common Mistake: Assuming all specialist checkpoints use the same tokenizer because they began from the same base. If the code checkpoint added fill-in-the-middle tokens such as <fim_prefix>, choosing base drops those added output entries while choosing union introduces filled embeddings for models that lack them. Verify tokenizer vocabularies, choose the output policy explicitly, and test code-completion prompts that require those tokens.

This preflight check makes the output-vocabulary decision explicit before any expensive merge runs:

choose-output-vocabulary-policy.py

def output_vocab(base_vocab, source_vocabs, policy):
    if policy == "base":
        return set(base_vocab)
    if policy == "union":
        return set().union(*source_vocabs)
    raise ValueError("policy must be 'base' or 'union'")

base_vocab = {"<bos>", "def", "return"}
code_vocab = base_vocab | {"<fim_prefix>"}
required_tokens = {"def", "<fim_prefix>"}

base_output = output_vocab(base_vocab, [base_vocab, code_vocab], "base")
union_output = output_vocab(base_vocab, [base_vocab, code_vocab], "union")

print("base_missing_required:", sorted(required_tokens - base_output))
print("union_missing_required:", sorted(required_tokens - union_output))
print("union_requires_added_embedding_eval:", "<fim_prefix>" not in base_vocab)

Output

base_missing_required: ['<fim_prefix>']
union_missing_required: []
union_requires_added_embedding_eval: True

Once the configuration is set, you can execute the merge using the CLI tool. The command below takes the YAML configuration file and the output path, producing the final merged model on disk:

terminal

# Run merge on a local GPU
mergekit-yaml merge_config.yml ./output_model --cuda

For hardware-specific and memory-saving flags, check mergekit-yaml --help because supported options vary by version.^{[1]Reference 1Mergekit: Tools for merging pre-trained large language modelshttps://github.com/arcee-ai/mergekit}

Mergekit lets engineers build candidate checkpoints without a new gradient-training run. Iterate over parameters such as retained density and task weights only against a defined evaluation suite.

Run an iterative validation loop after every merge: evaluate each target slice, compare it to declared gates and source baselines, and retune or reject any candidate that misses a critical threshold.

Merge validation loop from source models through merged candidate and per-task evaluation, showing that promotion stops when a critical task misses its threshold. — Every merge is provisional until a per-task evaluation suite confirms that each critical slice clears its release threshold instead of hiding misses behind an aggregate score.

An aggregate score is useful for ranking candidates, but a release decision should fail on any critical threshold miss:

evaluate-per-task-release-gates.py

def failed_release_gates(scores, thresholds):
    return {
        task: (scores[task], minimum)
        for task, minimum in thresholds.items()
        if scores[task] < minimum
    }

scores = {"code": 92.4, "math": 82.1, "chat": 87.0}
thresholds = {"code": 90.0, "math": 85.0, "chat": 84.0}
failures = failed_release_gates(scores, thresholds)

print("aggregate_score:", round(sum(scores.values()) / len(scores), 1))
print("failed_tasks:", sorted(failures))
print("promote:", not failures)

Output

aggregate_score: 87.2
failed_tasks: ['math']
promote: False

Finding good coefficients automatically

Manually searching merge coefficients and layer selections can be expensive. Evolutionary model merging^{[10]Reference 10Evolutionary Optimization of Model Merging Recipes.https://sakana.ai/evolutionary-model-merge/} applies evolutionary search to optimize merge recipes against a supplied fitness evaluation. Its search space can include per-layer source choices, interpolation weights, and whether to merge in parameter space, data-flow space, or both.

Sakana AI reports using this approach to build EvoLLM-JP from Japanese-language and math-oriented models, optimizing for the evaluations selected in that work.^{[10]Reference 10Evolutionary Optimization of Model Merging Recipes.https://sakana.ai/evolutionary-model-merge/} The principle is straightforward: treat each merge configuration as a "genome," score it on validation data, mutate its parameters, and retain better-scoring candidates. The resulting model inherits the objective's coverage and blind spots, so release gates still need independent slices.

In practice, a small grid search is a transparent baseline: try lambda values in [0.2, 0.4, 0.6, 0.8] for each task vector, measure each required slice on held-out data, and retain only configurations that clear all release gates. For a code/math/chat merge, those slices should include code-generation tests, math word problems, and instruction-following prompts; a code-only objective doesn't protect math behavior.

Passthrough and frankenmerging

Not every merge averages weights. Mergekit's passthrough method copies selected tensors or layer ranges from source models into the output checkpoint.^{[1]Reference 1Mergekit: Tools for merging pre-trained large language modelshttps://github.com/arcee-ai/mergekit} This is model splicing rather than interpolation: tensor dimensions must compose, while useful behavior across the splice remains an evaluation result.

For example, a recipe might copy layers 0-19 from checkpoint A and layers 20-31 from checkpoint B. Compatible dimensions make that artifact constructible; they don't show that B's later layers can interpret the hidden states produced by A's earlier layers.

Use passthrough as an experiment with explicit source lineage, layer boundaries, and the same per-task gates used for any other candidate. A successful build only proves shape compatibility.

Merging pitfalls and hard limits

Direct weight merging requires aligned parameter meaning, shape, and output-space assumptions. You can't directly interpolate a 7-billion-parameter checkpoint with a 70-billion-parameter checkpoint, or silently combine incompatible vocabulary mappings and treat token IDs as equivalent. A generated merged artifact is a candidate, not evidence of retained quality.

Avoid direct interpolation when:

Different architectures: Source models must share the exact same structural architecture, parameter count, and layer layout. You can't directly interpolate a MiniMax model with a Mistral model, nor an 8B model with a 72B model.
Different tokenizers: Direct interpolation assumes aligned embeddings and next-token heads. Mergekit can help with tokenizer union in compatible cases, but it doesn't make arbitrary vocabulary mismatches disappear.^{[1]Reference 1Mergekit: Tools for merging pre-trained large language modelshttps://github.com/arcee-ai/mergekit}
Mismatched prompt formats or chat templates: These usually won't block the raw tensor merge, but they can make the merged checkpoint look broken at inference time because the prompt serialization no longer matches the behaviors learned during tuning.
Quantized-only checkpoints: If you want to merge, do it on dequantized or full-precision weights first, then quantize the final artifact. The merge math depends on real-valued deltas, not already-rounded integers. Treat the merged checkpoint as a new model and re-run your quantization calibration and evals.
Distant fine-tuning trajectories: Models fine-tuned on divergent data or with large update magnitudes may be poor interpolation candidates. Git Re-Basin demonstrates permutation alignment in studied MLP, CNN, and ResNet settings; it isn't an established repair step for arbitrary LLM merges.^{[5]Reference 5Git Re-Basin: Merging Models modulo Permutation Symmetries.https://arxiv.org/abs/2209.04836}
Critical precision tasks: Don't ship a merge when a required slice misses its threshold, even if an aggregate metric rises. A specialist baseline remains part of the comparison.

Diagnosing a broken merge

When a merge goes wrong, the model usually tells you quickly. Three common failure patterns, their causes, and their fixes are:

Symptom	Likely Cause	Fix
Output is incoherent or random tokens	Tokenizer/output mapping, prompt template, or architecture mismatch	Verify output vocabulary policy, chat template, and layer shapes
Model loops repetitive text	Coefficient overload or weights pushed too far out of distribution	Reduce coefficient magnitudes, start additive lambdas in the `0.0-1.0` range, and re-run evals
Merged model is worse than every source	Source interference or incompatible lineage	Check source lineage and tokenizer policy; prefer compatible sources and retune or reject the merge

Don't predict quality from the merge rule alone. Model Soups^{[3]Reference 3Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Timehttps://arxiv.org/abs/2203.05482} reported gains over the best individual checkpoint in its evaluated shared-initialization experiments, including ImageNet settings. That result motivates trying a merge; it doesn't predict whether a new LLM merge will clear its specialist task gates.

After merging, run a full evaluation suite across all target tasks, not the aggregate score alone. Strong code-generation accuracy can mask a miss in math reasoning. Track per-task metrics independently and compare against declared thresholds and each source model's baseline.

Quick recall prompts

Why is weight averaging worth testing at all? Shouldn't mixing model weights produce garbage? Shared-initialization checkpoints provide a plausible starting point because studies such as Model Soups observe useful averages in evaluated settings. Shared lineage isn't a quality proof: sample the interpolation path and run task gates. Independently trained models can also encode equivalent functions with incompatible neuron orderings, making direct averaging particularly suspect.

How do you choose merge coefficients for three or more models? Start with equal weights as a baseline, then tune against a held-out validation set that reflects your target task mixture. Track each target capability separately. If equal weights cause interference, try task-vector coefficients, TIES, DARE-TIES, or a small grid search over weights and density values.

When should you merge models instead of using mixture-of-experts routing? Merge when you want one dense checkpoint with roughly one-model serving cost, simpler rollback, and no router path. Use MoE or explicit routing when tasks are highly distinct, specialists must stay sharp, or you can afford multiple experts plus routing complexity.

How does LoRA adapter merging differ from full-weight merging? LoRA adapters are low-rank delta weights attached to a frozen base, so adapter merging operates on a much smaller parameter set. That makes experiments cheaper, but it doesn't remove interference: two adapters can still push the same target matrix in conflicting directions.

What should you check first when a merge outputs incoherent text? Check architecture, tensor shapes, tokenizer output policy, and chat template before coefficient tuning. These checks identify whether inputs and output IDs still mean what the merge recipe assumed.

Try it yourself

Start merging work by reasoning through a concrete scenario before you touch a GPU.

Compatibility warm-up

Question: Why can't we average the weights of GPT-2 and Llama-3?

Reasoning: Even if you ignored the parameter-count mismatch, the two models have different architectures and different tokenizers. Their parameter entries therefore don't represent an aligned interpolation space. Same-base checkpoints are conservative direct-merge candidates; alignment methods such as Git Re-Basin provide research evidence in studied neural-network architectures, not a generic way to make GPT-2 and Llama compatible.

Grid-search challenge

You have a base model and two fine-tuned variants:

code_ft improves code-generation accuracy from 40% to 70%
math_ft improves math-reasoning accuracy from 35% to 65%

You want a merged model that scores at least 60% on both tasks. You try Task Arithmetic with coefficients $\lambda_{\text{code}}$ and $\lambda_{\text{math}}$ .

Given this table, which coefficient pair is the best starting point?

$\lambda_{\text{code}}$	$\lambda_{\text{math}}$	Code-generation score	Math-reasoning score
0.8	0.8	64%	59%
0.6	0.6	58%	53%
0.5	0.5	55%	50%
1.0	0.0	70%	35%

Answer: None of the listed pairs reaches 60% on both tasks. The best starting point is 0.8, 0.8 because it gets closest at 64% code-generation accuracy and 59% math-reasoning accuracy, which tells you the merge is almost there but still suffering from interference. In practice, you'd try TIES or DARE-TIES next, or continue tuning coefficients around that region, to lift math performance without giving up too much code-generation accuracy.

Debug a broken merge

You merge three assistant specialists (code, math, chat) using uniform averaging. The merged model answers algebra questions well but responds to code-completion prompts with malformed identifiers and random API names.

What is the most likely cause, and what should you check first?

Answer: Tokenizer/output-space mismatch is a plausible first suspect, but the symptom alone doesn't identify one root cause. Inspect each source tokenizer and chat template, then compare code-completion behavior in the source checkpoints. If fill-in-the-middle tokens are needed, choose and evaluate an explicit union tokenizer policy; otherwise retain a shared-base output vocabulary and investigate source interference.

Mastery check

Key concepts

Model soups and uniform averaging
Task vectors and task arithmetic
Linear mode connectivity
Git Re-Basin and permutation alignment
TIES-Merging
DARE sparsification
SLERP
mergekit workflow and tokenizer handling
Per-task regression testing for merged checkpoints

Evaluation rubric

Foundational: Explains why same-base fine-tuned checkpoints are reasonable averaging candidates but still require evaluation
Intermediate: Computes task vectors and shows how scaling coefficients change the merged checkpoint
Intermediate: Explains why absolute averaging can cancel opposing task deltas
Advanced: Compares when to use model soups, task arithmetic, TIES, DARE, or SLERP
Advanced: Diagnoses tokenizer mismatch, architecture mismatch, and coefficient overload in a broken merge
Advanced: Designs a post-merge eval suite that blocks promotion when any required behavior misses its release threshold

Follow-up questions

When merges break

"Same parameter count means mergeable"

Symptom: The merged model produces incoherent or unstable outputs right away.
Cause: Matching parameter count isn't enough. The checkpoints may differ in architecture details, tokenizer alignment, or base-model lineage.
Fix: Verify shared architecture, layer shapes, tokenizer handling, and common base checkpoint before merging any weights.

"One aggregate score proves the merge worked"

Symptom: The merged checkpoint looks strong overall but one specialist workflow regresses badly in production.
Cause: Aggregate metrics can hide per-task damage from interference.
Fix: Run per-task checks against declared release thresholds and source baselines, then block promotion when any critical skill falls below threshold.

"More coefficient means more skill"

Symptom: Larger lambda values make outputs repetitive, brittle, or off-distribution.
Cause: Oversized coefficients can push the merged checkpoint too far away from the stable base region.
Fix: Start with small or equal weights, tune on held-out evals, and watch for regressions rather than assuming stronger scaling always helps.

"Tokenizer issues only matter at preprocessing time"

Symptom: Catalog-heavy prompts fail or produce random product tokens even though the merge completed.
Cause: Embedding rows and LM-head rows no longer line up cleanly across token IDs.
Fix: Check vocabulary equality first. If source models added tokens, use explicit tokenizer alignment such as mergekit union handling and then rerun evals.

"DARE is a full merge recipe by itself"

Symptom: Engineers sparsify task vectors and assume the job is done.
Cause: DARE is a preprocessing step that drops and rescales delta entries; it still needs a downstream merge rule such as averaging or TIES.
Fix: Treat DARE as sparsification first, then run a real merge rule and evaluate the combined checkpoint.

Next Step

Continue to Vector DB Internals: HNSW & IVF

You can now train, adapt, distill, and combine model weights under evaluation gates. The advanced systems phase begins with retrieval infrastructure, where <span data-glossary="hnsw">HNSW</span>, IVF, and <span data-glossary="product-quantization">Product Quantization</span> decide which evidence reaches those models under latency and memory limits.

PreviousKnowledge Distillation for LLMs

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Mergekit: Tools for merging pre-trained large language models

Goddard, C., et al. · 2023

Linear Mode Connectivity and the Lottery Ticket Hypothesis.

Frankle, J., Dziugaite, G. K., Roy, D. M., & Carlin, M. · 2020 · ICML 2020

Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time

Wortsman, M., et al. · 2022 · ICML 2022

The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks.

Entezari, R., Sedghi, H., Saukh, O., & Neyshabur, B. · 2022 · ICLR 2022

Git Re-Basin: Merging Models modulo Permutation Symmetries.

Ainsworth, S. K., Hayase, J., & Srinivasa, S. · 2022 · ICLR 2023

Editing Models with Task Arithmetic

Ilharco, G., et al. · 2022 · ICLR 2023

TIES-Merging: Resolving Interference When Merging Models

Yadav, P., et al. · 2023 · NeurIPS 2023

Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunch

Yu, L., et al. · 2023 · ICML 2024

Animating Rotation with Quaternion Curves

Shoemake, K. · 1985 · SIGGRAPH '85

Evolutionary Optimization of Model Merging Recipes.

Sakana AI · 2024

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Model Merging and Weight Interpolation

When averaging weights is worth testing

The permutation invariance problem

Why does Git Re-Basin apply π\piπ before averaging?

The merging toolkit: from simple averages to conflict-aware methods

Model soups (uniform averaging)

Uniform vs. greedy soups

Pros

Cons

Task arithmetic

A concrete scalar walkthrough

Why did absolute averaging cancel the two specialist deltas in the scalar example?

Advantage

TIES-Merging (trim, elect sign, merge)

Step 1: Trim

Step 2: Elect sign

Step 3: Disjoint merge

DARE (drop and rescale)[8]Reference 8Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunchhttps://arxiv.org/abs/2311.03099

Why does DARE divide by 1−p1-p1−p after dropping weights?

What the paper establishes

SLERP (spherical linear interpolation)

What it guarantees geometrically

Running a merge with mergekit

Finding good coefficients automatically

Passthrough and frankenmerging

Merging pitfalls and hard limits

Diagnosing a broken merge

Quick recall prompts

Try it yourself

Compatibility warm-up

Grid-search challenge

Debug a broken merge

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why is task arithmetic often more controllable than averaging full checkpoints?

When does TIES-Merging help more than plain averaging?

Why can a merged model look good on one aggregate score but still be unsafe to ship?

What should you verify before merging checkpoints that came from different teams?

When merges break

"Same parameter count means mergeable"

"One aggregate score proves the merge worked"

"More coefficient means more skill"

"Tokenizer issues only matter at preprocessing time"

"DARE is a full merge recipe by itself"

Mastery Check

Discussion

DARE (drop and rescale)^{[8]Reference 8Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunchhttps://arxiv.org/abs/2311.03099}