Learn model merging techniques, from simple weight averaging and task arithmetic to TIES-Merging and DARE, including practical guidance on tokenizer compatibility, mergekit workflows, and evaluation.
Knowledge distillation compressed teacher behavior into a smaller student. Model merging asks a different deployment question: if you already have several useful fine-tuned checkpoints from the same base model, can you combine their weights into one checkpoint without launching another training run?
Your logistics team runs a customer-support LLM that answers shipping, catalog, and refund questions. After a peak season, you have three fine-tuned checkpoints from the same base model: one tuned on carrier-policy updates, one on product-search queries, and one on refund workflows. Serving all three behind a router adds cold-start latency and complicates rollback. Model merging asks whether you can combine those checkpoints into one deployable model that keeps useful behavior from each specialist without running another training job.
This chapter explains when interpolation is worth testing, where it fails, and how to evaluate merged models before shipping them.
In the previous articles on fine-tuning and alignment, you learned that a model checkpoint is a large vector of numbers, and that fine-tuning nudges those numbers in directions that improve a specific task. Model merging asks a natural follow-up question: what if you could combine those nudges? If one team fine-tuned a base model for Python code and another team fine-tuned the same base model for math reasoning, can you average their checkpoints and get a model that's good at both?
The attraction is practical: a merge creates a candidate checkpoint without another gradient-training run. Whether that candidate retains any target behavior is an evaluation question, not a property guaranteed by the merge recipe.
A hard prerequisite for direct tensor interpolation is compatible parameter structure: corresponding tensors must have compatible shapes and meanings. Embedding matrices and language-model heads also need an explicit tokenizer policy. You can't interpolate a 7B checkpoint with a 70B checkpoint, or blindly combine token rows from different vocabularies. Mergekit can construct a union tokenizer and assign fallback embeddings for missing tokens, but that is an explicit output-space choice that still needs evaluation.[1] Same-base checkpoints are the conservative starting point for task-vector merging because their parameter coordinates share lineage; compatible shapes alone do not establish merge quality.
The illustration below compares four recipes covered in this chapter, from averaging to sparse conflict-aware task-vector merging.
If someone told you to average the internal numbers of two trained neural networks, skepticism is correct. Weight averaging is plausible in some fine-tuning settings because nearby checkpoints can occupy a connected low-loss region in parameter space. It is not safe merely because both endpoints are good models.
Think of fine-tuning as moving from a shared starting point through weight space. Linear mode connectivity studies whether the straight interpolation path between endpoints crosses a high-loss barrier.[2] The Model Soups paper motivates averaging in a specific setting: models fine-tuned from a shared pretrained initialization under different hyperparameters often admit useful averages in its evaluated vision and text-classification experiments.[3] This is evidence for testing nearby same-lineage candidates, not a proof that differently specialized LLM checkpoints share one basin.
In concrete terms, evaluate points on rather than assuming the midpoint is usable. Two checkpoints can each be strong on their own evaluation and still interfere when combined, especially when they represent different tasks. Models with unrelated pretraining lineages are even poorer candidates for direct interpolation because their parameter coordinate systems were not preserved by a common base.
The screening logic below treats interpolation measurements as evidence, rather than granting the midpoint a pass because the sources share lineage. In a real run, losses and task_scores come from the candidate checkpoints and held-out evaluations:
1alphas = [0.00, 0.25, 0.50, 0.75, 1.00]
2losses = [0.18, 0.20, 0.61, 0.23, 0.19]
3task_scores = [0.88, 0.86, 0.70, 0.85, 0.89]
4
5max_accepted_loss = 0.30
6min_accepted_score = 0.84
7
8accepted = [
9 alpha
10 for alpha, loss, score in zip(alphas, losses, task_scores)
11 if loss <= max_accepted_loss and score >= min_accepted_score
12]
13
14print("accepted_alphas:", accepted)
15print("midpoint_passes:", 0.50 in accepted)1accepted_alphas: [0.0, 0.25, 0.75, 1.0]
2midpoint_passes: False
Key insight: Same-base lineage makes a merge experiment defensible. It does not establish that a shipping-policy and refund-workflow merge retained either behavior. Only per-task evaluation does that.
Neural networks can exhibit permutation invariance: under appropriate corresponding reordering of incoming and outgoing weights, hidden-unit permutations can preserve the function a network computes.[4] This helps explain why independently trained networks may be poorly aligned for naive averaging even if they solve similar tasks.
Git Re-Basin[5] addresses this by finding the optimal permutation that aligns the neurons of one model to match the other before merging:
The illustration below makes the failure mode concrete. Two checkpoints can learn the same three hidden features but store them in different slot orders, so Git Re-Basin permutes one model before averaging.
Git Re-Basin uses permutation matching and reports merged independently trained MLP, CNN, and ResNet models in its studied settings, including a zero-barrier ResNet result on CIFAR-10.[5] That paper is a useful alignment concept, but it is not evidence that an arbitrary pair of large language models can be repaired and merged. For LLM work, same-base candidates plus downstream evaluation remain the practical default here.
Several techniques exist for merging models, ranging from mathematical averages to geometric interpolation rules. Choice depends on source lineage, observed delta conflict, and the evaluations the output must pass.
| Method | Mechanism | Pros | Cons | Best For |
|---|---|---|---|---|
| Model Soups / Linear | Uniform or weighted averaging | Simple baseline | Interference between conflicting parameters | Nearby checkpoints with representative evaluation |
| Task Arithmetic | Weighted task vectors | Separate coefficients per delta | Coefficients do not guarantee separate capabilities survive | Same-base task-vector experiments |
| TIES-Merging (Trim, Elect Sign, Merge) | Trim and aggregate-sign filtering | Explicitly handles conflicting delta signs | Density and scale need tuning | Conflicting same-base task vectors |
| DARE (sparsify first) | Random dropping and rescaling of task vectors | Sparse preprocessing evaluated by its paper | Not a complete merge rule on its own | Testing DARE plus a downstream merge |
| SLERP (Spherical Linear Interpolation) | Spherical interpolation in direction space | Has a norm-preserving geometric interpretation | Geometry alone does not establish task quality | Pairwise interpolation experiment |
The simplest approach to merging is called Model Soups[3]. It averages the weights of multiple fine-tuned models to create a single robust model:
To see the operation, imagine two scalars instead of billion-parameter tensors. If checkpoint A has a weight 2.0 and checkpoint B has 4.0, their equal average is 3.0. Whether that compromise retains behavior cannot be determined from this parameter alone. Model Soups evaluates averaging models fine-tuned from a shared initialization over hyperparameter configurations and reports improved accuracy and robustness in its studied settings without the inference cost of an ensemble.[3]
The original Model Soups paper distinguishes between two selection strategies:
In the original paper, greedy soups outperform uniform averaging in the reported experiments because the selection rule skips candidates that hurt its held-out validation metric.[3] A similar selection rule is reasonable to test only when your validation slices represent the behavior you need to keep.
The following function demonstrates uniform averaging. It takes a list of model state dictionaries and an optional list of weights, and returns a single state dictionary containing the weighted average of their parameters:
1import torch
2
3def uniform_merge(
4 models: list[dict[str, torch.Tensor]],
5 weights: list[float] | None = None,
6) -> dict[str, torch.Tensor]:
7 """Merge models by weighted averaging of parameters."""
8 if weights is None:
9 weights = [1.0 / len(models)] * len(models)
10
11 if abs(sum(weights) - 1.0) >= 1e-6:
12 raise ValueError("Weights must sum to 1")
13
14 merged = {}
15 for key in models[0].keys():
16 merged[key] = sum(w * m[key] for w, m in zip(weights, models))
17
18 return merged
19
20code_model = {"w": torch.tensor([2.0, 4.0]), "bias": torch.tensor([1.0])}
21math_model = {"w": torch.tensor([4.0, 2.0]), "bias": torch.tensor([3.0])}
22merged = uniform_merge([code_model, math_model])
23
24print("weights_ok:", bool(torch.allclose(merged["w"], torch.tensor([3.0, 3.0]))))
25print("bias_ok:", bool(torch.allclose(merged["bias"], torch.tensor([2.0]))))
26print("merged w:", merged["w"].tolist())
27print("merged bias:", merged["bias"].tolist())1weights_ok: True
2bias_ok: True
3merged w: [3.0, 3.0]
4merged bias: [2.0]Exceptionally simple to implement and requires no hyperparameters beyond the optional weighting scheme.
If the models are fine-tuned on wildly different tasks, direct averaging can cause destructive interference between conflicting parameters, degrading overall performance.
Instead of averaging absolute weights, Task Arithmetic[6] operates on task vectors (the difference between fine-tuned weights and the base model). Think of task vectors like a set of specific directions (e.g., "walk 10 steps north"). Instead of averaging the final destinations of different hikers, you take the specific path each hiker took from base camp and combine them. By scaling these paths (e.g., "take half the steps north"), you can carefully mix different skills.
Where is the base model, are fine-tuned models, are per-task deltas, and are merge weights.
Before you run this on billion-parameter tensors, try it on a single parameter. Imagine a base model where one weight is 2.0. A Python expert fine-tune pushes that weight to 2.5, so its task vector is +0.5. A math expert fine-tune pushes it to 1.5, so its task vector is -0.5. If you want a model that is slightly better at Python but keeps its math ability, you might set and :
The result, 2.1, is a parameter nudge in the Python-delta direction. If you had instead averaged the absolute fine-tuned weights (2.5 and 1.5), you'd get 2.0, cancelling both deltas at this coordinate. A scalar calculation cannot establish how either task performs; it only shows how task-vector coefficients act on weights.
You can use the same arithmetic for logistics. Suppose a shipping-policy fine-tune nudges a weight to 2.4 (task vector +0.4) and a refund-workflow fine-tune nudges it to 1.8 (task vector -0.2). Setting and gives 2.0 + 0.28 - 0.10 = 2.18. Whether this coefficient pair keeps either workflow is measured after the merge.
The task_arithmetic_merge function below shows how this works in practice. It takes a shared base model, a list of fine-tuned models, and scaling coefficients. It subtracts the base to get task vectors, applies a scaling coefficient, and adds them back to return the merged model:
1import torch
2
3def task_arithmetic_merge(
4 base_model: dict[str, torch.Tensor],
5 fine_tuned_models: list[dict[str, torch.Tensor]],
6 scaling_coefficients: list[float]
7) -> dict[str, torch.Tensor]:
8 """Merge via task vectors (differences from base model).
9
10 Args:
11 base_model: The shared base model weights
12 fine_tuned_models: List of fine-tuned model weights
13 scaling_coefficients: Per-task scaling factors (lambda_i)
14 """
15 merged = {k: v.clone() for k, v in base_model.items()}
16
17 for model, coeff in zip(fine_tuned_models, scaling_coefficients):
18 for key in merged:
19 task_vector = model[key] - base_model[key]
20 merged[key] += coeff * task_vector
21
22 return merged
23
24shared_base = {"w": torch.tensor([2.0])}
25code_ft = {"w": torch.tensor([2.5])}
26math_ft = {"w": torch.tensor([1.5])}
27
28# Example: 0.6x code ability + 0.4x math ability
29merged = task_arithmetic_merge(
30 base_model=shared_base,
31 fine_tuned_models=[code_ft, math_ft],
32 scaling_coefficients=[0.6, 0.4],
33)
34
35print("base:", float(shared_base["w"]))
36print("matches_expected:", bool(torch.allclose(merged["w"], torch.tensor([2.1]))))
37print("merged:", round(float(merged["w"]), 2))1base: 2.0
2matches_expected: True
3merged: 2.1Scaling coefficients control how much of each task vector enters the candidate checkpoint; evaluation determines retained behavior.
A major issue with simple averaging is that task vectors can directly contradict one another (e.g., one model increases a weight by 0.5, while another decreases it by 0.5). TIES-Merging[7] addresses two forms of interference in its recipe: it trims low-magnitude deltas, elects a sign for each coordinate using total signed movement, and merges only values aligned with that elected sign.
Think of this process like a logistics operations committee proposing numeric budget changes. TIES resolves a coordinate in three stages. First, discard small proposals (Trim). Next, sum the size of increases against the size of decreases, rather than counting voters (Elect sign). Finally, average only proposals in the winning direction (Disjoint merge).
First, it drops low-magnitude changes treated as redundant by the recipe. The trim function below takes a task vector and a density threshold, then keeps high-magnitude updates:
1import torch
2
3def trim(task_vector: torch.Tensor, density: float = 0.2) -> torch.Tensor:
4 """Keep only the top-k% of parameter changes by magnitude."""
5 threshold = torch.quantile(task_vector.abs(), 1 - density)
6 mask = task_vector.abs() >= threshold
7 return task_vector * mask
8
9task_vector = torch.tensor([0.1, -0.5, 0.02, 0.8])
10trimmed = trim(task_vector, density=0.5)
11
12print("matches_expected:", bool(torch.allclose(trimmed, torch.tensor([0.0, -0.5, 0.0, 0.8]))))
13print("trimmed:", [round(float(x), 2) for x in trimmed])1matches_expected: True
2trimmed: [0.0, -0.5, 0.0, 0.8]When multiple task vectors update the same parameter, TIES-Merging resolves conflicting directions by total magnitude. The elect_sign function sums the trimmed signed deltas and takes the resulting sign. This differs from a majority vote: one large update can outweigh two smaller opposing updates.
1import torch
2
3def elect_sign(trimmed_vectors: list[torch.Tensor]) -> torch.Tensor:
4 """Elect sign with greatest aggregate signed movement per parameter."""
5 aggregate_delta = sum(trimmed_vectors)
6 return torch.sign(aggregate_delta)
7
8trimmed_vectors = [
9 torch.tensor([0.0, -0.5, 0.0, 0.8]),
10 torch.tensor([0.0, 1.1, 0.0, -0.6]),
11 torch.tensor([0.0, -0.4, 0.0, 0.7]),
12]
13elected = elect_sign(trimmed_vectors)
14print("two_negative_votes_at_p2:", True)
15print("positive_mass_wins_at_p2:", bool(elected[1] == 1))
16print("matches_expected:", bool(torch.allclose(elected, torch.tensor([0.0, 1.0, 0.0, 1.0]))))
17print("elected signs:", elected.tolist())1two_negative_votes_at_p2: True
2positive_mass_wins_at_p2: True
3matches_expected: True
4elected signs: [0.0, 1.0, 0.0, 1.0]Finally, it computes the average of only those task vectors whose updates match the consensus direction. The disjoint_merge function takes trimmed task vectors and their elected signs. It averages only the parameter updates that agree with the elected direction, zeroing out any dissenting values to produce the final merged task vector:
1import torch
2
3def disjoint_merge(
4 trimmed_vectors: list[torch.Tensor],
5 elected_signs: torch.Tensor
6) -> torch.Tensor:
7 """Average only non-zero values that agree with the elected sign."""
8 merged = torch.zeros_like(trimmed_vectors[0])
9 counts = torch.zeros_like(trimmed_vectors[0])
10
11 for tv in trimmed_vectors:
12 agree = (tv != 0) & (elected_signs != 0) & (torch.sign(tv) == elected_signs)
13 merged += torch.where(agree, tv, 0)
14 counts += agree.float()
15
16 return torch.where(
17 counts > 0,
18 merged / counts.clamp(min=1),
19 torch.zeros_like(merged)
20 )
21
22trimmed_vectors = [
23 torch.tensor([0.0, -0.5, 0.0, 0.8]),
24 torch.tensor([0.0, 1.1, 0.0, -0.6]),
25 torch.tensor([0.0, -0.4, 0.0, 0.7]),
26]
27elected_signs = torch.tensor([0.0, 1.0, 0.0, 1.0])
28merged = disjoint_merge(trimmed_vectors, elected_signs)
29
30print("matches_expected:", bool(torch.allclose(merged, torch.tensor([0.0, 1.1, 0.0, 0.75]))))
31print("merged task vector:", [round(float(x), 2) for x in merged])1matches_expected: True
2merged task vector: [0.0, 1.1, 0.0, 0.75]TIES-Merging outperforms compared baselines in the paper's evaluated vision and T5 task-vector settings, and its analysis highlights sign interference.[7] For an LLM merge, use it as a candidate recipe when task vectors conflict, then compare per-task evaluation against simpler baselines.
DARE randomly drops entries from each task vector and rescales the survivors before a downstream merge. The rescaling preserves an entry's expected delta under the random mask; it does not by itself prove that the resulting model preserves a capability.
The paper studies redundancy in supervised fine-tuning (SFT) deltas and reports that its evaluated models can often tolerate dropping 90% of delta entries, and in some cases 99%, before merging. Its size ablation reports that WizardMath-70B remains effective at a 99% drop rate while the evaluated 7B and 13B variants fail there.[8] Treat that as experimental evidence for the paper's SFT models, not a default density setting for a new merge. DARE is a sparsification step, not a complete merge recipe: sparsified task vectors still need a merger such as averaging or TIES.
DARE's analysis also separates SFT deltas, which it observes are typically within roughly 0.002, from continued-pretraining deltas that approach 0.03; its drop-and-rescale approach becomes ineffective on the latter.[8] One candidate pipeline is DARE-TIES: DARE sparsifies each delta, then TIES resolves directional conflicts among surviving values. Mergekit exposes that composition as dare_ties.[1]
where is a random binary mask with drop rate and the factor preserves each entry's expectation under masking.[8]
The dare_sparsify function below shows the DARE step itself. It sparsifies one task vector, after which you can pass the result to a downstream merge rule:
1import torch
2
3def dare_sparsify(
4 task_vector: torch.Tensor,
5 drop_rate: float = 0.9
6) -> torch.Tensor:
7 """DARE preprocessing for one task vector."""
8 keep_prob = 1.0 - drop_rate
9 if not 0.0 < keep_prob <= 1.0:
10 raise ValueError("drop_rate must be in [0, 1)")
11
12 mask = torch.bernoulli(torch.full_like(task_vector, keep_prob))
13 return (task_vector * mask) / keep_prob
14
15torch.manual_seed(4)
16task_vector = torch.tensor([0.2, -0.4, 0.1, 0.6])
17sparsified = dare_sparsify(task_vector, drop_rate=0.5)
18
19shape_ok = sparsified.shape == task_vector.shape
20finite_ok = bool(torch.isfinite(sparsified).all())
21binary_mask_ok = set(torch.unique((sparsified != 0).int()).tolist()).issubset({0, 1})
22makes_invalid_drop_rate_fail = False
23
24try:
25 dare_sparsify(task_vector, drop_rate=1.0)
26except ValueError as exc:
27 makes_invalid_drop_rate_fail = "drop_rate" in str(exc)
28
29print("shape_ok:", shape_ok)
30print("finite_ok:", finite_ok)
31print("binary_mask_ok:", binary_mask_ok)
32print("invalid_drop_rate_rejected:", makes_invalid_drop_rate_fail)
33print("original nonzero:", int((task_vector != 0).sum()))
34print("sparsified nonzero:", int((sparsified != 0).sum()))
35print("sparsified:", [0.0 if abs(float(x)) < 1e-8 else round(float(x), 2) for x in sparsified])1shape_ok: True
2finite_ok: True
3binary_mask_ok: True
4invalid_drop_rate_rejected: True
5original nonzero: 4
6sparsified nonzero: 2
7sparsified: [0.0, 0.0, 0.2, 1.2]On the SFT models and tasks it evaluates, DARE finds substantial redundancy in task-vector entries and improves several downstream merge methods after drop-and-rescale preprocessing.[8] For a new checkpoint family, density remains a tuned parameter: compare unsparsified and DARE-preprocessed merges on every required task slice.
SLERP is a geometric interpolation rule. It follows an angular arc between normalized vector directions rather than the chord used by linear interpolation. It is not an algorithm for discovering a low-loss path around an incompatible-model ridge.
Rather than interpolating directions along a straight line, SLERP[9] interpolates along a sphere. The cleanest way to write the geometry is in terms of normalized directions:
where is the interpolation factor (from 0 to 1) and is the angle between normalized weight vectors. Practical merge implementations may handle magnitude separately after interpolating direction, which is what the code below does.
SLERP was introduced for computer graphics to interpolate rotations represented as quaternions.[9] That source establishes its geometry, not downstream quality for neural-network weight merges. Use a SLERP checkpoint as another candidate and measure loss and required tasks just as you would for a linear merge.
The slerp function demonstrates this geometric interpolation. It takes two unnormalized vectors and an interpolation factor t, projects them onto a unit sphere, computes the interpolation, and returns the combined vector:
1import torch
2
3def slerp(
4 v0: torch.Tensor,
5 v1: torch.Tensor,
6 t: float
7) -> torch.Tensor:
8 """Practical SLERP for a single weight tensor."""
9 flat_v0 = v0.flatten()
10 flat_v1 = v1.flatten()
11
12 v0_mag = flat_v0.norm()
13 v1_mag = flat_v1.norm()
14 if v0_mag.item() == 0 or v1_mag.item() == 0:
15 return (1 - t) * v0 + t * v1
16
17 # Separate direction from magnitude
18 v0_dir = flat_v0 / v0_mag
19 v1_dir = flat_v1 / v1_mag
20
21 dot = torch.dot(v0_dir, v1_dir)
22 omega = torch.acos(torch.clamp(dot, -1.0, 1.0))
23
24 # Fall back to linear interpolation for near-identical vectors
25 if omega.abs().item() < 1e-6:
26 return (1 - t) * v0 + t * v1
27
28 sin_omega = torch.sin(omega)
29 direction = (
30 torch.sin((1 - t) * omega) / sin_omega * v0_dir +
31 torch.sin(t * omega) / sin_omega * v1_dir
32 )
33 magnitude = (1 - t) * v0_mag + t * v1_mag
34 return direction.view_as(v0) * magnitude
35
36v0 = torch.tensor([1.0, 0.0])
37v1 = torch.tensor([0.0, 1.0])
38mid = slerp(v0, v1, t=0.5)
39
40print("unit_norm:", bool(torch.allclose(mid.norm(), torch.tensor(1.0), atol=1e-6)))
41print("matches_45_degree:", bool(torch.allclose(mid, torch.tensor([2**-0.5, 2**-0.5]), atol=1e-6)))
42print("midpoint:", [round(float(x), 4) for x in mid])
43print("norm:", round(float(mid.norm()), 4))1unit_norm: True
2matches_45_degree: True
3midpoint: [0.7071, 0.7071]
4norm: 1.0In the equal-norm orthogonal-vector example above, the SLERP midpoint remains on the unit circle while a linear midpoint would have smaller norm. That is a geometric property of this example, not evidence that the midpoint preserves either model's behavior. For a pair of checkpoints, evaluate linear and spherical candidates against the same release gates.
mergekit[1] is an open-source toolkit for merging language-model checkpoints. Its documented CLI supports YAML-defined merges with CPU or limited-VRAM execution, and mergekit-multi can run multi-stage recipes where later merges consume earlier outputs.[1]
To use mergekit, you typically define a YAML configuration file that specifies the base model, the fine-tuned source models, their respective merging coefficients, and the desired algorithm. This configuration acts as the input to the CLI tool to generate the merged model:
1# mergekit config: merge_config.yml
2models:
3 - model: your-org/base-8b-shipping
4 parameters:
5 weight: 0.35
6 - model: your-org/base-8b-refund
7 parameters:
8 weight: 0.35
9 - model: your-org/base-8b-catalog
10 parameters:
11 weight: 0.30
12
13merge_method: dare_ties # or: ties, linear, slerp
14base_model: your-org/base-8b
15tokenizer:
16 source: base # switch to union if you must preserve extra tokens
17chat_template: auto # or pin a specific template when model families differ
18parameters:
19 density: 0.5 # fraction of delta parameters to keep
20dtype: float16If all required tokens are already in the base tokenizer, tokenizer.source: base pins the output vocabulary to that base. Modern mergekit configuration defaults to a union tokenizer, which adds tokens present in source vocabularies and assigns fallback embeddings where an input model lacks them.[1] Either policy is an output-space decision that needs targeted evaluation.
Common Mistake: Assuming all logistics checkpoints use the same tokenizer because they began from the same base. If the catalog checkpoint added SKU tokens (
<SKU-12345>), choosingbasedrops those added output entries while choosingunionintroduces filled embeddings for models that lack them. Verify tokenizer vocabularies, choose the output policy explicitly, and test SKU-heavy prompts.
This preflight check makes the output-vocabulary decision explicit before any expensive merge runs:
1def output_vocab(base_vocab, source_vocabs, policy):
2 if policy == "base":
3 return set(base_vocab)
4 if policy == "union":
5 return set().union(*source_vocabs)
6 raise ValueError("policy must be 'base' or 'union'")
7
8base_vocab = {"<bos>", "shipping", "refund"}
9catalog_vocab = base_vocab | {"<SKU-12345>"}
10required_tokens = {"shipping", "<SKU-12345>"}
11
12base_output = output_vocab(base_vocab, [base_vocab, catalog_vocab], "base")
13union_output = output_vocab(base_vocab, [base_vocab, catalog_vocab], "union")
14
15print("base_missing_required:", sorted(required_tokens - base_output))
16print("union_missing_required:", sorted(required_tokens - union_output))
17print("union_requires_added_embedding_eval:", "<SKU-12345>" not in base_vocab)1base_missing_required: ['<SKU-12345>']
2union_missing_required: []
3union_requires_added_embedding_eval: TrueOnce the configuration is set, you can execute the merge using the CLI tool. The command below takes the YAML configuration file and the output path, producing the final merged model on disk:
1# Run merge on a local GPU
2mergekit-yaml merge_config.yml ./output_model --cudaFor hardware-specific and memory-saving flags, check mergekit-yaml --help because supported options vary by version.[1]
Mergekit lets engineers build candidate checkpoints without a new gradient-training run. Iterate over parameters such as retained density and task weights only against a defined evaluation suite.
The diagram below shows the iterative validation loop you should run after every merge: evaluate each target slice, compare it to declared gates and source baselines, and retune or reject any candidate that misses a critical threshold.
An aggregate score is useful for ranking candidates, but a release decision should fail on any critical threshold miss:
1def failed_release_gates(scores, thresholds):
2 return {
3 task: (scores[task], minimum)
4 for task, minimum in thresholds.items()
5 if scores[task] < minimum
6 }
7
8scores = {"shipping": 92.4, "refunds": 82.1, "catalog": 87.0}
9thresholds = {"shipping": 90.0, "refunds": 85.0, "catalog": 84.0}
10failures = failed_release_gates(scores, thresholds)
11
12print("aggregate_score:", round(sum(scores.values()) / len(scores), 1))
13print("failed_tasks:", sorted(failures))
14print("promote:", not failures)1aggregate_score: 87.2
2failed_tasks: ['refunds']
3promote: FalseManually searching merge coefficients and layer selections can be expensive. Evolutionary model merging[10] applies evolutionary search to optimize merge recipes against a supplied fitness evaluation. Its search space can include per-layer source choices, interpolation weights, and whether to merge in parameter space, data-flow space, or both.
Sakana AI reports using this approach to build EvoLLM-JP from Japanese-language and math-oriented models, optimizing for the evaluations selected in that work.[10] The principle is straightforward: treat each merge configuration as a "genome," score it on validation data, mutate its parameters, and retain better-scoring candidates. The resulting model inherits the objective's coverage and blind spots, so release gates still need independent slices.
In practice, a small grid search is a transparent baseline: try lambda values in [0.2, 0.4, 0.6, 0.8] for each task vector, measure each required slice on held-out data, and retain only configurations that clear all release gates. For a logistics merge, those slices should include carrier-policy questions, refund scenarios, and catalog lookups; a shipping-only objective does not protect refund behavior.
Not every merge averages weights. Mergekit's passthrough method copies selected tensors or layer ranges from source models into the output checkpoint.[1] This is model splicing rather than interpolation: tensor dimensions must compose, while useful behavior across the splice remains an evaluation result.
For example, a recipe might copy layers 0-19 from checkpoint A and layers 20-31 from checkpoint B. Compatible dimensions make that artifact constructible; they do not show that B's later layers can interpret the hidden states produced by A's earlier layers.
Use passthrough as an experiment with explicit source lineage, layer boundaries, and the same per-task gates used for any other candidate. A successful build only proves shape compatibility.
Direct weight merging requires aligned parameter meaning, shape, and output-space assumptions. You cannot directly interpolate a 7-billion-parameter checkpoint with a 70-billion-parameter checkpoint, or silently combine incompatible vocabulary mappings and treat token IDs as equivalent. A generated merged artifact is a candidate, not evidence of retained quality.
You should avoid merging when:
When a merge goes wrong, the model usually tells you quickly. Here are three common failure patterns, their causes, and how to fix them:
| Symptom | Likely Cause | Fix |
|---|---|---|
| Output is incoherent or random tokens | Tokenizer/output mapping, prompt template, or architecture mismatch | Verify output vocabulary policy, chat template, and layer shapes |
| Model loops repetitive text | Coefficient overload or weights pushed too far out of distribution | Reduce coefficient magnitudes, start additive lambdas in the 0.0-1.0 range, and re-run evals |
| Merged model is worse than every source | Source interference or incompatible lineage | Check source lineage and tokenizer policy; prefer compatible sources and retune or reject the merge |
Do not predict quality from the merge rule alone. Model Soups[3] reported gains over the best individual checkpoint in its evaluated shared-initialization experiments, including ImageNet settings. That result motivates trying a merge; it does not predict whether a new LLM merge will clear its specialist task gates.
After merging, run a full evaluation suite across all target tasks, not only the aggregate score. Strong shipping accuracy can mask a miss in refund handling. Track per-task metrics independently and compare against declared thresholds and each source model's baseline.
Why is weight averaging worth testing at all? Shouldn't mixing model weights produce garbage? Shared-initialization checkpoints provide a plausible starting point because studies such as Model Soups observe useful averages in evaluated settings. Shared lineage is not a quality proof: sample the interpolation path and run task gates. Independently trained models can also encode equivalent functions with incompatible neuron orderings, making direct averaging particularly suspect.
How do you choose merge coefficients for three or more models? Start with equal weights as a baseline, then tune against a held-out validation set that reflects your target task mixture. Track each target capability separately. If equal weights cause interference, try task-vector coefficients, TIES, DARE-TIES, or a small grid search over weights and density values.
When should you merge models instead of using mixture-of-experts routing? Merge when you want one dense checkpoint with roughly one-model serving cost, simpler rollback, and no router path. Use MoE or explicit routing when tasks are highly distinct, specialists must stay sharp, or you can afford multiple experts plus routing complexity.
How does LoRA adapter merging differ from full-weight merging? LoRA adapters are low-rank delta weights attached to a frozen base, so adapter merging operates on a much smaller parameter set. That makes experiments cheaper, but it doesn't remove interference: two adapters can still push the same target matrix in conflicting directions.
What should you check first when a merge outputs incoherent text? Check architecture, tensor shapes, tokenizer output policy, and chat template before coefficient tuning. These checks identify whether inputs and output IDs still mean what the merge recipe assumed.
The best way to learn merging is to reason through a concrete scenario before you touch a GPU.
Question: Why can't we average the weights of GPT-2 and Llama-3?
Reasoning: Even if you ignored the parameter-count mismatch, the two models have different architectures and different tokenizers. Their parameter entries therefore do not represent an aligned interpolation space. Same-base checkpoints are conservative direct-merge candidates; alignment methods such as Git Re-Basin provide research evidence in studied neural-network architectures, not a generic way to make GPT-2 and Llama compatible.
You have a base model and two fine-tuned variants:
code_ft improves Python accuracy from 40% to 70%math_ft improves algebra accuracy from 35% to 65%You want a merged model that scores at least 60% on both tasks. You try Task Arithmetic with coefficients and .
Given the table below, which coefficient pair is the best starting point?
| Python score | Algebra score | ||
|---|---|---|---|
| 0.8 | 0.8 | 64% | 59% |
| 0.6 | 0.6 | 58% | 53% |
| 0.5 | 0.5 | 55% | 50% |
| 1.0 | 0.0 | 70% | 35% |
Answer: None of the listed pairs reaches 60% on both tasks. The best starting point is 0.8, 0.8 because it gets closest at 64% Python and 59% algebra, which tells you the merge is almost there but still suffering from interference. In practice, you'd try TIES or DARE next, or continue tuning coefficients around that region, to lift algebra without giving up too much Python performance.
You merge three e-commerce support models (refunds, shipping, catalog) using uniform averaging. The merged model answers refund questions perfectly but responds to shipping queries with seemingly random product names.
What is the most likely cause, and what should you check first?
Answer: Tokenizer/output-space mismatch is a plausible first suspect, but the symptom alone does not identify one root cause. Inspect each source tokenizer and chat template, then compare shipping behavior in the source checkpoints. If added SKU tokens are needed, choose and evaluate an explicit union tokenizer policy; otherwise retain a shared-base output vocabulary and investigate source interference.
Symptom: The merged model produces incoherent or unstable outputs right away. Cause: Matching parameter count is not enough. The checkpoints may differ in architecture details, tokenizer alignment, or base-model lineage. Fix: Verify shared architecture, layer shapes, tokenizer handling, and common base checkpoint before merging any weights.
Symptom: The merged checkpoint looks strong overall but one specialist workflow regresses badly in production. Cause: Aggregate metrics can hide per-task damage from interference. Fix: Run per-task checks against declared release thresholds and source baselines, then block promotion when any critical skill falls below threshold.
Symptom: Larger lambda values make outputs repetitive, brittle, or off-distribution. Cause: Oversized coefficients can push the merged checkpoint too far away from the stable base region. Fix: Start with small or equal weights, tune on held-out evals, and watch for regressions rather than assuming stronger scaling always helps.
Symptom: Catalog-heavy prompts fail or produce random product tokens even though the merge completed.
Cause: Embedding rows and LM-head rows no longer line up cleanly across token IDs.
Fix: Check vocabulary equality first. If source models added tokens, use explicit tokenizer alignment such as mergekit union handling and then rerun evals.
Symptom: Engineers sparsify task vectors and assume the job is done. Cause: DARE is a preprocessing step that drops and rescales delta entries; it still needs a downstream merge rule such as averaging or TIES. Fix: Treat DARE as sparsification first, then run a real merge rule and evaluate the combined checkpoint.
Mergekit: Tools for merging pre-trained large language models
Goddard, C., et al. · 2023
Linear Mode Connectivity and the Lottery Ticket Hypothesis.
Frankle, J., Dziugaite, G. K., Roy, D. M., & Carlin, M. · 2020 · ICML 2020
Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time
Wortsman, M., et al. · 2022 · ICML 2022
The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks.
Entezari, R., Sedghi, H., Saukh, O., & Neyshabur, B. · 2022 · ICLR 2022
Git Re-Basin: Merging Models modulo Permutation Symmetries.
Ainsworth, S. K., Hayase, J., & Srinivasa, S. · 2022 · ICLR 2023
Editing Models with Task Arithmetic
Ilharco, G., et al. · 2022 · ICLR 2023
TIES-Merging: Resolving Interference When Merging Models
Yadav, P., et al. · 2023 · NeurIPS 2023
Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunch
Yu, L., et al. · 2023 · ICML 2024
Animating Rotation with Quaternion Curves
Shoemake, K. · 1985 · SIGGRAPH '85
Evolutionary Optimization of Model Merging Recipes.
Sakana AI · 2024