LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Training & AdaptationModel Merging and Weight Interpolation
⚡HardFine-Tuning & Training

Model Merging and Weight Interpolation

Learn model merging techniques, from simple weight averaging and task arithmetic to TIES-Merging and DARE, including practical guidance on tokenizer compatibility, mergekit workflows, and evaluation.

36 min read
Learning path
Step 106 of 155 in the full curriculum
Knowledge Distillation for LLMsPrompt Optimization with DSPy

Model Merging and Weight Interpolation

Knowledge distillation compressed teacher behavior into a smaller student. Model merging asks a different deployment question: if you already have several useful fine-tuned checkpoints from the same base model, can you combine their weights into one checkpoint without launching another training run?

Your logistics team runs a customer-support LLM that answers shipping, catalog, and refund questions. After a peak season, you have three fine-tuned checkpoints from the same base model: one tuned on carrier-policy updates, one on product-search queries, and one on refund workflows. Serving all three behind a router adds cold-start latency and complicates rollback. Model merging asks whether you can combine those checkpoints into one deployable model that keeps useful behavior from each specialist without running another training job.

This chapter explains when interpolation is worth testing, where it fails, and how to evaluate merged models before shipping them.

In the previous articles on fine-tuning and alignment, you learned that a model checkpoint is a large vector of numbers, and that fine-tuning nudges those numbers in directions that improve a specific task. Model merging asks a natural follow-up question: what if you could combine those nudges? If one team fine-tuned a base model for Python code and another team fine-tuned the same base model for math reasoning, can you average their checkpoints and get a model that's good at both?

The attraction is practical: a merge creates a candidate checkpoint without another gradient-training run. Whether that candidate retains any target behavior is an evaluation question, not a property guaranteed by the merge recipe.

A hard prerequisite for direct tensor interpolation is compatible parameter structure: corresponding tensors must have compatible shapes and meanings. Embedding matrices and language-model heads also need an explicit tokenizer policy. You can't interpolate a 7B checkpoint with a 70B checkpoint, or blindly combine token rows from different vocabularies. Mergekit can construct a union tokenizer and assign fallback embeddings for missing tokens, but that is an explicit output-space choice that still needs evaluation.[1] Same-base checkpoints are the conservative starting point for task-vector merging because their parameter coordinates share lineage; compatible shapes alone do not establish merge quality.

The illustration below compares four recipes covered in this chapter, from averaging to sparse conflict-aware task-vector merging.

Comparison of four model merging recipes: model soups average checkpoints, task arithmetic adds weighted deltas, TIES resolves sign conflicts, and DARE sparsifies deltas before a downstream merge. Comparison of four model merging recipes: model soups average checkpoints, task arithmetic adds weighted deltas, TIES resolves sign conflicts, and DARE sparsifies deltas before a downstream merge.
Model soups average checkpoints, task arithmetic adds scaled task vectors, TIES trims and sign-aligns conflicting updates, and DARE sparsifies task vectors before a downstream merge.

When averaging weights is worth testing

If someone told you to average the internal numbers of two trained neural networks, skepticism is correct. Weight averaging is plausible in some fine-tuning settings because nearby checkpoints can occupy a connected low-loss region in parameter space. It is not safe merely because both endpoints are good models.

Think of fine-tuning as moving from a shared starting point through weight space. Linear mode connectivity studies whether the straight interpolation path between endpoints crosses a high-loss barrier.[2] The Model Soups paper motivates averaging in a specific setting: models fine-tuned from a shared pretrained initialization under different hyperparameters often admit useful averages in its evaluated vision and text-classification experiments.[3] This is evidence for testing nearby same-lineage candidates, not a proof that differently specialized LLM checkpoints share one basin.

In concrete terms, evaluate points on θ(α)=(1−α)θA+αθB\theta(\alpha) = (1-\alpha)\theta_A + \alpha\theta_Bθ(α)=(1−α)θA​+αθB​ rather than assuming the midpoint is usable. Two checkpoints can each be strong on their own evaluation and still interfere when combined, especially when they represent different tasks. Models with unrelated pretraining lineages are even poorer candidates for direct interpolation because their parameter coordinate systems were not preserved by a common base.

The screening logic below treats interpolation measurements as evidence, rather than granting the midpoint a pass because the sources share lineage. In a real run, losses and task_scores come from the candidate checkpoints and held-out evaluations:

screen-measured-interpolation-path.py
1alphas = [0.00, 0.25, 0.50, 0.75, 1.00] 2losses = [0.18, 0.20, 0.61, 0.23, 0.19] 3task_scores = [0.88, 0.86, 0.70, 0.85, 0.89] 4 5max_accepted_loss = 0.30 6min_accepted_score = 0.84 7 8accepted = [ 9 alpha 10 for alpha, loss, score in zip(alphas, losses, task_scores) 11 if loss <= max_accepted_loss and score >= min_accepted_score 12] 13 14print("accepted_alphas:", accepted) 15print("midpoint_passes:", 0.50 in accepted)
Output
1accepted_alphas: [0.0, 0.25, 0.75, 1.0] 2midpoint_passes: False
Interpolation diagnostic comparing a measured low-loss path between nearby same-lineage checkpoints with a measured high-loss ridge between incompatible candidates. Interpolation diagnostic comparing a measured low-loss path between nearby same-lineage checkpoints with a measured high-loss ridge between incompatible candidates.
Check interpolation loss or downstream quality rather than inferring it from lineage. Shared lineage is a useful precondition, not a pass result.

Key insight: Same-base lineage makes a merge experiment defensible. It does not establish that a shipping-policy and refund-workflow merge retained either behavior. Only per-task evaluation does that.

The permutation invariance problem

Neural networks can exhibit permutation invariance: under appropriate corresponding reordering of incoming and outgoing weights, hidden-unit permutations can preserve the function a network computes.[4] This helps explain why independently trained networks may be poorly aligned for naive averaging even if they solve similar tasks.

Git Re-Basin[5] addresses this by finding the optimal permutation π\piπ that aligns the neurons of one model to match the other before merging:

θmerged=12(θA+π(θB))\theta_{\text{merged}} = \frac{1}{2}(\theta_A + \pi(\theta_B))θmerged​=21​(θA​+π(θB​))

The illustration below makes the failure mode concrete. Two checkpoints can learn the same three hidden features but store them in different slot orders, so Git Re-Basin permutes one model before averaging.

Illustrative permutation-alignment diagram showing equivalent hidden features in different slot orders before and after reordering. Illustrative permutation-alignment diagram showing equivalent hidden features in different slot orders before and after reordering.
Permutation invariance means slot numbers are not semantic IDs. Git Re-Basin searches for a permutation that aligns equivalent hidden units in the network architectures studied by its paper.

Git Re-Basin uses permutation matching and reports merged independently trained MLP, CNN, and ResNet models in its studied settings, including a zero-barrier ResNet result on CIFAR-10.[5] That paper is a useful alignment concept, but it is not evidence that an arbitrary pair of large language models can be repaired and merged. For LLM work, same-base candidates plus downstream evaluation remain the practical default here.

The merging toolkit: from simple averages to conflict-aware methods

Several techniques exist for merging models, ranging from mathematical averages to geometric interpolation rules. Choice depends on source lineage, observed delta conflict, and the evaluations the output must pass.

Diagram showing Shared base checkpoint same architecture, Specialist fine-tunes shipping / refunds / catalog, Merge recipe linear / task arithmetic TIES / DARE-TIES / SLERP, and Merged candidate. Diagram showing Shared base checkpoint same architecture, Specialist fine-tunes shipping / refunds / catalog, Merge recipe linear / task arithmetic TIES / DARE-TIES / SLERP, and Merged candidate.
Shared base checkpoint same architecture, Specialist fine-tunes shipping / refunds / catalog, Merge recipe linear / task arithmetic TIES / DARE-TIES / SLERP, and Merged candidate.
MethodMechanismProsConsBest For
Model Soups / LinearUniform or weighted averagingSimple baselineInterference between conflicting parametersNearby checkpoints with representative evaluation
Task ArithmeticWeighted task vectorsSeparate coefficients per deltaCoefficients do not guarantee separate capabilities surviveSame-base task-vector experiments
TIES-Merging (Trim, Elect Sign, Merge)Trim and aggregate-sign filteringExplicitly handles conflicting delta signsDensity and scale need tuningConflicting same-base task vectors
DARE (sparsify first)Random dropping and rescaling of task vectorsSparse preprocessing evaluated by its paperNot a complete merge rule on its ownTesting DARE plus a downstream merge
SLERP (Spherical Linear Interpolation)Spherical interpolation in direction spaceHas a norm-preserving geometric interpretationGeometry alone does not establish task qualityPairwise interpolation experiment

Model soups (uniform averaging)

The simplest approach to merging is called Model Soups[3]. It averages the weights of multiple fine-tuned models to create a single robust model:

θmerged=1N∑i=1Nθi\theta_{\text{merged}} = \frac{1}{N} \sum_{i=1}^{N} \theta_iθmerged​=N1​∑i=1N​θi​

To see the operation, imagine two scalars instead of billion-parameter tensors. If checkpoint A has a weight 2.0 and checkpoint B has 4.0, their equal average is 3.0. Whether that compromise retains behavior cannot be determined from this parameter alone. Model Soups evaluates averaging models fine-tuned from a shared initialization over hyperparameter configurations and reports improved accuracy and robustness in its studied settings without the inference cost of an ensemble.[3]

Uniform vs. greedy soups

The original Model Soups paper distinguishes between two selection strategies:

  • Uniform Soups: Average all fine-tuned checkpoints with equal weight. Simple but risky, as a single poorly-performing checkpoint can drag down the overall quality.
  • Greedy Soups: Iteratively add checkpoints to the soup only if they improve performance on a held-out validation set. Start with the best individual model, then test each remaining model: if adding it to the average improves the validation metric, keep it; otherwise, discard it.

In the original paper, greedy soups outperform uniform averaging in the reported experiments because the selection rule skips candidates that hurt its held-out validation metric.[3] A similar selection rule is reasonable to test only when your validation slices represent the behavior you need to keep.

The following function demonstrates uniform averaging. It takes a list of model state dictionaries and an optional list of weights, and returns a single state dictionary containing the weighted average of their parameters:

uniform-vs-greedy-soups.py
1import torch 2 3def uniform_merge( 4 models: list[dict[str, torch.Tensor]], 5 weights: list[float] | None = None, 6) -> dict[str, torch.Tensor]: 7 """Merge models by weighted averaging of parameters.""" 8 if weights is None: 9 weights = [1.0 / len(models)] * len(models) 10 11 if abs(sum(weights) - 1.0) >= 1e-6: 12 raise ValueError("Weights must sum to 1") 13 14 merged = {} 15 for key in models[0].keys(): 16 merged[key] = sum(w * m[key] for w, m in zip(weights, models)) 17 18 return merged 19 20code_model = {"w": torch.tensor([2.0, 4.0]), "bias": torch.tensor([1.0])} 21math_model = {"w": torch.tensor([4.0, 2.0]), "bias": torch.tensor([3.0])} 22merged = uniform_merge([code_model, math_model]) 23 24print("weights_ok:", bool(torch.allclose(merged["w"], torch.tensor([3.0, 3.0])))) 25print("bias_ok:", bool(torch.allclose(merged["bias"], torch.tensor([2.0])))) 26print("merged w:", merged["w"].tolist()) 27print("merged bias:", merged["bias"].tolist())
Output
1weights_ok: True 2bias_ok: True 3merged w: [3.0, 3.0] 4merged bias: [2.0]

Pros

Exceptionally simple to implement and requires no hyperparameters beyond the optional weighting scheme.

Cons

If the models are fine-tuned on wildly different tasks, direct averaging can cause destructive interference between conflicting parameters, degrading overall performance.

Task arithmetic

Instead of averaging absolute weights, Task Arithmetic[6] operates on task vectors (the difference between fine-tuned weights and the base model). Think of task vectors like a set of specific directions (e.g., "walk 10 steps north"). Instead of averaging the final destinations of different hikers, you take the specific path each hiker took from base camp and combine them. By scaling these paths (e.g., "take half the steps north"), you can carefully mix different skills.

τi=θi−θ0(task vector for model i)\tau_i = \theta_i - \theta_0 \quad \text{(task vector for model } i \text{)}τi​=θi​−θ0​(task vector for model i)

θmerged=θ0+∑i=1Nλi⋅τi\theta_{\text{merged}} = \theta_0 + \sum_{i=1}^{N} \lambda_i \cdot \tau_iθmerged​=θ0​+∑i=1N​λi​⋅τi​

Where θ0\theta_0θ0​ is the base model, θi\theta_iθi​ are fine-tuned models, τi\tau_iτi​ are per-task deltas, and λi\lambda_iλi​ are merge weights.

A concrete scalar walkthrough

Before you run this on billion-parameter tensors, try it on a single parameter. Imagine a base model where one weight is 2.0. A Python expert fine-tune pushes that weight to 2.5, so its task vector is +0.5. A math expert fine-tune pushes it to 1.5, so its task vector is -0.5. If you want a model that is slightly better at Python but keeps its math ability, you might set λpython=0.6\lambda_{\text{python}} = 0.6λpython​=0.6 and λmath=0.4\lambda_{\text{math}} = 0.4λmath​=0.4:

merged weight=2.0+0.6(0.5)+0.4(−0.5)=2.0+0.3−0.2=2.1\text{merged weight} = 2.0 + 0.6(0.5) + 0.4(-0.5) = 2.0 + 0.3 - 0.2 = 2.1merged weight=2.0+0.6(0.5)+0.4(−0.5)=2.0+0.3−0.2=2.1

The result, 2.1, is a parameter nudge in the Python-delta direction. If you had instead averaged the absolute fine-tuned weights (2.5 and 1.5), you'd get 2.0, cancelling both deltas at this coordinate. A scalar calculation cannot establish how either task performs; it only shows how task-vector coefficients act on weights.

You can use the same arithmetic for logistics. Suppose a shipping-policy fine-tune nudges a weight to 2.4 (task vector +0.4) and a refund-workflow fine-tune nudges it to 1.8 (task vector -0.2). Setting λshipping=0.7\lambda_{\text{shipping}} = 0.7λshipping​=0.7 and λrefund=0.5\lambda_{\text{refund}} = 0.5λrefund​=0.5 gives 2.0 + 0.28 - 0.10 = 2.18. Whether this coefficient pair keeps either workflow is measured after the merge.

The task_arithmetic_merge function below shows how this works in practice. It takes a shared base model, a list of fine-tuned models, and scaling coefficients. It subtracts the base to get task vectors, applies a scaling coefficient, and adds them back to return the merged model:

a-concrete-scalar-walkthrough.py
1import torch 2 3def task_arithmetic_merge( 4 base_model: dict[str, torch.Tensor], 5 fine_tuned_models: list[dict[str, torch.Tensor]], 6 scaling_coefficients: list[float] 7) -> dict[str, torch.Tensor]: 8 """Merge via task vectors (differences from base model). 9 10 Args: 11 base_model: The shared base model weights 12 fine_tuned_models: List of fine-tuned model weights 13 scaling_coefficients: Per-task scaling factors (lambda_i) 14 """ 15 merged = {k: v.clone() for k, v in base_model.items()} 16 17 for model, coeff in zip(fine_tuned_models, scaling_coefficients): 18 for key in merged: 19 task_vector = model[key] - base_model[key] 20 merged[key] += coeff * task_vector 21 22 return merged 23 24shared_base = {"w": torch.tensor([2.0])} 25code_ft = {"w": torch.tensor([2.5])} 26math_ft = {"w": torch.tensor([1.5])} 27 28# Example: 0.6x code ability + 0.4x math ability 29merged = task_arithmetic_merge( 30 base_model=shared_base, 31 fine_tuned_models=[code_ft, math_ft], 32 scaling_coefficients=[0.6, 0.4], 33) 34 35print("base:", float(shared_base["w"])) 36print("matches_expected:", bool(torch.allclose(merged["w"], torch.tensor([2.1])))) 37print("merged:", round(float(merged["w"]), 2))
Output
1base: 2.0 2matches_expected: True 3merged: 2.1

Advantage

Scaling coefficients control how much of each task vector enters the candidate checkpoint; evaluation determines retained behavior.

TIES-Merging (trim, elect sign, merge)

A major issue with simple averaging is that task vectors can directly contradict one another (e.g., one model increases a weight by 0.5, while another decreases it by 0.5). TIES-Merging[7] addresses two forms of interference in its recipe: it trims low-magnitude deltas, elects a sign for each coordinate using total signed movement, and merges only values aligned with that elected sign.

Think of this process like a logistics operations committee proposing numeric budget changes. TIES resolves a coordinate in three stages. First, discard small proposals (Trim). Next, sum the size of increases against the size of decreases, rather than counting voters (Elect sign). Finally, average only proposals in the winning direction (Disjoint merge).

TIES-Merging flow from raw task vectors to trimmed updates, aggregate-magnitude sign election, and a merged aligned delta. TIES-Merging flow from raw task vectors to trimmed updates, aggregate-magnitude sign election, and a merged aligned delta.
TIES handles interference coordinate by coordinate: trim low-magnitude updates, elect the direction with greater aggregate movement, then average aligned deltas.

Step 1: Trim

First, it drops low-magnitude changes treated as redundant by the recipe. The trim function below takes a task vector and a density threshold, then keeps high-magnitude updates:

step-1-trim.py
1import torch 2 3def trim(task_vector: torch.Tensor, density: float = 0.2) -> torch.Tensor: 4 """Keep only the top-k% of parameter changes by magnitude.""" 5 threshold = torch.quantile(task_vector.abs(), 1 - density) 6 mask = task_vector.abs() >= threshold 7 return task_vector * mask 8 9task_vector = torch.tensor([0.1, -0.5, 0.02, 0.8]) 10trimmed = trim(task_vector, density=0.5) 11 12print("matches_expected:", bool(torch.allclose(trimmed, torch.tensor([0.0, -0.5, 0.0, 0.8])))) 13print("trimmed:", [round(float(x), 2) for x in trimmed])
Output
1matches_expected: True 2trimmed: [0.0, -0.5, 0.0, 0.8]

Step 2: Elect sign

When multiple task vectors update the same parameter, TIES-Merging resolves conflicting directions by total magnitude. The elect_sign function sums the trimmed signed deltas and takes the resulting sign. This differs from a majority vote: one large update can outweigh two smaller opposing updates.

step-2-elect-sign.py
1import torch 2 3def elect_sign(trimmed_vectors: list[torch.Tensor]) -> torch.Tensor: 4 """Elect sign with greatest aggregate signed movement per parameter.""" 5 aggregate_delta = sum(trimmed_vectors) 6 return torch.sign(aggregate_delta) 7 8trimmed_vectors = [ 9 torch.tensor([0.0, -0.5, 0.0, 0.8]), 10 torch.tensor([0.0, 1.1, 0.0, -0.6]), 11 torch.tensor([0.0, -0.4, 0.0, 0.7]), 12] 13elected = elect_sign(trimmed_vectors) 14print("two_negative_votes_at_p2:", True) 15print("positive_mass_wins_at_p2:", bool(elected[1] == 1)) 16print("matches_expected:", bool(torch.allclose(elected, torch.tensor([0.0, 1.0, 0.0, 1.0])))) 17print("elected signs:", elected.tolist())
Output
1two_negative_votes_at_p2: True 2positive_mass_wins_at_p2: True 3matches_expected: True 4elected signs: [0.0, 1.0, 0.0, 1.0]

Step 3: Disjoint merge

Finally, it computes the average of only those task vectors whose updates match the consensus direction. The disjoint_merge function takes trimmed task vectors and their elected signs. It averages only the parameter updates that agree with the elected direction, zeroing out any dissenting values to produce the final merged task vector:

step-3-disjoint-merge.py
1import torch 2 3def disjoint_merge( 4 trimmed_vectors: list[torch.Tensor], 5 elected_signs: torch.Tensor 6) -> torch.Tensor: 7 """Average only non-zero values that agree with the elected sign.""" 8 merged = torch.zeros_like(trimmed_vectors[0]) 9 counts = torch.zeros_like(trimmed_vectors[0]) 10 11 for tv in trimmed_vectors: 12 agree = (tv != 0) & (elected_signs != 0) & (torch.sign(tv) == elected_signs) 13 merged += torch.where(agree, tv, 0) 14 counts += agree.float() 15 16 return torch.where( 17 counts > 0, 18 merged / counts.clamp(min=1), 19 torch.zeros_like(merged) 20 ) 21 22trimmed_vectors = [ 23 torch.tensor([0.0, -0.5, 0.0, 0.8]), 24 torch.tensor([0.0, 1.1, 0.0, -0.6]), 25 torch.tensor([0.0, -0.4, 0.0, 0.7]), 26] 27elected_signs = torch.tensor([0.0, 1.0, 0.0, 1.0]) 28merged = disjoint_merge(trimmed_vectors, elected_signs) 29 30print("matches_expected:", bool(torch.allclose(merged, torch.tensor([0.0, 1.1, 0.0, 0.75])))) 31print("merged task vector:", [round(float(x), 2) for x in merged])
Output
1matches_expected: True 2merged task vector: [0.0, 1.1, 0.0, 0.75]

TIES-Merging outperforms compared baselines in the paper's evaluated vision and T5 task-vector settings, and its analysis highlights sign interference.[7] For an LLM merge, use it as a candidate recipe when task vectors conflict, then compare per-task evaluation against simpler baselines.

DARE (drop and rescale)[8]

DARE randomly drops entries from each task vector and rescales the survivors before a downstream merge. The rescaling preserves an entry's expected delta under the random mask; it does not by itself prove that the resulting model preserves a capability.

The paper studies redundancy in supervised fine-tuning (SFT) deltas and reports that its evaluated models can often tolerate dropping 90% of delta entries, and in some cases 99%, before merging. Its size ablation reports that WizardMath-70B remains effective at a 99% drop rate while the evaluated 7B and 13B variants fail there.[8] Treat that as experimental evidence for the paper's SFT models, not a default density setting for a new merge. DARE is a sparsification step, not a complete merge recipe: sparsified task vectors still need a merger such as averaging or TIES.

DARE's analysis also separates SFT deltas, which it observes are typically within roughly 0.002, from continued-pretraining deltas that approach 0.03; its drop-and-rescale approach becomes ineffective on the latter.[8] One candidate pipeline is DARE-TIES: DARE sparsifies each delta, then TIES resolves directional conflicts among surviving values. Mergekit exposes that composition as dare_ties.[1]

τiDARE=τi⊙m1−p\tau_i^{\text{DARE}} = \frac{\tau_i \odot m}{1 - p}τiDARE​=1−pτi​⊙m​

where mmm is a random binary mask with drop rate ppp and the 1/(1−p)1/(1-p)1/(1−p) factor preserves each entry's expectation under masking.[8]

The dare_sparsify function below shows the DARE step itself. It sparsifies one task vector, after which you can pass the result to a downstream merge rule:

dare-drop-and-rescale-yu2023.py
1import torch 2 3def dare_sparsify( 4 task_vector: torch.Tensor, 5 drop_rate: float = 0.9 6) -> torch.Tensor: 7 """DARE preprocessing for one task vector.""" 8 keep_prob = 1.0 - drop_rate 9 if not 0.0 < keep_prob <= 1.0: 10 raise ValueError("drop_rate must be in [0, 1)") 11 12 mask = torch.bernoulli(torch.full_like(task_vector, keep_prob)) 13 return (task_vector * mask) / keep_prob 14 15torch.manual_seed(4) 16task_vector = torch.tensor([0.2, -0.4, 0.1, 0.6]) 17sparsified = dare_sparsify(task_vector, drop_rate=0.5) 18 19shape_ok = sparsified.shape == task_vector.shape 20finite_ok = bool(torch.isfinite(sparsified).all()) 21binary_mask_ok = set(torch.unique((sparsified != 0).int()).tolist()).issubset({0, 1}) 22makes_invalid_drop_rate_fail = False 23 24try: 25 dare_sparsify(task_vector, drop_rate=1.0) 26except ValueError as exc: 27 makes_invalid_drop_rate_fail = "drop_rate" in str(exc) 28 29print("shape_ok:", shape_ok) 30print("finite_ok:", finite_ok) 31print("binary_mask_ok:", binary_mask_ok) 32print("invalid_drop_rate_rejected:", makes_invalid_drop_rate_fail) 33print("original nonzero:", int((task_vector != 0).sum())) 34print("sparsified nonzero:", int((sparsified != 0).sum())) 35print("sparsified:", [0.0 if abs(float(x)) < 1e-8 else round(float(x), 2) for x in sparsified])
Output
1shape_ok: True 2finite_ok: True 3binary_mask_ok: True 4invalid_drop_rate_rejected: True 5original nonzero: 4 6sparsified nonzero: 2 7sparsified: [0.0, 0.0, 0.2, 1.2]

What the paper establishes

On the SFT models and tasks it evaluates, DARE finds substantial redundancy in task-vector entries and improves several downstream merge methods after drop-and-rescale preprocessing.[8] For a new checkpoint family, density remains a tuned parameter: compare unsparsified and DARE-preprocessed merges on every required task slice.

SLERP (spherical linear interpolation)

SLERP is a geometric interpolation rule. It follows an angular arc between normalized vector directions rather than the chord used by linear interpolation. It is not an algorithm for discovering a low-loss path around an incompatible-model ridge.

Rather than interpolating directions along a straight line, SLERP[9] interpolates along a sphere. The cleanest way to write the geometry is in terms of normalized directions:

SLERP(θ^A,θ^B,t)=sin⁡((1−t)Ω)sin⁡Ωθ^A+sin⁡(tΩ)sin⁡Ωθ^B\begin{aligned} \text{SLERP}(\hat{\theta}_A, \hat{\theta}_B, t) &= \frac{\sin((1-t)\Omega)}{\sin \Omega} \hat{\theta}_A \\ &\quad + \frac{\sin(t\Omega)}{\sin \Omega} \hat{\theta}_B \end{aligned}SLERP(θ^A​,θ^B​,t)​=sinΩsin((1−t)Ω)​θ^A​+sinΩsin(tΩ)​θ^B​​

where ttt is the interpolation factor (from 0 to 1) and Ω=arccos⁡(θ^A⋅θ^B)\Omega = \arccos(\hat{\theta}_A \cdot \hat{\theta}_B)Ω=arccos(θ^A​⋅θ^B​) is the angle between normalized weight vectors. Practical merge implementations may handle magnitude separately after interpolating direction, which is what the code below does.

SLERP was introduced for computer graphics to interpolate rotations represented as quaternions.[9] That source establishes its geometry, not downstream quality for neural-network weight merges. Use a SLERP checkpoint as another candidate and measure loss and required tasks just as you would for a linear merge.

The slerp function demonstrates this geometric interpolation. It takes two unnormalized vectors and an interpolation factor t, projects them onto a unit sphere, computes the interpolation, and returns the combined vector:

slerp-spherical-linear-interpolation.py
1import torch 2 3def slerp( 4 v0: torch.Tensor, 5 v1: torch.Tensor, 6 t: float 7) -> torch.Tensor: 8 """Practical SLERP for a single weight tensor.""" 9 flat_v0 = v0.flatten() 10 flat_v1 = v1.flatten() 11 12 v0_mag = flat_v0.norm() 13 v1_mag = flat_v1.norm() 14 if v0_mag.item() == 0 or v1_mag.item() == 0: 15 return (1 - t) * v0 + t * v1 16 17 # Separate direction from magnitude 18 v0_dir = flat_v0 / v0_mag 19 v1_dir = flat_v1 / v1_mag 20 21 dot = torch.dot(v0_dir, v1_dir) 22 omega = torch.acos(torch.clamp(dot, -1.0, 1.0)) 23 24 # Fall back to linear interpolation for near-identical vectors 25 if omega.abs().item() < 1e-6: 26 return (1 - t) * v0 + t * v1 27 28 sin_omega = torch.sin(omega) 29 direction = ( 30 torch.sin((1 - t) * omega) / sin_omega * v0_dir + 31 torch.sin(t * omega) / sin_omega * v1_dir 32 ) 33 magnitude = (1 - t) * v0_mag + t * v1_mag 34 return direction.view_as(v0) * magnitude 35 36v0 = torch.tensor([1.0, 0.0]) 37v1 = torch.tensor([0.0, 1.0]) 38mid = slerp(v0, v1, t=0.5) 39 40print("unit_norm:", bool(torch.allclose(mid.norm(), torch.tensor(1.0), atol=1e-6))) 41print("matches_45_degree:", bool(torch.allclose(mid, torch.tensor([2**-0.5, 2**-0.5]), atol=1e-6))) 42print("midpoint:", [round(float(x), 4) for x in mid]) 43print("norm:", round(float(mid.norm()), 4))
Output
1unit_norm: True 2matches_45_degree: True 3midpoint: [0.7071, 0.7071] 4norm: 1.0

What it guarantees geometrically

In the equal-norm orthogonal-vector example above, the SLERP midpoint remains on the unit circle while a linear midpoint would have smaller norm. That is a geometric property of this example, not evidence that the midpoint preserves either model's behavior. For a pair of checkpoints, evaluate linear and spherical candidates against the same release gates.

Running a merge with mergekit

mergekit[1] is an open-source toolkit for merging language-model checkpoints. Its documented CLI supports YAML-defined merges with CPU or limited-VRAM execution, and mergekit-multi can run multi-stage recipes where later merges consume earlier outputs.[1]

To use mergekit, you typically define a YAML configuration file that specifies the base model, the fine-tuned source models, their respective merging coefficients, and the desired algorithm. This configuration acts as the input to the CLI tool to generate the merged model:

running-a-merge-with-mergekit.yaml
1# mergekit config: merge_config.yml 2models: 3 - model: your-org/base-8b-shipping 4 parameters: 5 weight: 0.35 6 - model: your-org/base-8b-refund 7 parameters: 8 weight: 0.35 9 - model: your-org/base-8b-catalog 10 parameters: 11 weight: 0.30 12 13merge_method: dare_ties # or: ties, linear, slerp 14base_model: your-org/base-8b 15tokenizer: 16 source: base # switch to union if you must preserve extra tokens 17chat_template: auto # or pin a specific template when model families differ 18parameters: 19 density: 0.5 # fraction of delta parameters to keep 20dtype: float16

If all required tokens are already in the base tokenizer, tokenizer.source: base pins the output vocabulary to that base. Modern mergekit configuration defaults to a union tokenizer, which adds tokens present in source vocabularies and assigns fallback embeddings where an input model lacks them.[1] Either policy is an output-space decision that needs targeted evaluation.

Common Mistake: Assuming all logistics checkpoints use the same tokenizer because they began from the same base. If the catalog checkpoint added SKU tokens (<SKU-12345>), choosing base drops those added output entries while choosing union introduces filled embeddings for models that lack them. Verify tokenizer vocabularies, choose the output policy explicitly, and test SKU-heavy prompts.

This preflight check makes the output-vocabulary decision explicit before any expensive merge runs:

choose-output-vocabulary-policy.py
1def output_vocab(base_vocab, source_vocabs, policy): 2 if policy == "base": 3 return set(base_vocab) 4 if policy == "union": 5 return set().union(*source_vocabs) 6 raise ValueError("policy must be 'base' or 'union'") 7 8base_vocab = {"<bos>", "shipping", "refund"} 9catalog_vocab = base_vocab | {"<SKU-12345>"} 10required_tokens = {"shipping", "<SKU-12345>"} 11 12base_output = output_vocab(base_vocab, [base_vocab, catalog_vocab], "base") 13union_output = output_vocab(base_vocab, [base_vocab, catalog_vocab], "union") 14 15print("base_missing_required:", sorted(required_tokens - base_output)) 16print("union_missing_required:", sorted(required_tokens - union_output)) 17print("union_requires_added_embedding_eval:", "<SKU-12345>" not in base_vocab)
Output
1base_missing_required: ['<SKU-12345>'] 2union_missing_required: [] 3union_requires_added_embedding_eval: True

Once the configuration is set, you can execute the merge using the CLI tool. The command below takes the YAML configuration file and the output path, producing the final merged model on disk:

terminal
1# Run merge on a local GPU 2mergekit-yaml merge_config.yml ./output_model --cuda

For hardware-specific and memory-saving flags, check mergekit-yaml --help because supported options vary by version.[1]

Mergekit lets engineers build candidate checkpoints without a new gradient-training run. Iterate over parameters such as retained density and task weights only against a defined evaluation suite.

The diagram below shows the iterative validation loop you should run after every merge: evaluate each target slice, compare it to declared gates and source baselines, and retune or reject any candidate that misses a critical threshold.

Merge validation loop from source models through merged candidate and per-task evaluation, showing that promotion stops when a critical task misses its threshold. Merge validation loop from source models through merged candidate and per-task evaluation, showing that promotion stops when a critical task misses its threshold.
Every merge is provisional until a per-task evaluation suite confirms that each critical slice clears its release threshold instead of hiding misses behind an aggregate score.
Diagram showing Source models, Merge step, Merged candidate, and Per-task evaluation suite. Diagram showing Source models, Merge step, Merged candidate, and Per-task evaluation suite.
Source models, Merge step, Merged candidate, and Per-task evaluation suite.

An aggregate score is useful for ranking candidates, but a release decision should fail on any critical threshold miss:

evaluate-per-task-release-gates.py
1def failed_release_gates(scores, thresholds): 2 return { 3 task: (scores[task], minimum) 4 for task, minimum in thresholds.items() 5 if scores[task] < minimum 6 } 7 8scores = {"shipping": 92.4, "refunds": 82.1, "catalog": 87.0} 9thresholds = {"shipping": 90.0, "refunds": 85.0, "catalog": 84.0} 10failures = failed_release_gates(scores, thresholds) 11 12print("aggregate_score:", round(sum(scores.values()) / len(scores), 1)) 13print("failed_tasks:", sorted(failures)) 14print("promote:", not failures)
Output
1aggregate_score: 87.2 2failed_tasks: ['refunds'] 3promote: False

Finding good coefficients automatically

Manually searching merge coefficients and layer selections can be expensive. Evolutionary model merging[10] applies evolutionary search to optimize merge recipes against a supplied fitness evaluation. Its search space can include per-layer source choices, interpolation weights, and whether to merge in parameter space, data-flow space, or both.

Sakana AI reports using this approach to build EvoLLM-JP from Japanese-language and math-oriented models, optimizing for the evaluations selected in that work.[10] The principle is straightforward: treat each merge configuration as a "genome," score it on validation data, mutate its parameters, and retain better-scoring candidates. The resulting model inherits the objective's coverage and blind spots, so release gates still need independent slices.

In practice, a small grid search is a transparent baseline: try lambda values in [0.2, 0.4, 0.6, 0.8] for each task vector, measure each required slice on held-out data, and retain only configurations that clear all release gates. For a logistics merge, those slices should include carrier-policy questions, refund scenarios, and catalog lookups; a shipping-only objective does not protect refund behavior.

Passthrough and frankenmerging

Not every merge averages weights. Mergekit's passthrough method copies selected tensors or layer ranges from source models into the output checkpoint.[1] This is model splicing rather than interpolation: tensor dimensions must compose, while useful behavior across the splice remains an evaluation result.

For example, a recipe might copy layers 0-19 from checkpoint A and layers 20-31 from checkpoint B. Compatible dimensions make that artifact constructible; they do not show that B's later layers can interpret the hidden states produced by A's earlier layers.

Use passthrough as an experiment with explicit source lineage, layer boundaries, and the same per-task gates used for any other candidate. A successful build only proves shape compatibility.

Merging pitfalls and hard limits

Direct weight merging requires aligned parameter meaning, shape, and output-space assumptions. You cannot directly interpolate a 7-billion-parameter checkpoint with a 70-billion-parameter checkpoint, or silently combine incompatible vocabulary mappings and treat token IDs as equivalent. A generated merged artifact is a candidate, not evidence of retained quality.

You should avoid merging when:

  1. Different architectures: Source models must share the exact same structural architecture, parameter count, and layer layout. You can't merge a MiniMax model with a Mistral model, nor an 8B model with a 72B model.
  2. Different tokenizers: Direct interpolation assumes aligned embeddings and next-token heads. Mergekit can help with tokenizer union in compatible cases, but it doesn't make arbitrary vocabulary mismatches disappear.[1]
  3. Mismatched prompt formats or chat templates: These usually won't block the raw tensor merge, but they can make the merged checkpoint look broken at inference time because the prompt serialization no longer matches the behaviors learned during tuning.
  4. Quantized-only checkpoints: If you want to merge, do it on dequantized or full-precision weights first, then quantize the final artifact. The merge math depends on real-valued deltas, not already-rounded integers. Treat the merged checkpoint as a new model and re-run your quantization calibration and evals.
  5. Distant fine-tuning trajectories: Models fine-tuned on divergent data or with large update magnitudes may be poor interpolation candidates. Git Re-Basin demonstrates permutation alignment in studied MLP, CNN, and ResNet settings; it is not an established repair step for arbitrary LLM merges.[5]
  6. Critical precision tasks: Do not ship a merge when a required slice misses its threshold, even if an aggregate metric rises. A specialist baseline remains part of the comparison.

Diagnosing a broken merge

When a merge goes wrong, the model usually tells you quickly. Here are three common failure patterns, their causes, and how to fix them:

SymptomLikely CauseFix
Output is incoherent or random tokensTokenizer/output mapping, prompt template, or architecture mismatchVerify output vocabulary policy, chat template, and layer shapes
Model loops repetitive textCoefficient overload or weights pushed too far out of distributionReduce coefficient magnitudes, start additive lambdas in the 0.0-1.0 range, and re-run evals
Merged model is worse than every sourceSource interference or incompatible lineageCheck source lineage and tokenizer policy; prefer compatible sources and retune or reject the merge

Do not predict quality from the merge rule alone. Model Soups[3] reported gains over the best individual checkpoint in its evaluated shared-initialization experiments, including ImageNet settings. That result motivates trying a merge; it does not predict whether a new LLM merge will clear its specialist task gates.

After merging, run a full evaluation suite across all target tasks, not only the aggregate score. Strong shipping accuracy can mask a miss in refund handling. Track per-task metrics independently and compare against declared thresholds and each source model's baseline.

Practice checkpoints

Why is weight averaging worth testing at all? Shouldn't mixing model weights produce garbage? Shared-initialization checkpoints provide a plausible starting point because studies such as Model Soups observe useful averages in evaluated settings. Shared lineage is not a quality proof: sample the interpolation path and run task gates. Independently trained models can also encode equivalent functions with incompatible neuron orderings, making direct averaging particularly suspect.

How do you choose merge coefficients for three or more models? Start with equal weights as a baseline, then tune against a held-out validation set that reflects your target task mixture. Track each target capability separately. If equal weights cause interference, try task-vector coefficients, TIES, DARE-TIES, or a small grid search over weights and density values.

When should you merge models instead of using mixture-of-experts routing? Merge when you want one dense checkpoint with roughly one-model serving cost, simpler rollback, and no router path. Use MoE or explicit routing when tasks are highly distinct, specialists must stay sharp, or you can afford multiple experts plus routing complexity.

How does LoRA adapter merging differ from full-weight merging? LoRA adapters are low-rank delta weights attached to a frozen base, so adapter merging operates on a much smaller parameter set. That makes experiments cheaper, but it doesn't remove interference: two adapters can still push the same target matrix in conflicting directions.

What should you check first when a merge outputs incoherent text? Check architecture, tensor shapes, tokenizer output policy, and chat template before coefficient tuning. These checks identify whether inputs and output IDs still mean what the merge recipe assumed.

Try it yourself

The best way to learn merging is to reason through a concrete scenario before you touch a GPU.

Interview warm-up

Question: Why can't we average the weights of GPT-2 and Llama-3?

Reasoning: Even if you ignored the parameter-count mismatch, the two models have different architectures and different tokenizers. Their parameter entries therefore do not represent an aligned interpolation space. Same-base checkpoints are conservative direct-merge candidates; alignment methods such as Git Re-Basin provide research evidence in studied neural-network architectures, not a generic way to make GPT-2 and Llama compatible.

Grid-search challenge

You have a base model and two fine-tuned variants:

  • code_ft improves Python accuracy from 40% to 70%
  • math_ft improves algebra accuracy from 35% to 65%

You want a merged model that scores at least 60% on both tasks. You try Task Arithmetic with coefficients λcode\lambda_{\text{code}}λcode​ and λmath\lambda_{\text{math}}λmath​.

Given the table below, which coefficient pair is the best starting point?

λcode\lambda_{\text{code}}λcode​λmath\lambda_{\text{math}}λmath​Python scoreAlgebra score
0.80.864%59%
0.60.658%53%
0.50.555%50%
1.00.070%35%

Answer: None of the listed pairs reaches 60% on both tasks. The best starting point is 0.8, 0.8 because it gets closest at 64% Python and 59% algebra, which tells you the merge is almost there but still suffering from interference. In practice, you'd try TIES or DARE next, or continue tuning coefficients around that region, to lift algebra without giving up too much Python performance.

Debug a broken merge

You merge three e-commerce support models (refunds, shipping, catalog) using uniform averaging. The merged model answers refund questions perfectly but responds to shipping queries with seemingly random product names.

What is the most likely cause, and what should you check first?

Answer: Tokenizer/output-space mismatch is a plausible first suspect, but the symptom alone does not identify one root cause. Inspect each source tokenizer and chat template, then compare shipping behavior in the source checkpoints. If added SKU tokens are needed, choose and evaluate an explicit union tokenizer policy; otherwise retain a shared-base output vocabulary and investigate source interference.

Mastery check

Key concepts

  • Model soups and uniform averaging
  • Task vectors and task arithmetic
  • Linear mode connectivity
  • Git Re-Basin and permutation alignment
  • TIES-Merging
  • DARE sparsification
  • SLERP
  • mergekit workflow and tokenizer handling
  • Per-task regression testing for merged checkpoints

Evaluation rubric

  • Foundational: Explains why same-base fine-tuned checkpoints are reasonable averaging candidates but still require evaluation
  • Intermediate: Computes task vectors and shows how scaling coefficients change the merged checkpoint
  • Intermediate: Explains why absolute averaging can cancel opposing task deltas
  • Advanced: Compares when to use model soups, task arithmetic, TIES, DARE, or SLERP
  • Advanced: Diagnoses tokenizer mismatch, architecture mismatch, and coefficient overload in a broken merge
  • Advanced: Designs a post-merge eval suite that blocks promotion when any critical capability misses its release threshold

Follow-up questions

Common pitfalls

"Same parameter count means mergeable"

Symptom: The merged model produces incoherent or unstable outputs right away. Cause: Matching parameter count is not enough. The checkpoints may differ in architecture details, tokenizer alignment, or base-model lineage. Fix: Verify shared architecture, layer shapes, tokenizer handling, and common base checkpoint before merging any weights.

"One aggregate score proves the merge worked"

Symptom: The merged checkpoint looks strong overall but one specialist workflow regresses badly in production. Cause: Aggregate metrics can hide per-task damage from interference. Fix: Run per-task checks against declared release thresholds and source baselines, then block promotion when any critical skill falls below threshold.

"More coefficient means more skill"

Symptom: Larger lambda values make outputs repetitive, brittle, or off-distribution. Cause: Oversized coefficients can push the merged checkpoint too far away from the stable base region. Fix: Start with small or equal weights, tune on held-out evals, and watch for regressions rather than assuming stronger scaling always helps.

"Tokenizer issues only matter at preprocessing time"

Symptom: Catalog-heavy prompts fail or produce random product tokens even though the merge completed. Cause: Embedding rows and LM-head rows no longer line up cleanly across token IDs. Fix: Check vocabulary equality first. If source models added tokens, use explicit tokenizer alignment such as mergekit union handling and then rerun evals.

"DARE is a full merge recipe by itself"

Symptom: Engineers sparsify task vectors and assume the job is done. Cause: DARE is a preprocessing step that drops and rescales delta entries; it still needs a downstream merge rule such as averaging or TIES. Fix: Treat DARE as sparsification first, then run a real merge rule and evaluate the combined checkpoint.

Next Step
Continue to Prompt Optimization with DSPy

Model merging changes behavior by combining weights; DSPy changes behavior at the program and prompt layer, giving you a faster iteration loop before expensive training work.

PreviousKnowledge Distillation for LLMs
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Mergekit: Tools for merging pre-trained large language models

Goddard, C., et al. · 2023

Linear Mode Connectivity and the Lottery Ticket Hypothesis.

Frankle, J., Dziugaite, G. K., Roy, D. M., & Carlin, M. · 2020 · ICML 2020

Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy without Increasing Inference Time

Wortsman, M., et al. · 2022 · ICML 2022

The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks.

Entezari, R., Sedghi, H., Saukh, O., & Neyshabur, B. · 2022 · ICLR 2022

Git Re-Basin: Merging Models modulo Permutation Symmetries.

Ainsworth, S. K., Hayase, J., & Srinivasa, S. · 2022 · ICLR 2023

Editing Models with Task Arithmetic

Ilharco, G., et al. · 2022 · ICLR 2023

TIES-Merging: Resolving Interference When Merging Models

Yadav, P., et al. · 2023 · NeurIPS 2023

Language Models are Super Mario: Absorbing Capabilities from Homologous Models as a Free Lunch

Yu, L., et al. · 2023 · ICML 2024

Animating Rotation with Quaternion Curves

Shoemake, K. · 1985 · SIGGRAPH '85

Evolutionary Optimization of Model Merging Recipes.

Sakana AI · 2024