Model Merging: Getting SOTA Quality at Zero Training Cost
SLERP, TIES, DARE, model soup, and LoRA adapter merging — how the open-source community routinely beats fine-tuned models by merging weights in parameter space. What works, what doesn't, and how to pick merge ratios without running full evals.
Model merging combines the weights of multiple fine-tuned models without any additional training. The result is often a model that outperforms any of its parents on the combined task distribution. The open-source community has turned this into an art form — top models on many HuggingFace leaderboards are merges, not vanilla fine-tunes. Here is why it works and how to do it.
Why It Works
Fine-tuned models that share the same base model occupy nearby regions of parameter space. Fine-tuning moves weights in directions that improve task-specific performance, but those directions are not arbitrary — the loss landscape of the pre-trained model shapes where fine-tuning can go. Interpolation between two fine-tuned models often lands in a region that preserves the capabilities of both, because the loss surface between nearby minima tends to be relatively flat.
This is the mode connectivity hypothesis: independently fine-tuned models starting from the same base are typically connected by low-loss paths in parameter space. Model merging exploits this geometry. It does not work reliably when merging models with different base architectures or very different base checkpoints — the parameter spaces are not aligned.
SLERP
Spherical Linear Interpolation (SLERP) moves along the arc between two weight vectors rather than a straight line. This preserves the norm of the interpolated weights and produces smoother interpolation in high-dimensional spaces than naive linear averaging. One hyperparameter: t in [0, 1], where t=0 returns model A and t=1 returns model B.
import torch
def slerp(w1, w2, t):
"""Spherical linear interpolation between weight tensors."""
w1_flat = w1.float().flatten()
w2_flat = w2.float().flatten()
cos_theta = (w1_flat @ w2_flat) / (w1_flat.norm() * w2_flat.norm())
cos_theta = cos_theta.clamp(-1, 1)
theta = torch.acos(cos_theta)
if theta.abs() < 1e-6:
# Nearly identical vectors: fall back to linear interpolation
return ((1 - t) * w1 + t * w2).to(w1.dtype).reshape(w1.shape)
sin_theta = torch.sin(theta)
interp = (torch.sin((1 - t) * theta) / sin_theta) * w1_flat + (torch.sin(t * theta) / sin_theta) * w2_flat
return interp.to(w1.dtype).reshape(w1.shape)
TIES Merging
TIES (Trim, Elect Sign, Disjoint Merge) addresses a specific failure mode of naive weight averaging: sign conflicts. When model A has increased a weight and model B has decreased it, averaging produces a near-zero value that neither model wanted. TIES resolves this with three steps:
- Trim: compute the task vector (fine-tuned weights minus base weights) for each model. Zero out small deltas — parameters that changed less than a threshold from the base. This reduces noise from parameters the fine-tuning barely touched.
- Elect Sign: for each parameter, look at the sign of the delta across all models being merged. Assign the majority-vote sign. If more models increased this parameter than decreased it, the elected sign is positive.
- Disjoint Merge: average only the parameters whose individual sign matches the elected sign. Parameters with conflicting signs are excluded from the average for that position. This prevents cancellation.
def ties_merge(base, models, trim_threshold=0.02, t=1.0):
task_vectors = [{k: m[k] - base[k] for k in base} for m in models]
merged = {}
for k in base:
deltas = torch.stack([tv[k].float() for tv in task_vectors])
# Step 1: Trim small deltas
deltas[deltas.abs() < trim_threshold] = 0
# Step 2: Elect sign by majority vote
elected_sign = deltas.sign().sum(dim=0).sign()
elected_sign[elected_sign == 0] = 1 # break ties
# Step 3: Average only matching-sign parameters
mask = (deltas.sign() == elected_sign.unsqueeze(0))
count = mask.float().sum(dim=0).clamp(min=1)
merged_delta = (deltas * mask.float()).sum(dim=0) / count
merged[k] = (base[k].float() + t * merged_delta).to(base[k].dtype)
return merged
DARE
DARE (Drop And REscale) is designed for merging many models simultaneously. When merging 4+ models, parameter interference accumulates even after TIES sign resolution. DARE randomly drops delta weights before merging and rescales the survivors to preserve expected value — reducing interference across many models.
def dare(task_vector, drop_rate=0.9):
"""
Randomly drop delta weights at rate p.
Rescale remaining by 1/(1-p) to preserve expected value.
drop_rate=0.9 for 4-8 model merges; 0.7 for 2-3 model merges.
Apply to each model task vector before TIES merge.
"""
mask = (torch.rand_like(task_vector) > drop_rate).float()
scale = 1.0 / (1.0 - drop_rate)
return task_vector * mask * scale
Model Soup
Model soup is the simplest merge technique: average the weights of multiple fine-tuned checkpoints of the same base model, trained with different hyperparameters (learning rate, batch size, data order, LoRA rank). Averaging often beats any single checkpoint because the checkpoints occupy nearby loss minima and their average lands in a flatter, more generalizable region.
- Greedy soup: start with the best single checkpoint. Add models one by one if they improve performance on a validation set. Stop when adding a model hurts.
- Uniform soup: average all checkpoints equally. Works when all checkpoints are trained to reasonable quality. Degrades if any checkpoint significantly underperforms.
- Model soup is particularly effective for fine-tuning sweeps: run 5-10 hyperparameter configurations, merge the top-k by validation performance, often beats the single best by 1-3 points.
LoRA Adapter Merging
LoRA adapters can be merged in adapter space before materializing into full weights. A LoRA adapter is defined by two low-rank matrices A and B; the effective weight delta is (alpha/r) * B @ A. Merging multiple LoRA adapters is cheaper and more numerically stable than merging full weights because the parameter count is small.
def merge_lora_adapters(adapters, weights=None):
"""
Linear combination of LoRA adapter deltas.
adapters: list of (A, B, alpha) tuples
weights: merge coefficients (default: uniform)
Returns: merged delta to add to base weight
"""
if weights is None:
weights = [1.0 / len(adapters)] * len(adapters)
merged_delta = None
for (A, B, alpha), w in zip(adapters, weights):
r = A.shape[0]
delta = (alpha / r) * (B @ A)
merged_delta = w * delta if merged_delta is None else merged_delta + w * delta
return merged_delta # W_final = W_base + merged_delta
How to Pick Merge Ratios
The interpolation coefficient t (for SLERP) or merge weight (for TIES/DARE) is a hyperparameter. Running full evals to tune it is expensive. Use proxy tasks instead: a small set of 50-100 representative queries where you can judge quality quickly. Run inference at t = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0. Fit a quadratic curve to the scores. Pick the peak.
Cost estimate: 6 inference runs on a 7B model with 50 proxy queries costs approximately $5-20 depending on your serving setup. A full eval sweep costs $500+. The proxy task approach gives you 80% of the information at 2% of the cost. The main risk is if your proxy tasks do not represent your actual eval distribution — choose proxy tasks that span the full range of your expected use cases.
When Merging Beats Fine-Tuning
- Combining capabilities: you have model A fine-tuned for task X and model B fine-tuned for task Y, both from the same base. TIES merge often produces a model that handles both tasks at near-single-task quality — without the catastrophic forgetting that would result from sequential fine-tuning.
- Zero training cost: you have no labeled data for a new task but there are relevant community models on HuggingFace. Merge 2-3 relevant models and evaluate — often competitive with fine-tuning from scratch.
- Avoiding catastrophic forgetting: sequential fine-tuning on task B tends to degrade performance on task A. Merging two independently fine-tuned models sidesteps this entirely.
- Community model leverage: the open-source community continuously releases specialized fine-tunes. Merging is often faster than retraining and benefits from community dataset curation work.
Decision Table
| Scenario | Method | Hyperparams | Notes |
|---|---|---|---|
| 2 models, simple blend | SLERP | t in [0,1] | Best for equal-capability models from same base |
| 2-4 models with sign conflicts | TIES | trim threshold, t | Use when naive averaging degrades performance |
| 4+ models simultaneously | DARE + TIES | drop rate p, trim threshold | drop_rate=0.9 standard; reduces parameter interference |
| Same base, different HPs | Model Soup | which checkpoints to include | Greedy soup for safety; uniform for speed |
| Adapter-level merge | LoRA Merge | per-adapter weights | Cheapest option; works across task-specific adapters |
→ Interactive: The Model Merging module in Systems Lab has a SLERP interpolation slider and method comparison table.
- TIES-Merging: Resolving Interference When Merging Models (Yadav et al., 2023)
- DARE: Language Models are Super Mario (Yu et al., 2023)
- Model Soups: Averaging Weights of Multiple Fine-Tuned Models (Wortsman et al.)
- Evolutionary Optimization of Model Merging Recipes (Akiba et al.)
- MergeKit: Open-source model merging library
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →