GenAI Systems Lab Open interactive version →
AI Engineering 10 min read

Model Merging: Getting SOTA Quality at Zero Training Cost

SLERP, TIES, DARE, model soup, and LoRA adapter merging — how the open-source community routinely beats fine-tuned models by merging weights in parameter space. What works, what doesn't, and how to pick merge ratios without running full evals.

Model merging combines the weights of multiple fine-tuned models without any additional training. The result is often a model that outperforms any of its parents on the combined task distribution. The open-source community has turned this into an art form — top models on many HuggingFace leaderboards are merges, not vanilla fine-tunes. Here is why it works and how to do it.

Why It Works

Fine-tuned models that share the same base model occupy nearby regions of parameter space. Fine-tuning moves weights in directions that improve task-specific performance, but those directions are not arbitrary — the loss landscape of the pre-trained model shapes where fine-tuning can go. Interpolation between two fine-tuned models often lands in a region that preserves the capabilities of both, because the loss surface between nearby minima tends to be relatively flat.

This is the mode connectivity hypothesis: independently fine-tuned models starting from the same base are typically connected by low-loss paths in parameter space. Model merging exploits this geometry. It does not work reliably when merging models with different base architectures or very different base checkpoints — the parameter spaces are not aligned.

SLERP

Spherical Linear Interpolation (SLERP) moves along the arc between two weight vectors rather than a straight line. This preserves the norm of the interpolated weights and produces smoother interpolation in high-dimensional spaces than naive linear averaging. One hyperparameter: t in [0, 1], where t=0 returns model A and t=1 returns model B.

import torch

def slerp(w1, w2, t):
    """Spherical linear interpolation between weight tensors."""
    w1_flat = w1.float().flatten()
    w2_flat = w2.float().flatten()

    cos_theta = (w1_flat @ w2_flat) / (w1_flat.norm() * w2_flat.norm())
    cos_theta = cos_theta.clamp(-1, 1)
    theta = torch.acos(cos_theta)

    if theta.abs() < 1e-6:
        # Nearly identical vectors: fall back to linear interpolation
        return ((1 - t) * w1 + t * w2).to(w1.dtype).reshape(w1.shape)

    sin_theta = torch.sin(theta)
    interp = (torch.sin((1 - t) * theta) / sin_theta) * w1_flat +              (torch.sin(t * theta) / sin_theta) * w2_flat

    return interp.to(w1.dtype).reshape(w1.shape)

TIES Merging

TIES (Trim, Elect Sign, Disjoint Merge) addresses a specific failure mode of naive weight averaging: sign conflicts. When model A has increased a weight and model B has decreased it, averaging produces a near-zero value that neither model wanted. TIES resolves this with three steps:

def ties_merge(base, models, trim_threshold=0.02, t=1.0):
    task_vectors = [{k: m[k] - base[k] for k in base} for m in models]
    merged = {}
    for k in base:
        deltas = torch.stack([tv[k].float() for tv in task_vectors])

        # Step 1: Trim small deltas
        deltas[deltas.abs() < trim_threshold] = 0

        # Step 2: Elect sign by majority vote
        elected_sign = deltas.sign().sum(dim=0).sign()
        elected_sign[elected_sign == 0] = 1  # break ties

        # Step 3: Average only matching-sign parameters
        mask = (deltas.sign() == elected_sign.unsqueeze(0))
        count = mask.float().sum(dim=0).clamp(min=1)
        merged_delta = (deltas * mask.float()).sum(dim=0) / count

        merged[k] = (base[k].float() + t * merged_delta).to(base[k].dtype)
    return merged

DARE

DARE (Drop And REscale) is designed for merging many models simultaneously. When merging 4+ models, parameter interference accumulates even after TIES sign resolution. DARE randomly drops delta weights before merging and rescales the survivors to preserve expected value — reducing interference across many models.

def dare(task_vector, drop_rate=0.9):
    """
    Randomly drop delta weights at rate p.
    Rescale remaining by 1/(1-p) to preserve expected value.
    drop_rate=0.9 for 4-8 model merges; 0.7 for 2-3 model merges.
    Apply to each model task vector before TIES merge.
    """
    mask = (torch.rand_like(task_vector) > drop_rate).float()
    scale = 1.0 / (1.0 - drop_rate)
    return task_vector * mask * scale

Model Soup

Model soup is the simplest merge technique: average the weights of multiple fine-tuned checkpoints of the same base model, trained with different hyperparameters (learning rate, batch size, data order, LoRA rank). Averaging often beats any single checkpoint because the checkpoints occupy nearby loss minima and their average lands in a flatter, more generalizable region.

LoRA Adapter Merging

LoRA adapters can be merged in adapter space before materializing into full weights. A LoRA adapter is defined by two low-rank matrices A and B; the effective weight delta is (alpha/r) * B @ A. Merging multiple LoRA adapters is cheaper and more numerically stable than merging full weights because the parameter count is small.

def merge_lora_adapters(adapters, weights=None):
    """
    Linear combination of LoRA adapter deltas.
    adapters: list of (A, B, alpha) tuples
    weights: merge coefficients (default: uniform)
    Returns: merged delta to add to base weight
    """
    if weights is None:
        weights = [1.0 / len(adapters)] * len(adapters)

    merged_delta = None
    for (A, B, alpha), w in zip(adapters, weights):
        r = A.shape[0]
        delta = (alpha / r) * (B @ A)
        merged_delta = w * delta if merged_delta is None else merged_delta + w * delta

    return merged_delta  # W_final = W_base + merged_delta

How to Pick Merge Ratios

The interpolation coefficient t (for SLERP) or merge weight (for TIES/DARE) is a hyperparameter. Running full evals to tune it is expensive. Use proxy tasks instead: a small set of 50-100 representative queries where you can judge quality quickly. Run inference at t = 0.0, 0.2, 0.4, 0.6, 0.8, 1.0. Fit a quadratic curve to the scores. Pick the peak.

Cost estimate: 6 inference runs on a 7B model with 50 proxy queries costs approximately $5-20 depending on your serving setup. A full eval sweep costs $500+. The proxy task approach gives you 80% of the information at 2% of the cost. The main risk is if your proxy tasks do not represent your actual eval distribution — choose proxy tasks that span the full range of your expected use cases.

When Merging Beats Fine-Tuning

Decision Table

ScenarioMethodHyperparamsNotes
2 models, simple blendSLERPt in [0,1]Best for equal-capability models from same base
2-4 models with sign conflictsTIEStrim threshold, tUse when naive averaging degrades performance
4+ models simultaneouslyDARE + TIESdrop rate p, trim thresholddrop_rate=0.9 standard; reduces parameter interference
Same base, different HPsModel Soupwhich checkpoints to includeGreedy soup for safety; uniform for speed
Adapter-level mergeLoRA Mergeper-adapter weightsCheapest option; works across task-specific adapters

→ Interactive: The Model Merging module in Systems Lab has a SLERP interpolation slider and method comparison table.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →