Production & LLMOps 10 min read

When to Retrain: Accuracy Triggers, Drift Triggers, and Continuous Training Pipelines

Why retraining on a calendar schedule is the wrong default. Accuracy-based and distribution-shift-based triggers, full retrain vs. warm start vs. incremental learning, and a minimal Airflow DAG for event-triggered retraining.

The Retrain-on-a-Schedule Trap

Retraining every Monday is a popular default. It's also wrong. A model deployed on Tuesday might need retraining by Thursday if a major product change shipped. A stable model might not need retraining for three months. Schedule-based retraining ignores the actual signal in favour of calendar comfort.

The right retraining trigger is a condition, not a date. The conditions fall into three categories: performance degradation detected, input distribution shift detected, business event occurred.

Trigger 1: Accuracy-Based Triggers

When ground truth labels arrive with acceptable latency, use accuracy triggers. Compare a rolling window of production predictions against the labels when they arrive. Trigger retraining when accuracy drops below a threshold.

import numpy as np
from collections import deque

class AccuracyMonitor:
    def __init__(self, window_size: int = 500, threshold: float = 0.85):
        self.window_size = window_size
        self.threshold = threshold
        self.correct = deque(maxlen=window_size)
    
    def observe(self, prediction: int, ground_truth: int):
        self.correct.append(int(prediction == ground_truth))
        if len(self.correct) >= self.window_size:
            acc = sum(self.correct) / len(self.correct)
            if acc < self.threshold:
                self._trigger_retrain(acc)
    
    def _trigger_retrain(self, current_acc: float):
        alert(f"Accuracy {current_acc:.3f} < threshold {self.threshold}. Retraining triggered.")
        submit_retraining_job()

Trigger 2: Distribution Shift Triggers

When labels are delayed (days to weeks), use input distribution triggers as a leading indicator. Run PSI or KS tests daily on the input feature distributions vs. a baseline week. Trigger retraining when shift is detected before accuracy degrades.

from scipy import stats

class DistributionMonitor:
    def __init__(self, baseline_features: np.ndarray, psi_threshold: float = 0.2):
        self.baseline = baseline_features
        self.psi_threshold = psi_threshold
    
    def check(self, recent_features: np.ndarray) -> bool:
        for i in range(recent_features.shape[1]):
            baseline_col = self.baseline[:, i]
            recent_col   = recent_features[:, i]
            
            # KS test for continuous features
            _, p_value = stats.ks_2samp(baseline_col, recent_col)
            if p_value < 0.01:
                print(f"Feature {i}: KS test significant (p={p_value:.4f})")
                return True  # drift detected
        return False

Trigger 3: Business Event Triggers

Some triggers bypass statistical monitoring entirely. They're manual but predictable: a major product launch (new user cohort), a marketing campaign (traffic spike with different intent distribution), a regulatory change, a competitor announcement that changes user behaviour patterns. These events are known in advance; you can pre-schedule retraining runs to execute immediately after them.

Retraining Strategies

Full retrain: retrain on all available data. Expensive. Required when the model architecture changes or the label schema changes. Ensures no stale patterns remain.

Rolling window retrain: keep only the last N weeks of data. Useful when older data is misleading (concept drift). Danger: if you use too short a window, you lose coverage of rare but important patterns (rare classes, seasonal events).

Incremental/online learning: update model parameters on each new batch without full retraining. Efficient but only supported by certain algorithms (SGD-based models, river library for streaming). Risk: catastrophic forgetting if the gradient updates are too large.

Warm start: initialize new training run from the previous model's weights. Converges faster than random init. Good for fine-tuning on recent data without discarding old knowledge.

Retraining cost is often underestimated. Include data pull, preprocessing, hyperparameter search, evaluation, registration, review, and deployment. For a complex model on a large dataset this can easily be 4–6 hours of wall time and meaningful compute cost. Know your cost before designing your trigger policy.

Continuous Training Pipeline

# Minimal Airflow DAG for event-triggered retraining
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {"owner": "mlops", "retries": 1, "retry_delay": timedelta(minutes=10)}

with DAG("triggered_retraining", schedule_interval=None,  # None = event-triggered only
         start_date=datetime(2024, 1, 1), default_args=default_args) as dag:
    
    pull_data    = PythonOperator(task_id="pull_training_data",    python_callable=fetch_recent_data)
    preprocess   = PythonOperator(task_id="preprocess",            python_callable=run_preprocessing)
    train        = PythonOperator(task_id="train_model",           python_callable=train_and_register)
    evaluate     = PythonOperator(task_id="evaluate_vs_champion",  python_callable=compare_to_champion)
    promote      = PythonOperator(task_id="promote_if_better",     python_callable=promote_to_staging)
    
    pull_data >> preprocess >> train >> evaluate >> promote

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →