When to Retrain: Accuracy Triggers, Drift Triggers, and Continuous Training Pipelines
Why retraining on a calendar schedule is the wrong default. Accuracy-based and distribution-shift-based triggers, full retrain vs. warm start vs. incremental learning, and a minimal Airflow DAG for event-triggered retraining.
The Retrain-on-a-Schedule Trap
Retraining every Monday is a popular default. It's also wrong. A model deployed on Tuesday might need retraining by Thursday if a major product change shipped. A stable model might not need retraining for three months. Schedule-based retraining ignores the actual signal in favour of calendar comfort.
The right retraining trigger is a condition, not a date. The conditions fall into three categories: performance degradation detected, input distribution shift detected, business event occurred.
Trigger 1: Accuracy-Based Triggers
When ground truth labels arrive with acceptable latency, use accuracy triggers. Compare a rolling window of production predictions against the labels when they arrive. Trigger retraining when accuracy drops below a threshold.
import numpy as np
from collections import deque
class AccuracyMonitor:
def __init__(self, window_size: int = 500, threshold: float = 0.85):
self.window_size = window_size
self.threshold = threshold
self.correct = deque(maxlen=window_size)
def observe(self, prediction: int, ground_truth: int):
self.correct.append(int(prediction == ground_truth))
if len(self.correct) >= self.window_size:
acc = sum(self.correct) / len(self.correct)
if acc < self.threshold:
self._trigger_retrain(acc)
def _trigger_retrain(self, current_acc: float):
alert(f"Accuracy {current_acc:.3f} < threshold {self.threshold}. Retraining triggered.")
submit_retraining_job()
Trigger 2: Distribution Shift Triggers
When labels are delayed (days to weeks), use input distribution triggers as a leading indicator. Run PSI or KS tests daily on the input feature distributions vs. a baseline week. Trigger retraining when shift is detected before accuracy degrades.
from scipy import stats
class DistributionMonitor:
def __init__(self, baseline_features: np.ndarray, psi_threshold: float = 0.2):
self.baseline = baseline_features
self.psi_threshold = psi_threshold
def check(self, recent_features: np.ndarray) -> bool:
for i in range(recent_features.shape[1]):
baseline_col = self.baseline[:, i]
recent_col = recent_features[:, i]
# KS test for continuous features
_, p_value = stats.ks_2samp(baseline_col, recent_col)
if p_value < 0.01:
print(f"Feature {i}: KS test significant (p={p_value:.4f})")
return True # drift detected
return False
Trigger 3: Business Event Triggers
Some triggers bypass statistical monitoring entirely. They're manual but predictable: a major product launch (new user cohort), a marketing campaign (traffic spike with different intent distribution), a regulatory change, a competitor announcement that changes user behaviour patterns. These events are known in advance; you can pre-schedule retraining runs to execute immediately after them.
Retraining Strategies
Full retrain: retrain on all available data. Expensive. Required when the model architecture changes or the label schema changes. Ensures no stale patterns remain.
Rolling window retrain: keep only the last N weeks of data. Useful when older data is misleading (concept drift). Danger: if you use too short a window, you lose coverage of rare but important patterns (rare classes, seasonal events).
Incremental/online learning: update model parameters on each new batch without full retraining. Efficient but only supported by certain algorithms (SGD-based models, river library for streaming). Risk: catastrophic forgetting if the gradient updates are too large.
Warm start: initialize new training run from the previous model's weights. Converges faster than random init. Good for fine-tuning on recent data without discarding old knowledge.
Retraining cost is often underestimated. Include data pull, preprocessing, hyperparameter search, evaluation, registration, review, and deployment. For a complex model on a large dataset this can easily be 4–6 hours of wall time and meaningful compute cost. Know your cost before designing your trigger policy.
Continuous Training Pipeline
# Minimal Airflow DAG for event-triggered retraining
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {"owner": "mlops", "retries": 1, "retry_delay": timedelta(minutes=10)}
with DAG("triggered_retraining", schedule_interval=None, # None = event-triggered only
start_date=datetime(2024, 1, 1), default_args=default_args) as dag:
pull_data = PythonOperator(task_id="pull_training_data", python_callable=fetch_recent_data)
preprocess = PythonOperator(task_id="preprocess", python_callable=run_preprocessing)
train = PythonOperator(task_id="train_model", python_callable=train_and_register)
evaluate = PythonOperator(task_id="evaluate_vs_champion", python_callable=compare_to_champion)
promote = PythonOperator(task_id="promote_if_better", python_callable=promote_to_staging)
pull_data >> preprocess >> train >> evaluate >> promote
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →