Model Drift

TL;DR

Model performance can degrade over time when the data distribution shifts away from the training data.
Causes include changes in the business environment or in consumer behavior, or changes in physical or demographic characteristics of a location.
Mitigation requires monitoring model performance and updating models (e.g., retraining or using online learning) when drift is detected.

Definition

Model drift is a phenomenon in which the performance of a machine learning model deteriorates over time due to changes in the distribution of the data on which it was trained. In this context, drift refers to the difference between the model’s expected performance and its actual performance over time.

Explanation

Model drift occurs when the real-world data that a model encounters changes relative to the data used during training. Such changes can stem from shifts in the underlying business environment, consumer behavior, or the physical and demographic characteristics of an environment. As these shifts alter the data distribution, the model’s predictions can become less accurate, producing a decline in measured performance.

Examples

Retail industry example

A model trained on data representing the current state of the retail industry may accurately predict consumer demand at the time of training. If the industry shifts—for example, a move towards online shopping—consumer behavior changes and the model may no longer accurately predict demand.

Geographic / location example

A model trained on data for a specific city may accurately predict traffic patterns and congestion levels at the time of training. If the city’s population grows significantly or new roads or highways are built, those underlying characteristics change and the model may fail to accurately predict traffic patterns.

Notes or pitfalls

The underlying cause of model drift is changes in the distribution of the data used for training.
Addressing model drift requires regularly retraining and updating models to account for changing data.
Online learning algorithms can allow models to be updated in real time as new data becomes available.
Ongoing performance monitoring is important to detect drift and take corrective action when necessary.

Online learning
Data distribution
Model performance