Philosopher Heraclitus once said, "Everything flows, nothing stands still" (τὰ πάντα ῥεῖ καὶ οὐδὲν μένει). This idea rings true in machine learning, especially when it comes to "Concept and Data Drift."
Concept drift refers to the changes in the relationship between the input data and the target variable that our machine learning models aim to predict. We need to recognize and adapt to these changes to keep our models relevant and most importantly accurate. E.g., lets consider a model predicting customer preferences. Over time, customer behavior may shift due to various factors like seasonal trends, market dynamics, or new product introductions. These changes can alter the patterns the model has learned, leading to a decline in its predictive performance.
Monitoring and Detecting Concept Drift
Monitoring concept drift is important for a system which receives non-stationary data. It's crucial to continuously monitor for signs that indicate shifts in data patterns. This involves:
Statistical testing such as Kolmogorov–Smirnov statistic or Chi-squared statistical test.
Performance based drift detection e.g., accuracy, f1-score, etc.
Machine learning mechanisms that detect changes in the data distributions.
By tracking these indicators, we can detect when our models start to diverge from their expected performance, allowing us to take proactive measures. We will explore such methods in the second part of this series of blog posts.
Understanding Data Drifts in Stream Learning
While discussing drifts, we'll use the terms "concept drift" and "data drift" interchangeably. However, it's important to distinguish between concept drift and virtual drift.
Virtual Drift (also known as feature change) occurs when the statistical characteristics of the input data used to train our machine learning models shift over time. Importantly, this shift does not affect the relationship between the input data and the target variable. In essence, while the input data may change, the model's method of predicting outcomes remains stable.
In the example above, we see the decision boundary of a binary classification model in two different time intervals `t` and `t + 1`. Let’s assume for the sake of simplicity that our data are characterized by X1 and X2 features and the dashed line is the decision boundary of a fitted model. On the left, we observe the data at the timepoint `t` while on the right, we see the newly inserted red colored data. As you see here, the new data are further away from the old ones; however, this change in the data distribution does not affect the decision boundary of the model. Such cases are called virtual drifts or feature changes.
Concept Drift (or Data Drift), on the other hand, refers to changes in the statistical properties of the target variable itself over time. Unlike virtual drift, concept drift directly impacts the model's predictions as it encounters new data. This phenomenon reveals that our understanding of the target variable may evolve, necessitating that our models adapt to these changes.
Now let’s consider the same binary classification scenario, but in this example, the new data on the right figure are directly affecting the decision boundary of the fitted model. As we can understand if we were to use the blue decision boundary on the new data, the accuracy in this scenario would deteriorate. Such cases are called concept drifts and call for model adaptation which in this example would be the green dashed line.
Types of Concept Drift
We are primarily interested in concept drifts, which come in four main types, each exhibiting unique behaviors in this ever-evolving landscape:
Sudden Concept Drift: This type of drift occurs abruptly, similar to a sudden change in weather. It presents a significant challenge to our models due to the unexpected shift in the underlying data distribution. Rapid adaptation is necessary to maintain model accuracy, often requiring the current model to be replaced with a new one trained on the most recent data.
Incremental Concept Drift: In contrast to sudden drift, incremental concept drift happens slowly and subtly. It involves a continuous, gradual replacement of the old concept with a new one. Though its immediate impact on model performance might be minimal, it can accumulate over time and degrade accuracy if not addressed. Models need to adapt progressively, often using methods in online or incremental learning where the model is updated per instance.
Gradual Concept Drift: This occurs when the target distribution shifts progressively from one concept to another. While it shares similarities with incremental drift, gradual drift does not necessarily imply a smooth or continuous transition. It can involve discrete steps, making it somewhat easier to detect, especially if the changes are distinct and occur at predictable intervals.
Recurring Concept Drift: This type involves patterns that reemerge cyclically. It includes two subtypes:
Cyclic Recurrent Drift: Occurs with a certain periodicity or seasonal trend, such as Christmas sales discounts.
Acyclic Recurrent Drift: Lacks clear periodicity, meaning the concept may reappear unpredictably. For instance, electricity prices may spike due to rising petrol prices and later return to previous levels when petrol prices decrease.
Understanding these types of concept drift is crucial for developing models that can effectively adapt and maintain performance in the face of changing data distributions.
Conclusions
Concept drift is an important issue in machine learning, significantly impacting the accuracy and reliability of predictive models over time. In this blog post, we highlight the nature of concept drift, explaining how the relationships between input features and target variables evolve, necessitating continuous monitoring. Concept drifts are characterized by different behaviors such as sudden, incremental, gradual, and recurring which makes the monitoring and adaptation challenging tasks. In the second part of this blogpost series, we will explore monitoring and tracking the concept drifts.
** If you are interested in other ML use-cases, please contact me using the form (and also include a publicly available dataset for this case, I'm always curious to explore new problems).
댓글