⚠️ Data Leakage in Machine Learning
Source: Dev.to
Part 2 of the ML Engineering Failure Series
Symptoms
| What You See | Example |
|---|---|
| Extremely high validation accuracy | “Wow! This model is amazing!” |
| Unrealistic performance vs. industry benchmarks | “We beat SOTA without trying!” |
| Near‑perfect predictions in training | “It’s ready for production!” |
| Sudden collapse after deployment | “Everything is broken. Why?!” |
When the model accidentally learns patterns it should never have access to, it performs perfectly in training but is completely useless in the real world.
Illustrative case
A retail company built a model to predict which customers would cancel subscriptions.
- Training accuracy: 94 %
- Production AUC: 0.51 (almost random)
A feature named cancellation_timestamp leaked the answer: during training the model learned that a non‑null cancellation_timestamp meant the customer would cancel. This feature didn’t exist at inference time, causing the collapse. The issue was a pipeline problem, not an algorithm problem.
Types of Leakage
| Type | Explanation |
|---|---|
| Target Leakage | Model sees target information before prediction. |
| Train–Test Contamination | Same records appear in both training and testing sets. |
| Future Information Leakage | Data from future timestamps used during training. |
| Proxy Leakage | Features highly correlated with the target act as hidden shortcuts. |
| Preprocessing Leakage | Scaling or encoding done before the split creates overlap. |
Example: Preprocessing Leakage
# Leaky version
scaler = StandardScaler()
scaled = scaler.fit_transform(dataset) # LEAKS TEST INFORMATION
x_train, x_test, y_train, y_test = train_test_split(scaled, y)
# Correct version
x_train, x_test, y_train, y_test = train_test_split(dataset, y)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
Detection Methods
| Signal | Indicator |
|---|---|
| Training accuracy ≫ validation accuracy | Suspicious model performance |
| Validation accuracy ≫ production accuracy | Pipeline mismatch |
| Certain features dominate importance scores | Proxy leakage |
| Model perfectly predicts rare events | Impossible without leakage |
| Sudden accuracy degradation post‑deployment | Real‑world collapse |
A robust workflow:
- Split → Preprocess → Train → Evaluate (chronological split for time‑series data).
- Document data lineage and ownership.
- Define allowed features for production.
- Continuously track drift, accuracy, and live feedback.
If a model performs unbelievably well, don’t celebrate—investigate. Good models improve gradually; perfect models almost always hide leakage.
Truth About Model Performance
- Training accuracy is not real performance; production is the only ground truth.
- Leakage is a pipeline problem, not an algorithm problem; engineering matters more than modeling.
- Prevention > debugging: fix the data design before training.
Feature Drift & Concept Drift — Why Models Rot in Production
Models lose accuracy over time due to changes in input data distribution (feature drift) or changes in the underlying relationship between inputs and target (concept drift). Detecting and preventing degradation requires:
- Monitoring feature statistics and model predictions in real time.
- Retraining with up‑to‑date data when drift is detected.
- Maintaining a feedback loop from production outcomes back into the training pipeline.