⚠️ Data Leakage in Machine Learning

Published: (December 1, 2025 at 11:28 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Part 2 of the ML Engineering Failure Series

Symptoms

What You SeeExample
Extremely high validation accuracy“Wow! This model is amazing!”
Unrealistic performance vs. industry benchmarks“We beat SOTA without trying!”
Near‑perfect predictions in training“It’s ready for production!”
Sudden collapse after deployment“Everything is broken. Why?!”

When the model accidentally learns patterns it should never have access to, it performs perfectly in training but is completely useless in the real world.

Illustrative case

A retail company built a model to predict which customers would cancel subscriptions.

  • Training accuracy: 94 %
  • Production AUC: 0.51 (almost random)

A feature named cancellation_timestamp leaked the answer: during training the model learned that a non‑null cancellation_timestamp meant the customer would cancel. This feature didn’t exist at inference time, causing the collapse. The issue was a pipeline problem, not an algorithm problem.

Types of Leakage

TypeExplanation
Target LeakageModel sees target information before prediction.
Train–Test ContaminationSame records appear in both training and testing sets.
Future Information LeakageData from future timestamps used during training.
Proxy LeakageFeatures highly correlated with the target act as hidden shortcuts.
Preprocessing LeakageScaling or encoding done before the split creates overlap.

Example: Preprocessing Leakage

# Leaky version
scaler = StandardScaler()
scaled = scaler.fit_transform(dataset)   # LEAKS TEST INFORMATION
x_train, x_test, y_train, y_test = train_test_split(scaled, y)

# Correct version
x_train, x_test, y_train, y_test = train_test_split(dataset, y)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Detection Methods

SignalIndicator
Training accuracy ≫ validation accuracySuspicious model performance
Validation accuracy ≫ production accuracyPipeline mismatch
Certain features dominate importance scoresProxy leakage
Model perfectly predicts rare eventsImpossible without leakage
Sudden accuracy degradation post‑deploymentReal‑world collapse

A robust workflow:

  1. Split → Preprocess → Train → Evaluate (chronological split for time‑series data).
  2. Document data lineage and ownership.
  3. Define allowed features for production.
  4. Continuously track drift, accuracy, and live feedback.

If a model performs unbelievably well, don’t celebrate—investigate. Good models improve gradually; perfect models almost always hide leakage.

Truth About Model Performance

  • Training accuracy is not real performance; production is the only ground truth.
  • Leakage is a pipeline problem, not an algorithm problem; engineering matters more than modeling.
  • Prevention > debugging: fix the data design before training.

Feature Drift & Concept Drift — Why Models Rot in Production

Models lose accuracy over time due to changes in input data distribution (feature drift) or changes in the underlying relationship between inputs and target (concept drift). Detecting and preventing degradation requires:

  • Monitoring feature statistics and model predictions in real time.
  • Retraining with up‑to‑date data when drift is detected.
  • Maintaining a feedback loop from production outcomes back into the training pipeline.
Back to Blog

Related posts

Read more »

The End of the Train-Test Split

Article URL: https://folio.benguzovsky.com/train-test Comments URL: https://news.ycombinator.com/item?id=46149740 Points: 7 Comments: 1...

Learning, Hacking, and Shipping ML

Vyacheslav Efimov on AI hackathons, data science roadmaps, and how AI meaningfully changed day-to-day ML Engineer work The post Learning, Hacking, and Shipping...