Machine learning

Data Leakage: The Silent Killer of Real-World ML Performance

khaled January 2, 2025 4 mins read
Data Leakage: The Silent Killer of Real-World ML Performance

Data Leakage: The Silent Killer of Real-World ML Performance

Data leakage is the single most common cause of the gap between evaluation performance and production performance in ML projects. A model with AUC-ROC of 0.97 in evaluation that delivers 0.72 in production is not a fluke — it is almost certainly a leakage problem. Leakage occurs when information about the target is inadvertently included in the training features in a way that would not be available at inference time. Understanding where leakage hides, and how to eliminate it, is essential for building models that actually work.

What Data Leakage Means

During training, the model has access to both features and labels. Leakage occurs when some features encode information that is only available because we know the label. In production, those features would either be unavailable or would look different — meaning the model's performance collapses.

The insidious part: leakage models typically show suspiciously high evaluation performance. Any classification model achieving AUC > 0.98 on a non-trivial real-world problem deserves a leakage audit.

Pattern 1: Temporal Leakage

The most common leakage in time-series and temporal problems: using future information to predict the past. If your training set includes features computed at time T+1 to predict an outcome at time T, you have temporal leakage.

Example: predicting customer churn using "account closed date" as a feature. If the customer is in the training set, you might have their actual account close date — which directly reveals whether they churned.

Prevention: implement strict temporal splits. Training data must only contain information available before the prediction date for each example. Any feature that could change after the target event is a leakage risk. For rolling aggregates (30-day purchase count), ensure the window ends before the prediction date.

Pattern 2: Train-Test Contamination in Preprocessing

A very common pipeline error: fitting preprocessing transformers (scalers, imputers, encoders) on the full dataset including the test split, then evaluating on that test split.

# WRONG — scaler fitted on full dataset including test
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # leaks test statistics into training
X_train, X_test = train_test_split(X_scaled)

# CORRECT — scaler fitted only on training data
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)  # transform only, no fit

When the test set's statistics influence the scaler parameters, the test set is no longer independent from training — evaluation metrics are optimistically biased.

Solution: always use sklearn Pipelines, which fit preprocessing steps only on training folds during cross-validation.

Pattern 3: Target Encoding Leakage

Target encoding (replacing a categorical with the mean of the target for that category) leaks when computed on the full training set and then applied to cross-validation folds.

# WRONG — target statistics computed on full training set
mean_target = train_df.groupby('category')['target'].mean()
train_df['encoded'] = train_df['category'].map(mean_target)
# test rows in the validation fold saw their own target during encoding

Correct: use k-fold target encoding, where each fold's encoding is computed on the other k-1 folds. Libraries like category_encoders implement this correctly.

Pattern 4: Duplicate Records Across Splits

If the same record appears in both training and test sets — due to duplicates in the raw data that were not deduplicated before splitting — the model can effectively "memorise" those examples. Test performance looks better than true generalisation performance.

Prevention: deduplicate before splitting. For time-series data with multiple records per entity, ensure that all records for a given entity are in the same split (group-based splitting).

Pattern 5: High-Cardinality ID Leakage

Including user IDs, session IDs, or other high-cardinality identifier columns as features is almost always leakage. The model learns to associate specific IDs with specific outcomes from the training set; at inference, new IDs carry no information and the feature provides no value.

Leakage Detection Checklist

  1. Check AUC-ROC: is it suspiciously high (>0.95 on a non-trivial problem)?
  2. Audit feature generation: for each feature, would it be available at inference time?
  3. Confirm preprocessing is fitted only on training data
  4. Verify temporal splits use strict cutoffs
  5. Check for duplicates across train and test sets
  6. Remove ID columns before training

Conclusion

Data leakage is not a sign of incompetence — it is a sign of complexity in real-world ML pipelines. The best defence is systematic auditing: for every feature and every preprocessing step, ask "would this information be available at the time we need to make this prediction?" A model that survives this audit is a model that will actually work in production.

Keywords: data leakage, machine learning, temporal leakage, train test split, target encoding leakage, ML pipeline, production ML, feature engineering leakage