Overfitting Is Not Just a Model Problem — It Is a Mindset Problem

Overfitting — when a model learns to perform well on training data but fails on new data — has well-known technical remedies: more data, regularisation, dropout, early stopping, cross-validation. These tools work. But they treat overfitting as a technical failure when it is also a systemic failure in how ML projects are run. The hardest-to-detect forms of overfitting are not in the model — they are in the decisions, processes, and incentives surrounding the model.

The Standard Story (Incomplete)

The standard story: overfitting occurs when a model is too complex relative to the size of the training dataset. The model memorises noise rather than learning signal. Fix it with regularisation, get more data, or reduce model complexity. Test on a held-out set; if test accuracy matches train accuracy, no overfitting.

This story is accurate as far as it goes. It is incomplete because it focuses exclusively on the model and ignores how the test set evaluation itself can become overfitted.

Evaluation Overfitting: The Deeper Problem

Every time you check performance on your test set and use that information to make a modelling decision — selecting between two models, tuning a threshold, choosing a feature — you use the test set as implicit training data. Repeat this across 50 experiments and the model you report as "best on the test set" is the one that happened to benefit from test-set random variation, not necessarily the one with the best true generalization.

This phenomenon is sometimes called evaluation overfitting or benchmark overfitting. It explains why models that perform brilliantly in research papers often underperform in production.

How to prevent it:

Use a dedicated validation set for all iterative decisions; reserve the test set for a single final evaluation
Use nested cross-validation if your dataset is too small for a three-way split
Pre-register your final evaluation protocol before conducting it
Be deeply sceptical of performance gains that are not reproducible on fresh data

Pipeline Leakage: Feature Engineering Overfitting

A subtler form: when feature engineering decisions are made by looking at the full dataset including the test split. If you discover and add a feature after observing that it correlates with the target on the test set, you have incorporated test-set information into the feature set — a form of data leakage.

Correct approach: all feature engineering decisions, including which features to include, how to encode categoricals, and what transformations to apply, must be made based on the training set only. Implement this with sklearn Pipeline objects that encapsulate all preprocessing within cross-validation folds.

Reporting Overfitting: The Incentive Problem

Research environments and business ML teams share an incentive structure that systematically promotes overfitting: reporting is rewarded when results look good. If a team tries 20 models and reports the best, the expected reported accuracy is substantially higher than the true generalisation accuracy of any individual model.

Addressing this requires cultural and process changes, not technical ones:

Report confidence intervals, not point estimates
Report the number of experiments conducted alongside the best result
Require out-of-time validation (testing on data from a time period after the training period) for any time-series or temporal model
Treat production performance as the ground truth, not benchmark performance

The Mindset Shift

Treating overfitting as only a technical problem leads to technical-only solutions (add regularisation) while leaving systemic causes intact (test sets reused for model selection, features engineered on full data, incentives for good-looking numbers).

The mindset shift: every decision that touches outcome data is a potential source of leakage. This includes how features are built, how models are selected, how hyperparameters are tuned, how thresholds are set, and how performance is reported. Treating each of these decisions as requiring the same rigour as a scientific experiment is the foundation of models that actually generalise.

Conclusion

Technical overfitting is solved by regularisation and more data. Systemic overfitting is solved by evaluation discipline, pipeline hygiene, and a cultural commitment to measuring what actually generalises — not what looks best on the held-out set that everyone has been peeking at. The best ML practitioners are sceptical of their own results and design their processes to protect against the natural human tendency to find patterns that are not really there.

Keywords: overfitting, machine learning, evaluation overfitting, data leakage, generalisation, regularisation, cross-validation, ML evaluation, benchmark overfitting