Machine learning

Calibrating Your Classifier: Why Accuracy Alone Is Not Enough

khaled November 17, 2024 4 mins read
Calibrating Your Classifier: Why Accuracy Alone Is Not Enough

Calibrating Your Classifier: Why Accuracy Alone Is Not Enough

A classifier that correctly orders examples — ranking true positives above true negatives — can be completely wrong about the magnitude of its predictions. If a model outputs a probability of 0.9 for examples that are actually positive only 60% of the time, it is miscalibrated: its confidence levels do not match reality. For any application where the raw probability estimate drives a downstream decision — risk thresholds, expected value calculations, portfolio management, medical triage — miscalibration is as damaging as poor accuracy.

What Calibration Means

A perfectly calibrated model satisfies: among all examples the model assigns probability p, exactly fraction p of them should be positive. If the model says "I'm 80% confident this is positive," roughly 80% of those examples should actually be positive.

In practice, no model achieves perfect calibration, but the degree of miscalibration matters enormously for real decisions.

Common Calibration Failures by Model Type

Naive Bayes: typically severely overconfident — it produces probabilities near 0 and 1 that are much more extreme than the actual frequencies.

Logistic Regression: generally well-calibrated if regularisation is not too strong, because it is explicitly trained to minimise log-loss (a proper scoring rule).

Random Forest: tends toward moderate overconfidence; ensemble averaging produces mean class frequencies that can be biased away from extreme probabilities.

Gradient Boosting (XGBoost, LightGBM): often overconfident; the sequential boosting process tends to push predictions toward extreme values.

Neural Networks: can be severely miscalibrated, especially large networks trained with early stopping. The 2017 paper "On Calibration of Modern Neural Networks" showed that modern neural networks are significantly worse-calibrated than older, smaller networks.

Measuring Calibration: Reliability Diagrams and ECE

The reliability diagram (calibration plot) is the standard visualisation: bin predictions by probability (e.g., [0.0-0.1), [0.1-0.2), ..., [0.9-1.0]) and for each bin, plot the mean predicted probability (x-axis) against the fraction of positive examples (y-axis). A perfectly calibrated model produces a diagonal line. Points above the diagonal indicate underconfidence; below indicates overconfidence.

Expected Calibration Error (ECE) quantifies calibration as a scalar: the weighted average absolute difference between mean predicted probability and true fraction positive across bins. Lower ECE is better; ECE under 0.03 is generally considered good.

Calibration Methods

Platt Scaling

Fit a logistic regression on top of the model's raw output scores, using a held-out calibration set:

from sklearn.calibration import CalibratedClassifierCV

calibrated_model = CalibratedClassifierCV(base_model, method='sigmoid', cv='prefit')
calibrated_model.fit(X_calibration, y_calibration)

Platt scaling is effective when the calibration curve is roughly sigmoid-shaped. It has only 2 parameters and requires a small calibration set (200-500 examples is often sufficient).

Isotonic Regression

A non-parametric alternative that fits a monotonic function between raw scores and calibrated probabilities. More flexible than Platt scaling but requires more calibration data (500-1000 examples minimum) and can overfit on small calibration sets.

calibrated_model = CalibratedClassifierCV(base_model, method='isotonic', cv='prefit')

Temperature Scaling (for Neural Networks)

Divide the model's output logits by a learned temperature parameter T before applying softmax. T > 1 softens the distribution (reduces overconfidence); T < 1 sharpens it. Temperature is a single parameter fit on the calibration set by minimising negative log-likelihood.

When to Prioritise Calibration

Calibration matters most when:

  • Probability thresholds drive decisions: setting a cutoff for fraud alerts, approval thresholds for loan applications
  • Probabilities feed downstream calculations: expected value = predicted probability × business value
  • Multiple models are compared: calibrated probabilities allow meaningful comparison across different model types
  • Users see probability estimates: displaying "85% chance of rain" to users requires calibration to maintain trust

Calibration is less critical when only the ranking of predictions matters, not the absolute probability (e.g., a recommendation system where you only care about the relative order of items).

Conclusion

Calibration is a component of model quality that many practitioners check too late or not at all. A well-calibrated model is a trustworthy model — one whose probability outputs can be used directly in decision systems without systematic bias. Measure calibration with reliability diagrams and ECE, and apply Platt scaling or isotonic regression as a lightweight post-processing step that significantly improves usability in production decision systems.

Keywords: model calibration, classifier calibration, Platt scaling, isotonic regression, temperature scaling, reliability diagram, ECE, probability calibration, machine learning