Blog·08/05/2026

How to evaluate whether an ML model actually works

A model with 95% accuracy can be completely useless. A model that looks worse in validation may be the one that actually works in production. What separates honest evaluation from the kind that inflates results.

A model with 95% accuracy on a default dataset with 5% positives means nothing: the model that always predicts 'no default' achieves that result without learning anything. The metric you choose defines what the model optimises, and if you choose wrong, you optimise the wrong thing.

The problem with global metrics. Accuracy on imbalanced datasets is the most common trap. For binary classification where positives are rare (fraud, default, disease), the useful metrics are ROC-AUC (how well the model separates, regardless of threshold) and PR-AUC (area under the precision-recall curve, more informative when positives are few). For problems with asymmetric cost, what matters is how much each type of error costs your business.

Data leakage: the source of optimistic evaluations. If preprocessing (scaling, encoding, missing value imputation) happens before cross-validation, the model has indirectly seen the validation data during training. The result: validation metrics that do not reproduce in production. The solution is for preprocessing to live inside the pipeline, fitted only on the training data of each fold.

Calibration: when the score needs to be a probability. If the model says 0.7 probability of default, does default actually occur in 70% of those cases? If the answer is no, the model is not calibrated. You can have good ROC-AUC and still need calibration to make decisions based on the score. The Brier score and reliability curve tell you whether probabilities are reliable; Platt scaling or isotonic calibration corrects them when they are not.

The decision threshold is not 0.5. The threshold that minimises total error is not necessarily the best for your business. If a false negative (letting fraud through) costs 10 times more than a false positive (blocking a legitimate transaction), the optimal threshold reflects that asymmetry. Optimising the threshold for the real business cost usually changes the operational result more than changing the algorithm.

Per-segment error analysis. A global metric of 0.87 ROC-AUC can hide that the model works well in one segment and poorly in another. You need to analyse error by subgroups relevant to the problem: customer tenure, product category, region, etc. Errors concentrated in one segment are a data or feature engineering problem, not an algorithm problem.

The rule I apply: before delivering any model, the evaluation report must include at least ROC-AUC and PR-AUC on a held-out test set, the calibration curve, the chosen threshold with its justification and the per-segment error analysis. Without that, you cannot claim the model works.

Work with JMWEB

Let's build something that reaches production.

It all starts with a conversation. Bring a dataset, a goal or a model that is stuck; I will take care of the rest.

Start a project

Keep reading:

15/05/2026

When is it worth using an LLM — and when is it not?

Read article

01/05/2026

Why the decision threshold matters more than the model

Read article