Blog·13/06/2026

Leakage between train and test: the mistake that inflates every metric

A row that appears in both training and test makes the model look better than it is. It is easy to introduce by accident and hard to see by eye. Where it creeps in and how to catch it before you trust your numbers.

The point of the test set is to estimate how the model will behave on data it has not seen. The moment an example appears in both training and test, that estimate stops being honest: the model already saw the answer, so the metric goes up, not because the model generalises better, but because it is remembering.

How leakage creeps in unnoticed. It is almost never an obvious bug. It is a `concat` of two files that already overlapped. It is a dataset regenerated and re-split without fixing the seed. It is duplication at source: the same event logged twice with a different id. It is text normalisation collapsing two near-identical rows into one. In time-series data, it is splitting at random instead of by date, and training on the future to predict the past.

Why you do not see it by eye. A 2% overlap between train and test can lift your accuracy by a couple of points, just enough for a model to look better than the baseline without being so. Two points do not jump out of a results table; they look like a legitimate improvement. And because leakage inflates exactly the metric you are looking at, it reinforces the wrong conclusion.

Detection: exact and normalised. The minimum check is to look for identical rows across splits. The check that actually catches real cases is the normalised one: compare after lowercasing, stripping punctuation and collapsing whitespace, because duplication usually comes with cosmetic differences. What matters is measuring which fraction of the test set also appears in training; that fraction is your leakage.

Contamination against benchmarks. The same problem appears at larger scale when you train or fine-tune on data that unknowingly contains examples from the benchmark you later evaluate on. The result is a spectacular number that does not hold in production. The defence is the same: compare your training set against the benchmark before trusting the figure.

I automated this check in splitcheck: it compares two or more splits, reports exact and normalised overlap, and fails CI when leakage exceeds a threshold. The rule I follow is simple: no test metric is credible until you have checked that the test set contains nothing from training.

Work with JMWEB

Let's build something that reaches production.

It all starts with a conversation. Bring a dataset, a goal or a model that is stuck; I will take care of the rest.

Start a project

Keep reading:

20/06/2026

Your eval dropped from 90 to 89%: real regression or noise?

Read article

06/06/2026

From notebook to an inference API that survives production

Read article