Blog·20/06/2026

Your eval dropped from 90 to 89%: real regression or noise?

A new model scores 89.4% where the old one scored 90.0% on 1,000 examples. It looks like a regression. At that sample size it is noise. How to tell them apart before blocking a deploy.

You change a part of the pipeline, re-run the evaluation, and accuracy goes from 90.0% to 89.4%. The question that decides whether you ship is easy to state and hard to answer by eye: is that half point a real regression, or the normal variation of measuring on a finite sample?

The number alone says nothing. An accuracy is a proportion measured over a specific number of examples. On 1,000 examples, the uncertainty of that measurement is roughly ±1.9 points at 95% confidence. A model whose true accuracy is 90% can give you 88.1% or 91.9% on a given run purely by the luck of which examples landed in the test set. Blocking a deploy over half a point on 1,000 examples is reacting to noise.

The right statistical test. For two proportions (two accuracies with their sample sizes) the test is a two-proportion z-test: compute the difference, the pooled standard error and a p-value. If the p-value is above your threshold (0.05 is standard), the difference is not distinguishable from noise. When both models were evaluated on the same examples, a paired test like McNemar is more powerful: it looks only at the cases where they disagree, which is where the signal is.

What breaks when you gate on the raw number in CI. If your continuous integration fails the moment the metric dips, the build flaps on every re-run from pure sampling noise, and the team learns to ignore it. If you never gate, a real regression slips through. The way out is to gate on significance: fail only when the candidate is significantly worse, and let through what is within noise.

A four-state verdict, not two. Instead of pass/fail, a four-value verdict is more useful: improvement, no measurable change, worse but within noise, and significant regression. Only the last should break the build. This is exactly the logic I implemented in evalgate, a command-line tool that runs the right test and returns the verdict with its p-value.

The practical takeaway: when the sample is small, the right answer to an ambiguous delta is almost never to block or to ship blindly, but to collect more examples. A borderline p-value is a sign that your evaluation set is too small to decide, not that the model is worse.

Work with JMWEB

Let's build something that reaches production.

It all starts with a conversation. Bring a dataset, a goal or a model that is stuck; I will take care of the rest.

Start a project

Keep reading:

13/06/2026

Leakage between train and test: the mistake that inflates every metric

Read article

06/06/2026

From notebook to an inference API that survives production

Read article