Evaluation
evalgate
Decide whether a metric change is a real regression or noise.
Install
pip install evalgate-cliOnce published to PyPI. Also available now from GitHub:
pip install git+https://github.com/jmweb-org/evalgateWhat it does
An eval dropping from 90.0% to 89.4% on 1,000 examples looks like a regression, but at that sample size it is noise. evalgate runs the right statistical test and fails only when the candidate is significantly worse.
Features
- —Two-proportion test over aggregate accuracies.
- —McNemar's test for paired per-example results.
- —Verdict: improvement, unchanged, noise or regression.
- —CI gate with a configurable alpha.