Data
splitcheck
Detect rows leaking between train, validation and test splits.
Install
pip install splitcheckOnce published to PyPI. Also available now from GitHub:
pip install git+https://github.com/jmweb-org/splitcheckWhat it does
A row that appears in both train and test inflates every metric and is easy to introduce by accident. splitcheck compares your splits and reports how much of one appears in another, both exact and after normalization.
Features
- —Exact and normalized overlap between splits.
- —Whole-row or single-column comparison.
- —Leakage as a fraction of the target split.
- —CI gate; reads CSV, Parquet, JSONL and text.