AboutServicesProjectsContact
All tools

Data

splitcheck

Detect rows leaking between train, validation and test splits.

Install

pip install splitcheck

Once published to PyPI. Also available now from GitHub:

pip install git+https://github.com/jmweb-org/splitcheck

What it does

A row that appears in both train and test inflates every metric and is easy to introduce by accident. splitcheck compares your splits and reports how much of one appears in another, both exact and after normalization.

Features

  • Exact and normalized overlap between splits.
  • Whole-row or single-column comparison.
  • Leakage as a fraction of the target split.
  • CI gate; reads CSV, Parquet, JSONL and text.
View the code on GitHub

Other tools

hola@jmwebsoluciones.com