Data
dsdiff
A git-style diff between two datasets, with distribution drift.
Install
pip install dsdiffOnce published to PyPI. Also available now from GitHub:
pip install git+https://github.com/jmweb-org/dsdiffWhat it does
When a dataset is regenerated, columns quietly get renamed, retyped, gain nulls or shift, and the pipeline keeps running while the model degrades. dsdiff compares two files and reports what changed, ranked by severity.
Features
- —Schema changes: added, removed or retyped columns.
- —Per-column distribution drift with PSI.
- —Null-rate and cardinality jumps.
- —CI gate and JSON output; reads CSV, Parquet and JSONL.