Data
pii-sweep
Scan datasets for PII before sharing them.
Install
pip install pii-sweepOnce published to PyPI. Also available now from GitHub:
pip install git+https://github.com/jmweb-org/pii-sweepWhat it does
Before a dataset is shared, copied to a notebook or pushed to a bucket, it is worth knowing whether a column holds emails, card numbers or IDs. pii-sweep samples each column and reports which look like PII and how strongly.
Features
- —Checksum detectors: cards (Luhn), IBAN (mod-97), SSN.
- —Email, phone and IPv4, grouped by severity.
- —Confidence per column and a configurable threshold.
- —CI gate; reads CSV, Parquet and JSONL.