--- language: en license: mit tags: - token-classification - ner - pii - privacy - english datasets: - ai4privacy/pii-masking-300k metrics: - precision - recall - f1 base_model: distilbert/distilbert-base-uncased library_name: transformers pipeline_tag: token-classification --- # pleno_anonymize_en Lightweight English PII NER trained on the English split of [`ai4privacy/pii-masking-300k`](https://huggingface.co/datasets/ai4privacy/pii-masking-300k). Built as the English counterpart of [`0xhikae/pleno_anonymize_ja`](https://huggingface.co/0xhikae/pleno_anonymize_ja); recipe is intentionally a mirror of the JP supervised v2 pipeline but with `distilbert-base-uncased` as the backbone (~66M params) so the artefact stays small for CPU inference. ## Acceptance tier Smoke (≥0.50 F1) on the EN validation split is the explicit target. Numbers reported alongside the JP card are 1000-iter document-level bootstrap CIs; this card refreshes after the run completes. ## Quick start ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline tok = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_en") mdl = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_en") ner = pipeline("token-classification", model=mdl, tokenizer=tok, aggregation_strategy="simple") ner("Contact: Alice Johnson , phone 555-123-4567.") ``` ## Training - Base: `distilbert-base-uncased` - Dataset: `ai4privacy/pii-masking-300k`, English slice (~30k train / ~8k val) - Recipe: 2 epochs, batch 16, lr 5e-5, fp16, seed 42 (mirror of JP v2) - Hardware: single RTX 4090 on RunPod Reproduce: ```bash make -C packages/training dump-supervised-en make -C packages/training train-supervised-en make -C packages/training eval-300k-en ``` See [`docs/benchmark-pleno-anonymize-ja.md`](https://github.com/plenoai/pleno-anonymize/blob/main/docs/benchmark-pleno-anonymize-ja.md) for the JP methodology this mirrors. ## License MIT (matches the upstream `ai4privacy/pii-masking-300k` license).