---
language: en
license: mit
tags:
- token-classification
- ner
- pii
- privacy
- english
datasets:
- ai4privacy/pii-masking-300k
metrics:
- precision
- recall
- f1
base_model: distilbert/distilbert-base-uncased
library_name: transformers
pipeline_tag: token-classification
---

# pleno_anonymize_en

Lightweight English PII NER trained on the English split of
[`ai4privacy/pii-masking-300k`](https://huggingface.co/datasets/ai4privacy/pii-masking-300k).
Built as the English counterpart of
[`0xhikae/pleno_anonymize_ja`](https://huggingface.co/0xhikae/pleno_anonymize_ja);
recipe is intentionally a mirror of the JP supervised v2 pipeline but
with `distilbert-base-uncased` as the backbone (~66M params) so the
artefact stays small for CPU inference.

## Acceptance tier

Smoke (≥0.50 F1) on the EN validation split is the explicit target.
Numbers reported alongside the JP card are 1000-iter document-level
bootstrap CIs; this card refreshes after the run completes.

## Quick start

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tok = AutoTokenizer.from_pretrained("0xhikae/pleno_anonymize_en")
mdl = AutoModelForTokenClassification.from_pretrained("0xhikae/pleno_anonymize_en")
ner = pipeline("token-classification", model=mdl, tokenizer=tok, aggregation_strategy="simple")
ner("Contact: Alice Johnson <alice@example.com>, phone 555-123-4567.")
```

## Training

- Base: `distilbert-base-uncased`
- Dataset: `ai4privacy/pii-masking-300k`, English slice (~30k train / ~8k val)
- Recipe: 2 epochs, batch 16, lr 5e-5, fp16, seed 42 (mirror of JP v2)
- Hardware: single RTX 4090 on RunPod

Reproduce:

```bash
make -C packages/training dump-supervised-en
make -C packages/training train-supervised-en
make -C packages/training eval-300k-en
```

See [`docs/benchmark-pleno-anonymize-ja.md`](https://github.com/plenoai/pleno-anonymize/blob/main/docs/benchmark-pleno-anonymize-ja.md) for the JP methodology this mirrors.

## License

MIT (matches the upstream `ai4privacy/pii-masking-300k` license).