Instructions to use nationaldesignstudio/rampart with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use nationaldesignstudio/rampart with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('token-classification', 'nationaldesignstudio/rampart');
Update model card to reflect released weights
Browse files
README.md
CHANGED
|
@@ -97,12 +97,24 @@ const { text } = await guard.protect("My name is Alex Rivera and my SSN is 472-8
|
|
| 97 |
|
| 98 |
## Training data
|
| 99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
| Source | Rows used | License | Role |
|
| 101 |
| ------------------------------------------------------------------------------------------------------------ | -------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
| 102 |
-
|
|
| 103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
The held-out 100,000 rows are split into two non-overlapping subsets, seeded for full reproducibility:
|
| 106 |
|
| 107 |
- **10,000 rows** for recall-floor threshold tuning.
|
| 108 |
- **30,000 rows** for the headline test results below (per-language row counts in the eval table).
|
|
|
|
| 97 |
|
| 98 |
## Training data
|
| 99 |
|
| 100 |
+
The shipped model is trained on a **cumulative three-source mix**, added in the
|
| 101 |
+
order below; the selection matrix found that folding in all three sources
|
| 102 |
+
produced every top-recall configuration, with breadth of corpus mattering more
|
| 103 |
+
than the volume of any single source.
|
| 104 |
+
|
| 105 |
| Source | Rows used | License | Role |
|
| 106 |
| ------------------------------------------------------------------------------------------------------------ | -------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
|
| 107 |
+
| Synthetic conversation corpus (in-house) | ~250,000 conversations | CC BY 4.0 | Primary in-house corpus. Chat-style messages generated to be **deliberately messy and realistic**: low-effort/typo-prone text, voice-dictated phrasing, values pasted out of forms, multilingual mixing, and contradictory/duplicated/wrong-field entries, across a range of assistant personas — so the model learns to catch fragmented PII in disordered chat rather than only clean prose. Covers all 17 entity types. |
|
| 108 |
+
| OCR'd-document corpus (in-house) | in-house, span-tagged | CC BY 4.0 | Scanned and photographed forms, IDs, and documents run through OCR and then **span-tagged** with the 17 entity types. Adds OCR noise (character confusions, broken line wraps, stray glyphs) and form-style field layouts absent from the conversational sources, hardening recall on values lifted out of documents. |
|
| 109 |
+
| [`ai4privacy/pii-masking-openpii-1.5m`](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1.5m) | full corpus: ~1.4M train + 100,000 held-out | CC BY 4.0 | Public AI4Privacy template corpus, used in **full** — its `train` and `validation` splits are pooled, deduplicated by `uid`, shuffled with a fixed seed, then split into all-but-100k for training (~1.4M rows) and a 100k held-out. No training cap and no language filter are applied; this is the entire deduplicated corpus, not a subsampled slice. Broad multilingual entropy across 7 Latin-script languages (en, es, fr, de, it, pt, nl); the OpenPII schema mapped to our 35-label BIO schema (17 entity types). Also the source of the held-out evaluation split below. |
|
| 110 |
+
|
| 111 |
+
The synthetic and OCR'd corpora supply the disordered, document-noisy inputs
|
| 112 |
+
OpenPII lacks (its conversations are clean and well-formed); OpenPII supplies the
|
| 113 |
+
multilingual breadth and the held-out test set. The exact synthetic/OCR mix and
|
| 114 |
+
volume were chosen by the end-to-end selection matrix, not assumed — see the
|
| 115 |
+
project whitepaper (§3.1, §3.5, §4.2) for the ablation that fixed the recipe.
|
| 116 |
|
| 117 |
+
The held-out 100,000 rows (from the AI4Privacy corpus) are split into two non-overlapping subsets, seeded for full reproducibility:
|
| 118 |
|
| 119 |
- **10,000 rows** for recall-floor threshold tuning.
|
| 120 |
- **30,000 rows** for the headline test results below (per-language row counts in the eval table).
|