Buckets:
NER Benchmark: PharmaDetect on Pill Packaging Text
Latest Tier 1 Pipeline Benchmark (2026-05-22)
Status: GCP Cloud Run benchmark execution is working end to end from GitHub Actions, with immutable outputs stored in the HF experiments bucket. The latest validated run includes benchmark-only diagnostics for per-component timing, RxNorm attempts, DDInter/OpenFDA pair attempts, NER diagnostics, and pipeline errors.
Latest validated runs
| Run | GitHub Actions | Artifact path | Limit | Result |
|---|---|---|---|---|
tier1-smoke-20260522-metrics2 |
26314362770 |
benchmark-results/2026-05-22/tier1-smoke-20260522-metrics2/ |
1 | Passed; real Git SHA and diagnostic fields verified. |
tier1-25-20260522-metrics1 |
26314539958 |
benchmark-results/2026-05-22/tier1-25-20260522-metrics1/ |
25 | Passed; errors.jsonl empty and 25 predictions written. |
Superseded smoke run: tier1-smoke-20260522-metrics1 passed artifact validation but exposed manifest.git_commit = 0000000 because the Cloud Run image did not contain .git. PR #73 fixed this by passing BENCHMARK_GIT_COMMIT=${GITHUB_SHA} into the Cloud Run Job and tightening artifact validation.
Reproducibility metadata
| Field | Value |
|---|---|
| Dataset | SPerva/pillchecker-ner-benchmark |
| Dataset revision | d0b6566c202f37fe23a4222423c51186ed064776 |
| Code commit | 698a7ac548e2c8414525dfb0c58acd8c6755a0bd |
| Metric schema | benchmark-diagnostics-v1 |
| DDInter source | GitHub release configured by INTERACTION_DB_REPO / INTERACTION_DB_TAG |
| Result bucket | hf://buckets/SPerva/pillchecker-experiments/ |
Limit=25 headline metrics
| Metric | Value |
|---|---|
| Records completed | 25 / 25 |
| Errors | 0 |
| Throughput | 1.085 records/sec |
| Total latency p50 | 3146 ms |
| Total latency p95 | 11035 ms |
| Slowest component | RxNorm |
| Slowest-component counts | RxNorm 17, OpenFDA 7, NER 1 |
| NER strict F1 | 0.655 |
| NER lenient F1 | 0.709 |
| RxNorm coverage | 1.000 |
| RxNorm NIL rate | 0.187 |
| RxNorm attempts | 107 |
| Interaction pairs checked | 36 |
| DDInter hit rate | 0.111 |
| DDInter RxCUI hit rate | 0.083 |
| DDInter FTS rescue rate | 0.028 |
| OpenFDA hit rate | 0.000 |
| Unknown interaction rate | 0.889 |
| Seed smoke recall | 1.000 |
| Seed smoke false-alarm rate | 0.000 |
| FP taxonomy total | 10 |
Diagnostic interpretation
The new traces explain the high unknown interaction rate. In the limit=25 run, all 32 unknown interaction pairs had both RxCUIs resolved. Every unknown pair then missed DDInter RxCUI lookup, DDInter FTS lookup, and OpenFDA fallback, with miss_reason = no_source_hit.
So the unknowns are not primarily an RxNorm linking failure. They are mostly interaction-input quality and source-coverage issues:
- Brand/generic duplicates are being paired as if they were interacting drugs, such as
Bevacizumab + Avastin,Allegra + Fexofenadine,Spironolactone + Aldactone, andAmoxycillin + Augmentin. - Salt/form fragments can be extracted as drugs, especially
Maleate, creating spurious pairs such asMaleate + Phenylephrine. - Co-formulated ingredient pairs in cough/cold or combination products often have no explicit DDInter/OpenFDA pair evidence.
- Some remaining true ingredient pairs may be DDInter/OpenFDA coverage gaps, but that should be evaluated after cleanup.
Recommendation
Do not scale to a larger benchmark yet. A higher limit will mostly amplify known noisy pair-generation behavior. Next implementation should clean interaction inputs before pair generation:
- Canonicalize brand/generic duplicates and collapse them before interaction checking.
- Filter salts/fragments/forms that are not active ingredients for DDI purposes.
- Generate pairs only from unique canonical active ingredients.
- Rerun
limit=25and compare unknown rate, top unknown pairs, RxNorm timing, and seed-smoke recall. - Run
limit=100only after the cleanedlimit=25run shows the diagnostic artifacts are stable and pair quality is improved.
Clinical accuracy caveat: DDInter/OpenFDA outputs remain candidate/source evidence, not ground truth. Interaction precision, recall, and severity accuracy require reviewed expected_interactions and known_safe_pairs in the benchmark dataset.
Date: 2026-04-14
Model: OpenMed/OpenMed-NER-PharmaDetect-BioPatient-108M (108M params)
Dataset: 11,796 synthesized pack-label texts from HuggingFace MattBastar/Medicine_Details
Pipeline tested: OCR cleaner → NER entity extraction (no RxNorm validation)
Dataset
Source
MattBastar/Medicine_Details — 11,825 Indian brand medicines with structured Composition, Manufacturer, Uses, and Image URL fields. We use the Medicine Name + Composition + Manufacturer columns to synthesize realistic pack-label OCR text.
How It Works
prepare_hf_dataset.py takes each row and:
Synthesizes a pack label from a random template (blister pack, box label, prescription-style, syrup label, etc.):
Augmentin 625 Duo Tablet Each tablet contains: Amoxycillin 500mg Clavulanic Acid 125mg Glaxo SmithKline Pharmaceuticals LtdParses ground-truth labels from the Composition field:
"Amoxycillin (500mg) + Clavulanic Acid (125mg)"→["Amoxycillin", "Clavulanic Acid"]Optionally injects OCR noise (
--noise light|heavy) using pharma-specific distortion patterns drawn from real OCR failures:Pattern Example Source m→rn Metformin → Metforrnin Glyph confusion in serif fonts I→l (word start) Ibuprofen → lbuprofen Uppercase I vs lowercase L l→1 (interior) Alprazolam → A1prazolam l/1 confusion o→0 / O→0 Omeprazole → 0mepraz0le Letter/digit confusion cl→d Clavulanic → Davulanic Ligature misread mg→rng 500mg → 500rng m→rn in dosage suffix Mid-word splits Bevacizumab → Bevacizu mab Line-wrap OCR artifact All-caps ATORVASTATIN 40MG Uppercase printed labels
Category Breakdown
| Category | N | Description |
|---|---|---|
| single_ingredient | 7,081 | One active ingredient (e.g., Azithromycin) |
| dual_ingredient | 3,591 | Two active ingredients (e.g., Amoxycillin + Clavulanic Acid) |
| multi_ingredient | 1,124 | Three or more active ingredients |
Reproducing
uv run python eval/prepare_hf_dataset.py # clean text (default)
uv run python eval/prepare_hf_dataset.py --noise light # light OCR artifacts
uv run python eval/prepare_hf_dataset.py --noise heavy # heavy OCR distortion
uv run python eval/prepare_hf_dataset.py --limit 500 # smaller sample
About the Model
Architecture
PharmaDetect-BioPatient-108M is a token-classification (NER) model from the
OpenMed NER suite (Panahi, 2025). It
detects chemical entities using BIO tagging (B-CHEM, I-CHEM).
| Property | Value |
|---|---|
| Base model | Bio_Discharge_Summary_BERT (BioBERT v1.0 → MIMIC-III discharge summaries, ~880M words) |
| DAPT corpus | 350k passages (90M tokens): 100k PubMed, 100k arXiv, 100k MIMIC-III, 50k ClinicalTrials.gov |
| DAPT method | LoRA (rank=16, α=32, dropout=0.05) on query/value matrices, 3 epochs, single A100 ~4h |
| Fine-tuning dataset | BC5CDR-CHEM (BioCreative V Chemical-Disease Relation) |
| Parameters | 108M (1.4% trainable via LoRA during DAPT) |
| Entity types | Chemical entities only (B-CHEM / I-CHEM) |
| Published F1 | 95.83% on BC5CDR-CHEM test set |
Domain Gap: Literature vs. Packaging
| Aspect | BC5CDR (training) | Pill packaging (our use) |
|---|---|---|
| Text style | Scientific prose | Formulaic labels |
| Length | Full abstracts (~200 words) | Short labels (~5-20 words) |
| Chemical mentions | In sentence context | Standalone, prominent |
| Brand names | Rare | Very common |
| Salt forms | Part of scientific name | Separated on packaging |
| OCR artifacts | None | Common (rn→m, 0→o, etc.) |
Our Pipeline
Raw OCR text
→ ocr_cleaner.py (fix rn→m, 0→o, ligatures, whitespace)
→ ner_model.py (PharmaDetect with manual sub-word merging)
→ drug_analyzer.py (filter CHEM labels, dedupe, skip single-char/punctuation)
→ rxnorm_client.py (validate against RxNorm, get rxcui)
→ [if NER finds nothing] rxnorm fallback (approximate term search)
The benchmark tests the first three steps (OCR cleaner → NER → basic filtering) without RxNorm validation, to isolate the NER model's behavior on packaging text.
Results
By Noise Level (Bare NER, 500-case samples)
| Noise | Precision | Recall | F1 | Detection |
|---|---|---|---|---|
| none (clean) | 46.9% | 84.4% | 60.3% | 99.6% |
| light (5-15% char errors) | 44.9% | 79.8% | 57.5% | 99.8% |
| heavy (40% errors + splits) | 26.2% | 53.5% | 35.2% | 99.8% |
Detection rate = percentage of cases where NER found at least one entity.
Full Pipeline vs Bare NER (50-case samples)
| Pipeline Step | Noise | Precision | Recall | F1 | Latency / Case |
|---|---|---|---|---|---|
| Bare NER | none | 48.6% | 81.0% | 60.7% | 64ms |
| Full Pipeline | none | 71.6% | 81.0% | 76.0% | 961ms |
| Bare NER | light | 47.5% | 79.8% | 59.6% | 69ms |
| Full Pipeline | light | 74.4% | 79.8% | 77.0% | 1089ms |
| Bare NER | heavy | 24.7% | 47.6% | 32.5% | 78ms |
| Full Pipeline | heavy | 65.6% | 47.6% | 55.2% | 2597ms |
The RxNorm validation step successfully rejected 37 False Positives (boosting precision by ~23 points) without hurting recall. However, relying on an external HTTP API adds nearly 900ms of latency per case.
By Category (clean text, 500 cases)
| Category | N | Precision | Recall | F1 | TP | FP | FN |
|---|---|---|---|---|---|---|---|
| single_ingredient | 279 | 41% | 86% | 55% | 239 | 347 | 40 |
| dual_ingredient | 155 | 49% | 85% | 62% | 261 | 270 | 47 |
| multi_ingredient | 66 | 54% | 82% | 65% | 170 | 143 | 37 |
GLiNER Pipeline Augmentation (500-case samples, clean text)
To improve precision (rejecting brands, salts, and manufacturers) and recall (catching edge cases), we evaluated urchade/gliner_medium-v2.1 as a secondary observer and adjudicator alongside the baseline PharmaDetect + RxNorm pipeline.
Evaluated Architectures
baseline(Current): PharmaDetect NER → RxNorm Validation. If no entities are found, falls back to raw text blocks via RxNorm'sapproximateTerm.gliner_filter(Precision): GLiNER runs alongside PharmaDetect. If GLiNER explicitly tags a PharmaDetect entity as a negative label (e.g.,brand or trade name,manufacturer,salt or counter-ion), the entity is rejected before hitting RxNorm.gliner_sequential(Speed & Precision): PharmaDetect runs first. Only the short entity spans extracted by PharmaDetect are passed to GLiNER for classification. If GLiNER tags the snippet as anactive pharmaceutical ingredient, it proceeds to RxNorm. This saves massive CPU overhead compared to running GLiNER on the whole document.gliner_fallback(Recall): If PharmaDetect returns zero entities, the pipeline queries GLiNER foractive pharmaceutical ingredientspans instead of running the raw textapproximateTermfallback.gliner_union(Recall & Precision): Both PharmaDetect and GLiNER run. All GLiNERactive pharmaceutical ingredientspans are merged with the PharmaDetect results, deduplicated by their resolved RxNorm IDs.gliner_adjudicated(Complex Logic): An advanced version of the filter. It rejects negative GLiNER labels, but is "salt-aware"—it won't reject a standalone salt form if GLiNER detects an active ingredient immediately adjacent to it in the text.
| Experiment Mode | Pipeline Precision | Pipeline Recall | Pipeline F1 | Avg Latency (ms)* |
|---|---|---|---|---|
| baseline (Current) | 76.2% | 84.1% | 80.0% | 984.5 |
| gliner_filter | 77.8% | 84.0% | 80.8% | 1221.9 |
| gliner_fallback | 76.5% | 84.5% | 80.3% | 1208.5 |
| gliner_sequential | 84.9% | 58.6% | 69.3% | 990.9 |
| gliner_adjudicated | 78.2% | 84.4% | 81.2% | 1218.8 |
| gliner_union | 78.0% | 93.6% | 85.1% | 1266.3 |
*Note: Latency is inflated by concurrent background testing hitting CPU limits, but relative differences show the overhead of GLiNER. The sequential mode has very low latency because it only runs GLiNER on short snippets.
Findings:
- Recall Improvement (
gliner_union): The union mode dramatically increased recall to 93.6% (a 9.5pp increase) while slightly increasing precision to 78.0%. GLiNER successfully identifies active ingredients that PharmaDetect misses entirely, without introducing false positives. - The Context Trap (
gliner_sequential): The sequential mode achieved the highest precision (84.9%) and the fastest latency. However, its recall plummeted to 58.6%. Because GLiNER was only fed the short text snippets extracted by PharmaDetect, it lost all surrounding context. GLiNER relies heavily on context to classify entities; without it, it failed to recognize many valid active ingredients, causing them to be falsely filtered out. - Precision Improvement (
gliner_filterandgliner_adjudicated): The salt-aware adjudicator successfully stripped out false positives (brands, salts), raising precision by 2pp with virtually no loss in recall. - Conclusion:
gliner_unionproduced the best overall F1 score (85.1%) because running GLiNER on the full text preserves its contextual reasoning.
Interpretation
Recall is strong (84%). The model finds the active ingredient in most cases. Multi-ingredient labels are slightly harder (82%) due to more complex text. Under light OCR noise, recall drops only to 80% — the model is reasonably robust to minor distortion.
Precision is poor (47%). The model tags brand names, manufacturer names, salt forms, and dosage form words as chemical entities. This is correct per BC5CDR training ("find all chemicals") but wrong for our use case ("find active pharmaceutical ingredients only").
Heavy noise halves recall (54%). Mid-word splits (Bevacizu mab)
and character corruption (Metforrnin) break the tokenizer. This is
the target for Phase 4 (OCR Modernization with dictionary-backed fuzzy
matching).
False Positive Analysis
All false positives fall into predictable categories:
1. Brand Names
The model tags brand names as chemical entities: Augmentin, Avastin, Allegra, Lipitor, etc. These are the medicine's trade names, not active ingredients.
2. Salt Forms
The model tags salt/counter-ion names separately: Sodium, Hydrochloride, Calcium, Phosphate, Maleate, Potassium. These appear in compositions like "Atorvastatin Calcium" but are not the active drug.
3. Manufacturer Names
Pharmaceutical company names occasionally get tagged: "Cipla", "Lupin", names that look chemical-like to the model.
4. Dosage Form Words
Words like "Tablet", "Capsule", "Syrup", "Injection" sometimes get tagged, especially in compact label layouts.
Noise Degradation Analysis
| Metric | None → Light | Light → Heavy |
|---|---|---|
| Recall | 84% → 80% (−4pp) | 80% → 54% (−26pp) |
| Precision | 47% → 45% (−2pp) | 45% → 26% (−19pp) |
| F1 | 60% → 58% (−3pp) | 58% → 35% (−23pp) |
Light noise has minimal impact — the OCR cleaner handles common o→0
and l→1 substitutions. Heavy noise causes severe degradation because:
- Mid-word splits break tokenization (
Amoxy cillinbecomes two tokens) - m→rn corruption in drug names (
Metforrnin) escapes the cleaner's limited regex patterns - All-caps text shifts token distributions away from training data
Comparison with Published Benchmarks
| Benchmark | Precision | Recall | F1 |
|---|---|---|---|
| BC5CDR-CHEM (published) | 95.1% | 96.6% | 95.8% |
| Our packaging — clean | 46.9% | 84.4% | 60.3% |
| Our packaging — light noise | 44.9% | 79.8% | 57.5% |
| Our packaging — heavy noise | 26.2% | 53.5% | 35.2% |
The precision gap (95% → 47%) reflects a task mismatch, not model quality. BC5CDR measures "find all chemicals" — we measure "find only active pharmaceutical ingredients." The recall gap (97% → 84%) reflects the domain shift from clean scientific text to formulaic packaging labels.
Remediation Plan
| Phase | Target | Expected Impact |
|---|---|---|
| Phase 2: Entity Linking | Filter NER output through DrugBank to reject non-drug entities | Precision 47% → ~85%+ |
| Phase 3: Fallback | DrugBank fuzzy search when NER finds 0 entities | Recall +5-10% on edge cases |
| Phase 4: OCR Modernization | Dictionary-backed fuzzy correction before NER | Heavy-noise recall 54% → ~75%+ |
See docs/plans/phase{2,3,4}_*.md for detailed designs.
Reproducing
# Generate dataset (once)
uv run python eval/prepare_hf_dataset.py # clean
uv run python eval/prepare_hf_dataset.py --noise light # light OCR noise
uv run python eval/prepare_hf_dataset.py --noise heavy # heavy OCR noise
# Run benchmark
uv run python eval/benchmark.py # full dataset
uv run python eval/benchmark.py --limit 500 # quick run
uv run python eval/benchmark.py --with-rxnorm # full pipeline (needs network)
uv run python eval/benchmark.py -v # per-case details
Results are written to eval/results.json.
References
- OpenMed NER paper — Panahi, 2025
- BC5CDR corpus — Li et al., 2016
- Bio_Discharge_Summary_BERT — Alsentzer et al.
- PharmaDetect-BioPatient-108M — OpenMed
- MattBastar/Medicine_Details — HuggingFace dataset
Xet Storage Details
- Size:
- 18.6 kB
- Xet hash:
- e8d4975c3cd80fab38e7e74b7bc9a2171a8242ac345520d0000e46a00a9f5059
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.