Buckets:

SPerva
/

pillchecker-experiments

Files

xet

SPerva/pillchecker-experiments / BENCHMARK.md

SPerva

about 1 month ago

preview code

download

raw

18.6 kB

NER Benchmark: PharmaDetect on Pill Packaging Text

Latest Tier 1 Pipeline Benchmark (2026-05-22)

Status: GCP Cloud Run benchmark execution is working end to end from GitHub Actions, with immutable outputs stored in the HF experiments bucket. The latest validated run includes benchmark-only diagnostics for per-component timing, RxNorm attempts, DDInter/OpenFDA pair attempts, NER diagnostics, and pipeline errors.

Latest validated runs

Run	GitHub Actions	Artifact path	Limit	Result
`tier1-smoke-20260522-metrics2`	`26314362770`	`benchmark-results/2026-05-22/tier1-smoke-20260522-metrics2/`	1	Passed; real Git SHA and diagnostic fields verified.
`tier1-25-20260522-metrics1`	`26314539958`	`benchmark-results/2026-05-22/tier1-25-20260522-metrics1/`	25	Passed; `errors.jsonl` empty and 25 predictions written.

Superseded smoke run: tier1-smoke-20260522-metrics1 passed artifact validation but exposed manifest.git_commit = 0000000 because the Cloud Run image did not contain .git. PR #73 fixed this by passing BENCHMARK_GIT_COMMIT=${GITHUB_SHA} into the Cloud Run Job and tightening artifact validation.

Reproducibility metadata

Field	Value
Dataset	`SPerva/pillchecker-ner-benchmark`
Dataset revision	`d0b6566c202f37fe23a4222423c51186ed064776`
Code commit	`698a7ac548e2c8414525dfb0c58acd8c6755a0bd`
Metric schema	`benchmark-diagnostics-v1`
DDInter source	GitHub release configured by `INTERACTION_DB_REPO` / `INTERACTION_DB_TAG`
Result bucket	`hf://buckets/SPerva/pillchecker-experiments/`

Limit=25 headline metrics

Metric	Value
Records completed	25 / 25
Errors	0
Throughput	1.085 records/sec
Total latency p50	3146 ms
Total latency p95	11035 ms
Slowest component	RxNorm
Slowest-component counts	RxNorm 17, OpenFDA 7, NER 1
NER strict F1	0.655
NER lenient F1	0.709
RxNorm coverage	1.000
RxNorm NIL rate	0.187
RxNorm attempts	107
Interaction pairs checked	36
DDInter hit rate	0.111
DDInter RxCUI hit rate	0.083
DDInter FTS rescue rate	0.028
OpenFDA hit rate	0.000
Unknown interaction rate	0.889
Seed smoke recall	1.000
Seed smoke false-alarm rate	0.000
FP taxonomy total	10

Diagnostic interpretation

The new traces explain the high unknown interaction rate. In the limit=25 run, all 32 unknown interaction pairs had both RxCUIs resolved. Every unknown pair then missed DDInter RxCUI lookup, DDInter FTS lookup, and OpenFDA fallback, with miss_reason = no_source_hit.

So the unknowns are not primarily an RxNorm linking failure. They are mostly interaction-input quality and source-coverage issues:

Brand/generic duplicates are being paired as if they were interacting drugs, such as Bevacizumab + Avastin, Allegra + Fexofenadine, Spironolactone + Aldactone, and Amoxycillin + Augmentin.
Salt/form fragments can be extracted as drugs, especially Maleate, creating spurious pairs such as Maleate + Phenylephrine.
Co-formulated ingredient pairs in cough/cold or combination products often have no explicit DDInter/OpenFDA pair evidence.
Some remaining true ingredient pairs may be DDInter/OpenFDA coverage gaps, but that should be evaluated after cleanup.

Recommendation

Do not scale to a larger benchmark yet. A higher limit will mostly amplify known noisy pair-generation behavior. Next implementation should clean interaction inputs before pair generation:

Canonicalize brand/generic duplicates and collapse them before interaction checking.
Filter salts/fragments/forms that are not active ingredients for DDI purposes.
Generate pairs only from unique canonical active ingredients.
Rerun limit=25 and compare unknown rate, top unknown pairs, RxNorm timing, and seed-smoke recall.
Run limit=100 only after the cleaned limit=25 run shows the diagnostic artifacts are stable and pair quality is improved.

Clinical accuracy caveat: DDInter/OpenFDA outputs remain candidate/source evidence, not ground truth. Interaction precision, recall, and severity accuracy require reviewed expected_interactions and known_safe_pairs in the benchmark dataset.

Date: 2026-04-14 Model: OpenMed/OpenMed-NER-PharmaDetect-BioPatient-108M (108M params) Dataset: 11,796 synthesized pack-label texts from HuggingFace MattBastar/Medicine_Details Pipeline tested: OCR cleaner → NER entity extraction (no RxNorm validation)

Dataset

Source

MattBastar/Medicine_Details — 11,825 Indian brand medicines with structured Composition, Manufacturer, Uses, and Image URL fields. We use the Medicine Name + Composition + Manufacturer columns to synthesize realistic pack-label OCR text.

How It Works

prepare_hf_dataset.py takes each row and:

Synthesizes a pack label from a random template (blister pack, box label, prescription-style, syrup label, etc.):

Augmentin 625 Duo Tablet
Each tablet contains:
Amoxycillin 500mg
Clavulanic Acid 125mg
Glaxo SmithKline Pharmaceuticals Ltd

Parses ground-truth labels from the Composition field: "Amoxycillin (500mg) + Clavulanic Acid (125mg)" → ["Amoxycillin", "Clavulanic Acid"]

Optionally injects OCR noise (--noise light|heavy) using pharma-specific distortion patterns drawn from real OCR failures:

Pattern	Example	Source
m→rn	Metformin → Metforrnin	Glyph confusion in serif fonts
I→l (word start)	Ibuprofen → lbuprofen	Uppercase I vs lowercase L
l→1 (interior)	Alprazolam → A1prazolam	l/1 confusion
o→0 / O→0	Omeprazole → 0mepraz0le	Letter/digit confusion
cl→d	Clavulanic → Davulanic	Ligature misread
mg→rng	500mg → 500rng	m→rn in dosage suffix
Mid-word splits	Bevacizumab → Bevacizu mab	Line-wrap OCR artifact
All-caps	ATORVASTATIN 40MG	Uppercase printed labels

Category Breakdown

Category	N	Description
single_ingredient	7,081	One active ingredient (e.g., Azithromycin)
dual_ingredient	3,591	Two active ingredients (e.g., Amoxycillin + Clavulanic Acid)
multi_ingredient	1,124	Three or more active ingredients

Reproducing

uv run python eval/prepare_hf_dataset.py                  # clean text (default)
uv run python eval/prepare_hf_dataset.py --noise light     # light OCR artifacts
uv run python eval/prepare_hf_dataset.py --noise heavy     # heavy OCR distortion
uv run python eval/prepare_hf_dataset.py --limit 500       # smaller sample

About the Model

Architecture

PharmaDetect-BioPatient-108M is a token-classification (NER) model from the OpenMed NER suite (Panahi, 2025). It detects chemical entities using BIO tagging (B-CHEM, I-CHEM).

Property	Value
Base model	`Bio_Discharge_Summary_BERT` (BioBERT v1.0 → MIMIC-III discharge summaries, ~880M words)
DAPT corpus	350k passages (90M tokens): 100k PubMed, 100k arXiv, 100k MIMIC-III, 50k ClinicalTrials.gov
DAPT method	LoRA (rank=16, α=32, dropout=0.05) on query/value matrices, 3 epochs, single A100 ~4h
Fine-tuning dataset	BC5CDR-CHEM (BioCreative V Chemical-Disease Relation)
Parameters	108M (1.4% trainable via LoRA during DAPT)
Entity types	Chemical entities only (`B-CHEM` / `I-CHEM`)
Published F1	95.83% on BC5CDR-CHEM test set

Domain Gap: Literature vs. Packaging

Aspect	BC5CDR (training)	Pill packaging (our use)
Text style	Scientific prose	Formulaic labels
Length	Full abstracts (~200 words)	Short labels (~5-20 words)
Chemical mentions	In sentence context	Standalone, prominent
Brand names	Rare	Very common
Salt forms	Part of scientific name	Separated on packaging
OCR artifacts	None	Common (rn→m, 0→o, etc.)

Our Pipeline

Raw OCR text
    → ocr_cleaner.py (fix rn→m, 0→o, ligatures, whitespace)
    → ner_model.py (PharmaDetect with manual sub-word merging)
    → drug_analyzer.py (filter CHEM labels, dedupe, skip single-char/punctuation)
    → rxnorm_client.py (validate against RxNorm, get rxcui)
    → [if NER finds nothing] rxnorm fallback (approximate term search)

The benchmark tests the first three steps (OCR cleaner → NER → basic filtering) without RxNorm validation, to isolate the NER model's behavior on packaging text.

Results

By Noise Level (Bare NER, 500-case samples)

Noise	Precision	Recall	F1	Detection
none (clean)	46.9%	84.4%	60.3%	99.6%
light (5-15% char errors)	44.9%	79.8%	57.5%	99.8%
heavy (40% errors + splits)	26.2%	53.5%	35.2%	99.8%

Detection rate = percentage of cases where NER found at least one entity.

Full Pipeline vs Bare NER (50-case samples)

Pipeline Step	Noise	Precision	Recall	F1	Latency / Case
Bare NER	none	48.6%	81.0%	60.7%	64ms
Full Pipeline	none	71.6%	81.0%	76.0%	961ms
Bare NER	light	47.5%	79.8%	59.6%	69ms
Full Pipeline	light	74.4%	79.8%	77.0%	1089ms
Bare NER	heavy	24.7%	47.6%	32.5%	78ms
Full Pipeline	heavy	65.6%	47.6%	55.2%	2597ms

The RxNorm validation step successfully rejected 37 False Positives (boosting precision by ~23 points) without hurting recall. However, relying on an external HTTP API adds nearly 900ms of latency per case.

By Category (clean text, 500 cases)

Category	N	Precision	Recall	F1	TP	FP	FN
single_ingredient	279	41%	86%	55%	239	347	40
dual_ingredient	155	49%	85%	62%	261	270	47
multi_ingredient	66	54%	82%	65%	170	143	37

GLiNER Pipeline Augmentation (500-case samples, clean text)

To improve precision (rejecting brands, salts, and manufacturers) and recall (catching edge cases), we evaluated urchade/gliner_medium-v2.1 as a secondary observer and adjudicator alongside the baseline PharmaDetect + RxNorm pipeline.

Evaluated Architectures

baseline (Current): PharmaDetect NER → RxNorm Validation. If no entities are found, falls back to raw text blocks via RxNorm's approximateTerm.
gliner_filter (Precision): GLiNER runs alongside PharmaDetect. If GLiNER explicitly tags a PharmaDetect entity as a negative label (e.g., brand or trade name, manufacturer, salt or counter-ion), the entity is rejected before hitting RxNorm.
gliner_sequential (Speed & Precision): PharmaDetect runs first. Only the short entity spans extracted by PharmaDetect are passed to GLiNER for classification. If GLiNER tags the snippet as an active pharmaceutical ingredient, it proceeds to RxNorm. This saves massive CPU overhead compared to running GLiNER on the whole document.
gliner_fallback (Recall): If PharmaDetect returns zero entities, the pipeline queries GLiNER for active pharmaceutical ingredient spans instead of running the raw text approximateTerm fallback.
gliner_union (Recall & Precision): Both PharmaDetect and GLiNER run. All GLiNER active pharmaceutical ingredient spans are merged with the PharmaDetect results, deduplicated by their resolved RxNorm IDs.
gliner_adjudicated (Complex Logic): An advanced version of the filter. It rejects negative GLiNER labels, but is "salt-aware"—it won't reject a standalone salt form if GLiNER detects an active ingredient immediately adjacent to it in the text.

Experiment Mode	Pipeline Precision	Pipeline Recall	Pipeline F1	Avg Latency (ms)*
baseline (Current)	76.2%	84.1%	80.0%	984.5
gliner_filter	77.8%	84.0%	80.8%	1221.9
gliner_fallback	76.5%	84.5%	80.3%	1208.5
gliner_sequential	84.9%	58.6%	69.3%	990.9
gliner_adjudicated	78.2%	84.4%	81.2%	1218.8
gliner_union	78.0%	93.6%	85.1%	1266.3

*Note: Latency is inflated by concurrent background testing hitting CPU limits, but relative differences show the overhead of GLiNER. The sequential mode has very low latency because it only runs GLiNER on short snippets.

Findings:

Recall Improvement (gliner_union): The union mode dramatically increased recall to 93.6% (a 9.5pp increase) while slightly increasing precision to 78.0%. GLiNER successfully identifies active ingredients that PharmaDetect misses entirely, without introducing false positives.
The Context Trap (gliner_sequential): The sequential mode achieved the highest precision (84.9%) and the fastest latency. However, its recall plummeted to 58.6%. Because GLiNER was only fed the short text snippets extracted by PharmaDetect, it lost all surrounding context. GLiNER relies heavily on context to classify entities; without it, it failed to recognize many valid active ingredients, causing them to be falsely filtered out.
Precision Improvement (gliner_filter and gliner_adjudicated): The salt-aware adjudicator successfully stripped out false positives (brands, salts), raising precision by 2pp with virtually no loss in recall.
Conclusion: gliner_union produced the best overall F1 score (85.1%) because running GLiNER on the full text preserves its contextual reasoning.

Interpretation

Recall is strong (84%). The model finds the active ingredient in most cases. Multi-ingredient labels are slightly harder (82%) due to more complex text. Under light OCR noise, recall drops only to 80% — the model is reasonably robust to minor distortion.

Precision is poor (47%). The model tags brand names, manufacturer names, salt forms, and dosage form words as chemical entities. This is correct per BC5CDR training ("find all chemicals") but wrong for our use case ("find active pharmaceutical ingredients only").

Heavy noise halves recall (54%). Mid-word splits (Bevacizu mab) and character corruption (Metforrnin) break the tokenizer. This is the target for Phase 4 (OCR Modernization with dictionary-backed fuzzy matching).

False Positive Analysis

All false positives fall into predictable categories:

1. Brand Names

The model tags brand names as chemical entities: Augmentin, Avastin, Allegra, Lipitor, etc. These are the medicine's trade names, not active ingredients.

2. Salt Forms

The model tags salt/counter-ion names separately: Sodium, Hydrochloride, Calcium, Phosphate, Maleate, Potassium. These appear in compositions like "Atorvastatin Calcium" but are not the active drug.

3. Manufacturer Names

Pharmaceutical company names occasionally get tagged: "Cipla", "Lupin", names that look chemical-like to the model.

4. Dosage Form Words

Words like "Tablet", "Capsule", "Syrup", "Injection" sometimes get tagged, especially in compact label layouts.

Noise Degradation Analysis

Metric	None → Light	Light → Heavy
Recall	84% → 80% (−4pp)	80% → 54% (−26pp)
Precision	47% → 45% (−2pp)	45% → 26% (−19pp)
F1	60% → 58% (−3pp)	58% → 35% (−23pp)

Light noise has minimal impact — the OCR cleaner handles common o→0 and l→1 substitutions. Heavy noise causes severe degradation because:

Mid-word splits break tokenization (Amoxy cillin becomes two tokens)
m→rn corruption in drug names (Metforrnin) escapes the cleaner's limited regex patterns
All-caps text shifts token distributions away from training data

Comparison with Published Benchmarks

Benchmark	Precision	Recall	F1
BC5CDR-CHEM (published)	95.1%	96.6%	95.8%
Our packaging — clean	46.9%	84.4%	60.3%
Our packaging — light noise	44.9%	79.8%	57.5%
Our packaging — heavy noise	26.2%	53.5%	35.2%

The precision gap (95% → 47%) reflects a task mismatch, not model quality. BC5CDR measures "find all chemicals" — we measure "find only active pharmaceutical ingredients." The recall gap (97% → 84%) reflects the domain shift from clean scientific text to formulaic packaging labels.

Remediation Plan

Phase	Target	Expected Impact
Phase 2: Entity Linking	Filter NER output through DrugBank to reject non-drug entities	Precision 47% → ~85%+
Phase 3: Fallback	DrugBank fuzzy search when NER finds 0 entities	Recall +5-10% on edge cases
Phase 4: OCR Modernization	Dictionary-backed fuzzy correction before NER	Heavy-noise recall 54% → ~75%+

See docs/plans/phase{2,3,4}_*.md for detailed designs.

Reproducing

# Generate dataset (once)
uv run python eval/prepare_hf_dataset.py                  # clean
uv run python eval/prepare_hf_dataset.py --noise light     # light OCR noise
uv run python eval/prepare_hf_dataset.py --noise heavy     # heavy OCR noise

# Run benchmark
uv run python eval/benchmark.py                    # full dataset
uv run python eval/benchmark.py --limit 500        # quick run
uv run python eval/benchmark.py --with-rxnorm      # full pipeline (needs network)
uv run python eval/benchmark.py -v                 # per-case details

Results are written to eval/results.json.

References

OpenMed NER paper — Panahi, 2025
BC5CDR corpus — Li et al., 2016
Bio_Discharge_Summary_BERT — Alsentzer et al.
PharmaDetect-BioPatient-108M — OpenMed
MattBastar/Medicine_Details — HuggingFace dataset

Xet Storage Details

Size:: 18.6 kB
Xet hash:: e8d4975c3cd80fab38e7e74b7bc9a2171a8242ac345520d0000e46a00a9f5059

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.