--- license: mit language: en library_name: transformers tags: - onnx - onnxruntime - quantized - int8 - sequence-classification - modernbert - prerequisite-detection - knowledge-graph base_model: answerdotai/ModernBERT-base --- # Concept Verifier (ONNX, int8) A ModernBERT-base classifier fine-tuned for concept-level verification in the EXAMI knowledge-graph pipeline. Distributed in ONNX format with a dynamic int8 quantized variant for efficient CPU inference. ## Files | File | Purpose | Size | |---|---|---| | `model.onnx` | FP32 ONNX export (reference) | ~600 MB | | `model_int8.onnx` | Dynamic int8 quantized for deployment | ~150 MB | | `config.json` | HuggingFace config | | `tokenizer.json` / `tokenizer_config.json` | Fast tokenizer | ## Usage (onnxruntime) ```python import onnxruntime as ort from transformers import AutoTokenizer import numpy as np tok = AutoTokenizer.from_pretrained(".") sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"]) enc = tok("your concept text here", return_tensors="np", padding=True, truncation=True, max_length=128) feed = {k: v.astype(np.int64) for k, v in enc.items() if k in {i.name for i in sess.get_inputs()}} logits = sess.run(None, feed)[0] ``` ## Production context This model is one of two classifiers used in the EXAMI knowledge-graph pipeline. The other (the *merge verifier*) handles `same-as` / merge classification. For details on how this model fits into the broader incremental knowledge-graph architecture, see the merge-verifier model card and its accompanying `CLUSTERING_STRATEGY.md` and `MERGE_AND_CLUSTERING_ARCHITECTURE.md` documents. ## Notes on int8 quantization — partial regression validated on test set Validated on a 5,000-row stratified test sample (same seed=42 split as fp32): | | fp32 test (full 21,651) | int8 test (5k sample) | Δ | |---|---|---|---| | real_concept P | 0.9361 | 0.9371 | +0.0010 (tied) | | real_concept R | 0.9389 | **0.9045** | **−0.0344** | | macro_f0.5 | 0.9165 | 0.8944 | −0.0221 | **The int8 model trades recall for precision** — admits ~3.4% fewer valid concepts than fp32 (≈464 missed admissions per 13,552 valid concepts in test). Precision is intact. **Deployment guidance:** - Use int8 if file size matters (151 MB vs 599 MB) and you can tolerate a 3.4% recall loss. The dropped concepts are recoverable via re-extraction from another document. - Use fp32 if you need maximum recall. - 95.4% of MatMuls are properly quantized (vs 50.3% on DeBERTa-v3-large which is broken — see the v2 model card). ModernBERT's standard transformer architecture round-trips through `quantize_dynamic` cleanly. Diagnostic command (for reproducing the integrity check): ```python from collections import Counter import onnx m = onnx.load("model_int8.onnx") ops = Counter(n.op_type for n in m.graph.node) fp32_mm = ops.get("MatMul", 0) int8_mm = ops.get("MatMulInteger", 0) print(f"MatMul fp32 left: {fp32_mm}; MatMulInteger: {int8_mm}; " f"quantized %: {100*int8_mm/(fp32_mm+int8_mm):.1f}") # ModernBERT-base: 66.2% (with surrounding fp32 ops normal — model accuracy fine) # DeBERTa-v3-large: 50.3% (with disentangled-attention partially fp32 — broken) ```