---
license: mit
language: en
library_name: transformers
tags:
  - onnx
  - onnxruntime
  - quantized
  - int8
  - sequence-classification
  - modernbert
  - prerequisite-detection
  - knowledge-graph
base_model: answerdotai/ModernBERT-base
---

# Concept Verifier (ONNX, int8)

A ModernBERT-base classifier fine-tuned for concept-level verification in the
EXAMI knowledge-graph pipeline. Distributed in ONNX format with a dynamic int8
quantized variant for efficient CPU inference.

## Files

| File | Purpose | Size |
|---|---|---|
| `model.onnx` | FP32 ONNX export (reference) | ~600 MB |
| `model_int8.onnx` | Dynamic int8 quantized for deployment | ~150 MB |
| `config.json` | HuggingFace config |
| `tokenizer.json` / `tokenizer_config.json` | Fast tokenizer |

## Usage (onnxruntime)

```python
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tok = AutoTokenizer.from_pretrained(".")
sess = ort.InferenceSession("model_int8.onnx",
                             providers=["CPUExecutionProvider"])

enc = tok("your concept text here",
           return_tensors="np", padding=True, truncation=True, max_length=128)
feed = {k: v.astype(np.int64) for k, v in enc.items()
         if k in {i.name for i in sess.get_inputs()}}
logits = sess.run(None, feed)[0]
```

## Production context

This model is one of two classifiers used in the EXAMI knowledge-graph pipeline.
The other (the *merge verifier*) handles `same-as` / merge classification.

For details on how this model fits into the broader incremental knowledge-graph
architecture, see the merge-verifier model card and its accompanying
`CLUSTERING_STRATEGY.md` and `MERGE_AND_CLUSTERING_ARCHITECTURE.md` documents.

## Notes on int8 quantization — partial regression validated on test set

Validated on a 5,000-row stratified test sample (same seed=42 split as fp32):

| | fp32 test (full 21,651) | int8 test (5k sample) | Δ |
|---|---|---|---|
| real_concept P | 0.9361 | 0.9371 | +0.0010 (tied) |
| real_concept R | 0.9389 | **0.9045** | **−0.0344** |
| macro_f0.5 | 0.9165 | 0.8944 | −0.0221 |

**The int8 model trades recall for precision** — admits ~3.4% fewer valid concepts
than fp32 (≈464 missed admissions per 13,552 valid concepts in test). Precision
is intact.

**Deployment guidance:**
- Use int8 if file size matters (151 MB vs 599 MB) and you can tolerate a
  3.4% recall loss. The dropped concepts are recoverable via re-extraction
  from another document.
- Use fp32 if you need maximum recall.
- 95.4% of MatMuls are properly quantized (vs 50.3% on DeBERTa-v3-large which
  is broken — see the v2 model card). ModernBERT's standard transformer
  architecture round-trips through `quantize_dynamic` cleanly.

Diagnostic command (for reproducing the integrity check):
```python
from collections import Counter
import onnx
m = onnx.load("model_int8.onnx")
ops = Counter(n.op_type for n in m.graph.node)
fp32_mm = ops.get("MatMul", 0)
int8_mm = ops.get("MatMulInteger", 0)
print(f"MatMul fp32 left: {fp32_mm}; MatMulInteger: {int8_mm}; "
       f"quantized %: {100*int8_mm/(fp32_mm+int8_mm):.1f}")
# ModernBERT-base: 66.2% (with surrounding fp32 ops normal — model accuracy fine)
# DeBERTa-v3-large: 50.3% (with disentangled-attention partially fp32 — broken)
```