opir-multilang-onnx
opir-multilang-onnx is an ONNX export of
knowledgator/opir-multitask-multilang-v1.0
(GLiClass uni-encoder over microsoft/mdeberta-v3-base), packaged as an offline, multilingual
content-safety classifier with a frozen taxonomy baked into the graph. Produced for
AgentGuard but usable standalone with ONNX Runtime in
any language.
What it does
Scores text against a fixed label set and returns one logit per label. The candidate labels are
prepended to the text as <<LABEL>>l1<<LABEL>>l2…<<SEP>>text and run through a single mDeBERTa-v3
forward pass (GLiClass uni-encoder); each label's pooled hidden state is scored. Decision:
P(label) = sigmoid(logit), block iff max P over the harm labels >= threshold.
Frozen V1 taxonomy. The block decision is over 6 harm categories:
toxicity, hate speech, violence, sexual content, self-harm, harassment
The graph bakes a 7th label, safe and benign, as label 0. GLiClass scores all labels jointly
in one forward (they cross-attend through the encoder), so this sentinel is essential for
calibration - it absorbs benign probability mass. It is excluded from the block decision
(prefix.json lists the 6 harm labels under unsafe_labels). Omitting it inflates both recall and
false positives.
The label prefix is constant, so its token-id sequence is precomputed and shipped as
prefix.json; an integrator only SP-encodes the variable text and assembles
prefix_ids ++ spm(text) ++ [SEP].
Files
| File | Size | Notes |
|---|---|---|
model.onnx |
~1.12 GB | fp32 graph, logits[batch, 7] (label 0 = safe sentinel) |
model_fp16.onnx |
~561 MB | fp16, numerically identical (max ΔP(unsafe) 0.0003) — default |
spm.model |
~4.3 MB | stock microsoft/mdeberta-v3-base SentencePiece (250k multilingual vocab) |
prefix.json |
<1 KB | baked labels (7, safe first) + unsafe_labels (6 harm) + precomputed [CLS] <<LABEL>>…<<SEP>> id prefix + special ids |
Inputs: input_ids (int64), attention_mask (int64). Output: logits ([batch, 7] - the safe
sentinel plus the 6 harm labels). Special ids: [CLS]=1, [SEP]=2, <<LABEL>>=250102, <<SEP>>=250103, pad=0.
Threshold
Default 0.5 (shipped in prefix.json). Per-deployment tunable: the false-positive rate is
somewhat threshold-sensitive on this multilingual model (unlike the English Opir variant), e.g.
Hindi toxicity moves from 56% recall / 16% FPR at 0.5 to 36% / 4% at 0.8.
Positioning
This is an offline / sovereign multilingual content-safety guard. It fills a gap that
English-only injection classifiers (which score ~0% recall off-English) cannot, and that
cloud content-safety APIs serve only per-call. It is not a prompt-injection specialist and is
not intended to replace a mature cloud content-safety product on that product's own categories;
it provides genuine offline non-English toxicity coverage (≈40–76% recall at 16–36% FPR on
textdetox/multilingual_toxicity_dataset across de/es/ru/ar/zh/hi), free and PII-safe.
Attribution
Derivative of knowledgator/opir-multitask-multilang-v1.0 (Apache-2.0) using Microsoft's
mdeberta-v3-base SentencePiece tokenizer. ONNX export, fp16 conversion, and frozen-taxonomy
packaging by AgentGuard (Apache-2.0). The frozen taxonomy and the int-id prefix are the only
additions; the model weights are unchanged.
Citation
If you use this model, please cite the original Opir work:
@misc{stepanov2026opirefficientmultitasksafety,
title={Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content},
author={Ihor Stepanov and Aleksandr Smechov},
year={2026},
eprint={2605.29659},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.29659},
}
Model tree for filip-w/opir-multilang-onnx
Base model
knowledgator/opir-multitask-multilang-v1.0