MiniLM Content Guard - 3 Class

A lightweight content moderation model that classifies text into safe, toxic, or spam. Built on MiniLMv2-L6-H384 and fine-tuned with focal loss for robust handling of hard examples. This model is in ONNX format and optimized for CPU inference.

suggested threshold for considering results as valid: 0.9 (if less than 0.9 confidence, there can be a risk that prediction is wrong)

Labels

Label ID	Label	Description
0	`safe`	Normal, non-harmful content
1	`toxic`	Hate speech, threats, personal attacks, severe insults
2	`spam`	Unsolicited promotions, scams, phishing attempts

Usage

ONNX Runtime

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import torch

model_name = "navodPeiris/minilm-toxic-spam-classifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
model = ORTModelForSequenceClassification.from_pretrained(model_name)

text = "look like garbage!"

inputs = tokenizer(text, return_tensors="pt")

outputs = model(**inputs)

# Convert logits → probabilities
probs = torch.softmax(outputs.logits, dim=-1)

# Get predicted class
pred_id = torch.argmax(probs, dim=-1).item()

label = config.id2label[pred_id]
confidence = probs[0][pred_id].item()

print(label, f"{confidence}")

Transformers.js

import { pipeline } from "@huggingface/transformers";

const pipe = await pipeline(
  "text-classification",
  "navodPeiris/minilm-toxic-spam-classifier",
);

const res = await pipe("this code is uglier than u ugghh");
console.log("res:", res);

Performance

Evaluated on a held-out test set of 13,123 samples:

            precision  recall   f1-score   support

safe          0.98      0.96      0.97      4332
toxic         0.90      0.96      0.93      1626
spam          0.99      0.97      0.98      1157

accuracy                          0.96      7115
macro avg     0.95      0.96      0.96      7115
weighted avg  0.96      0.96      0.96      7115

Training Details

Architecture

Base model: MiniLMv2-L6-H384-distilled-from-BERT-Large (6 layers, 384 hidden dim, ~22M params)
Task head: Linear classification head (3 classes)
Max sequence length: 512 tokens

Training Data

The model was trained on a combined dataset from multiple sources:

Source	Type	Usage
Jigsaw Toxic Comments	Toxicity	safe / toxic labels
Civil Comments	Toxicity	safe / toxic labels
Mail Spam/Ham	Spam	spam labels
Enron Spam	Spam	spam labels

Hyperparameters

Epochs: 5
Batch size: 16 (train) / 32 (eval)
Learning rate: 3e-5
Weight decay: 0.01
Loss: Focal loss (gamma=2) for better handling of hard/borderline examples
Early stopping: Enabled on F1 macro

ONNX Export

An ONNX version of the model is included for fast CPU inference:

Opset version: 18
Dynamic axes: batch size and sequence length
Constant folding: Enabled

Limitations

English-only — not tested on other languages
May struggle with subtle or implicit toxicity where the language closely resembles negative sentiment (e.g., strong product complaints vs. personal attacks)
Not designed for nuanced content policy enforcement — best used as a first-pass filter

License

Apache 2.0

Downloads last month: 23

Model tree for navodPeiris/minilm-toxic-spam-classifier

Base model

nreimers/MiniLMv2-L6-H384-distilled-from-BERT-Large

Quantized

(4)

this model

Datasets used to train navodPeiris/minilm-toxic-spam-classifier

Evaluation results

Accuracy
self-reported

0.960
F1 (macro)
self-reported

0.960
F1 (weighted)
self-reported

0.960