---
language: en
license: apache-2.0
tags:
- safety-classifier
- content-moderation
- multi-task
- deberta-v3
- text-classification
datasets:
- budecosystem/guardrail-training-data
metrics:
- accuracy
- f1
---

# 🛡️ Guard Safety Classifier

A multi-task safety classifier based on **DeBERTa-v3-small** trained on 3.9M+ samples for content moderation and safety detection.

## 🎯 Model Tasks

This model performs **three simultaneous predictions**:

1. **Binary Safety Classification** (`is_safe`)
   - ✅ Safe content
   - ⚠️ Unsafe content

2. **Single-Label Category Classification** (`category`)
   - Identifies the primary safety concern category

3. **Multi-Label Categories** (`categories`)
   - Can detect multiple safety issues simultaneously

## 📊 Performance Metrics

| Metric | Score |
|--------|-------|
| **is_safe Accuracy** | 92.76% |
| **category F1** | 0.5037 |
| **categories F1** | 0.9068 |
| **Test Loss** | 1.0233 |

## 🚀 Quick Start

```python
import torch
from transformers import AutoTokenizer
import pickle

# Load model and tokenizer
model_name = "YOUR_USERNAME/guard-safety-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model architecture
from your_model_file import MultiTaskSafetyClassifier
model = MultiTaskSafetyClassifier(
    model_name="microsoft/deberta-v3-small",
    num_categories=NUM_CATEGORIES,
    num_multi_labels=NUM_MULTI_LABELS
)

# Load weights
model.load_state_dict(torch.load("model_weights.pt"))
model.eval()

# Load label encoders
with open("label_encoders.pkl", "rb") as f:
    encoders = pickle.load(f)
    le_category = encoders['le_category']
    mlb = encoders['mlb']

# Inference
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", max_length=128, 
                   truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    
is_safe = torch.softmax(outputs['is_safe'], dim=1)[0][1].item() > 0.5
category = le_category.inverse_transform([outputs['category'].argmax(1).item()])[0]
categories = mlb.inverse_transform((torch.sigmoid(outputs['categories']) > 0.5).cpu().numpy())[0]

print(f"Is Safe: {is_safe}")
print(f"Category: {category}")
print(f"Categories: {list(categories)}")
```

## 🏗️ Model Architecture

- **Base Model**: `microsoft/deberta-v3-small` (141M parameters)
- **Hidden Size**: 768
- **Max Sequence Length**: 128 tokens
- **Training Framework**: PyTorch + Transformers

## 📚 Training Details

- **Dataset**: [budecosystem/guardrail-training-data](https://huggingface.co/datasets/budecosystem/guardrail-training-data)
- **Training Samples**: 3,182,844
- **Validation Samples**: 397,855
- **Test Samples**: 397,856
- **Batch Size**: 64
- **Learning Rate**: 2e-5
- **Epochs**: 1
- **Optimizer**: AdamW with linear warmup
- **Hardware**: NVIDIA Tesla T4 (16GB)
- **Training Time**: ~8 hours

## 🏷️ Categories

The model can identify the following safety categories:

```python
[
  "animal_abuse",
  "benign",
  "child_abuse",
  "code_vulnerabilities",
  "controversial_topics_politics",
  "cwe_compliance",
  "dangerous_expert_advice",
  "discrimination_stereotype_injustice",
  "drug_abuse_weapons_banned_substance",
  "financial_crime_property_crime_theft",
  "fraud_deception_misinformation",
  "gender_bias",
  "hate_speech_offensive_language",
  "jailbreak_prompt_injection",
  "malware_hacking_cyberattack",
  "misinformation_regarding_ethics_laws_and_safety",
  "mitre_compliance",
  "non_violent_unethical_behavior",
  "orientation_bias",
  "privacy_violation",
  "race_bias",
  "religious_bias",
  "self_harm",
  "sexually_explicit_adult_content",
  "terrorism_organized_crime",
  "violence_aiding_and_abetting_incitement"
]
```

## 🔢 Multi-Label Classes

```python
[
  " ",
  ",",
  "_",
  "a",
  "b",
  "c",
  "d",
  "e",
  "f",
  "g",
  "h",
  "i",
  "j",
  "k",
  "l",
  "m",
  "n",
  "o",
  "p",
  "r",
  "s",
  "t",
  "u",
  "v",
  "w",
  "x",
  "y",
  "z"
]
```

## ⚙️ Configuration

Full model configuration is available in `config.json`

## 📄 License

Apache 2.0

## 🙏 Acknowledgments

- Base model: [microsoft/deberta-v3-small](https://huggingface.co/microsoft/deberta-v3-small)
- Training data: [budecosystem/guardrail-training-data](https://huggingface.co/datasets/budecosystem/guardrail-training-data)

## 📮 Contact

For questions or issues, please open an issue on the model repository.