--- language: en license: apache-2.0 tags: - safety-classifier - content-moderation - multi-task - deberta-v3 - text-classification datasets: - budecosystem/guardrail-training-data metrics: - accuracy - f1 --- # 🛡️ Guard Safety Classifier A multi-task safety classifier based on **DeBERTa-v3-small** trained on 3.9M+ samples for content moderation and safety detection. ## 🎯 Model Tasks This model performs **three simultaneous predictions**: 1. **Binary Safety Classification** (`is_safe`) - ✅ Safe content - ⚠️ Unsafe content 2. **Single-Label Category Classification** (`category`) - Identifies the primary safety concern category 3. **Multi-Label Categories** (`categories`) - Can detect multiple safety issues simultaneously ## 📊 Performance Metrics | Metric | Score | |--------|-------| | **is_safe Accuracy** | 92.76% | | **category F1** | 0.5037 | | **categories F1** | 0.9068 | | **Test Loss** | 1.0233 | ## 🚀 Quick Start ```python import torch from transformers import AutoTokenizer import pickle # Load model and tokenizer model_name = "YOUR_USERNAME/guard-safety-classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) # Load model architecture from your_model_file import MultiTaskSafetyClassifier model = MultiTaskSafetyClassifier( model_name="microsoft/deberta-v3-small", num_categories=NUM_CATEGORIES, num_multi_labels=NUM_MULTI_LABELS ) # Load weights model.load_state_dict(torch.load("model_weights.pt")) model.eval() # Load label encoders with open("label_encoders.pkl", "rb") as f: encoders = pickle.load(f) le_category = encoders['le_category'] mlb = encoders['mlb'] # Inference text = "Your text here" inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) is_safe = torch.softmax(outputs['is_safe'], dim=1)[0][1].item() > 0.5 category = le_category.inverse_transform([outputs['category'].argmax(1).item()])[0] categories = mlb.inverse_transform((torch.sigmoid(outputs['categories']) > 0.5).cpu().numpy())[0] print(f"Is Safe: {is_safe}") print(f"Category: {category}") print(f"Categories: {list(categories)}") ``` ## 🏗️ Model Architecture - **Base Model**: `microsoft/deberta-v3-small` (141M parameters) - **Hidden Size**: 768 - **Max Sequence Length**: 128 tokens - **Training Framework**: PyTorch + Transformers ## 📚 Training Details - **Dataset**: [budecosystem/guardrail-training-data](https://huggingface.co/datasets/budecosystem/guardrail-training-data) - **Training Samples**: 3,182,844 - **Validation Samples**: 397,855 - **Test Samples**: 397,856 - **Batch Size**: 64 - **Learning Rate**: 2e-5 - **Epochs**: 1 - **Optimizer**: AdamW with linear warmup - **Hardware**: NVIDIA Tesla T4 (16GB) - **Training Time**: ~8 hours ## 🏷️ Categories The model can identify the following safety categories: ```python [ "animal_abuse", "benign", "child_abuse", "code_vulnerabilities", "controversial_topics_politics", "cwe_compliance", "dangerous_expert_advice", "discrimination_stereotype_injustice", "drug_abuse_weapons_banned_substance", "financial_crime_property_crime_theft", "fraud_deception_misinformation", "gender_bias", "hate_speech_offensive_language", "jailbreak_prompt_injection", "malware_hacking_cyberattack", "misinformation_regarding_ethics_laws_and_safety", "mitre_compliance", "non_violent_unethical_behavior", "orientation_bias", "privacy_violation", "race_bias", "religious_bias", "self_harm", "sexually_explicit_adult_content", "terrorism_organized_crime", "violence_aiding_and_abetting_incitement" ] ``` ## 🔢 Multi-Label Classes ```python [ " ", ",", "_", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "r", "s", "t", "u", "v", "w", "x", "y", "z" ] ``` ## ⚙️ Configuration Full model configuration is available in `config.json` ## 📄 License Apache 2.0 ## 🙏 Acknowledgments - Base model: [microsoft/deberta-v3-small](https://huggingface.co/microsoft/deberta-v3-small) - Training data: [budecosystem/guardrail-training-data](https://huggingface.co/datasets/budecosystem/guardrail-training-data) ## 📮 Contact For questions or issues, please open an issue on the model repository.