Text Classification
Transformers
Safetensors
Uzbek
xlm-roberta
spam-detection
uzbek
telegram
fine-tuned
Eval Results (legacy)
text-embeddings-inference
Instructions to use sukhrobnurali/uzbek-spam-detector with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sukhrobnurali/uzbek-spam-detector with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="sukhrobnurali/uzbek-spam-detector")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("sukhrobnurali/uzbek-spam-detector") model = AutoModelForSequenceClassification.from_pretrained("sukhrobnurali/uzbek-spam-detector") - Notebooks
- Google Colab
- Kaggle
File size: 6,437 Bytes
2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 2338443 6228030 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | ---
language:
- uz
license: mit
library_name: transformers
tags:
- text-classification
- spam-detection
- uzbek
- telegram
- xlm-roberta
- fine-tuned
datasets:
- sukhrobnurali/uzbek_spam_dataset
metrics:
- accuracy
- f1
- precision
- recall
pipeline_tag: text-classification
model-index:
- name: uzbek-spam-detector
results:
- task:
type: text-classification
name: Spam Detection
dataset:
name: Uzbek Spam Dataset
type: sukhrobnurali/uzbek_spam_dataset
metrics:
- type: accuracy
value: 1
name: Accuracy
- type: f1
value: 1
name: F1
- type: precision
value: 1
name: Precision
- type: recall
value: 1
name: Recall
base_model:
- FacebookAI/xlm-roberta-base
---
# Uzbek Spam Detector
A fine-tuned XLM-RoBERTa model for detecting spam messages in Uzbek language, specifically designed for Telegram-style messages.
## Model Description
This model classifies Uzbek text messages as either **spam** or **normal** (ham). It was fine-tuned on a synthetic dataset of 2,000 Uzbek messages covering common spam patterns found in Telegram and messaging platforms.
| Property | Value |
|----------|-------|
| **Base Model** | [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) |
| **Language** | Uzbek (Latin & Cyrillic scripts) |
| **Task** | Binary Text Classification |
| **Labels** | `spam`, `normal` |
## Intended Use
### Primary Use Cases
- Spam filtering for Uzbek Telegram bots
- Content moderation for Uzbek social platforms
- Message classification in Uzbek chat applications
### Out-of-Scope Use
- Languages other than Uzbek
- Long-form document classification
- Detecting other types of harmful content (hate speech, etc.)
## How to Use
### Quick Start
```python
from transformers import pipeline
# Load the model
classifier = pipeline("text-classification", model="sukhrobnurali/uzbek-spam-detector")
# Classify messages
result = classifier("Salom! Bugun uchrashuvga kela olasanmi?")
print(result)
# [{'label': 'normal', 'score': 0.98}]
result = classifier("TEZKOR KREDIT! 50% chegirma! Bosing: example.com")
print(result)
# [{'label': 'spam', 'score': 0.99}]
```
### Batch Classification
```python
messages = [
"Rahmat katta yordam uchun!",
"Tabriklaymiz! Siz 1000$ yutdingiz!",
"Kecha juda charchab uyga keldim",
]
results = classifier(messages)
for msg, res in zip(messages, results):
print(f"{res['label']}: {msg[:40]}...")
```
### Using with PyTorch
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("sukhrobnurali/uzbek-spam-detector")
model = AutoModelForSequenceClassification.from_pretrained("sukhrobnurali/uzbek-spam-detector")
text = "Bizning kanalga obuna bo'ling!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=-1)
label = model.config.id2label[prediction.item()]
print(f"Prediction: {label}")
```
## Training Details
### Training Data
The model was trained on [sukhrobnurali/uzbek_spam_dataset](https://huggingface.co/datasets/sukhrobnurali/uzbek_spam_dataset), a synthetic dataset of 2,000 Uzbek messages:
| Split | Samples | Spam | Normal |
|-------|---------|------|--------|
| Train | 1,800 | ~900 | ~900 |
| Test | 200 | ~100 | ~100 |
### Spam Categories Covered
- Aggressive advertising and promotions
- Get-rich-quick schemes
- Unsolicited loan/credit offers
- Fake prize/giveaway announcements
- Clickbait messages
- Channel/group promotion spam
### Training Procedure
| Parameter | Value |
|-----------|-------|
| Base model | `xlm-roberta-base` |
| Epochs | 3 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Max sequence length | 128 |
| Optimizer | AdamW |
| Precision | FP16 |
## Evaluation Results
Performance on the held-out test set (200 samples):
| Metric | Score |
|--------|-------|
| **Accuracy** | 100.0% |
| **F1 Score** | 100.0% |
| **Precision** | 100.0% |
| **Recall** | 100.0% |
### Classification Report
```
precision recall f1-score support
normal 1.00 1.00 1.00 106
spam 1.00 1.00 1.00 94
accuracy 1.00 200
macro avg 1.00 1.00 1.00 200
weighted avg 1.00 1.00 1.00 200
```
> **Note**: The perfect scores are due to the synthetic nature of the training data, where spam and normal messages have distinct, learnable patterns. Real-world performance may vary with organic messages that have more subtle spam indicators.
## Limitations
1. **Synthetic Data**: The model was trained on AI-generated messages, which may not capture all real-world spam patterns.
2. **Domain Specific**: Optimized for Telegram-style short messages. Performance may vary on:
- Long-form content
- Formal documents
- Other messaging platforms
3. **Language Coverage**: Primarily tested on Uzbek. May have unpredictable behavior on:
- Code-mixed Uzbek-Russian text
- Heavy use of transliteration
4. **Evolving Spam**: Spam tactics change over time. The model may need retraining to catch new patterns.
## Ethical Considerations
- **False Positives**: The model may incorrectly flag legitimate messages as spam. Always provide users a way to report misclassifications.
- **Bias**: Synthetic training data may contain biases from the generation model.
- **Privacy**: This model processes text locally and does not store or transmit user messages.
## Citation
If you use this model in your research or project, please cite:
```bibtex
@misc{uzbek-spam-detector,
author = {Sukhrob Nurali},
title = {Uzbek Spam Detector: Fine-tuned XLM-RoBERTa for Uzbek Spam Classification},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/sukhrobnurali/uzbek-spam-detector}
}
```
## Links
- **Model**: [sukhrobnurali/uzbek-spam-detector](https://huggingface.co/sukhrobnurali/uzbek-spam-detector)
- **Dataset**: [sukhrobnurali/uzbek_spam_dataset](https://huggingface.co/datasets/sukhrobnurali/uzbek_spam_dataset)
- **Base Model**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
## License
This model is released under the [MIT License](https://opensource.org/licenses/MIT). |