File size: 6,437 Bytes

---
language:
- uz
license: mit
library_name: transformers
tags:
- text-classification
- spam-detection
- uzbek
- telegram
- xlm-roberta
- fine-tuned
datasets:
- sukhrobnurali/uzbek_spam_dataset
metrics:
- accuracy
- f1
- precision
- recall
pipeline_tag: text-classification
model-index:
- name: uzbek-spam-detector
  results:
  - task:
      type: text-classification
      name: Spam Detection
    dataset:
      name: Uzbek Spam Dataset
      type: sukhrobnurali/uzbek_spam_dataset
    metrics:
    - type: accuracy
      value: 1
      name: Accuracy
    - type: f1
      value: 1
      name: F1
    - type: precision
      value: 1
      name: Precision
    - type: recall
      value: 1
      name: Recall
base_model:
- FacebookAI/xlm-roberta-base
---

# Uzbek Spam Detector

A fine-tuned XLM-RoBERTa model for detecting spam messages in Uzbek language, specifically designed for Telegram-style messages.

## Model Description

This model classifies Uzbek text messages as either **spam** or **normal** (ham). It was fine-tuned on a synthetic dataset of 2,000 Uzbek messages covering common spam patterns found in Telegram and messaging platforms.

| Property | Value |
|----------|-------|
| **Base Model** | [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) |
| **Language** | Uzbek (Latin & Cyrillic scripts) |
| **Task** | Binary Text Classification |
| **Labels** | `spam`, `normal` |

## Intended Use

### Primary Use Cases
- Spam filtering for Uzbek Telegram bots
- Content moderation for Uzbek social platforms
- Message classification in Uzbek chat applications

### Out-of-Scope Use
- Languages other than Uzbek
- Long-form document classification
- Detecting other types of harmful content (hate speech, etc.)

## How to Use

### Quick Start

```python
from transformers import pipeline

# Load the model
classifier = pipeline("text-classification", model="sukhrobnurali/uzbek-spam-detector")

# Classify messages
result = classifier("Salom! Bugun uchrashuvga kela olasanmi?")
print(result)
# [{'label': 'normal', 'score': 0.98}]

result = classifier("TEZKOR KREDIT! 50% chegirma! Bosing: example.com")
print(result)
# [{'label': 'spam', 'score': 0.99}]
```

### Batch Classification

```python
messages = [
    "Rahmat katta yordam uchun!",
    "Tabriklaymiz! Siz 1000$ yutdingiz!",
    "Kecha juda charchab uyga keldim",
]

results = classifier(messages)
for msg, res in zip(messages, results):
    print(f"{res['label']}: {msg[:40]}...")
```

### Using with PyTorch

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("sukhrobnurali/uzbek-spam-detector")
model = AutoModelForSequenceClassification.from_pretrained("sukhrobnurali/uzbek-spam-detector")

text = "Bizning kanalga obuna bo'ling!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=-1)

label = model.config.id2label[prediction.item()]
print(f"Prediction: {label}")
```

## Training Details

### Training Data

The model was trained on [sukhrobnurali/uzbek_spam_dataset](https://huggingface.co/datasets/sukhrobnurali/uzbek_spam_dataset), a synthetic dataset of 2,000 Uzbek messages:

| Split | Samples | Spam | Normal |
|-------|---------|------|--------|
| Train | 1,800 | ~900 | ~900 |
| Test | 200 | ~100 | ~100 |

### Spam Categories Covered
- Aggressive advertising and promotions
- Get-rich-quick schemes
- Unsolicited loan/credit offers
- Fake prize/giveaway announcements
- Clickbait messages
- Channel/group promotion spam


### Training Procedure

| Parameter | Value |
|-----------|-------|
| Base model | `xlm-roberta-base` |
| Epochs | 3 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Max sequence length | 128 |
| Optimizer | AdamW |
| Precision | FP16 |

## Evaluation Results

Performance on the held-out test set (200 samples):

| Metric | Score |
|--------|-------|
| **Accuracy** | 100.0% |
| **F1 Score** | 100.0% |
| **Precision** | 100.0% |
| **Recall** | 100.0% |

### Classification Report

```
              precision    recall  f1-score   support

      normal       1.00      1.00      1.00       106
        spam       1.00      1.00      1.00        94

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200
```

> **Note**: The perfect scores are due to the synthetic nature of the training data, where spam and normal messages have distinct, learnable patterns. Real-world performance may vary with organic messages that have more subtle spam indicators.

## Limitations

1. **Synthetic Data**: The model was trained on AI-generated messages, which may not capture all real-world spam patterns.

2. **Domain Specific**: Optimized for Telegram-style short messages. Performance may vary on:
   - Long-form content
   - Formal documents
   - Other messaging platforms

3. **Language Coverage**: Primarily tested on Uzbek. May have unpredictable behavior on:
   - Code-mixed Uzbek-Russian text
   - Heavy use of transliteration

4. **Evolving Spam**: Spam tactics change over time. The model may need retraining to catch new patterns.

## Ethical Considerations

- **False Positives**: The model may incorrectly flag legitimate messages as spam. Always provide users a way to report misclassifications.
- **Bias**: Synthetic training data may contain biases from the generation model.
- **Privacy**: This model processes text locally and does not store or transmit user messages.

## Citation

If you use this model in your research or project, please cite:

```bibtex
@misc{uzbek-spam-detector,
  author = {Sukhrob Nurali},
  title = {Uzbek Spam Detector: Fine-tuned XLM-RoBERTa for Uzbek Spam Classification},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/sukhrobnurali/uzbek-spam-detector}
}
```

## Links

- **Model**: [sukhrobnurali/uzbek-spam-detector](https://huggingface.co/sukhrobnurali/uzbek-spam-detector)
- **Dataset**: [sukhrobnurali/uzbek_spam_dataset](https://huggingface.co/datasets/sukhrobnurali/uzbek_spam_dataset)
- **Base Model**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)

## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).