File size: 6,437 Bytes
2338443
6228030
 
 
2338443
6228030
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2338443
 
6228030
2338443
6228030
2338443
6228030
2338443
6228030
2338443
6228030
 
 
 
 
 
2338443
6228030
2338443
6228030
 
 
 
2338443
6228030
 
 
 
2338443
6228030
2338443
6228030
2338443
6228030
 
2338443
6228030
 
2338443
6228030
 
 
 
2338443
6228030
 
 
 
2338443
6228030
2338443
6228030
 
 
 
 
 
2338443
6228030
 
 
 
2338443
6228030
2338443
6228030
 
 
2338443
6228030
 
2338443
6228030
 
2338443
6228030
 
 
2338443
6228030
 
 
2338443
 
 
 
 
6228030
2338443
6228030
 
 
 
2338443
6228030
 
 
 
 
 
 
2338443
 
6228030
2338443
6228030
 
 
 
 
 
 
 
 
 
 
2338443
6228030
2338443
6228030
2338443
6228030
 
 
 
 
 
2338443
6228030
2338443
6228030
 
2338443
6228030
 
2338443
6228030
 
 
 
2338443
6228030
2338443
6228030
2338443
6228030
2338443
6228030
 
 
 
2338443
6228030
 
 
2338443
6228030
2338443
6228030
2338443
6228030
 
 
2338443
6228030
2338443
6228030
2338443
6228030
 
 
 
 
 
 
 
 
2338443
6228030
2338443
6228030
 
 
2338443
6228030
2338443
6228030
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
language:
- uz
license: mit
library_name: transformers
tags:
- text-classification
- spam-detection
- uzbek
- telegram
- xlm-roberta
- fine-tuned
datasets:
- sukhrobnurali/uzbek_spam_dataset
metrics:
- accuracy
- f1
- precision
- recall
pipeline_tag: text-classification
model-index:
- name: uzbek-spam-detector
  results:
  - task:
      type: text-classification
      name: Spam Detection
    dataset:
      name: Uzbek Spam Dataset
      type: sukhrobnurali/uzbek_spam_dataset
    metrics:
    - type: accuracy
      value: 1
      name: Accuracy
    - type: f1
      value: 1
      name: F1
    - type: precision
      value: 1
      name: Precision
    - type: recall
      value: 1
      name: Recall
base_model:
- FacebookAI/xlm-roberta-base
---

# Uzbek Spam Detector

A fine-tuned XLM-RoBERTa model for detecting spam messages in Uzbek language, specifically designed for Telegram-style messages.

## Model Description

This model classifies Uzbek text messages as either **spam** or **normal** (ham). It was fine-tuned on a synthetic dataset of 2,000 Uzbek messages covering common spam patterns found in Telegram and messaging platforms.

| Property | Value |
|----------|-------|
| **Base Model** | [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) |
| **Language** | Uzbek (Latin & Cyrillic scripts) |
| **Task** | Binary Text Classification |
| **Labels** | `spam`, `normal` |

## Intended Use

### Primary Use Cases
- Spam filtering for Uzbek Telegram bots
- Content moderation for Uzbek social platforms
- Message classification in Uzbek chat applications

### Out-of-Scope Use
- Languages other than Uzbek
- Long-form document classification
- Detecting other types of harmful content (hate speech, etc.)

## How to Use

### Quick Start

```python
from transformers import pipeline

# Load the model
classifier = pipeline("text-classification", model="sukhrobnurali/uzbek-spam-detector")

# Classify messages
result = classifier("Salom! Bugun uchrashuvga kela olasanmi?")
print(result)
# [{'label': 'normal', 'score': 0.98}]

result = classifier("TEZKOR KREDIT! 50% chegirma! Bosing: example.com")
print(result)
# [{'label': 'spam', 'score': 0.99}]
```

### Batch Classification

```python
messages = [
    "Rahmat katta yordam uchun!",
    "Tabriklaymiz! Siz 1000$ yutdingiz!",
    "Kecha juda charchab uyga keldim",
]

results = classifier(messages)
for msg, res in zip(messages, results):
    print(f"{res['label']}: {msg[:40]}...")
```

### Using with PyTorch

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("sukhrobnurali/uzbek-spam-detector")
model = AutoModelForSequenceClassification.from_pretrained("sukhrobnurali/uzbek-spam-detector")

text = "Bizning kanalga obuna bo'ling!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=-1)

label = model.config.id2label[prediction.item()]
print(f"Prediction: {label}")
```

## Training Details

### Training Data

The model was trained on [sukhrobnurali/uzbek_spam_dataset](https://huggingface.co/datasets/sukhrobnurali/uzbek_spam_dataset), a synthetic dataset of 2,000 Uzbek messages:

| Split | Samples | Spam | Normal |
|-------|---------|------|--------|
| Train | 1,800 | ~900 | ~900 |
| Test | 200 | ~100 | ~100 |

### Spam Categories Covered
- Aggressive advertising and promotions
- Get-rich-quick schemes
- Unsolicited loan/credit offers
- Fake prize/giveaway announcements
- Clickbait messages
- Channel/group promotion spam


### Training Procedure

| Parameter | Value |
|-----------|-------|
| Base model | `xlm-roberta-base` |
| Epochs | 3 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Max sequence length | 128 |
| Optimizer | AdamW |
| Precision | FP16 |

## Evaluation Results

Performance on the held-out test set (200 samples):

| Metric | Score |
|--------|-------|
| **Accuracy** | 100.0% |
| **F1 Score** | 100.0% |
| **Precision** | 100.0% |
| **Recall** | 100.0% |

### Classification Report

```
              precision    recall  f1-score   support

      normal       1.00      1.00      1.00       106
        spam       1.00      1.00      1.00        94

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200
```

> **Note**: The perfect scores are due to the synthetic nature of the training data, where spam and normal messages have distinct, learnable patterns. Real-world performance may vary with organic messages that have more subtle spam indicators.

## Limitations

1. **Synthetic Data**: The model was trained on AI-generated messages, which may not capture all real-world spam patterns.

2. **Domain Specific**: Optimized for Telegram-style short messages. Performance may vary on:
   - Long-form content
   - Formal documents
   - Other messaging platforms

3. **Language Coverage**: Primarily tested on Uzbek. May have unpredictable behavior on:
   - Code-mixed Uzbek-Russian text
   - Heavy use of transliteration

4. **Evolving Spam**: Spam tactics change over time. The model may need retraining to catch new patterns.

## Ethical Considerations

- **False Positives**: The model may incorrectly flag legitimate messages as spam. Always provide users a way to report misclassifications.
- **Bias**: Synthetic training data may contain biases from the generation model.
- **Privacy**: This model processes text locally and does not store or transmit user messages.

## Citation

If you use this model in your research or project, please cite:

```bibtex
@misc{uzbek-spam-detector,
  author = {Sukhrob Nurali},
  title = {Uzbek Spam Detector: Fine-tuned XLM-RoBERTa for Uzbek Spam Classification},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/sukhrobnurali/uzbek-spam-detector}
}
```

## Links

- **Model**: [sukhrobnurali/uzbek-spam-detector](https://huggingface.co/sukhrobnurali/uzbek-spam-detector)
- **Dataset**: [sukhrobnurali/uzbek_spam_dataset](https://huggingface.co/datasets/sukhrobnurali/uzbek_spam_dataset)
- **Base Model**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)

## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).