--- language: - uz license: mit library_name: transformers tags: - text-classification - spam-detection - uzbek - telegram - xlm-roberta - fine-tuned datasets: - sukhrobnurali/uzbek_spam_dataset metrics: - accuracy - f1 - precision - recall pipeline_tag: text-classification model-index: - name: uzbek-spam-detector results: - task: type: text-classification name: Spam Detection dataset: name: Uzbek Spam Dataset type: sukhrobnurali/uzbek_spam_dataset metrics: - type: accuracy value: 1 name: Accuracy - type: f1 value: 1 name: F1 - type: precision value: 1 name: Precision - type: recall value: 1 name: Recall base_model: - FacebookAI/xlm-roberta-base --- # Uzbek Spam Detector A fine-tuned XLM-RoBERTa model for detecting spam messages in Uzbek language, specifically designed for Telegram-style messages. ## Model Description This model classifies Uzbek text messages as either **spam** or **normal** (ham). It was fine-tuned on a synthetic dataset of 2,000 Uzbek messages covering common spam patterns found in Telegram and messaging platforms. | Property | Value | |----------|-------| | **Base Model** | [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) | | **Language** | Uzbek (Latin & Cyrillic scripts) | | **Task** | Binary Text Classification | | **Labels** | `spam`, `normal` | ## Intended Use ### Primary Use Cases - Spam filtering for Uzbek Telegram bots - Content moderation for Uzbek social platforms - Message classification in Uzbek chat applications ### Out-of-Scope Use - Languages other than Uzbek - Long-form document classification - Detecting other types of harmful content (hate speech, etc.) ## How to Use ### Quick Start ```python from transformers import pipeline # Load the model classifier = pipeline("text-classification", model="sukhrobnurali/uzbek-spam-detector") # Classify messages result = classifier("Salom! Bugun uchrashuvga kela olasanmi?") print(result) # [{'label': 'normal', 'score': 0.98}] result = classifier("TEZKOR KREDIT! 50% chegirma! Bosing: example.com") print(result) # [{'label': 'spam', 'score': 0.99}] ``` ### Batch Classification ```python messages = [ "Rahmat katta yordam uchun!", "Tabriklaymiz! Siz 1000$ yutdingiz!", "Kecha juda charchab uyga keldim", ] results = classifier(messages) for msg, res in zip(messages, results): print(f"{res['label']}: {msg[:40]}...") ``` ### Using with PyTorch ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("sukhrobnurali/uzbek-spam-detector") model = AutoModelForSequenceClassification.from_pretrained("sukhrobnurali/uzbek-spam-detector") text = "Bizning kanalga obuna bo'ling!" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) with torch.no_grad(): outputs = model(**inputs) prediction = torch.argmax(outputs.logits, dim=-1) label = model.config.id2label[prediction.item()] print(f"Prediction: {label}") ``` ## Training Details ### Training Data The model was trained on [sukhrobnurali/uzbek_spam_dataset](https://huggingface.co/datasets/sukhrobnurali/uzbek_spam_dataset), a synthetic dataset of 2,000 Uzbek messages: | Split | Samples | Spam | Normal | |-------|---------|------|--------| | Train | 1,800 | ~900 | ~900 | | Test | 200 | ~100 | ~100 | ### Spam Categories Covered - Aggressive advertising and promotions - Get-rich-quick schemes - Unsolicited loan/credit offers - Fake prize/giveaway announcements - Clickbait messages - Channel/group promotion spam ### Training Procedure | Parameter | Value | |-----------|-------| | Base model | `xlm-roberta-base` | | Epochs | 3 | | Batch size | 16 | | Learning rate | 2e-5 | | Weight decay | 0.01 | | Warmup ratio | 0.1 | | Max sequence length | 128 | | Optimizer | AdamW | | Precision | FP16 | ## Evaluation Results Performance on the held-out test set (200 samples): | Metric | Score | |--------|-------| | **Accuracy** | 100.0% | | **F1 Score** | 100.0% | | **Precision** | 100.0% | | **Recall** | 100.0% | ### Classification Report ``` precision recall f1-score support normal 1.00 1.00 1.00 106 spam 1.00 1.00 1.00 94 accuracy 1.00 200 macro avg 1.00 1.00 1.00 200 weighted avg 1.00 1.00 1.00 200 ``` > **Note**: The perfect scores are due to the synthetic nature of the training data, where spam and normal messages have distinct, learnable patterns. Real-world performance may vary with organic messages that have more subtle spam indicators. ## Limitations 1. **Synthetic Data**: The model was trained on AI-generated messages, which may not capture all real-world spam patterns. 2. **Domain Specific**: Optimized for Telegram-style short messages. Performance may vary on: - Long-form content - Formal documents - Other messaging platforms 3. **Language Coverage**: Primarily tested on Uzbek. May have unpredictable behavior on: - Code-mixed Uzbek-Russian text - Heavy use of transliteration 4. **Evolving Spam**: Spam tactics change over time. The model may need retraining to catch new patterns. ## Ethical Considerations - **False Positives**: The model may incorrectly flag legitimate messages as spam. Always provide users a way to report misclassifications. - **Bias**: Synthetic training data may contain biases from the generation model. - **Privacy**: This model processes text locally and does not store or transmit user messages. ## Citation If you use this model in your research or project, please cite: ```bibtex @misc{uzbek-spam-detector, author = {Sukhrob Nurali}, title = {Uzbek Spam Detector: Fine-tuned XLM-RoBERTa for Uzbek Spam Classification}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/sukhrobnurali/uzbek-spam-detector} } ``` ## Links - **Model**: [sukhrobnurali/uzbek-spam-detector](https://huggingface.co/sukhrobnurali/uzbek-spam-detector) - **Dataset**: [sukhrobnurali/uzbek_spam_dataset](https://huggingface.co/datasets/sukhrobnurali/uzbek_spam_dataset) - **Base Model**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) ## License This model is released under the [MIT License](https://opensource.org/licenses/MIT).