--- library_name: transformers license: mit base_model: christinacdl/XLM_RoBERTa-Clickbait-Detection-new language: - en tags: - generated_from_trainer - text-classification - clickbait-detection - xlm-roberta - binary-classification metrics: - accuracy - f1 model-index: - name: XLM_roberta_finetuned results: - task: name: Text Classification type: text-classification dataset: name: Clickbait Detection Dataset type: unknown metrics: - name: Accuracy type: accuracy value: 0.9990 - name: F1 type: f1 value: 0.9990 - name: Loss type: loss value: 0.0068 --- # 🎯 XLM-RoBERTa Clickbait Detector ## Model Overview This model is a fine-tuned version of [christinacdl/XLM_RoBERTa-Clickbait-Detection-new](https://huggingface.co/christinacdl/XLM_RoBERTa-Clickbait-Detection-new) trained to classify headlines into **Clickbait** and **Legitimate News** categories. The model achieves state-of-the-art performance on clickbait detection: | Metric | Value | |--------|-------| | **Accuracy** | 99.90% | | **F1-Score** | 0.9990 | | **Validation Loss** | 0.0068 | --- ## 📊 Model Details - **Model Type:** Sequence Classification (Binary) - **Base Model:** XLM-RoBERTa (Cross-lingual RoBERTa) - **Language:** English (with multilingual capabilities via XLM-RoBERTa) - **Task:** Clickbait Detection - **Output Classes:** 2 (Clickbait, Legitimate News) - **Model Size:** ~270M parameters - **License:** MIT --- ## 🚀 Intended Uses **Primary Use Cases:** - 🔍 Automated clickbait detection in news feeds and social media - 📱 Browser extensions and browser plugins for user warnings - 📰 News aggregator platforms for content filtering - 🤖 Content moderation systems for social platforms - 📊 Media analytics and trend detection **Intended Audience:** - News organizations and publishers - Social media platforms - Content moderation teams - Researchers studying misinformation - Browser extension developers --- ## ⚠️ Limitations ### Model-Specific Limitations: - **Language Scope:** Optimized for English headlines. While built on XLM-RoBERTa which supports 100+ languages, performance on non-English content may vary significantly - **Domain Bias:** Trained on news and media headlines; may not generalize well to other domains (scientific papers, technical blogs, legal documents) - **Context Dependency:** Classifies headlines in isolation without full article context - **Emerging Patterns:** May struggle with new or evolving clickbait tactics not present in training data - **Sarcasm & Irony:** Can be challenged by figurative language and subtle linguistic tricks ### Recommendations: - Use primarily for English-language headlines - Validate on domain-specific data before production deployment - Combine with contextual analysis for edge cases - Monitor performance on new clickbait patterns - Consider ensemble approaches for critical applications --- ## 📚 Training and Evaluation Data ### Dataset Information - **Dataset Type:** News headlines with clickbait binary labels - **Language:** English - **Train/Eval Split:** Not specified - **Preprocessing:** Standard tokenization via XLM-RoBERTa tokenizer ### Data Characteristics - Headlines from news sources and social media - Binary labels: Clickbait (0) and Legitimate News (1) - Diverse linguistic patterns and sensationalism levels - Representative of modern digital media language --- ## 🛠️ Training Procedure ### Training Hyperparameters | Parameter | Value | |-----------|-------| | **Base Model** | christinacdl/XLM_RoBERTa-Clickbait-Detection-new | | **Learning Rate** | 2e-05 | | **Train Batch Size** | 32 | | **Eval Batch Size** | 32 | | **Gradient Accumulation Steps** | 2 | | **Effective Batch Size** | 64 | | **Epochs** | 2 | | **Optimizer** | AdamW (Fused) | | **Optimizer Betas** | (0.9, 0.999) | | **Optimizer Epsilon** | 1e-08 | | **LR Scheduler** | Linear warmup | | **Mixed Precision** | Native AMP (FP16) | | **Random Seed** | 42 | ### Training Optimization Strategy - **Mixed Precision Training:** FP16 with Native AMP for memory efficiency - **Gradient Accumulation:** 2 steps to simulate larger batch size (64) with memory constraints - **Optimizer:** AdamW Fused implementation for faster computation - **Learning Rate Schedule:** Linear warmup followed by linear decay ### Training Results | Epoch | Training Loss | Step | Validation Loss | Accuracy | F1 Score | |:-----:|:-------------:|:----:|:---------------:|:--------:|:--------:| | 1.0 | — | 400 | 0.0067 | 0.9984 | 0.9984 | | 2.0 | 0.0167 | 800 | 0.0068 | **0.9990** | **0.9990** | **Key Observations:** - Rapid convergence to near-perfect accuracy - Minimal overfitting (validation loss stable across epochs) - F1-Score indicates well-balanced precision and recall - Peak performance achieved at epoch 2 --- ## 📦 Framework Versions | Library | Version | |---------|---------| | Transformers | 4.57.3 | | PyTorch | 2.9.0+cu126 | | Datasets | 4.0.0 | | Tokenizers | 0.22.2 | --- ## 💻 How to Use ### Basic Usage ```python from transformers import pipeline # Load the model classifier = pipeline("text-classification", model="kesavanguru/XLM_roberta_finetuned") # Classify a headline headline = "You Won't Believe What Happened Next! Click Here!" result = classifier(headline) print(result) # Output: [{'label': 'LABEL_0', 'score': 0.9998}] ``` ### Advanced Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "kesavanguru/XLM_roberta_finetuned" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Batch classification headlines = [ "Scientists Make Shocking Discovery - You Won't Believe!", "New Climate Study Released by UN Scientists", "This One Trick Will Change Your Life Forever" ] inputs = tokenizer(headlines, padding=True, truncation=True, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predictions = torch.argmax(logits, dim=-1) for headline, pred in zip(headlines, predictions): label = "Clickbait" if pred.item() == 0 else "Legitimate" print(f"{headline} → {label}") ``` --- ## 🔄 Model Architecture ``` XLM-RoBERTa Base (270M parameters) ↓ [CLS] Token Representation ↓ Sequence Classification Head ↓ Binary Output (Softmax) ``` --- ## 📈 Performance Analysis - **Accuracy:** 99.90% - Excellent for binary classification - **F1-Score:** 0.9990 - Indicates balanced precision and recall - **Loss:** 0.0068 - Very low validation loss, minimal overfitting - **Training Efficiency:** 2 epochs sufficient for convergence --- ## 🤝 Contributing Contributions, issues, and feature requests are welcome! To contribute: 1. Open an issue to discuss proposed changes 2. Submit a pull request with improvements 3. Share feedback on model performance --- ## 📝 Citation If you use this model in your research or application, please cite: ```bibtex @model{xlm_roberta_clickbait_2024, title={XLM-RoBERTa Fine-tuned for Clickbait Detection}, author={Kesavanguru}, year={2024}, publisher={Hugging Face}, howpublished={https://huggingface.co/kesavanguru/XLM_roberta_finetuned} } ``` --- ## 📄 License This model is licensed under the **MIT License**. See LICENSE file for details. --- ## ✨ Acknowledgments - Built on [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) by Facebook - Base model from [christinacdl/XLM_RoBERTa-Clickbait-Detection-new](https://huggingface.co/christinacdl/XLM_RoBERTa-Clickbait-Detection-new) - Developed with Hugging Face Transformers library --- **Model Card Updated:** January 2026 | **Last Training:** 2 epochs | **Status:** Production Ready **Developed by Kesavanguru** | [Model Repository](https://huggingface.co/kesavanguru/XLM_roberta_finetuned)