--- title: Hatespeech Detection emoji: 🛡️ colorFrom: red colorTo: red sdk: docker app_port: 8501 tags: - streamlit pinned: false short_description: testing huggingface spaces license: mit --- # 🛡️ Hate Speech Detection Streamlit App A professional web application for detecting hate speech using advanced NLP with explainable AI. ## Features - **Real-time Hate Speech Detection**: Classify text as hate speech or not - **Batch File Processing**: Upload CSV files to analyze multiple texts at once (up to 200 rows) - **Explainable AI**: See which words influenced the prediction - **Token Importance Visualization**: Color-coded highlighting of important tokens - **Probability Distribution**: Visual representation of model confidence - **Performance Metrics**: View F1 score, accuracy, precision, recall, and confusion matrix for batch processing - **Resource Monitoring**: Track CPU and memory usage during batch predictions - **Professional UI**: Clean, modern interface with interactive elements ## Installation 1. Install the required packages: ```bash uv sync ``` ## Running the Application Run the Streamlit app with: ```bash uv run main.py ``` The application will open in your default web browser at `http://localhost:8501` ## Usage ### Single Text Analysis 1. **Enter Text**: Type or paste text into the main input area 2. **Optional Context**: Provide additional context or rationale (optional) 3. **Analyze**: Click the "🔍 Analyze Text" button 4. **View Results**: - See the classification (Hate Speech or Not Hate Speech) - View confidence scores and probability distribution - Explore token importance visualization - Check which words influenced the decision ### Batch File Analysis 1. **Enable File Upload**: Check the "Enable File Upload" option in the sidebar 2. **Upload CSV File**: Click "Browse files" and select your CSV file - Required columns: `text`, `CF_Rationales`, `label` - Maximum recommended: 200 rows 3. **Preview Data**: Review the file statistics and preview 4. **Analyze**: Click the "🔍 Analyze Text" button 5. **View Results**: - Classification metrics (F1 score, accuracy, precision, recall) - Confusion matrix heatmap - CPU and memory usage statistics - Processing time and performance summary ### CSV File Format Your CSV file should contain the following columns: - `text`: The text to analyze for hate speech - `CF_Rationales`: Contextual rationale or explanation (can be empty) - `label`: Ground truth label (0 = not hate speech, 1 = hate speech) Example: ```csv text,CF_Rationales,label "This is a sample text",Some context here,0 "Another example text",More context,1 ``` ## Model Information - **Architecture**: HateBERT + Rationale BERT + Multi-Scale CNN + Attention - **Model Repository**: [seffyehl/BetterShield](https://huggingface.co/seffyehl/BetterShield) - **Training Details**: - Batch Size: 8 - Learning Rate: 1e-5 - Weight Decay: 0.05 - Best Validation Loss: 0.27 ## Files - `app.py` - Main Streamlit application - `hatespeech_model.py` - Model loading and prediction functions - `requirements.txt` - Python dependencies - `README.md` - This file ## Troubleshooting ### Model Loading Issues If the model fails to load: - Check your internet connection (model downloads from Hugging Face) - Ensure you have enough disk space (~500MB for model files) - The first run will take longer as it downloads the model ### Memory Issues If you encounter memory errors: - The model requires approximately 2GB of RAM - Close other applications to free up memory - Use CPU mode if GPU memory is limited ## Configuration You can modify settings in the sidebar: - **Enable File Upload**: Toggle between single text and batch file processing - **Show Token Importance**: Toggle token highlighting - **Show Probability Distribution**: Toggle probability chart - **Show Technical Details**: View raw model outputs ## Performance Optimizations The application includes several optimizations for efficient batch processing: - **Selective column loading**: Only loads required CSV columns to reduce memory usage - **Optimized resource monitoring**: Samples CPU/memory every 10th prediction instead of every prediction - **No blocking delays**: Removed sleep intervals from performance tracking - **Memory efficient**: Processes up to 200 rows with minimal memory overhead (~15-20MB reduction) ## Examples Try the built-in examples: - **Hate Speech Example**: Clear example of offensive content - **Not Hate Speech Example**: Disagreement expressed respectfully - **Borderline Example**: Strong criticism without hate ## Credits Model trained using best practices: - Early stopping to prevent overfitting - Batch size optimization (8 vs 16) - Proper regularization (weight decay, dropout) - Extensive hyperparameter tuning ## License MIT License