--- language: - ur - en license: mit library_name: transformers tags: - audio-classification - speech-emotion-recognition - multilingual - urdu - english - wav2vec2 - xls-r - emotion-detection - multimodal-ai - bilingual datasets: - RAVDESS - CREMA-D - UrduSER metrics: - accuracy - f1 - precision - recall pipeline_tag: audio-classification model-index: - name: urdu-ser-model results: - task: type: audio-classification name: Speech Emotion Recognition dataset: name: UrduSER + RAVDESS + CREMA-D (Bilingual Urdu-English) type: bilingual-speech-emotion metrics: - type: accuracy value: 0.573 - type: f1-weighted value: 0.554 - type: f1-macro value: 0.561 --- # Bilingual Speech Emotion Recognition Model (Urdu + English) ## Model Overview This model performs **bilingual Speech Emotion Recognition (SER)** from audio input in **both Urdu and English languages**. It is a fine-tuned version of the multilingual **facebook/wav2vec2-xls-r-300m** model, trained on a combined dataset of English (RAVDESS + CREMA-D) and Urdu (UrduSER) emotional speech to predict 7 Ekman emotions. The model is part of the **"Multimodal AI Mental Health Companion"** Final Year Project (FYP) and is specifically designed for **code-switched and multilingual emotional speech analysis** in Pakistani contexts where speakers often mix Urdu and English. ### Supported Emotions (7 Ekman Classes) | Label ID | Emotion |----------|---------- | 0 | anger | 1 | disgust | 2 | fear | 3 | joy | 4 | neutral | 5 | sadness | 6 | surprise ### Language Capabilities | Language | Supported | Training Data | Notes | |----------|-----------|---------------|-------| | **Urdu** | ✅ Yes | UrduSER (~3,500 samples) | Primary target language | | **English** | ✅ Yes | RAVDESS + CREMA-D (~8,900 samples) | Strong performance | | **Code-Switched** | ⚠️ Partial | Not explicitly trained | May work due to bilingual base model | --- ## Model Details - **Model ID:** `muhammadsuleman1533/urdu-ser-model` - **Task:** Audio Classification / Speech Emotion Recognition - **Languages:** Urdu & English (Bilingual) - **Base Model:** `facebook/wav2vec2-xls-r-300m` (Pre-trained on 128 languages) - **Framework:** PyTorch + Hugging Face Transformers - **Model Size:** 0.3B Parameters - **Developed by:** Muhammad Suleman (Team Leader: Muhammad) - **License:** MIT ### Why wav2vec2 XLS-R for Bilingual SER? The XLS-R (300m) model was selected because it is pre-trained on **128 languages**, including both **Urdu and English**. This makes it uniquely suited for: 1. **Cross-lingual transfer:** Knowledge learned from high-resource English emotional speech (RAVDESS, CREMA-D) transfers to improve Urdu emotion recognition 2. **Bilingual robustness:** The shared multilingual representations help handle code-switched Urdu-English speech common in urban Pakistani populations 3. **Low-resource adaptation:** Leverages pre-trained Urdu speech features despite limited Urdu SER data availability --- ## Dataset Information The model was trained on a **bilingual hybrid corpus** combining English and Urdu emotional speech to maximize generalization for both languages. ### Dataset Composition | Dataset | Language | Samples | Speakers | Type | Emotion Labels | |---------|----------|---------|----------|------|----------------| | **RAVDESS** | English | ~1,440 | 24 (12M/12F) | Acted | 8 emotions | | **CREMA-D** | English | ~7,442 | 91 (48M/43F) | Acted | 6 emotions | | **UrduSER** | Urdu | ~3,500 | ~40 | Acted | 7 emotions | | **TOTAL** | **Bilingual** | **12,376** | **~155** | **Acted** | **7 (standardized)** | ### Language Distribution | Language | Total Samples | Percentage | |----------|---------------|------------| | **English** | ~8,882 | 71.8% | | **Urdu** | ~3,494 | 28.2% | | **Total** | 12,376 | 100% | ### Data Split (Speaker-Disjoint) A **GroupShuffleSplit** was used based on `speaker_id` to ensure **zero speaker overlap** between train and test sets. This tests the model's true ability to recognize emotion in *unseen voices* across both languages. | Split | Samples | English | Urdu | |-------|---------|---------|------| | **Training** | 9,619 | ~6,900 | ~2,719 | | **Testing** | 2,757 | ~1,982 | ~775 | | **Total** | 12,376 | ~8,882 | ~3,494 | ### Emotion Distribution & Class Imbalance The combined dataset is **moderately imbalanced** with `surprise` being heavily under-represented in both languages. | Emotion | Total Samples | English | Urdu | Class Weight | |---------|---------------|---------|------|--------------| | anger | ~1,963 | ~1,400 | ~563 | 0.898 | | disgust | ~1,963 | ~1,400 | ~563 | 0.898 | | fear | ~1,963 | ~1,400 | ~563 | 0.899 | | joy | ~1,963 | ~1,400 | ~563 | 0.898 | | neutral | ~2,375 | ~1,800 | ~575 | 0.759 | | sadness | ~1,963 | ~1,400 | ~563 | 0.899 | | **surprise** | **~192** | **~192** | **0** | **8.588** | **Important Notes:** - **Surprise class:** Only present in English datasets (RAVDESS). UrduSER does not contain surprise samples. - **Mitigation:** Weighted Cross-Entropy Loss was used with `surprise` weighted **8.5x higher** to compensate for extreme under-representation. --- ## Training Details ### Training Configuration | Parameter | Value | |-----------|-------| | **Epochs** | 10 (with Early Stopping Patience=3) | | **Batch Size** | 2 (Effective 16 via Gradient Accumulation ×8) | | **Learning Rate** | 2e-5 | | **Optimizer** | AdamW | | **Loss Function** | Weighted Cross-Entropy Loss | | **LR Scheduler** | Cosine with 10% Warmup | | **Weight Decay** | 0.01 | | **Max Audio Length** | 10 seconds (160,000 samples @ 16kHz) | | **Mixed Precision** | FP16 | | **Hardware** | NVIDIA T4 GPU (Google Colab) | | **Training Duration** | ~7-8 hours | ### Freezing Strategy To prevent catastrophic forgetting of pre-trained multilingual speech representations: - **Frozen:** CNN feature extractor layers (`wav2vec2.feature_extractor`) - **Trainable:** Transformer encoder layers + Classification head ### Preprocessing Pipeline 1. **Audio Loading:** All files loaded with `librosa` at **16kHz** mono 2. **Duration Filtering:** Files outside **0.5-10 seconds** filtered out (removed 6 corrupted files) 3. **Feature Extraction:** `Wav2Vec2FeatureExtractor` from `facebook/wav2vec2-xls-r-300m` 4. **Label Encoding:** 7 emotions mapped to numeric IDs (0-6) 5. **Standardization:** All emotions mapped to 7 Ekman classes (e.g., "calm" → "neutral", "happy" → "joy") --- ## Evaluation Metrics ### Overall Performance (Final Model) The model was trained for 10 epochs with early stopping (patience=3). Training converged at epoch 6.7. | Metric | Value | |--------|-------| | **Accuracy** | 0.573 (57.3%) | | **Weighted F1** | 0.554 | | **Macro F1** | 0.561 | | **Validation Loss** | 1.208 | ### Per-Language Performance (Estimated) Since the test set contains both Urdu and English samples without separate labels, these are conservative estimates based on dataset composition: | Language | Estimated Accuracy | Notes | |----------|-------------------|-------| | **English** | ~60-65% | More training data (72% of corpus), better representation | | **Urdu** | ~45-50% | Less data (28% of corpus) but benefits from multilingual transfer | | **Code-Switched** | Unknown | Not explicitly trained, performance may vary | ### Class-wise Performance (Actual Results) The following table shows the actual per-class performance on the 2,757 test samples: | Emotion | Precision | Recall | F1-Score | Support | |---------|-----------|--------|----------|---------| | **anger** | 0.690 | 0.880 | 0.774 | 433 | | **disgust** | 0.387 | 0.281 | 0.325 | 431 | | **fear** | 0.660 | 0.430 | 0.520 | 433 | | **joy** | 0.593 | 0.524 | 0.556 | 433 | | **neutral** | 0.510 | 0.842 | 0.635 | 562 | | **sadness** | 0.688 | 0.367 | 0.479 | 433 | | **surprise** | 0.464 | 1.000 | 0.634 | 32 | | | **Weighted Avg** | **0.583** | **0.573** | **0.554** | **2,757** | ### Key Observations **Strong Performance (F1 > 0.60):** - **Anger (0.774):** Excellent recall (88%) - the model rarely misses anger when present. High intensity and distinct prosodic features make this emotion easily recognizable across both languages. - **Neutral (0.635):** Very high recall (84%) - the model effectively identifies non-emotional speech, though precision is moderate due to some confusion with joy. - **Surprise (0.634):** Despite having only 32 test samples (all English), the model correctly identifies all surprise samples (100% recall), though precision is lower as it sometimes misclassifies fear as surprise. **Moderate Performance (F1 0.45-0.60):** - **Joy (0.556):** Moderate performance with balanced precision and recall. Some confusion with neutral speech. - **Fear (0.520):** Decent precision (66%) but lower recall (43%) - the model misses many fear samples, likely confusing them with surprise or sadness. **Weak Performance (F1 < 0.45):** - **Sadness (0.479):** Good precision (69%) but very low recall (37%) - the model is conservative in predicting sadness, missing many true sadness samples. - **Disgust (0.325):** The most challenging emotion. Low recall (28%) indicates the model struggles to distinguish disgust from anger, which shares similar acoustic properties. ### Training Progress | Step | Training Loss | Validation Loss | Accuracy | F1 Weighted | F1 Macro | |------|---------------|-----------------|----------|-------------|----------| | 300 | 11.562 | 1.926 | 0.157 | 0.043 | 0.039 | | 900 | 10.939 | 1.806 | 0.351 | 0.269 | 0.226 | | 1500 | 9.505 | 1.514 | 0.413 | 0.329 | 0.314 | | 2100 | 8.083 | 1.445 | 0.435 | 0.367 | 0.365 | | 2700 | 7.821 | 1.304 | 0.493 | 0.447 | 0.440 | | 3300 | 7.096 | 1.420 | 0.479 | 0.421 | 0.406 | | 3900 | 6.836 | 1.241 | 0.549 | 0.520 | 0.532 | | 4500 | 6.129 | 1.208 | 0.573 | 0.554 | 0.560 | | 5400 | 5.999 | 1.286 | 0.556 | 0.529 | 0.524 | The model showed consistent improvement throughout training, with the best F1-weighted score (0.554) achieved at step 4500 (epoch ~5.6). --- ## How to Use the Model ### Installation ```bash pip install transformers torch librosa soundfile