tomasBernal commited on
Commit
f59ccbc
·
verified ·
1 Parent(s): debe4f1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +226 -0
README.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: transformers
6
+ pipeline_tag: audio-classification
7
+ tags:
8
+ - emotion-recognition
9
+ - speech-emotion-recognition
10
+ - multimodal-learning
11
+ - audio-classification
12
+ - speech-processing
13
+ - text-processing
14
+ - english
15
+ - affective-computing
16
+ - umuteam
17
+ datasets:
18
+ - RAVDESS
19
+ - TESS
20
+ metrics:
21
+ - accuracy
22
+ - f1
23
+
24
+ model-index:
25
+ - name: UMUTeam/w2v-bert-beto-concat-emotion-en
26
+ results:
27
+ - task:
28
+ type: audio-classification
29
+ name: Multimodal Speech Emotion Recognition
30
+ dataset:
31
+ name: English Multimodal Emotion Recognition Benchmark
32
+ type: custom
33
+ metrics:
34
+ - type: accuracy
35
+ value: 96.0462
36
+ name: Accuracy
37
+ - type: weighted-f1
38
+ value: 96.0257
39
+ name: Weighted F1
40
+ - type: macro-f1
41
+ value: 96.0462
42
+ name: Macro F1
43
+ ---
44
+
45
+ # UMUTeam/w2v-bert-beto-concat-emotion-en
46
+
47
+ ## Model description
48
+
49
+ `UMUTeam/w2v-bert-beto-concat-emotion-en` is an English multimodal emotion recognition model developed as part of **speech-emotion**, an open-source multilingual and multimodal toolkit for emotion recognition from speech, text, and multimodal inputs.
50
+
51
+ This model performs **multimodal emotion classification from English speech and text inputs**.
52
+
53
+ The model combines acoustic representations extracted with Wav2Vec2-BERT and linguistic representations generated with RoBERTa using a concatenation-based multimodal fusion strategy.
54
+
55
+ It is designed to jointly exploit complementary emotional information from speech and text in order to improve emotion recognition performance compared to unimodal approaches.
56
+
57
+ The model predicts one of the following emotion labels:
58
+
59
+ - `angry`
60
+ - `disgust`
61
+ - `fear`
62
+ - `happy`
63
+ - `neutral`
64
+ - `sad`
65
+ - `surprise`
66
+
67
+ ## Intended use
68
+
69
+ This model is intended for research and applied scenarios involving multimodal emotion recognition in English, such as:
70
+
71
+ - multimodal conversational analysis
72
+ - speech and text emotion analysis
73
+ - affective computing research
74
+ - emotion-aware conversational systems
75
+ - human-computer interaction
76
+ - multimodal AI research
77
+
78
+ The model is particularly useful in scenarios where both speech audio and transcribed text are available.
79
+
80
+ It can be used through the `speech-emotion` toolkit.
81
+
82
+ ## Out-of-scope use
83
+
84
+ This model should not be used as the sole basis for high-stakes decisions, including but not limited to:
85
+
86
+ - clinical diagnosis
87
+ - mental health assessment
88
+ - employment, legal, or educational decisions
89
+ - biometric profiling or surveillance
90
+ - automated decisions affecting individuals without human oversight
91
+
92
+ Emotion recognition is inherently uncertain and context-dependent. Predictions should be interpreted as model estimates, not as definitive assessments of a person's emotional state.
93
+
94
+ ## Training data
95
+
96
+ The model was trained on the English multimodal datasets used in the `speech-emotion` project.
97
+
98
+ The training data combines multiple publicly available English speech and multimodal emotion recognition datasets, including:
99
+
100
+ - RAVDESS
101
+ - TESS
102
+ - datasets derived from prior speech emotion recognition research benchmarks
103
+
104
+ Because the original datasets use different emotion taxonomies, all datasets were harmonized into a unified seven-class emotion taxonomy:
105
+
106
+ - `angry`
107
+ - `disgust`
108
+ - `fear`
109
+ - `happy`
110
+ - `neutral`
111
+ - `sad`
112
+ - `surprise`
113
+
114
+ For the English multimodal emotion recognition setup, the same aligned speech-text samples were used for both the acoustic and textual modalities:
115
+
116
+ - Training samples: 3,622
117
+ - Validation samples: 453
118
+ - Test samples: 453
119
+
120
+ More details about the dataset preprocessing and label harmonization pipeline are available in the project repository:
121
+
122
+ https://github.com/NLP-UMUTeam/umuteam-speech-emotion
123
+
124
+ ## Evaluation
125
+
126
+ The model was evaluated on the English held-out test set used in the `speech-emotion` toolkit.
127
+
128
+ ### Performance comparison on English emotion recognition
129
+
130
+ | Configuration | Accuracy | Weighted Precision | Weighted F1 | Macro F1 |
131
+ |---|---:|---:|---:|---:|
132
+ | Speech-only | 95.1435 | 95.2700 | 95.1575 | 95.1679 |
133
+ | Text-only | 76.0842 | 75.5723 | 75.6852 | 68.0266 |
134
+ | Multimodal (Concat) | **96.0462** | **96.0880** | **96.0257** | **96.0462** |
135
+ | Multimodal (Mean) | 90.2870 | 90.5162 | 90.2334 | 90.2589 |
136
+ | Multimodal (Multihead) | 93.1567 | 93.2715 | 93.1898 | 93.2115 |
137
+
138
+ The results show that combining acoustic and linguistic representations improves emotion recognition performance compared to unimodal speech-only or text-only systems.
139
+
140
+ Among the evaluated fusion strategies, the concatenation-based multimodal approach achieved the best overall performance across all reported metrics.
141
+
142
+ ## How to use
143
+
144
+ Install the toolkit:
145
+
146
+ ```bash
147
+ pip install speech-emotion
148
+ ```
149
+
150
+ ### Multimodal emotion recognition using audio and text
151
+
152
+ ```python
153
+ from speech_emotion import predict_emotion
154
+
155
+ emotion = predict_emotion(
156
+ audio_path="audio.wav",
157
+ text="I was really happy to see you again.",
158
+ language="en",
159
+ mode="concat",
160
+ model_config_path="model.json"
161
+ )
162
+
163
+ print("Detected emotion:", emotion)
164
+ ```
165
+
166
+ ### Multimodal emotion recognition using automatic transcription (Whisper)
167
+
168
+ If no transcription is provided, the toolkit can automatically generate it using Whisper before performing emotion recognition.
169
+
170
+ ```python
171
+ from speech_emotion import predict_emotion
172
+
173
+ emotion = predict_emotion(
174
+ audio_path="audio.wav",
175
+ language="en",
176
+ mode="concat",
177
+ model_config_path="model.json"
178
+ )
179
+
180
+ print("Detected emotion:", emotion)
181
+ ```
182
+
183
+ Repository:
184
+
185
+ https://github.com/NLP-UMUTeam/umuteam-speech-emotion
186
+
187
+ ## Limitations
188
+
189
+ - The model is designed for English multimodal emotion recognition and may not generalize reliably to other languages.
190
+ - It predicts a single label from a fixed set of seven emotions.
191
+ - Emotion expression is subjective and highly context-dependent.
192
+ - Performance may decrease with noisy audio, inaccurate transcriptions, overlapping speakers, or domain shifts.
193
+ - The model assumes that audio and text inputs are semantically aligned.
194
+ - Errors in automatic speech transcription may negatively affect multimodal performance.
195
+
196
+ ## Bias and ethical considerations
197
+
198
+ Emotion recognition systems may reflect biases present in their training data, including differences related to accents, speaking styles, demographics, recording conditions, or annotation subjectivity.
199
+
200
+ Users should avoid interpreting predictions as objective truths about a person's internal emotional state. The model should be used with transparency, appropriate consent, and human oversight, especially in sensitive contexts.
201
+
202
+ ## Citation
203
+
204
+ If you use this model in your research, please cite the following works:
205
+
206
+ ### speech-emotion toolkit
207
+
208
+ ```bibtex
209
+ @article{PAN2026102677,
210
+ title = {speech-emotion: A multilingual and multimodal toolkit for emotion recognition from speech},
211
+ journal = {SoftwareX},
212
+ volume = {34},
213
+ pages = {102677},
214
+ year = {2026},
215
+ issn = {2352-7110},
216
+ doi = {https://doi.org/10.1016/j.softx.2026.102677},
217
+ url = {https://www.sciencedirect.com/science/article/pii/S235271102600169X},
218
+ author = {Ronghao Pan and Tomás Bernal-Beltrán and José Antonio García-Díaz and Rafael Valencia-García},
219
+ }
220
+ ```
221
+
222
+ ## Acknowledgments
223
+
224
+ This work is part of the research project LaTe4PoliticES (PID2022-138099OB-I00), funded by MICIU/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF/EU - FEDER/UE), “A way of making Europe”.
225
+
226
+ Mr. Tomás Bernal-Beltrán is supported by the University of Murcia through the predoctoral programme.