Automatic Speech Recognition
Transformers
Safetensors
DiCoW
speech
whisper
multilingual
speaker-diarization
meeting-transcription
target-speaker-asr
SE-DiCoW
BUT-FIT
custom_code
Instructions to use BUT-FIT/SE_DiCoW with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BUT-FIT/SE_DiCoW with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="BUT-FIT/SE_DiCoW", trust_remote_code=True)# Load model directly from transformers import AutoModelForSpeechSeq2Seq model = AutoModelForSpeechSeq2Seq.from_pretrained("BUT-FIT/SE_DiCoW", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - speech | |
| - automatic-speech-recognition | |
| - whisper | |
| - multilingual | |
| - speaker-diarization | |
| - meeting-transcription | |
| - target-speaker-asr | |
| - SE-DiCoW | |
| - BUT-FIT | |
| pipeline_tag: automatic-speech-recognition | |
| license: apache-2.0 | |
| datasets: | |
| - microsoft/NOTSOFAR | |
| - edinburghcstr/ami | |
| - LibriSpeechMix | |
| - LibriMix | |
| > Warning: This is an older version of the SE-DiCoW model, please use https://huggingface.co/BUT-FIT/SE-DiCoW, which is compatible with training and inference code. | |
| # 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper | |
| This repository hosts the **SE-DiCoW** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with **JHU CLSP/HLTCOE** and **CMU LTI**, tailored for **target-speaker multi-talker automatic speech recognition (TS-ASR)**. | |
| ## 🔧 Key Innovations | |
| * **Self-Enrollment (SE):** | |
| Automatically selects the most informative segment of the target speaker within a conversation and integrates it via **cross-attention** at each encoder layer. | |
| * **Improved Initialization & Segmentation:** | |
| Refined FDDT initialization and corrected data segmentation for more stable training. | |
| * **Augmentations:** | |
| - Gaussian noise injection to STNO masks | |
| - Segment-wise flipping of dominant STNO classes | |
| - Joint **SpecAugment** on input + STNO | |
| - **MUSAN** noise mixing | |
| ➡️ Together, these yield **49.7% tcpWER reduction** over the original DiCoW on the **EMMA MT-ASR benchmark**, with over **70% gains** on heavily overlapped Libri3Mix. | |
|  | |
| --- | |
| ## 🛠️ Model Usage | |
| ```python | |
| from transformers import AutoModelForSpeechSeq2Seq | |
| MODEL_NAME = "BUT-FIT/SE_DiCoW" | |
| model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True) | |
| ```` | |
| ➡️ Training and inference pipelines: | |
| * [**Training Code (TS-ASR-Whisper)**](https://github.com/BUTSpeechFIT/TS-ASR-Whisper) | |
| * [**Inference Code**](https://github.com/BUTSpeechFIT/DiCoW) | |
| --- | |
| ## 🏆 Performance | |
| **Benchmark:** EMMA MT-ASR (multi-domain, multi-talker) | |
| * SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both **oracle** and **real diarization**, particularly in highly overlapped conditions (Libri3Mix). | |
| * Achieves **state-of-the-art** or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures. | |
| 🔗 [**EMMA-MT ASR Leaderboard**](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard) | |
| --- | |
| ## 📦 Model Details | |
| * **Base Model:** [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) | |
| * **Training Datasets:** | |
| * [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge) | |
| * [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/) | |
| * [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix) | |
| * [LibriSpeech](https://www.openslr.org/12) synthetic mixtures | |
| --- | |
| ## 🧬 Source Repositories | |
| * 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper) | |
| * 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW) | |
| --- | |
| ## 📚 Related Publications | |
| * 📰 **ICASSP 2026:** | |
| *SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper* | |
| [IEEE ICASSP 2026] | |
| * 📰 **Journal Paper (CSL 2026):** | |
| *DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR* | |
| [Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X) | |
| * 📰 **ICASSP 2025:** | |
| *Target Speaker ASR with Whisper* | |
| [IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683) | |
| --- | |
| ## 📝 Citation | |
| If you use this model, please cite the following works: | |
| ```bibtex | |
| @INPROCEEDINGS{polok2026sedicow, | |
| author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš}, | |
| booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, | |
| title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, | |
| year={2026}, | |
| pages={1-5}, | |
| } | |
| @article{POLOK2026101841, | |
| title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition}, | |
| journal = {Computer Speech & Language}, | |
| volume = {95}, | |
| pages = {101841}, | |
| year = {2026}, | |
| doi = {https://doi.org/10.1016/j.csl.2025.101841}, | |
| author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget}, | |
| } | |
| @INPROCEEDINGS{10887683, | |
| author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš}, | |
| booktitle={ICASSP 2025}, | |
| title={Target Speaker ASR with Whisper}, | |
| year={2025}, | |
| doi={10.1109/ICASSP49660.2025.10887683} | |
| } | |
| ``` | |
| --- | |
| ## 📬 Contact | |
| For questions or collaboration inquiries: | |
| 📧 **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz) | |
| 🏢 **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology | |
| 🔗 **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT) | |