SE_DiCoW / README.md

Update README.md

8098620 verified 4 months ago

5.13 kB

	---
	library_name: transformers
	tags:
	- speech
	- automatic-speech-recognition
	- whisper
	- multilingual
	- speaker-diarization
	- meeting-transcription
	- target-speaker-asr
	- SE-DiCoW
	- BUT-FIT
	pipeline_tag: automatic-speech-recognition
	license: apache-2.0
	datasets:
	- microsoft/NOTSOFAR
	- edinburghcstr/ami
	- LibriSpeechMix
	- LibriMix
	---

	> Warning: This is an older version of the SE-DiCoW model, please use https://huggingface.co/BUT-FIT/SE-DiCoW, which is compatible with training and inference code.

	# 🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper

	This repository hosts the SE-DiCoW model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT), in collaboration with JHU CLSP/HLTCOE and CMU LTI, tailored for target-speaker multi-talker automatic speech recognition (TS-ASR).

	## 🔧 Key Innovations

	* Self-Enrollment (SE):
	Automatically selects the most informative segment of the target speaker within a conversation and integrates it via cross-attention at each encoder layer.
	* Improved Initialization & Segmentation:
	Refined FDDT initialization and corrected data segmentation for more stable training.
	* Augmentations:
	- Gaussian noise injection to STNO masks
	- Segment-wise flipping of dominant STNO classes
	- Joint SpecAugment on input + STNO
	- MUSAN noise mixing

	➡️ Together, these yield 49.7% tcpWER reduction over the original DiCoW on the EMMA MT-ASR benchmark, with over 70% gains on heavily overlapped Libri3Mix.

	![SE-DiCoW Architecture](./SE-DiCoW_figure.png)
	---

	## 🛠️ Model Usage

	```python
	from transformers import AutoModelForSpeechSeq2Seq

	MODEL_NAME = "BUT-FIT/SE_DiCoW"
	model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
	````

	➡️ Training and inference pipelines:

	* [Training Code (TS-ASR-Whisper)](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
	* [Inference Code](https://github.com/BUTSpeechFIT/DiCoW)

	---

	## 🏆 Performance

	Benchmark: EMMA MT-ASR (multi-domain, multi-talker)

	* SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both oracle and real diarization, particularly in highly overlapped conditions (Libri3Mix).
	* Achieves state-of-the-art or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.

	🔗 [EMMA-MT ASR Leaderboard](https://huggingface.co/spaces/BUT-FIT/EMMA_leaderboard)

	---

	## 📦 Model Details

	* Base Model: [Whisper large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo)
	* Training Datasets:

	* [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
	* [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
	* [Libri2Mix / Libri3Mix](https://github.com/JorisCos/LibriMix)
	* [LibriSpeech](https://www.openslr.org/12) synthetic mixtures

	---

	## 🧬 Source Repositories

	* 🔧 [Training Code: TS-ASR-Whisper](https://github.com/BUTSpeechFIT/TS-ASR-Whisper)
	* 🚀 [Inference (DiCoW)](https://github.com/BUTSpeechFIT/DiCoW)

	---

	## 📚 Related Publications

	* 📰 ICASSP 2026:
	SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper
	[IEEE ICASSP 2026]

	* 📰 Journal Paper (CSL 2026):
	DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR
	[Computer Speech & Language, 2026](https://www.sciencedirect.com/science/article/pii/S088523082500066X)

	* 📰 ICASSP 2025:
	Target Speaker ASR with Whisper
	[IEEE ICASSP 2025](https://doi.org/10.1109/ICASSP49660.2025.10887683)

	---

	## 📝 Citation

	If you use this model, please cite the following works:

	```bibtex
	@INPROCEEDINGS{polok2026sedicow,
	author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
	booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
	title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
	year={2026},
	pages={1-5},
	}

	@article{POLOK2026101841,
	title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
	journal = {Computer Speech & Language},
	volume = {95},
	pages = {101841},
	year = {2026},
	doi = {https://doi.org/10.1016/j.csl.2025.101841},
	author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
	}

	@INPROCEEDINGS{10887683,
	author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
	booktitle={ICASSP 2025},
	title={Target Speaker ASR with Whisper},
	year={2025},
	doi={10.1109/ICASSP49660.2025.10887683}
	}
	```

	---

	## 📬 Contact

	For questions or collaboration inquiries:

	📧 Email: [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)

	🏢 Affiliation: [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology

	🔗 GitHub: [BUTSpeechFIT](https://github.com/BUTSpeechFIT)