--- library_name: transformers tags: - speech - automatic-speech-recognition - whisper - multilingual - speaker-diarization - meeting-transcription - target-speaker-asr - DiCoW - BUT-FIT pipeline_tag: automatic-speech-recognition license: cc-by-4.0 datasets: - microsoft/NOTSOFAR - edinburghcstr/ami - LibriMix base_model: BUT-FIT/DiCoW_v3_2 --- # DiCoW v3.2-SF — Sortformer Fine-Tuned Diarization-Conditioned Whisper This repository hosts **DiCoW v3.2-SF**, a variant of [DiCoW v3.2](https://huggingface.co/BUT-FIT/DiCoW_v3_2) developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT). It is fine-tuned to accept **Sortformer soft speaker-activity masks** as diarization input, with the goal of adapting the FDDT conditioning layers to the noise patterns of the Sortformer diarization system. --- ## What is this model? Standard DiCoW models are trained with ground-truth (GT) binary diarization masks. At inference time, however, real-world diarization systems such as [Sortformer](https://arxiv.org/abs/2409.06656) produce continuous soft probabilities that can differ substantially from clean binary signals. DiCoW v3.2-SF addresses this mismatch by fine-tuning the **FDDT (Frame-Level Diarization-Dependent Transformation)** layers of DiCoW v3.2 exclusively on **Sortformer soft masks**, while keeping the Whisper decoder frozen. ## Quick Usage ### Load in Python ```python from transformers import AutoModelForSpeechSeq2Seq model = AutoModelForSpeechSeq2Seq.from_pretrained( "BUT-FIT/DiCoW_v3_2_SF", trust_remote_code=True ) ``` ## Performance (tcpWER, 5 s collar) *Comparison between this model and its baseline DiCoW v3.2 under both ground-truth (GT) and Sortformer (SF) diarization.* | Dataset | Baseline — GT diar. | Baseline — SF diar. | **This model — GT diar.** | **This model — SF diar.** | |---|---|---|---|---| | NOTSOFAR-1 eval-small | 16.4 | 57.6 | 20.2 | **58.7** | | AMI-SDM | 15.3 | 81.3 | 17.6 | **77.3** | | Libri2Mix clean | 4.7 | 8.9 | 7.9 | **5.2** | | Libri2Mix noisy | 11.3 | 21.5 | 16.1 | **12.3** | | Libri3Mix clean | 28.8 | 38.9 | 36.7 | **38.2** | | Libri3Mix noisy | 39.6 | 53.9 | 47.1 | **47.3** | ### Interpretation Fine-tuning on Sortformer soft masks teaches the FDDT layers to be more tolerant of continuous probability values, yielding substantial gains on **simulated LibriMix** conditions: Libri2Mix-clean drops from 8.9 % to **5.2 %** and Libri2Mix-noisy from 21.5 % to **12.3 %** under SF diarization. This adaptation carries a trade-off: GT diarization performance degrades across all datasets, because the FDDT layers have adjusted their learned transformations to expect noisy soft inputs rather than clean binary masks. On real-world meeting data the SF-diarization gains are modest: AMI-SDM improves from 81.3 % to **77.3 %**, while NOTSOFAR-1 is largely unchanged. > **Recommendation:** Use DiCoW v3.2-SF when your inference pipeline uses Sortformer-style diarization on relatively clean or simulated mixtures. For real-world meeting transcription with the best overall accuracy, prefer [DiCoW v3.3](https://huggingface.co/BUT-FIT/DiCoW_v3_3) or [SE-DiCoW](https://huggingface.co/BUT-FIT/SE-DiCoW). ## Limitations * **GT mask degradation:** Compared to DiCoW v3.2, oracle (GT) diarization performance is meaningfully worse. Do not use this model when clean binary masks are available. * **Meeting-domain gap:** Sortformer mask adaptation on LibriMix data does not fully transfer to spontaneous meeting corpora such as NOTSOFAR-1. --- ## Citations If you use this model, please cite the original papers: ```bibtex @article{POLOK2026101841, title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition}, journal = {Computer Speech & Language}, volume = {95}, year = {2026}, doi = {10.1016/j.csl.2025.101841}, author = {Alexander Polok et al.} } @INPROCEEDINGS{10887683, title={Target Speaker ASR with Whisper}, author={Polok, Alexander et al.}, booktitle={ICASSP 2025}, year={2025}, doi={10.1109/ICASSP49660.2025.10887683} } ``` ## Contact * **Email:** [xbohatd00@stud.fit.vut.cz](mailto:xbohatd00@stud.fit.vut.cz)