---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- speaker-diarization
- meeting-transcription
- target-speaker-asr
- DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: cc-by-4.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
- LibriMix
base_model: BUT-FIT/DiCoW_v3_2
---

# DiCoW v3.2-SF — Sortformer Fine-Tuned Diarization-Conditioned Whisper

This repository hosts **DiCoW v3.2-SF**, a variant of [DiCoW v3.2](https://huggingface.co/BUT-FIT/DiCoW_v3_2) developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT). It is fine-tuned to accept **Sortformer soft speaker-activity masks** as diarization input, with the goal of adapting the FDDT conditioning layers to the noise patterns of the Sortformer diarization system.

---

## What is this model?

Standard DiCoW models are trained with ground-truth (GT) binary diarization masks. At inference time, however, real-world diarization systems such as [Sortformer](https://arxiv.org/abs/2409.06656) produce continuous soft probabilities that can differ substantially from clean binary signals.

DiCoW v3.2-SF addresses this mismatch by fine-tuning the **FDDT (Frame-Level Diarization-Dependent Transformation)** layers of DiCoW v3.2 exclusively on **Sortformer soft masks**, while keeping the Whisper decoder frozen.


## Quick Usage

### Load in Python

```python
from transformers import AutoModelForSpeechSeq2Seq

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "BUT-FIT/DiCoW_v3_2_SF",
    trust_remote_code=True
)

```


## Performance (tcpWER, 5 s collar)

*Comparison between this model and its baseline DiCoW v3.2 under both ground-truth (GT) and Sortformer (SF) diarization.*

| Dataset | Baseline — GT diar. | Baseline — SF diar. | **This model — GT diar.** | **This model — SF diar.** |
|---|---|---|---|---|
| NOTSOFAR-1 eval-small | 16.4 | 57.6 | 20.2 | **58.7** |
| AMI-SDM | 15.3 | 81.3 | 17.6 | **77.3** |
| Libri2Mix clean | 4.7 | 8.9 | 7.9 | **5.2** |
| Libri2Mix noisy | 11.3 | 21.5 | 16.1 | **12.3** |
| Libri3Mix clean | 28.8 | 38.9 | 36.7 | **38.2** |
| Libri3Mix noisy | 39.6 | 53.9 | 47.1 | **47.3** |

### Interpretation

Fine-tuning on Sortformer soft masks teaches the FDDT layers to be more tolerant of continuous probability values, yielding substantial gains on **simulated LibriMix** conditions: Libri2Mix-clean drops from 8.9 % to **5.2 %** and Libri2Mix-noisy from 21.5 % to **12.3 %** under SF diarization.

This adaptation carries a trade-off: GT diarization performance degrades across all datasets, because the FDDT layers have adjusted their learned transformations to expect noisy soft inputs rather than clean binary masks.

On real-world meeting data the SF-diarization gains are modest: AMI-SDM improves from 81.3 % to **77.3 %**, while NOTSOFAR-1 is largely unchanged.

> **Recommendation:** Use DiCoW v3.2-SF when your inference pipeline uses Sortformer-style diarization on relatively clean or simulated mixtures. For real-world meeting transcription with the best overall accuracy, prefer [DiCoW v3.3](https://huggingface.co/BUT-FIT/DiCoW_v3_3) or [SE-DiCoW](https://huggingface.co/BUT-FIT/SE-DiCoW).


## Limitations

* **GT mask degradation:** Compared to DiCoW v3.2, oracle (GT) diarization performance is meaningfully worse. Do not use this model when clean binary masks are available.
* **Meeting-domain gap:** Sortformer mask adaptation on LibriMix data does not fully transfer to spontaneous meeting corpora such as NOTSOFAR-1.

---

## Citations

If you use this model, please cite the original papers:

```bibtex
@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    year = {2026},
    doi = {10.1016/j.csl.2025.101841},
    author = {Alexander Polok et al.}
}

@INPROCEEDINGS{10887683,
    title={Target Speaker ASR with Whisper}, 
    author={Polok, Alexander et al.},
    booktitle={ICASSP 2025}, 
    year={2025},
    doi={10.1109/ICASSP49660.2025.10887683}
}
```

## Contact

* **Email:** [xbohatd00@stud.fit.vut.cz](mailto:xbohatd00@stud.fit.vut.cz)