Automatic Speech Recognition
PEFT
Safetensors
English
whisper
lora
disfluent-speech
stuttering
accessibility
speech-disorders
Eval Results (legacy)
Instructions to use nazarkozak/whisper-small-disfluent-smoothed-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use nazarkozak/whisper-small-disfluent-smoothed-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForSeq2SeqLM base_model = AutoModelForSeq2SeqLM.from_pretrained("openai/whisper-small") model = PeftModel.from_pretrained(base_model, "nazarkozak/whisper-small-disfluent-smoothed-lora") - Notebooks
- Google Colab
- Kaggle
File size: 9,440 Bytes
6387d90 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 | ---
language: en
license: cc-by-nc-sa-4.0
library_name: peft
base_model: openai/whisper-small
tags:
- whisper
- lora
- automatic-speech-recognition
- disfluent-speech
- stuttering
- accessibility
- speech-disorders
datasets:
- amaai-lab/DisfluencySpeech
metrics:
- wer
pipeline_tag: automatic-speech-recognition
model-index:
- name: whisper-small-disfluent-smoothed-lora
results:
- task:
type: automatic-speech-recognition
name: Disfluent Speech Recognition
dataset:
name: FluencyBank Adults Who Stutter (interviews, smoothed transcripts)
type: fluencybank-aws-interview
split: full
metrics:
- type: wer
value: 20.46
name: Word Error Rate (smoothed)
- task:
type: automatic-speech-recognition
name: Disfluent Speech Recognition
dataset:
name: held-out validation split
type: validation
split: validation
metrics:
- type: wer
value: 18.56
name: Word Error Rate (held-out)
---
# whisper-small-disfluent-smoothed (LoRA adapter)
A LoRA fine-tune of [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) that
transcribes **disfluent / stuttered speech** to clean, fluent text β recovering the speaker's
*intended* words and dropping the disfluencies (filled pauses, repetitions, sound prolongations,
blocks, interjections).
This is the **smoothed** variant. A companion **verbatim** variant β which preserves disfluencies
in the output β is published as
[`nazarkozak/whisper-small-disfluent-verbatim-lora`](#related-models).
## Why this exists
Vanilla Whisper-small was trained on largely fluent speech and tends to hallucinate, truncate, or
mangle disfluent input. On a corpus of 5,444 utterances by adults who stutter, the base model
hits **26.04 % WER**. The same model with this adapter applied: **20.46 % WER** β a **β21.4 %
relative reduction** on real disfluent speech.
The adapter is small (β14 MB) and applies on top of the existing Whisper-small weights, so users
can keep using their existing pipeline and toggle the adapter on for disfluent input.
## Use cases
- **Inclusive voice assistants** that don't fail on disfluent users
- **Live captioning** where the speaker stutters
- **AAC apps** for people with speech-disorder profiles
- **Speech-therapy session transcription** (smoothed mode is what most clinicians want)
- **Inclusive media transcription** of interviews / podcasts with disfluent speakers
## How to use
```python
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-smoothed-lora")
model.eval()
# Inference is identical to vanilla Whisper-small
import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav") # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])
```
If you need the *verbatim* version (preserves disfluencies in the output), load
`nazarkozak/whisper-small-disfluent-verbatim-lora` instead.
## Performance
All numbers are Word Error Rate (lower is better), computed with Whisper's official English
text normalizer.
| Split | Vanilla Whisper-small | This adapter | Ξ Absolute | Ξ Relative |
|---|---|---|---|---|
| **Full FluencyBank AWS-interview (5,444 utts)** | **26.04 %** | **20.46 %** | **β5.58** | **β21.4 %** |
| Held-out validation (200 utts) | ~26 %β | **18.56 %** | **β7.4** | **β28 %** |
| Held-out test (200 utts) | 28.71 % | 26.08 % | β2.63 | β9.2 % |
β Baseline on the exact 200-sample held-out val split was not separately re-run; the figure is
estimated from the 26.04 % full-set baseline. Held-out test was re-run and shows the test split
is harder than the validation split (likely a session-level effect with only 200 samples).
### Per-example breakdown (5,020 paired samples)
| Outcome | Count | % |
|---|---:|---:|
| Adapter strictly better than baseline | 1,836 | 36.6 % |
| Tie (both correct or both equally wrong) | 2,564 | 51.1 % |
| Vanilla strictly better | 620 | 12.4 % |
### What kinds of failures it fixes
The adapter reliably fixes Whisper-small's failure modes on disfluent input:
| Vanilla failure mode | Example |
|---|---|
| **Hallucination repeats** ("saint saint saintβ¦" Γ100) | `REF: my stuttering has the least impact on my life like my friends and family` <br> `BASE: my saint saint saint saint saint saint saintβ¦` <br> `LoRA: my stuttering has the least impact on my life like my friends and family` |
| **Mistaking disfluencies as words** | `REF: and they stutter like that person` <br> `BASE: and they they they said are like like that person` <br> `LoRA: and they stutter like that person` |
| **Mishearing / wrong intent** | `REF: stuttering has had a variety of impacts on me` <br> `BASE: i still do not have a variety of packs on me` <br> `LoRA: stuttering has had a variety of impacts on me` |
| **Truncation** | `REF: we can speak to pets` <br> `BASE: because we can` <br> `LoRA: we can speak to pets` |
## Training data
| Source | Hours | License | Use |
|---|---|---|---|
| **FluencyBank Teaching β Adults Who Stutter (interview)** | ~17 | CC BY-NC-SA 4.0 (TalkBank) | Train (most of the disfluent signal) |
| **DisfluencySpeech** ([amaai-lab](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)) | 10 | Apache 2.0 | Train (single-speaker scripted, augmentation) |
Total: ~27 hr training, 9,402 utterance pairs.
The smoothed transcripts come from:
- For FluencyBank β CHAT-format transcripts, with a custom parser that drops filled-pause markers
(`&-uh`, `&-um`), bracket annotations (`[/]`, `[//]`), pause marks (`(.)`), unintelligible markers
(`xxx`/`yyy`), and collapses consecutive duplicate words.
- For DisfluencySpeech β the dataset's own `transcript_c` field (its cleanest variant, with
filled pauses, editing terms, and false starts removed).
## Training recipe
| Hyperparameter | Value |
|---|---|
| Base model | `openai/whisper-small` (244 M params) |
| LoRA target modules | `q_proj`, `v_proj` |
| LoRA rank / alpha / dropout | 32 / 64 / 0.05 |
| Trainable params | 3.5 M (1.4 % of total) |
| Batch size (effective) | 16 (8 Γ 2 grad accum) |
| Learning rate | 1e-4 with linear warmup over 100 steps, then linear decay |
| Epochs | 3 (1764 optimizer steps) |
| Hardware | Apple M1 Max 64 GB (PyTorch + MPS backend) |
| Wall clock | ~2.5 h |
The released checkpoint is the one with the lowest validation WER over training (step 1200).
## Limitations
- **English only.** Trained exclusively on English speech.
- **Adult speakers.** Training data is from adults who stutter β not validated on children.
- **Non-commercial license** (inherited from FluencyBank's CC BY-NC-SA 4.0). For a
commercial-friendly variant trained on Apache 2.0 data only, see
[`nazarkozak/whisper-small-disfluent-commercial-lora`](#) (planned).
- **Not a clinical instrument.** The adapter improves transcription, not assessment of
stuttering severity.
- **Generalization gap.** On a 200-sample held-out test split the relative WER reduction is
smaller (~9 %) than on validation (~28 %), suggesting some sensitivity to which sessions land
in train vs test. Real-world WER will fall somewhere in between depending on speaker similarity.
## Citation
If you use this adapter, please cite both the base model and the source datasets:
```bibtex
@misc{kozak2026whisperdisfluent,
author = {Nazar Kozak},
title = {whisper-small-disfluent-smoothed-lora: a Whisper LoRA for disfluent speech},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora}
}
@article{radford2022whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey,
Christine and Sutskever, Ilya},
journal = {arXiv preprint arXiv:2212.04356},
year = {2022}
}
@article{ratner2018fluencybank,
title = {Fluency Bank: A new resource for fluency research and practice},
author = {Ratner, Nan Bernstein and MacWhinney, Brian},
journal= {Journal of Fluency Disorders},
volume = {56},
pages = {69--80},
year = {2018}
}
@misc{wang2024disfluencyspeech,
title = {DisfluencySpeech: Single-Speaker Conversational Speech Dataset with Paralanguage},
author = {Kyra Wang and Dorien Herremans},
year = {2024},
eprint = {2406.08820},
archivePrefix = {arXiv}
}
```
## Related models
- **Verbatim variant** β preserves disfluencies in the output:
[`nazarkozak/whisper-small-disfluent-verbatim-lora`](https://huggingface.co/nazarkozak/whisper-small-disfluent-verbatim-lora)
- **CoreML build** β for on-device iOS use (planned):
[`nazarkozak/whisper-small-disfluent-smoothed-coreml`](#)
- **Hosted API** β Replicate endpoint (planned):
`replicate.com/nazarkozak/whisper-disfluent`
|