---
language: en
license: cc-by-nc-sa-4.0
library_name: peft
base_model: openai/whisper-small
tags:
- whisper
- lora
- automatic-speech-recognition
- disfluent-speech
- stuttering
- accessibility
- speech-disorders
datasets:
- amaai-lab/DisfluencySpeech
metrics:
- wer
pipeline_tag: automatic-speech-recognition
model-index:
- name: whisper-small-disfluent-smoothed-lora
results:
- task:
type: automatic-speech-recognition
name: Disfluent Speech Recognition
dataset:
name: FluencyBank Adults Who Stutter (interviews, smoothed transcripts)
type: fluencybank-aws-interview
split: full
metrics:
- type: wer
value: 20.46
name: Word Error Rate (smoothed)
- task:
type: automatic-speech-recognition
name: Disfluent Speech Recognition
dataset:
name: held-out validation split
type: validation
split: validation
metrics:
- type: wer
value: 18.56
name: Word Error Rate (held-out)
---
# whisper-small-disfluent-smoothed (LoRA adapter)
A LoRA fine-tune of [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) that
transcribes **disfluent / stuttered speech** to clean, fluent text — recovering the speaker's
*intended* words and dropping the disfluencies (filled pauses, repetitions, sound prolongations,
blocks, interjections).
This is the **smoothed** variant. A companion **verbatim** variant — which preserves disfluencies
in the output — is published as
[`nazarkozak/whisper-small-disfluent-verbatim-lora`](#related-models).
## Why this exists
Vanilla Whisper-small was trained on largely fluent speech and tends to hallucinate, truncate, or
mangle disfluent input. On a corpus of 5,444 utterances by adults who stutter, the base model
hits **26.04 % WER**. The same model with this adapter applied: **20.46 % WER** — a **−21.4 %
relative reduction** on real disfluent speech.
The adapter is small (≈14 MB) and applies on top of the existing Whisper-small weights, so users
can keep using their existing pipeline and toggle the adapter on for disfluent input.
## Use cases
- **Inclusive voice assistants** that don't fail on disfluent users
- **Live captioning** where the speaker stutters
- **AAC apps** for people with speech-disorder profiles
- **Speech-therapy session transcription** (smoothed mode is what most clinicians want)
- **Inclusive media transcription** of interviews / podcasts with disfluent speakers
## How to use
```python
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-smoothed-lora")
model.eval()
# Inference is identical to vanilla Whisper-small
import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav") # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])
```
If you need the *verbatim* version (preserves disfluencies in the output), load
`nazarkozak/whisper-small-disfluent-verbatim-lora` instead.
## Performance
All numbers are Word Error Rate (lower is better), computed with Whisper's official English
text normalizer.
| Split | Vanilla Whisper-small | This adapter | Δ Absolute | Δ Relative |
|---|---|---|---|---|
| **Full FluencyBank AWS-interview (5,444 utts)** | **26.04 %** | **20.46 %** | **−5.58** | **−21.4 %** |
| Held-out validation (200 utts) | ~26 %† | **18.56 %** | **−7.4** | **−28 %** |
| Held-out test (200 utts) | 28.71 % | 26.08 % | −2.63 | −9.2 % |
† Baseline on the exact 200-sample held-out val split was not separately re-run; the figure is
estimated from the 26.04 % full-set baseline. Held-out test was re-run and shows the test split
is harder than the validation split (likely a session-level effect with only 200 samples).
### Per-example breakdown (5,020 paired samples)
| Outcome | Count | % |
|---|---:|---:|
| Adapter strictly better than baseline | 1,836 | 36.6 % |
| Tie (both correct or both equally wrong) | 2,564 | 51.1 % |
| Vanilla strictly better | 620 | 12.4 % |
### What kinds of failures it fixes
The adapter reliably fixes Whisper-small's failure modes on disfluent input:
| Vanilla failure mode | Example |
|---|---|
| **Hallucination repeats** ("saint saint saint…" ×100) | `REF: my stuttering has the least impact on my life like my friends and family`
`BASE: my saint saint saint saint saint saint saint…`
`LoRA: my stuttering has the least impact on my life like my friends and family` |
| **Mistaking disfluencies as words** | `REF: and they stutter like that person`
`BASE: and they they they said are like like that person`
`LoRA: and they stutter like that person` |
| **Mishearing / wrong intent** | `REF: stuttering has had a variety of impacts on me`
`BASE: i still do not have a variety of packs on me`
`LoRA: stuttering has had a variety of impacts on me` |
| **Truncation** | `REF: we can speak to pets`
`BASE: because we can`
`LoRA: we can speak to pets` |
## Training data
| Source | Hours | License | Use |
|---|---|---|---|
| **FluencyBank Teaching — Adults Who Stutter (interview)** | ~17 | CC BY-NC-SA 4.0 (TalkBank) | Train (most of the disfluent signal) |
| **DisfluencySpeech** ([amaai-lab](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)) | 10 | Apache 2.0 | Train (single-speaker scripted, augmentation) |
Total: ~27 hr training, 9,402 utterance pairs.
The smoothed transcripts come from:
- For FluencyBank — CHAT-format transcripts, with a custom parser that drops filled-pause markers
(`&-uh`, `&-um`), bracket annotations (`[/]`, `[//]`), pause marks (`(.)`), unintelligible markers
(`xxx`/`yyy`), and collapses consecutive duplicate words.
- For DisfluencySpeech — the dataset's own `transcript_c` field (its cleanest variant, with
filled pauses, editing terms, and false starts removed).
## Training recipe
| Hyperparameter | Value |
|---|---|
| Base model | `openai/whisper-small` (244 M params) |
| LoRA target modules | `q_proj`, `v_proj` |
| LoRA rank / alpha / dropout | 32 / 64 / 0.05 |
| Trainable params | 3.5 M (1.4 % of total) |
| Batch size (effective) | 16 (8 × 2 grad accum) |
| Learning rate | 1e-4 with linear warmup over 100 steps, then linear decay |
| Epochs | 3 (1764 optimizer steps) |
| Hardware | Apple M1 Max 64 GB (PyTorch + MPS backend) |
| Wall clock | ~2.5 h |
The released checkpoint is the one with the lowest validation WER over training (step 1200).
## Limitations
- **English only.** Trained exclusively on English speech.
- **Adult speakers.** Training data is from adults who stutter — not validated on children.
- **Non-commercial license** (inherited from FluencyBank's CC BY-NC-SA 4.0). For a
commercial-friendly variant trained on Apache 2.0 data only, see
[`nazarkozak/whisper-small-disfluent-commercial-lora`](#) (planned).
- **Not a clinical instrument.** The adapter improves transcription, not assessment of
stuttering severity.
- **Generalization gap.** On a 200-sample held-out test split the relative WER reduction is
smaller (~9 %) than on validation (~28 %), suggesting some sensitivity to which sessions land
in train vs test. Real-world WER will fall somewhere in between depending on speaker similarity.
## Citation
If you use this adapter, please cite both the base model and the source datasets:
```bibtex
@misc{kozak2026whisperdisfluent,
author = {Nazar Kozak},
title = {whisper-small-disfluent-smoothed-lora: a Whisper LoRA for disfluent speech},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora}
}
@article{radford2022whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey,
Christine and Sutskever, Ilya},
journal = {arXiv preprint arXiv:2212.04356},
year = {2022}
}
@article{ratner2018fluencybank,
title = {Fluency Bank: A new resource for fluency research and practice},
author = {Ratner, Nan Bernstein and MacWhinney, Brian},
journal= {Journal of Fluency Disorders},
volume = {56},
pages = {69--80},
year = {2018}
}
@misc{wang2024disfluencyspeech,
title = {DisfluencySpeech: Single-Speaker Conversational Speech Dataset with Paralanguage},
author = {Kyra Wang and Dorien Herremans},
year = {2024},
eprint = {2406.08820},
archivePrefix = {arXiv}
}
```
## Related models
- **Verbatim variant** — preserves disfluencies in the output:
[`nazarkozak/whisper-small-disfluent-verbatim-lora`](https://huggingface.co/nazarkozak/whisper-small-disfluent-verbatim-lora)
- **CoreML build** — for on-device iOS use (planned):
[`nazarkozak/whisper-small-disfluent-smoothed-coreml`](#)
- **Hosted API** — Replicate endpoint (planned):
`replicate.com/nazarkozak/whisper-disfluent`