Instructions to use nazarkozak/whisper-small-disfluent-smoothed-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use nazarkozak/whisper-small-disfluent-smoothed-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForSeq2SeqLM base_model = AutoModelForSeq2SeqLM.from_pretrained("openai/whisper-small") model = PeftModel.from_pretrained(base_model, "nazarkozak/whisper-small-disfluent-smoothed-lora") - Notebooks
- Google Colab
- Kaggle
whisper-small-disfluent-smoothed (LoRA adapter)
A LoRA fine-tune of openai/whisper-small that
transcribes disfluent / stuttered speech to clean, fluent text — recovering the speaker's
intended words and dropping the disfluencies (filled pauses, repetitions, sound prolongations,
blocks, interjections).
This is the smoothed variant. A companion verbatim variant — which preserves disfluencies
in the output — is published as
nazarkozak/whisper-small-disfluent-verbatim-lora.
Why this exists
Vanilla Whisper-small was trained on largely fluent speech and tends to hallucinate, truncate, or mangle disfluent input. On a corpus of 5,444 utterances by adults who stutter, the base model hits 26.04 % WER. The same model with this adapter applied: 20.46 % WER — a −21.4 % relative reduction on real disfluent speech.
The adapter is small (≈14 MB) and applies on top of the existing Whisper-small weights, so users can keep using their existing pipeline and toggle the adapter on for disfluent input.
Use cases
- Inclusive voice assistants that don't fail on disfluent users
- Live captioning where the speaker stutters
- AAC apps for people with speech-disorder profiles
- Speech-therapy session transcription (smoothed mode is what most clinicians want)
- Inclusive media transcription of interviews / podcasts with disfluent speakers
How to use
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-smoothed-lora")
model.eval()
# Inference is identical to vanilla Whisper-small
import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav") # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])
If you need the verbatim version (preserves disfluencies in the output), load
nazarkozak/whisper-small-disfluent-verbatim-lora instead.
Performance
All numbers are Word Error Rate (lower is better), computed with Whisper's official English text normalizer.
| Split | Vanilla Whisper-small | This adapter | Δ Absolute | Δ Relative |
|---|---|---|---|---|
| Full FluencyBank AWS-interview (5,444 utts) | 26.04 % | 20.46 % | −5.58 | −21.4 % |
| Held-out validation (200 utts) | ~26 %†| 18.56 % | −7.4 | −28 % |
| Held-out test (200 utts) | 28.71 % | 26.08 % | −2.63 | −9.2 % |
†Baseline on the exact 200-sample held-out val split was not separately re-run; the figure is estimated from the 26.04 % full-set baseline. Held-out test was re-run and shows the test split is harder than the validation split (likely a session-level effect with only 200 samples).
Per-example breakdown (5,020 paired samples)
| Outcome | Count | % |
|---|---|---|
| Adapter strictly better than baseline | 1,836 | 36.6 % |
| Tie (both correct or both equally wrong) | 2,564 | 51.1 % |
| Vanilla strictly better | 620 | 12.4 % |
What kinds of failures it fixes
The adapter reliably fixes Whisper-small's failure modes on disfluent input:
| Vanilla failure mode | Example |
|---|---|
| Hallucination repeats ("saint saint saint…" ×100) | REF: my stuttering has the least impact on my life like my friends and family BASE: my saint saint saint saint saint saint saint… LoRA: my stuttering has the least impact on my life like my friends and family |
| Mistaking disfluencies as words | REF: and they stutter like that person BASE: and they they they said are like like that person LoRA: and they stutter like that person |
| Mishearing / wrong intent | REF: stuttering has had a variety of impacts on me BASE: i still do not have a variety of packs on me LoRA: stuttering has had a variety of impacts on me |
| Truncation | REF: we can speak to pets BASE: because we can LoRA: we can speak to pets |
Training data
| Source | Hours | License | Use |
|---|---|---|---|
| FluencyBank Teaching — Adults Who Stutter (interview) | ~17 | CC BY-NC-SA 4.0 (TalkBank) | Train (most of the disfluent signal) |
| DisfluencySpeech (amaai-lab) | 10 | Apache 2.0 | Train (single-speaker scripted, augmentation) |
Total: ~27 hr training, 9,402 utterance pairs.
The smoothed transcripts come from:
- For FluencyBank — CHAT-format transcripts, with a custom parser that drops filled-pause markers
(
&-uh,&-um), bracket annotations ([/],[//]), pause marks ((.)), unintelligible markers (xxx/yyy), and collapses consecutive duplicate words. - For DisfluencySpeech — the dataset's own
transcript_cfield (its cleanest variant, with filled pauses, editing terms, and false starts removed).
Training recipe
| Hyperparameter | Value |
|---|---|
| Base model | openai/whisper-small (244 M params) |
| LoRA target modules | q_proj, v_proj |
| LoRA rank / alpha / dropout | 32 / 64 / 0.05 |
| Trainable params | 3.5 M (1.4 % of total) |
| Batch size (effective) | 16 (8 × 2 grad accum) |
| Learning rate | 1e-4 with linear warmup over 100 steps, then linear decay |
| Epochs | 3 (1764 optimizer steps) |
| Hardware | Apple M1 Max 64 GB (PyTorch + MPS backend) |
| Wall clock | ~2.5 h |
The released checkpoint is the one with the lowest validation WER over training (step 1200).
Limitations
- English only. Trained exclusively on English speech.
- Adult speakers. Training data is from adults who stutter — not validated on children.
- Non-commercial license (inherited from FluencyBank's CC BY-NC-SA 4.0). For a
commercial-friendly variant trained on Apache 2.0 data only, see
nazarkozak/whisper-small-disfluent-commercial-lora(planned). - Not a clinical instrument. The adapter improves transcription, not assessment of stuttering severity.
- Generalization gap. On a 200-sample held-out test split the relative WER reduction is
smaller (
9 %) than on validation (28 %), suggesting some sensitivity to which sessions land in train vs test. Real-world WER will fall somewhere in between depending on speaker similarity.
Citation
If you use this adapter, please cite both the base model and the source datasets:
@misc{kozak2026whisperdisfluent,
author = {Nazar Kozak},
title = {whisper-small-disfluent-smoothed-lora: a Whisper LoRA for disfluent speech},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora}
}
@article{radford2022whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey,
Christine and Sutskever, Ilya},
journal = {arXiv preprint arXiv:2212.04356},
year = {2022}
}
@article{ratner2018fluencybank,
title = {Fluency Bank: A new resource for fluency research and practice},
author = {Ratner, Nan Bernstein and MacWhinney, Brian},
journal= {Journal of Fluency Disorders},
volume = {56},
pages = {69--80},
year = {2018}
}
@misc{wang2024disfluencyspeech,
title = {DisfluencySpeech: Single-Speaker Conversational Speech Dataset with Paralanguage},
author = {Kyra Wang and Dorien Herremans},
year = {2024},
eprint = {2406.08820},
archivePrefix = {arXiv}
}
Related models
- Verbatim variant — preserves disfluencies in the output:
nazarkozak/whisper-small-disfluent-verbatim-lora - CoreML build — for on-device iOS use (planned):
nazarkozak/whisper-small-disfluent-smoothed-coreml - Hosted API — Replicate endpoint (planned):
replicate.com/nazarkozak/whisper-disfluent
- Downloads last month
- 3
Model tree for nazarkozak/whisper-small-disfluent-smoothed-lora
Base model
openai/whisper-smallDataset used to train nazarkozak/whisper-small-disfluent-smoothed-lora
Papers for nazarkozak/whisper-small-disfluent-smoothed-lora
DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage
Robust Speech Recognition via Large-Scale Weak Supervision
Evaluation results
- Word Error Rate (smoothed) on FluencyBank Adults Who Stutter (interviews, smoothed transcripts)self-reported20.460
- Word Error Rate (held-out) on held-out validation splitvalidation set self-reported18.560