whisper-small-disfluent-smoothed (LoRA adapter)

A LoRA fine-tune of openai/whisper-small that transcribes disfluent / stuttered speech to clean, fluent text — recovering the speaker's intended words and dropping the disfluencies (filled pauses, repetitions, sound prolongations, blocks, interjections).

This is the smoothed variant. A companion verbatim variant — which preserves disfluencies in the output — is published as nazarkozak/whisper-small-disfluent-verbatim-lora.

Why this exists

Vanilla Whisper-small was trained on largely fluent speech and tends to hallucinate, truncate, or mangle disfluent input. On a corpus of 5,444 utterances by adults who stutter, the base model hits 26.04 % WER. The same model with this adapter applied: 20.46 % WER — a −21.4 % relative reduction on real disfluent speech.

The adapter is small (≈14 MB) and applies on top of the existing Whisper-small weights, so users can keep using their existing pipeline and toggle the adapter on for disfluent input.

Use cases

Inclusive voice assistants that don't fail on disfluent users
Live captioning where the speaker stutters
AAC apps for people with speech-disorder profiles
Speech-therapy session transcription (smoothed mode is what most clinicians want)
Inclusive media transcription of interviews / podcasts with disfluent speakers

How to use

from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-smoothed-lora")
model.eval()

# Inference is identical to vanilla Whisper-small
import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav")  # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])

If you need the verbatim version (preserves disfluencies in the output), load nazarkozak/whisper-small-disfluent-verbatim-lora instead.

Performance

All numbers are Word Error Rate (lower is better), computed with Whisper's official English text normalizer.

Split	Vanilla Whisper-small	This adapter	Δ Absolute	Δ Relative
Full FluencyBank AWS-interview (5,444 utts)	26.04 %	20.46 %	−5.58	−21.4 %
Held-out validation (200 utts)	~26 %†	18.56 %	−7.4	−28 %
Held-out test (200 utts)	28.71 %	26.08 %	−2.63	−9.2 %

† Baseline on the exact 200-sample held-out val split was not separately re-run; the figure is estimated from the 26.04 % full-set baseline. Held-out test was re-run and shows the test split is harder than the validation split (likely a session-level effect with only 200 samples).

Per-example breakdown (5,020 paired samples)

Outcome	Count	%
Adapter strictly better than baseline	1,836	36.6 %
Tie (both correct or both equally wrong)	2,564	51.1 %
Vanilla strictly better	620	12.4 %

What kinds of failures it fixes

The adapter reliably fixes Whisper-small's failure modes on disfluent input:

Vanilla failure mode	Example
Hallucination repeats ("saint saint saint…" ×100)	`REF: my stuttering has the least impact on my life like my friends and family` `BASE: my saint saint saint saint saint saint saint…` `LoRA: my stuttering has the least impact on my life like my friends and family`
Mistaking disfluencies as words	`REF: and they stutter like that person` `BASE: and they they they said are like like that person` `LoRA: and they stutter like that person`
Mishearing / wrong intent	`REF: stuttering has had a variety of impacts on me` `BASE: i still do not have a variety of packs on me` `LoRA: stuttering has had a variety of impacts on me`
Truncation	`REF: we can speak to pets` `BASE: because we can` `LoRA: we can speak to pets`

Training data

Source	Hours	License	Use
FluencyBank Teaching — Adults Who Stutter (interview)	~17	CC BY-NC-SA 4.0 (TalkBank)	Train (most of the disfluent signal)
DisfluencySpeech (amaai-lab)	10	Apache 2.0	Train (single-speaker scripted, augmentation)

Total: ~27 hr training, 9,402 utterance pairs.

The smoothed transcripts come from:

For FluencyBank — CHAT-format transcripts, with a custom parser that drops filled-pause markers (&-uh, &-um), bracket annotations ([/], [//]), pause marks ((.)), unintelligible markers (xxx/yyy), and collapses consecutive duplicate words.
For DisfluencySpeech — the dataset's own transcript_c field (its cleanest variant, with filled pauses, editing terms, and false starts removed).

Training recipe

Hyperparameter	Value
Base model	`openai/whisper-small` (244 M params)
LoRA target modules	`q_proj`, `v_proj`
LoRA rank / alpha / dropout	32 / 64 / 0.05
Trainable params	3.5 M (1.4 % of total)
Batch size (effective)	16 (8 × 2 grad accum)
Learning rate	1e-4 with linear warmup over 100 steps, then linear decay
Epochs	3 (1764 optimizer steps)
Hardware	Apple M1 Max 64 GB (PyTorch + MPS backend)
Wall clock	~2.5 h

The released checkpoint is the one with the lowest validation WER over training (step 1200).

Limitations

English only. Trained exclusively on English speech.
Adult speakers. Training data is from adults who stutter — not validated on children.
Non-commercial license (inherited from FluencyBank's CC BY-NC-SA 4.0). For a commercial-friendly variant trained on Apache 2.0 data only, see nazarkozak/whisper-small-disfluent-commercial-lora (planned).
Not a clinical instrument. The adapter improves transcription, not assessment of stuttering severity.
Generalization gap. On a 200-sample held-out test split the relative WER reduction is smaller (~~9 %) than on validation (~~28 %), suggesting some sensitivity to which sessions land in train vs test. Real-world WER will fall somewhere in between depending on speaker similarity.

Citation

If you use this adapter, please cite both the base model and the source datasets:

@misc{kozak2026whisperdisfluent,
  author = {Nazar Kozak},
  title  = {whisper-small-disfluent-smoothed-lora: a Whisper LoRA for disfluent speech},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora}
}

@article{radford2022whisper,
  title   = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author  = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey,
             Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year    = {2022}
}

@article{ratner2018fluencybank,
  title  = {Fluency Bank: A new resource for fluency research and practice},
  author = {Ratner, Nan Bernstein and MacWhinney, Brian},
  journal= {Journal of Fluency Disorders},
  volume = {56},
  pages  = {69--80},
  year   = {2018}
}

@misc{wang2024disfluencyspeech,
  title  = {DisfluencySpeech: Single-Speaker Conversational Speech Dataset with Paralanguage},
  author = {Kyra Wang and Dorien Herremans},
  year   = {2024},
  eprint = {2406.08820},
  archivePrefix = {arXiv}
}

Related models

Verbatim variant — preserves disfluencies in the output: nazarkozak/whisper-small-disfluent-verbatim-lora
CoreML build — for on-device iOS use (planned): nazarkozak/whisper-small-disfluent-smoothed-coreml
Hosted API — Replicate endpoint (planned): replicate.com/nazarkozak/whisper-disfluent

Downloads last month: 3

Model tree for nazarkozak/whisper-small-disfluent-smoothed-lora

Base model

openai/whisper-small

Adapter

(245)

this model

Dataset used to train nazarkozak/whisper-small-disfluent-smoothed-lora

Papers for nazarkozak/whisper-small-disfluent-smoothed-lora

DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage

Paper • 2406.08820 • Published Jun 13, 2024 • 2

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 54

Evaluation results

Word Error Rate (smoothed) on FluencyBank Adults Who Stutter (interviews, smoothed transcripts)
self-reported

20.460
Word Error Rate (held-out) on held-out validation split
validation set self-reported

18.560