whisper-small-disfluent-smoothed (LoRA adapter)

A LoRA fine-tune of openai/whisper-small that transcribes disfluent / stuttered speech to clean, fluent text — recovering the speaker's intended words and dropping the disfluencies (filled pauses, repetitions, sound prolongations, blocks, interjections).

This is the smoothed variant. A companion verbatim variant — which preserves disfluencies in the output — is published as nazarkozak/whisper-small-disfluent-verbatim-lora.

Why this exists

Vanilla Whisper-small was trained on largely fluent speech and tends to hallucinate, truncate, or mangle disfluent input. On a corpus of 5,444 utterances by adults who stutter, the base model hits 26.04 % WER. The same model with this adapter applied: 20.46 % WER — a −21.4 % relative reduction on real disfluent speech.

The adapter is small (≈14 MB) and applies on top of the existing Whisper-small weights, so users can keep using their existing pipeline and toggle the adapter on for disfluent input.

Use cases

  • Inclusive voice assistants that don't fail on disfluent users
  • Live captioning where the speaker stutters
  • AAC apps for people with speech-disorder profiles
  • Speech-therapy session transcription (smoothed mode is what most clinicians want)
  • Inclusive media transcription of interviews / podcasts with disfluent speakers

How to use

from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-smoothed-lora")
model.eval()

# Inference is identical to vanilla Whisper-small
import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav")  # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])

If you need the verbatim version (preserves disfluencies in the output), load nazarkozak/whisper-small-disfluent-verbatim-lora instead.

Performance

All numbers are Word Error Rate (lower is better), computed with Whisper's official English text normalizer.

Split Vanilla Whisper-small This adapter Δ Absolute Δ Relative
Full FluencyBank AWS-interview (5,444 utts) 26.04 % 20.46 % −5.58 −21.4 %
Held-out validation (200 utts) ~26 %† 18.56 % −7.4 −28 %
Held-out test (200 utts) 28.71 % 26.08 % −2.63 −9.2 %

† Baseline on the exact 200-sample held-out val split was not separately re-run; the figure is estimated from the 26.04 % full-set baseline. Held-out test was re-run and shows the test split is harder than the validation split (likely a session-level effect with only 200 samples).

Per-example breakdown (5,020 paired samples)

Outcome Count %
Adapter strictly better than baseline 1,836 36.6 %
Tie (both correct or both equally wrong) 2,564 51.1 %
Vanilla strictly better 620 12.4 %

What kinds of failures it fixes

The adapter reliably fixes Whisper-small's failure modes on disfluent input:

Vanilla failure mode Example
Hallucination repeats ("saint saint saint…" ×100) REF: my stuttering has the least impact on my life like my friends and family
BASE: my saint saint saint saint saint saint saint…
LoRA: my stuttering has the least impact on my life like my friends and family
Mistaking disfluencies as words REF: and they stutter like that person
BASE: and they they they said are like like that person
LoRA: and they stutter like that person
Mishearing / wrong intent REF: stuttering has had a variety of impacts on me
BASE: i still do not have a variety of packs on me
LoRA: stuttering has had a variety of impacts on me
Truncation REF: we can speak to pets
BASE: because we can
LoRA: we can speak to pets

Training data

Source Hours License Use
FluencyBank Teaching — Adults Who Stutter (interview) ~17 CC BY-NC-SA 4.0 (TalkBank) Train (most of the disfluent signal)
DisfluencySpeech (amaai-lab) 10 Apache 2.0 Train (single-speaker scripted, augmentation)

Total: ~27 hr training, 9,402 utterance pairs.

The smoothed transcripts come from:

  • For FluencyBank — CHAT-format transcripts, with a custom parser that drops filled-pause markers (&-uh, &-um), bracket annotations ([/], [//]), pause marks ((.)), unintelligible markers (xxx/yyy), and collapses consecutive duplicate words.
  • For DisfluencySpeech — the dataset's own transcript_c field (its cleanest variant, with filled pauses, editing terms, and false starts removed).

Training recipe

Hyperparameter Value
Base model openai/whisper-small (244 M params)
LoRA target modules q_proj, v_proj
LoRA rank / alpha / dropout 32 / 64 / 0.05
Trainable params 3.5 M (1.4 % of total)
Batch size (effective) 16 (8 × 2 grad accum)
Learning rate 1e-4 with linear warmup over 100 steps, then linear decay
Epochs 3 (1764 optimizer steps)
Hardware Apple M1 Max 64 GB (PyTorch + MPS backend)
Wall clock ~2.5 h

The released checkpoint is the one with the lowest validation WER over training (step 1200).

Limitations

  • English only. Trained exclusively on English speech.
  • Adult speakers. Training data is from adults who stutter — not validated on children.
  • Non-commercial license (inherited from FluencyBank's CC BY-NC-SA 4.0). For a commercial-friendly variant trained on Apache 2.0 data only, see nazarkozak/whisper-small-disfluent-commercial-lora (planned).
  • Not a clinical instrument. The adapter improves transcription, not assessment of stuttering severity.
  • Generalization gap. On a 200-sample held-out test split the relative WER reduction is smaller (9 %) than on validation (28 %), suggesting some sensitivity to which sessions land in train vs test. Real-world WER will fall somewhere in between depending on speaker similarity.

Citation

If you use this adapter, please cite both the base model and the source datasets:

@misc{kozak2026whisperdisfluent,
  author = {Nazar Kozak},
  title  = {whisper-small-disfluent-smoothed-lora: a Whisper LoRA for disfluent speech},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora}
}

@article{radford2022whisper,
  title   = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author  = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey,
             Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year    = {2022}
}

@article{ratner2018fluencybank,
  title  = {Fluency Bank: A new resource for fluency research and practice},
  author = {Ratner, Nan Bernstein and MacWhinney, Brian},
  journal= {Journal of Fluency Disorders},
  volume = {56},
  pages  = {69--80},
  year   = {2018}
}

@misc{wang2024disfluencyspeech,
  title  = {DisfluencySpeech: Single-Speaker Conversational Speech Dataset with Paralanguage},
  author = {Kyra Wang and Dorien Herremans},
  year   = {2024},
  eprint = {2406.08820},
  archivePrefix = {arXiv}
}

Related models

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nazarkozak/whisper-small-disfluent-smoothed-lora

Adapter
(245)
this model

Dataset used to train nazarkozak/whisper-small-disfluent-smoothed-lora

Papers for nazarkozak/whisper-small-disfluent-smoothed-lora

Evaluation results

  • Word Error Rate (smoothed) on FluencyBank Adults Who Stutter (interviews, smoothed transcripts)
    self-reported
    20.460
  • Word Error Rate (held-out) on held-out validation split
    validation set self-reported
    18.560