--- language: en license: cc-by-nc-sa-4.0 library_name: peft base_model: openai/whisper-small tags: - whisper - lora - automatic-speech-recognition - disfluent-speech - stuttering - accessibility - speech-disorders datasets: - amaai-lab/DisfluencySpeech metrics: - wer pipeline_tag: automatic-speech-recognition model-index: - name: whisper-small-disfluent-smoothed-lora results: - task: type: automatic-speech-recognition name: Disfluent Speech Recognition dataset: name: FluencyBank Adults Who Stutter (interviews, smoothed transcripts) type: fluencybank-aws-interview split: full metrics: - type: wer value: 20.46 name: Word Error Rate (smoothed) - task: type: automatic-speech-recognition name: Disfluent Speech Recognition dataset: name: held-out validation split type: validation split: validation metrics: - type: wer value: 18.56 name: Word Error Rate (held-out) --- # whisper-small-disfluent-smoothed (LoRA adapter) A LoRA fine-tune of [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) that transcribes **disfluent / stuttered speech** to clean, fluent text — recovering the speaker's *intended* words and dropping the disfluencies (filled pauses, repetitions, sound prolongations, blocks, interjections). This is the **smoothed** variant. A companion **verbatim** variant — which preserves disfluencies in the output — is published as [`nazarkozak/whisper-small-disfluent-verbatim-lora`](#related-models). ## Why this exists Vanilla Whisper-small was trained on largely fluent speech and tends to hallucinate, truncate, or mangle disfluent input. On a corpus of 5,444 utterances by adults who stutter, the base model hits **26.04 % WER**. The same model with this adapter applied: **20.46 % WER** — a **−21.4 % relative reduction** on real disfluent speech. The adapter is small (≈14 MB) and applies on top of the existing Whisper-small weights, so users can keep using their existing pipeline and toggle the adapter on for disfluent input. ## Use cases - **Inclusive voice assistants** that don't fail on disfluent users - **Live captioning** where the speaker stutters - **AAC apps** for people with speech-disorder profiles - **Speech-therapy session transcription** (smoothed mode is what most clinicians want) - **Inclusive media transcription** of interviews / podcasts with disfluent speakers ## How to use ```python from peft import PeftModel from transformers import WhisperForConditionalGeneration, WhisperProcessor processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe") model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-smoothed-lora") model.eval() # Inference is identical to vanilla Whisper-small import soundfile as sf audio, sr = sf.read("path/to/disfluent_speech.wav") # 16 kHz mono inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200) print(processor.batch_decode(generated, skip_special_tokens=True)[0]) ``` If you need the *verbatim* version (preserves disfluencies in the output), load `nazarkozak/whisper-small-disfluent-verbatim-lora` instead. ## Performance All numbers are Word Error Rate (lower is better), computed with Whisper's official English text normalizer. | Split | Vanilla Whisper-small | This adapter | Δ Absolute | Δ Relative | |---|---|---|---|---| | **Full FluencyBank AWS-interview (5,444 utts)** | **26.04 %** | **20.46 %** | **−5.58** | **−21.4 %** | | Held-out validation (200 utts) | ~26 %† | **18.56 %** | **−7.4** | **−28 %** | | Held-out test (200 utts) | 28.71 % | 26.08 % | −2.63 | −9.2 % | † Baseline on the exact 200-sample held-out val split was not separately re-run; the figure is estimated from the 26.04 % full-set baseline. Held-out test was re-run and shows the test split is harder than the validation split (likely a session-level effect with only 200 samples). ### Per-example breakdown (5,020 paired samples) | Outcome | Count | % | |---|---:|---:| | Adapter strictly better than baseline | 1,836 | 36.6 % | | Tie (both correct or both equally wrong) | 2,564 | 51.1 % | | Vanilla strictly better | 620 | 12.4 % | ### What kinds of failures it fixes The adapter reliably fixes Whisper-small's failure modes on disfluent input: | Vanilla failure mode | Example | |---|---| | **Hallucination repeats** ("saint saint saint…" ×100) | `REF: my stuttering has the least impact on my life like my friends and family`
`BASE: my saint saint saint saint saint saint saint…`
`LoRA: my stuttering has the least impact on my life like my friends and family` | | **Mistaking disfluencies as words** | `REF: and they stutter like that person`
`BASE: and they they they said are like like that person`
`LoRA: and they stutter like that person` | | **Mishearing / wrong intent** | `REF: stuttering has had a variety of impacts on me`
`BASE: i still do not have a variety of packs on me`
`LoRA: stuttering has had a variety of impacts on me` | | **Truncation** | `REF: we can speak to pets`
`BASE: because we can`
`LoRA: we can speak to pets` | ## Training data | Source | Hours | License | Use | |---|---|---|---| | **FluencyBank Teaching — Adults Who Stutter (interview)** | ~17 | CC BY-NC-SA 4.0 (TalkBank) | Train (most of the disfluent signal) | | **DisfluencySpeech** ([amaai-lab](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)) | 10 | Apache 2.0 | Train (single-speaker scripted, augmentation) | Total: ~27 hr training, 9,402 utterance pairs. The smoothed transcripts come from: - For FluencyBank — CHAT-format transcripts, with a custom parser that drops filled-pause markers (`&-uh`, `&-um`), bracket annotations (`[/]`, `[//]`), pause marks (`(.)`), unintelligible markers (`xxx`/`yyy`), and collapses consecutive duplicate words. - For DisfluencySpeech — the dataset's own `transcript_c` field (its cleanest variant, with filled pauses, editing terms, and false starts removed). ## Training recipe | Hyperparameter | Value | |---|---| | Base model | `openai/whisper-small` (244 M params) | | LoRA target modules | `q_proj`, `v_proj` | | LoRA rank / alpha / dropout | 32 / 64 / 0.05 | | Trainable params | 3.5 M (1.4 % of total) | | Batch size (effective) | 16 (8 × 2 grad accum) | | Learning rate | 1e-4 with linear warmup over 100 steps, then linear decay | | Epochs | 3 (1764 optimizer steps) | | Hardware | Apple M1 Max 64 GB (PyTorch + MPS backend) | | Wall clock | ~2.5 h | The released checkpoint is the one with the lowest validation WER over training (step 1200). ## Limitations - **English only.** Trained exclusively on English speech. - **Adult speakers.** Training data is from adults who stutter — not validated on children. - **Non-commercial license** (inherited from FluencyBank's CC BY-NC-SA 4.0). For a commercial-friendly variant trained on Apache 2.0 data only, see [`nazarkozak/whisper-small-disfluent-commercial-lora`](#) (planned). - **Not a clinical instrument.** The adapter improves transcription, not assessment of stuttering severity. - **Generalization gap.** On a 200-sample held-out test split the relative WER reduction is smaller (~9 %) than on validation (~28 %), suggesting some sensitivity to which sessions land in train vs test. Real-world WER will fall somewhere in between depending on speaker similarity. ## Citation If you use this adapter, please cite both the base model and the source datasets: ```bibtex @misc{kozak2026whisperdisfluent, author = {Nazar Kozak}, title = {whisper-small-disfluent-smoothed-lora: a Whisper LoRA for disfluent speech}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora} } @article{radford2022whisper, title = {Robust Speech Recognition via Large-Scale Weak Supervision}, author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, journal = {arXiv preprint arXiv:2212.04356}, year = {2022} } @article{ratner2018fluencybank, title = {Fluency Bank: A new resource for fluency research and practice}, author = {Ratner, Nan Bernstein and MacWhinney, Brian}, journal= {Journal of Fluency Disorders}, volume = {56}, pages = {69--80}, year = {2018} } @misc{wang2024disfluencyspeech, title = {DisfluencySpeech: Single-Speaker Conversational Speech Dataset with Paralanguage}, author = {Kyra Wang and Dorien Herremans}, year = {2024}, eprint = {2406.08820}, archivePrefix = {arXiv} } ``` ## Related models - **Verbatim variant** — preserves disfluencies in the output: [`nazarkozak/whisper-small-disfluent-verbatim-lora`](https://huggingface.co/nazarkozak/whisper-small-disfluent-verbatim-lora) - **CoreML build** — for on-device iOS use (planned): [`nazarkozak/whisper-small-disfluent-smoothed-coreml`](#) - **Hosted API** — Replicate endpoint (planned): `replicate.com/nazarkozak/whisper-disfluent`