File size: 9,440 Bytes

6387d90

---
language: en
license: cc-by-nc-sa-4.0
library_name: peft
base_model: openai/whisper-small
tags:
  - whisper
  - lora
  - automatic-speech-recognition
  - disfluent-speech
  - stuttering
  - accessibility
  - speech-disorders
datasets:
  - amaai-lab/DisfluencySpeech
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
model-index:
  - name: whisper-small-disfluent-smoothed-lora
    results:
      - task:
          type: automatic-speech-recognition
          name: Disfluent Speech Recognition
        dataset:
          name: FluencyBank Adults Who Stutter (interviews, smoothed transcripts)
          type: fluencybank-aws-interview
          split: full
        metrics:
          - type: wer
            value: 20.46
            name: Word Error Rate (smoothed)
      - task:
          type: automatic-speech-recognition
          name: Disfluent Speech Recognition
        dataset:
          name: held-out validation split
          type: validation
          split: validation
        metrics:
          - type: wer
            value: 18.56
            name: Word Error Rate (held-out)
---

# whisper-small-disfluent-smoothed (LoRA adapter)

A LoRA fine-tune of [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) that
transcribes **disfluent / stuttered speech** to clean, fluent text — recovering the speaker's
*intended* words and dropping the disfluencies (filled pauses, repetitions, sound prolongations,
blocks, interjections).

This is the **smoothed** variant. A companion **verbatim** variant — which preserves disfluencies
in the output — is published as
[`nazarkozak/whisper-small-disfluent-verbatim-lora`](#related-models).

## Why this exists

Vanilla Whisper-small was trained on largely fluent speech and tends to hallucinate, truncate, or
mangle disfluent input. On a corpus of 5,444 utterances by adults who stutter, the base model
hits **26.04 % WER**. The same model with this adapter applied: **20.46 % WER** — a **−21.4 %
relative reduction** on real disfluent speech.

The adapter is small (≈14 MB) and applies on top of the existing Whisper-small weights, so users
can keep using their existing pipeline and toggle the adapter on for disfluent input.

## Use cases

- **Inclusive voice assistants** that don't fail on disfluent users
- **Live captioning** where the speaker stutters
- **AAC apps** for people with speech-disorder profiles
- **Speech-therapy session transcription** (smoothed mode is what most clinicians want)
- **Inclusive media transcription** of interviews / podcasts with disfluent speakers

## How to use

```python
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-smoothed-lora")
model.eval()

# Inference is identical to vanilla Whisper-small
import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav")  # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])
```

If you need the *verbatim* version (preserves disfluencies in the output), load
`nazarkozak/whisper-small-disfluent-verbatim-lora` instead.

## Performance

All numbers are Word Error Rate (lower is better), computed with Whisper's official English
text normalizer.

| Split | Vanilla Whisper-small | This adapter | Δ Absolute | Δ Relative |
|---|---|---|---|---|
| **Full FluencyBank AWS-interview (5,444 utts)** | **26.04 %** | **20.46 %** | **−5.58** | **−21.4 %** |
| Held-out validation (200 utts) | ~26 %† | **18.56 %** | **−7.4** | **−28 %** |
| Held-out test (200 utts) | 28.71 % | 26.08 % | −2.63 | −9.2 % |

† Baseline on the exact 200-sample held-out val split was not separately re-run; the figure is
estimated from the 26.04 % full-set baseline. Held-out test was re-run and shows the test split
is harder than the validation split (likely a session-level effect with only 200 samples).

### Per-example breakdown (5,020 paired samples)

| Outcome | Count | % |
|---|---:|---:|
| Adapter strictly better than baseline | 1,836 | 36.6 % |
| Tie (both correct or both equally wrong) | 2,564 | 51.1 % |
| Vanilla strictly better | 620 | 12.4 % |

### What kinds of failures it fixes

The adapter reliably fixes Whisper-small's failure modes on disfluent input:

| Vanilla failure mode | Example |
|---|---|
| **Hallucination repeats** ("saint saint saint…" ×100) | `REF: my stuttering has the least impact on my life like my friends and family` <br> `BASE: my saint saint saint saint saint saint saint…` <br> `LoRA: my stuttering has the least impact on my life like my friends and family` |
| **Mistaking disfluencies as words** | `REF: and they stutter like that person` <br> `BASE: and they they they said are like like that person` <br> `LoRA: and they stutter like that person` |
| **Mishearing / wrong intent** | `REF: stuttering has had a variety of impacts on me` <br> `BASE: i still do not have a variety of packs on me` <br> `LoRA: stuttering has had a variety of impacts on me` |
| **Truncation** | `REF: we can speak to pets` <br> `BASE: because we can` <br> `LoRA: we can speak to pets` |

## Training data

| Source | Hours | License | Use |
|---|---|---|---|
| **FluencyBank Teaching — Adults Who Stutter (interview)** | ~17 | CC BY-NC-SA 4.0 (TalkBank) | Train (most of the disfluent signal) |
| **DisfluencySpeech** ([amaai-lab](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)) | 10 | Apache 2.0 | Train (single-speaker scripted, augmentation) |

Total: ~27 hr training, 9,402 utterance pairs.

The smoothed transcripts come from:
- For FluencyBank — CHAT-format transcripts, with a custom parser that drops filled-pause markers
  (`&-uh`, `&-um`), bracket annotations (`[/]`, `[//]`), pause marks (`(.)`), unintelligible markers
  (`xxx`/`yyy`), and collapses consecutive duplicate words.
- For DisfluencySpeech — the dataset's own `transcript_c` field (its cleanest variant, with
  filled pauses, editing terms, and false starts removed).

## Training recipe

| Hyperparameter | Value |
|---|---|
| Base model | `openai/whisper-small` (244 M params) |
| LoRA target modules | `q_proj`, `v_proj` |
| LoRA rank / alpha / dropout | 32 / 64 / 0.05 |
| Trainable params | 3.5 M (1.4 % of total) |
| Batch size (effective) | 16 (8 × 2 grad accum) |
| Learning rate | 1e-4 with linear warmup over 100 steps, then linear decay |
| Epochs | 3 (1764 optimizer steps) |
| Hardware | Apple M1 Max 64 GB (PyTorch + MPS backend) |
| Wall clock | ~2.5 h |

The released checkpoint is the one with the lowest validation WER over training (step 1200).

## Limitations

- **English only.** Trained exclusively on English speech.
- **Adult speakers.** Training data is from adults who stutter — not validated on children.
- **Non-commercial license** (inherited from FluencyBank's CC BY-NC-SA 4.0). For a
  commercial-friendly variant trained on Apache 2.0 data only, see
  [`nazarkozak/whisper-small-disfluent-commercial-lora`](#) (planned).
- **Not a clinical instrument.** The adapter improves transcription, not assessment of
  stuttering severity.
- **Generalization gap.** On a 200-sample held-out test split the relative WER reduction is
  smaller (~9 %) than on validation (~28 %), suggesting some sensitivity to which sessions land
  in train vs test. Real-world WER will fall somewhere in between depending on speaker similarity.

## Citation

If you use this adapter, please cite both the base model and the source datasets:

```bibtex
@misc{kozak2026whisperdisfluent,
  author = {Nazar Kozak},
  title  = {whisper-small-disfluent-smoothed-lora: a Whisper LoRA for disfluent speech},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora}
}

@article{radford2022whisper,
  title   = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author  = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey,
             Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year    = {2022}
}

@article{ratner2018fluencybank,
  title  = {Fluency Bank: A new resource for fluency research and practice},
  author = {Ratner, Nan Bernstein and MacWhinney, Brian},
  journal= {Journal of Fluency Disorders},
  volume = {56},
  pages  = {69--80},
  year   = {2018}
}

@misc{wang2024disfluencyspeech,
  title  = {DisfluencySpeech: Single-Speaker Conversational Speech Dataset with Paralanguage},
  author = {Kyra Wang and Dorien Herremans},
  year   = {2024},
  eprint = {2406.08820},
  archivePrefix = {arXiv}
}
```

## Related models

- **Verbatim variant** — preserves disfluencies in the output:
  [`nazarkozak/whisper-small-disfluent-verbatim-lora`](https://huggingface.co/nazarkozak/whisper-small-disfluent-verbatim-lora)
- **CoreML build** — for on-device iOS use (planned):
  [`nazarkozak/whisper-small-disfluent-smoothed-coreml`](#)
- **Hosted API** — Replicate endpoint (planned):
  `replicate.com/nazarkozak/whisper-disfluent`