File size: 9,440 Bytes
6387d90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
language: en
license: cc-by-nc-sa-4.0
library_name: peft
base_model: openai/whisper-small
tags:
  - whisper
  - lora
  - automatic-speech-recognition
  - disfluent-speech
  - stuttering
  - accessibility
  - speech-disorders
datasets:
  - amaai-lab/DisfluencySpeech
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
model-index:
  - name: whisper-small-disfluent-smoothed-lora
    results:
      - task:
          type: automatic-speech-recognition
          name: Disfluent Speech Recognition
        dataset:
          name: FluencyBank Adults Who Stutter (interviews, smoothed transcripts)
          type: fluencybank-aws-interview
          split: full
        metrics:
          - type: wer
            value: 20.46
            name: Word Error Rate (smoothed)
      - task:
          type: automatic-speech-recognition
          name: Disfluent Speech Recognition
        dataset:
          name: held-out validation split
          type: validation
          split: validation
        metrics:
          - type: wer
            value: 18.56
            name: Word Error Rate (held-out)
---

# whisper-small-disfluent-smoothed (LoRA adapter)

A LoRA fine-tune of [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) that
transcribes **disfluent / stuttered speech** to clean, fluent text β€” recovering the speaker's
*intended* words and dropping the disfluencies (filled pauses, repetitions, sound prolongations,
blocks, interjections).

This is the **smoothed** variant. A companion **verbatim** variant β€” which preserves disfluencies
in the output β€” is published as
[`nazarkozak/whisper-small-disfluent-verbatim-lora`](#related-models).

## Why this exists

Vanilla Whisper-small was trained on largely fluent speech and tends to hallucinate, truncate, or
mangle disfluent input. On a corpus of 5,444 utterances by adults who stutter, the base model
hits **26.04 % WER**. The same model with this adapter applied: **20.46 % WER** β€” a **βˆ’21.4 %
relative reduction** on real disfluent speech.

The adapter is small (β‰ˆ14 MB) and applies on top of the existing Whisper-small weights, so users
can keep using their existing pipeline and toggle the adapter on for disfluent input.

## Use cases

- **Inclusive voice assistants** that don't fail on disfluent users
- **Live captioning** where the speaker stutters
- **AAC apps** for people with speech-disorder profiles
- **Speech-therapy session transcription** (smoothed mode is what most clinicians want)
- **Inclusive media transcription** of interviews / podcasts with disfluent speakers

## How to use

```python
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-smoothed-lora")
model.eval()

# Inference is identical to vanilla Whisper-small
import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav")  # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])
```

If you need the *verbatim* version (preserves disfluencies in the output), load
`nazarkozak/whisper-small-disfluent-verbatim-lora` instead.

## Performance

All numbers are Word Error Rate (lower is better), computed with Whisper's official English
text normalizer.

| Split | Vanilla Whisper-small | This adapter | Ξ” Absolute | Ξ” Relative |
|---|---|---|---|---|
| **Full FluencyBank AWS-interview (5,444 utts)** | **26.04 %** | **20.46 %** | **βˆ’5.58** | **βˆ’21.4 %** |
| Held-out validation (200 utts) | ~26 %† | **18.56 %** | **βˆ’7.4** | **βˆ’28 %** |
| Held-out test (200 utts) | 28.71 % | 26.08 % | βˆ’2.63 | βˆ’9.2 % |

† Baseline on the exact 200-sample held-out val split was not separately re-run; the figure is
estimated from the 26.04 % full-set baseline. Held-out test was re-run and shows the test split
is harder than the validation split (likely a session-level effect with only 200 samples).

### Per-example breakdown (5,020 paired samples)

| Outcome | Count | % |
|---|---:|---:|
| Adapter strictly better than baseline | 1,836 | 36.6 % |
| Tie (both correct or both equally wrong) | 2,564 | 51.1 % |
| Vanilla strictly better | 620 | 12.4 % |

### What kinds of failures it fixes

The adapter reliably fixes Whisper-small's failure modes on disfluent input:

| Vanilla failure mode | Example |
|---|---|
| **Hallucination repeats** ("saint saint saint…" Γ—100) | `REF: my stuttering has the least impact on my life like my friends and family` <br> `BASE: my saint saint saint saint saint saint saint…` <br> `LoRA: my stuttering has the least impact on my life like my friends and family` |
| **Mistaking disfluencies as words** | `REF: and they stutter like that person` <br> `BASE: and they they they said are like like that person` <br> `LoRA: and they stutter like that person` |
| **Mishearing / wrong intent** | `REF: stuttering has had a variety of impacts on me` <br> `BASE: i still do not have a variety of packs on me` <br> `LoRA: stuttering has had a variety of impacts on me` |
| **Truncation** | `REF: we can speak to pets` <br> `BASE: because we can` <br> `LoRA: we can speak to pets` |

## Training data

| Source | Hours | License | Use |
|---|---|---|---|
| **FluencyBank Teaching β€” Adults Who Stutter (interview)** | ~17 | CC BY-NC-SA 4.0 (TalkBank) | Train (most of the disfluent signal) |
| **DisfluencySpeech** ([amaai-lab](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)) | 10 | Apache 2.0 | Train (single-speaker scripted, augmentation) |

Total: ~27 hr training, 9,402 utterance pairs.

The smoothed transcripts come from:
- For FluencyBank β€” CHAT-format transcripts, with a custom parser that drops filled-pause markers
  (`&-uh`, `&-um`), bracket annotations (`[/]`, `[//]`), pause marks (`(.)`), unintelligible markers
  (`xxx`/`yyy`), and collapses consecutive duplicate words.
- For DisfluencySpeech β€” the dataset's own `transcript_c` field (its cleanest variant, with
  filled pauses, editing terms, and false starts removed).

## Training recipe

| Hyperparameter | Value |
|---|---|
| Base model | `openai/whisper-small` (244 M params) |
| LoRA target modules | `q_proj`, `v_proj` |
| LoRA rank / alpha / dropout | 32 / 64 / 0.05 |
| Trainable params | 3.5 M (1.4 % of total) |
| Batch size (effective) | 16 (8 Γ— 2 grad accum) |
| Learning rate | 1e-4 with linear warmup over 100 steps, then linear decay |
| Epochs | 3 (1764 optimizer steps) |
| Hardware | Apple M1 Max 64 GB (PyTorch + MPS backend) |
| Wall clock | ~2.5 h |

The released checkpoint is the one with the lowest validation WER over training (step 1200).

## Limitations

- **English only.** Trained exclusively on English speech.
- **Adult speakers.** Training data is from adults who stutter β€” not validated on children.
- **Non-commercial license** (inherited from FluencyBank's CC BY-NC-SA 4.0). For a
  commercial-friendly variant trained on Apache 2.0 data only, see
  [`nazarkozak/whisper-small-disfluent-commercial-lora`](#) (planned).
- **Not a clinical instrument.** The adapter improves transcription, not assessment of
  stuttering severity.
- **Generalization gap.** On a 200-sample held-out test split the relative WER reduction is
  smaller (~9 %) than on validation (~28 %), suggesting some sensitivity to which sessions land
  in train vs test. Real-world WER will fall somewhere in between depending on speaker similarity.

## Citation

If you use this adapter, please cite both the base model and the source datasets:

```bibtex
@misc{kozak2026whisperdisfluent,
  author = {Nazar Kozak},
  title  = {whisper-small-disfluent-smoothed-lora: a Whisper LoRA for disfluent speech},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora}
}

@article{radford2022whisper,
  title   = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author  = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey,
             Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year    = {2022}
}

@article{ratner2018fluencybank,
  title  = {Fluency Bank: A new resource for fluency research and practice},
  author = {Ratner, Nan Bernstein and MacWhinney, Brian},
  journal= {Journal of Fluency Disorders},
  volume = {56},
  pages  = {69--80},
  year   = {2018}
}

@misc{wang2024disfluencyspeech,
  title  = {DisfluencySpeech: Single-Speaker Conversational Speech Dataset with Paralanguage},
  author = {Kyra Wang and Dorien Herremans},
  year   = {2024},
  eprint = {2406.08820},
  archivePrefix = {arXiv}
}
```

## Related models

- **Verbatim variant** β€” preserves disfluencies in the output:
  [`nazarkozak/whisper-small-disfluent-verbatim-lora`](https://huggingface.co/nazarkozak/whisper-small-disfluent-verbatim-lora)
- **CoreML build** β€” for on-device iOS use (planned):
  [`nazarkozak/whisper-small-disfluent-smoothed-coreml`](#)
- **Hosted API** β€” Replicate endpoint (planned):
  `replicate.com/nazarkozak/whisper-disfluent`