Instructions to use nazarkozak/whisper-small-disfluent-verbatim-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use nazarkozak/whisper-small-disfluent-verbatim-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForSeq2SeqLM base_model = AutoModelForSeq2SeqLM.from_pretrained("openai/whisper-small") model = PeftModel.from_pretrained(base_model, "nazarkozak/whisper-small-disfluent-verbatim-lora") - Notebooks
- Google Colab
- Kaggle
language: en
license: cc-by-nc-sa-4.0
library_name: peft
base_model: openai/whisper-small
tags:
- whisper
- lora
- automatic-speech-recognition
- disfluent-speech
- stuttering
- accessibility
- speech-disorders
- verbatim-transcription
datasets:
- amaai-lab/DisfluencySpeech
metrics:
- wer
pipeline_tag: automatic-speech-recognition
model-index:
- name: whisper-small-disfluent-verbatim-lora
results:
- task:
type: automatic-speech-recognition
name: Disfluent Speech Recognition (verbatim)
dataset:
name: held-out validation split (verbatim transcripts)
type: validation
split: validation
metrics:
- type: wer
value: 20.29
name: Word Error Rate (held-out, verbatim)
whisper-small-disfluent-verbatim (LoRA adapter)
A LoRA fine-tune of openai/whisper-small that
transcribes disfluent / stuttered speech verbatim β preserving filled pauses (uh, um),
exact word repetitions, and partial-word fragments rather than smoothing them away.
This is the verbatim variant. A companion smoothed variant β which drops disfluencies
to give clean text of the speaker's intended words β is published as
nazarkozak/whisper-small-disfluent-smoothed-lora.
Why verbatim, not smoothed?
Most ASR systems aim for smoothed output (clean intended speech). But for clinical, research, and assistive use cases the disfluencies are the signal:
- Speech therapy assessment β clinicians need verbatim transcripts to measure stuttering frequency, type distribution (blocks vs repetitions vs prolongations), and progress over time.
- Stuttering research β corpus annotation work needs the model to capture what was actually said, including the disfluencies.
- Accessibility tooling β some downstream tools want to expose disfluencies to the user (e.g. "your speech included 3 blocks and 5 filled pauses in the last minute").
- Faithful captioning of interviews / oral history β when the disfluency is part of the speaker's voice, smoothing it out distorts the record.
The smoothed variant is the right pick for voice assistants, AAC apps, and most live-captioning use cases. Use this verbatim variant only when the disfluencies themselves matter.
How to use
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-verbatim-lora")
model.eval()
import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav") # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])
Compared to the smoothed variant, expect output like
"and uh I- I- I think that um yeah" rather than "and I think that yeah".
Performance
All numbers are Word Error Rate (lower is better), computed against verbatim references with Whisper's official English text normalizer.
| Split | Vanilla Whisper-small (verbatim refs) | This adapter | Ξ Absolute | Ξ Relative |
|---|---|---|---|---|
| Full FluencyBank AWS-interview (5,444 utts) | 26.75 % | 26.46 % | β0.29 | β1.1 % |
| Held-out validation (200 utts) | ~26 % | 20.29 % | β5.7 | β22 % |
| Held-out test (200 utts) | 28.71 % | 27.99 % | β0.72 | β2.5 % |
The contrast between the held-out validation (β22 %) and full-set (β1.1 %) numbers reveals that this adapter overfits to training sessions β it learns the disfluency style of specific speakers it saw rather than a general verbatim transcription policy. The companion smoothed variant generalizes much better (β21 % rel reduction on the same full set). For most production use cases, prefer the smoothed variant.
Note that verbatim is intrinsically a harder task than smoothed β the model must produce a
longer transcript that includes natural-spoken markers (uh, um) and word fragments (I-)
which are out-of-distribution for vanilla Whisper. The improvement on the held-out test split
is modest, while the validation split shows a much larger gain. Real-world WER will fall
somewhere in between depending on speaker/session similarity.
Training data
| Source | Hours | License | Use |
|---|---|---|---|
| FluencyBank Teaching β Adults Who Stutter (interview) | ~17 | CC BY-NC-SA 4.0 (TalkBank) | Train (most of the disfluent signal) |
| DisfluencySpeech (amaai-lab) | 10 | Apache 2.0 | Train (single-speaker scripted, augmentation) |
Total: ~27 hr training, 9,402 utterance pairs.
The verbatim transcripts come from:
- For FluencyBank β CHAT-format transcripts processed with a custom parser that preserves
disfluencies as natural words:
&-uhβuh,&friβfri-, exact repetitions kept. Bracket annotations ([/],[//]), pause marks(.), and unintelligible markers (xxx/yyy) are dropped β they are CHAT markup, not spoken content. - For DisfluencySpeech β the dataset's
transcript_afield (its most-detailed natural-text variant, which keeps filled pauses and repetitions).
Training recipe
| Hyperparameter | Value |
|---|---|
| Base model | openai/whisper-small (244 M params) |
| LoRA target modules | q_proj, v_proj |
| LoRA rank / alpha / dropout | 32 / 64 / 0.05 |
| Trainable params | 3.5 M (1.4 % of total) |
| Batch size (effective) | 16 (8 Γ 2 grad accum) |
| Learning rate | 1e-4 with linear warmup over 100 steps, then linear decay |
| Epochs | 3 (1764 optimizer steps) |
| Hardware | Apple M1 Max 64 GB (PyTorch + MPS backend) |
| Wall clock | ~3.5 h |
The released checkpoint is the lowest-WER one over training (step 1600). Note: the eval trajectory is noisier than the smoothed variant β there were transient generation spikes at steps 1000, 1200, and 1764 where eval WER climbed back to ~37 % before recovering. This is a known interaction between Seq2SeqTrainer's eval-time generation, MPS backend, and PEFT-wrapped models on Apple Silicon. Loss decreased monotonically throughout, confirming the model itself was healthy across the spikes.
Limitations
- English only.
- Adult speakers β not validated on children.
- Verbatim style is FluencyBank-shaped. The model learned to emit
uh/um/ one-word repetitions in the style of CHAT-format transcripts. Other annotation conventions (e.g. Switchboard's{F uh, }markup) will not be reproduced. - Non-commercial license (inherited from FluencyBank's CC BY-NC-SA 4.0).
- Larger gap between validation and test WER suggests session-level sensitivity. If the test speakers' speech patterns differ from training, gains may be smaller.
Citation
Same as the smoothed variant β see
nazarkozak/whisper-small-disfluent-smoothed-lora.
Related models
- Smoothed variant β drops disfluencies, returns clean intended speech:
nazarkozak/whisper-small-disfluent-smoothed-lora