whisper-small-disfluent-verbatim (LoRA adapter)

A LoRA fine-tune of openai/whisper-small that transcribes disfluent / stuttered speech verbatim β€” preserving filled pauses (uh, um), exact word repetitions, and partial-word fragments rather than smoothing them away.

This is the verbatim variant. A companion smoothed variant β€” which drops disfluencies to give clean text of the speaker's intended words β€” is published as nazarkozak/whisper-small-disfluent-smoothed-lora.

Why verbatim, not smoothed?

Most ASR systems aim for smoothed output (clean intended speech). But for clinical, research, and assistive use cases the disfluencies are the signal:

  • Speech therapy assessment β€” clinicians need verbatim transcripts to measure stuttering frequency, type distribution (blocks vs repetitions vs prolongations), and progress over time.
  • Stuttering research β€” corpus annotation work needs the model to capture what was actually said, including the disfluencies.
  • Accessibility tooling β€” some downstream tools want to expose disfluencies to the user (e.g. "your speech included 3 blocks and 5 filled pauses in the last minute").
  • Faithful captioning of interviews / oral history β€” when the disfluency is part of the speaker's voice, smoothing it out distorts the record.

The smoothed variant is the right pick for voice assistants, AAC apps, and most live-captioning use cases. Use this verbatim variant only when the disfluencies themselves matter.

How to use

from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-verbatim-lora")
model.eval()

import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav")  # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])

Compared to the smoothed variant, expect output like "and uh I- I- I think that um yeah" rather than "and I think that yeah".

Performance

All numbers are Word Error Rate (lower is better), computed against verbatim references with Whisper's official English text normalizer.

Split Vanilla Whisper-small (verbatim refs) This adapter Ξ” Absolute Ξ” Relative
Full FluencyBank AWS-interview (5,444 utts) 26.75 % 26.46 % βˆ’0.29 βˆ’1.1 %
Held-out validation (200 utts) ~26 % 20.29 % βˆ’5.7 βˆ’22 %
Held-out test (200 utts) 28.71 % 27.99 % βˆ’0.72 βˆ’2.5 %

The contrast between the held-out validation (βˆ’22 %) and full-set (βˆ’1.1 %) numbers reveals that this adapter overfits to training sessions β€” it learns the disfluency style of specific speakers it saw rather than a general verbatim transcription policy. The companion smoothed variant generalizes much better (βˆ’21 % rel reduction on the same full set). For most production use cases, prefer the smoothed variant.

Note that verbatim is intrinsically a harder task than smoothed β€” the model must produce a longer transcript that includes natural-spoken markers (uh, um) and word fragments (I-) which are out-of-distribution for vanilla Whisper. The improvement on the held-out test split is modest, while the validation split shows a much larger gain. Real-world WER will fall somewhere in between depending on speaker/session similarity.

Training data

Source Hours License Use
FluencyBank Teaching β€” Adults Who Stutter (interview) ~17 CC BY-NC-SA 4.0 (TalkBank) Train (most of the disfluent signal)
DisfluencySpeech (amaai-lab) 10 Apache 2.0 Train (single-speaker scripted, augmentation)

Total: ~27 hr training, 9,402 utterance pairs.

The verbatim transcripts come from:

  • For FluencyBank β€” CHAT-format transcripts processed with a custom parser that preserves disfluencies as natural words: &-uh β†’ uh, &fri β†’ fri-, exact repetitions kept. Bracket annotations ([/], [//]), pause marks (.), and unintelligible markers (xxx/yyy) are dropped β€” they are CHAT markup, not spoken content.
  • For DisfluencySpeech β€” the dataset's transcript_a field (its most-detailed natural-text variant, which keeps filled pauses and repetitions).

Training recipe

Hyperparameter Value
Base model openai/whisper-small (244 M params)
LoRA target modules q_proj, v_proj
LoRA rank / alpha / dropout 32 / 64 / 0.05
Trainable params 3.5 M (1.4 % of total)
Batch size (effective) 16 (8 Γ— 2 grad accum)
Learning rate 1e-4 with linear warmup over 100 steps, then linear decay
Epochs 3 (1764 optimizer steps)
Hardware Apple M1 Max 64 GB (PyTorch + MPS backend)
Wall clock ~3.5 h

The released checkpoint is the lowest-WER one over training (step 1600). Note: the eval trajectory is noisier than the smoothed variant β€” there were transient generation spikes at steps 1000, 1200, and 1764 where eval WER climbed back to ~37 % before recovering. This is a known interaction between Seq2SeqTrainer's eval-time generation, MPS backend, and PEFT-wrapped models on Apple Silicon. Loss decreased monotonically throughout, confirming the model itself was healthy across the spikes.

Limitations

  • English only.
  • Adult speakers β€” not validated on children.
  • Verbatim style is FluencyBank-shaped. The model learned to emit uh / um / one-word repetitions in the style of CHAT-format transcripts. Other annotation conventions (e.g. Switchboard's {F uh, } markup) will not be reproduced.
  • Non-commercial license (inherited from FluencyBank's CC BY-NC-SA 4.0).
  • Larger gap between validation and test WER suggests session-level sensitivity. If the test speakers' speech patterns differ from training, gains may be smaller.

Citation

Same as the smoothed variant β€” see nazarkozak/whisper-small-disfluent-smoothed-lora.

Related models

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nazarkozak/whisper-small-disfluent-verbatim-lora

Adapter
(245)
this model

Dataset used to train nazarkozak/whisper-small-disfluent-verbatim-lora

Evaluation results