whisper-small-disfluent-verbatim (LoRA adapter)

A LoRA fine-tune of openai/whisper-small that transcribes disfluent / stuttered speech verbatim — preserving filled pauses (uh, um), exact word repetitions, and partial-word fragments rather than smoothing them away.

This is the verbatim variant. A companion smoothed variant — which drops disfluencies to give clean text of the speaker's intended words — is published as nazarkozak/whisper-small-disfluent-smoothed-lora.

Why verbatim, not smoothed?

Most ASR systems aim for smoothed output (clean intended speech). But for clinical, research, and assistive use cases the disfluencies are the signal:

Speech therapy assessment — clinicians need verbatim transcripts to measure stuttering frequency, type distribution (blocks vs repetitions vs prolongations), and progress over time.
Stuttering research — corpus annotation work needs the model to capture what was actually said, including the disfluencies.
Accessibility tooling — some downstream tools want to expose disfluencies to the user (e.g. "your speech included 3 blocks and 5 filled pauses in the last minute").
Faithful captioning of interviews / oral history — when the disfluency is part of the speaker's voice, smoothing it out distorts the record.

The smoothed variant is the right pick for voice assistants, AAC apps, and most live-captioning use cases. Use this verbatim variant only when the disfluencies themselves matter.

How to use

from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-verbatim-lora")
model.eval()

import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav")  # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])

Compared to the smoothed variant, expect output like "and uh I- I- I think that um yeah" rather than "and I think that yeah".

Performance

All numbers are Word Error Rate (lower is better), computed against verbatim references with Whisper's official English text normalizer.

Split	Vanilla Whisper-small (verbatim refs)	This adapter	Δ Absolute	Δ Relative
Full FluencyBank AWS-interview (5,444 utts)	26.75 %	26.46 %	−0.29	−1.1 %
Held-out validation (200 utts)	~26 %	20.29 %	−5.7	−22 %
Held-out test (200 utts)	28.71 %	27.99 %	−0.72	−2.5 %

The contrast between the held-out validation (−22 %) and full-set (−1.1 %) numbers reveals that this adapter overfits to training sessions — it learns the disfluency style of specific speakers it saw rather than a general verbatim transcription policy. The companion smoothed variant generalizes much better (−21 % rel reduction on the same full set). For most production use cases, prefer the smoothed variant.

Note that verbatim is intrinsically a harder task than smoothed — the model must produce a longer transcript that includes natural-spoken markers (uh, um) and word fragments (I-) which are out-of-distribution for vanilla Whisper. The improvement on the held-out test split is modest, while the validation split shows a much larger gain. Real-world WER will fall somewhere in between depending on speaker/session similarity.

Training data

Source	Hours	License	Use
FluencyBank Teaching — Adults Who Stutter (interview)	~17	CC BY-NC-SA 4.0 (TalkBank)	Train (most of the disfluent signal)
DisfluencySpeech (amaai-lab)	10	Apache 2.0	Train (single-speaker scripted, augmentation)

Total: ~27 hr training, 9,402 utterance pairs.

The verbatim transcripts come from:

For FluencyBank — CHAT-format transcripts processed with a custom parser that preserves disfluencies as natural words: &-uh → uh, &fri → fri-, exact repetitions kept. Bracket annotations ([/], [//]), pause marks (.), and unintelligible markers (xxx/yyy) are dropped — they are CHAT markup, not spoken content.
For DisfluencySpeech — the dataset's transcript_a field (its most-detailed natural-text variant, which keeps filled pauses and repetitions).

Training recipe

Hyperparameter	Value
Base model	`openai/whisper-small` (244 M params)
LoRA target modules	`q_proj`, `v_proj`
LoRA rank / alpha / dropout	32 / 64 / 0.05
Trainable params	3.5 M (1.4 % of total)
Batch size (effective)	16 (8 × 2 grad accum)
Learning rate	1e-4 with linear warmup over 100 steps, then linear decay
Epochs	3 (1764 optimizer steps)
Hardware	Apple M1 Max 64 GB (PyTorch + MPS backend)
Wall clock	~3.5 h

The released checkpoint is the lowest-WER one over training (step 1600). Note: the eval trajectory is noisier than the smoothed variant — there were transient generation spikes at steps 1000, 1200, and 1764 where eval WER climbed back to ~37 % before recovering. This is a known interaction between Seq2SeqTrainer's eval-time generation, MPS backend, and PEFT-wrapped models on Apple Silicon. Loss decreased monotonically throughout, confirming the model itself was healthy across the spikes.

Limitations

English only.
Adult speakers — not validated on children.
Verbatim style is FluencyBank-shaped. The model learned to emit uh / um / one-word repetitions in the style of CHAT-format transcripts. Other annotation conventions (e.g. Switchboard's {F uh, } markup) will not be reproduced.
Non-commercial license (inherited from FluencyBank's CC BY-NC-SA 4.0).
Larger gap between validation and test WER suggests session-level sensitivity. If the test speakers' speech patterns differ from training, gains may be smaller.

Citation

Same as the smoothed variant — see nazarkozak/whisper-small-disfluent-smoothed-lora.

Related models

Smoothed variant — drops disfluencies, returns clean intended speech: nazarkozak/whisper-small-disfluent-smoothed-lora

Downloads last month: 10

Model tree for nazarkozak/whisper-small-disfluent-verbatim-lora

Base model

openai/whisper-small

Adapter

(245)

this model

Dataset used to train nazarkozak/whisper-small-disfluent-verbatim-lora

Evaluation results

Word Error Rate (held-out, verbatim) on held-out validation split (verbatim transcripts)
validation set self-reported

20.290