--- language: en license: cc-by-nc-sa-4.0 library_name: peft base_model: openai/whisper-small tags: - whisper - lora - automatic-speech-recognition - disfluent-speech - stuttering - accessibility - speech-disorders - verbatim-transcription datasets: - amaai-lab/DisfluencySpeech metrics: - wer pipeline_tag: automatic-speech-recognition model-index: - name: whisper-small-disfluent-verbatim-lora results: - task: type: automatic-speech-recognition name: Disfluent Speech Recognition (verbatim) dataset: name: held-out validation split (verbatim transcripts) type: validation split: validation metrics: - type: wer value: 20.29 name: Word Error Rate (held-out, verbatim) --- # whisper-small-disfluent-verbatim (LoRA adapter) A LoRA fine-tune of [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) that transcribes **disfluent / stuttered speech verbatim** — preserving filled pauses (`uh`, `um`), exact word repetitions, and partial-word fragments rather than smoothing them away. This is the **verbatim** variant. A companion **smoothed** variant — which drops disfluencies to give clean text of the speaker's intended words — is published as [`nazarkozak/whisper-small-disfluent-smoothed-lora`](#related-models). ## Why verbatim, not smoothed? Most ASR systems aim for smoothed output (clean intended speech). But for **clinical, research, and assistive use cases** the disfluencies *are* the signal: - **Speech therapy assessment** — clinicians need verbatim transcripts to measure stuttering frequency, type distribution (blocks vs repetitions vs prolongations), and progress over time. - **Stuttering research** — corpus annotation work needs the model to capture what was actually said, including the disfluencies. - **Accessibility tooling** — some downstream tools want to expose disfluencies to the user (e.g. "your speech included 3 blocks and 5 filled pauses in the last minute"). - **Faithful captioning of interviews / oral history** — when the disfluency is part of the speaker's voice, smoothing it out distorts the record. The smoothed variant is the right pick for voice assistants, AAC apps, and most live-captioning use cases. Use this verbatim variant only when the disfluencies themselves matter. ## How to use ```python from peft import PeftModel from transformers import WhisperForConditionalGeneration, WhisperProcessor processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe") model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-verbatim-lora") model.eval() import soundfile as sf audio, sr = sf.read("path/to/disfluent_speech.wav") # 16 kHz mono inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200) print(processor.batch_decode(generated, skip_special_tokens=True)[0]) ``` Compared to the smoothed variant, expect output like `"and uh I- I- I think that um yeah"` rather than `"and I think that yeah"`. ## Performance All numbers are Word Error Rate (lower is better), computed against verbatim references with Whisper's official English text normalizer. | Split | Vanilla Whisper-small (verbatim refs) | This adapter | Δ Absolute | Δ Relative | |---|---|---|---|---| | **Full FluencyBank AWS-interview (5,444 utts)** | **26.75 %** | **26.46 %** | **−0.29** | **−1.1 %** | | Held-out validation (200 utts) | ~26 % | **20.29 %** | **−5.7** | **−22 %** | | Held-out test (200 utts) | 28.71 % | 27.99 % | −0.72 | −2.5 % | The contrast between the held-out validation (−22 %) and full-set (−1.1 %) numbers reveals that this adapter **overfits to training sessions** — it learns the disfluency style of specific speakers it saw rather than a general verbatim transcription policy. The companion *smoothed* variant generalizes much better (−21 % rel reduction on the same full set). For most production use cases, prefer the smoothed variant. Note that verbatim is intrinsically a *harder* task than smoothed — the model must produce a longer transcript that includes natural-spoken markers (`uh`, `um`) and word fragments (`I-`) which are out-of-distribution for vanilla Whisper. The improvement on the held-out test split is modest, while the validation split shows a much larger gain. Real-world WER will fall somewhere in between depending on speaker/session similarity. ## Training data | Source | Hours | License | Use | |---|---|---|---| | **FluencyBank Teaching — Adults Who Stutter (interview)** | ~17 | CC BY-NC-SA 4.0 (TalkBank) | Train (most of the disfluent signal) | | **DisfluencySpeech** ([amaai-lab](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)) | 10 | Apache 2.0 | Train (single-speaker scripted, augmentation) | Total: ~27 hr training, 9,402 utterance pairs. The verbatim transcripts come from: - For FluencyBank — CHAT-format transcripts processed with a custom parser that **preserves** disfluencies as natural words: `&-uh` → `uh`, `&fri` → `fri-`, exact repetitions kept. Bracket annotations (`[/]`, `[//]`), pause marks `(.)`, and unintelligible markers (`xxx`/`yyy`) are dropped — they are CHAT markup, not spoken content. - For DisfluencySpeech — the dataset's `transcript_a` field (its most-detailed natural-text variant, which keeps filled pauses and repetitions). ## Training recipe | Hyperparameter | Value | |---|---| | Base model | `openai/whisper-small` (244 M params) | | LoRA target modules | `q_proj`, `v_proj` | | LoRA rank / alpha / dropout | 32 / 64 / 0.05 | | Trainable params | 3.5 M (1.4 % of total) | | Batch size (effective) | 16 (8 × 2 grad accum) | | Learning rate | 1e-4 with linear warmup over 100 steps, then linear decay | | Epochs | 3 (1764 optimizer steps) | | Hardware | Apple M1 Max 64 GB (PyTorch + MPS backend) | | Wall clock | ~3.5 h | The released checkpoint is the lowest-WER one over training (step 1600). Note: the eval trajectory is **noisier than the smoothed variant** — there were transient generation spikes at steps 1000, 1200, and 1764 where eval WER climbed back to ~37 % before recovering. This is a known interaction between Seq2SeqTrainer's eval-time generation, MPS backend, and PEFT-wrapped models on Apple Silicon. Loss decreased monotonically throughout, confirming the model itself was healthy across the spikes. ## Limitations - **English only.** - **Adult speakers** — not validated on children. - **Verbatim style is FluencyBank-shaped.** The model learned to emit `uh` / `um` / one-word repetitions in the style of CHAT-format transcripts. Other annotation conventions (e.g. Switchboard's `{F uh, }` markup) will not be reproduced. - **Non-commercial license** (inherited from FluencyBank's CC BY-NC-SA 4.0). - **Larger gap between validation and test** WER suggests session-level sensitivity. If the test speakers' speech patterns differ from training, gains may be smaller. ## Citation Same as the smoothed variant — see [`nazarkozak/whisper-small-disfluent-smoothed-lora`](https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora#citation). ## Related models - **Smoothed variant** — drops disfluencies, returns clean intended speech: [`nazarkozak/whisper-small-disfluent-smoothed-lora`](https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora)