---
language: en
license: cc-by-nc-sa-4.0
library_name: peft
base_model: openai/whisper-small
tags:
  - whisper
  - lora
  - automatic-speech-recognition
  - disfluent-speech
  - stuttering
  - accessibility
  - speech-disorders
  - verbatim-transcription
datasets:
  - amaai-lab/DisfluencySpeech
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
model-index:
  - name: whisper-small-disfluent-verbatim-lora
    results:
      - task:
          type: automatic-speech-recognition
          name: Disfluent Speech Recognition (verbatim)
        dataset:
          name: held-out validation split (verbatim transcripts)
          type: validation
          split: validation
        metrics:
          - type: wer
            value: 20.29
            name: Word Error Rate (held-out, verbatim)
---

# whisper-small-disfluent-verbatim (LoRA adapter)

A LoRA fine-tune of [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) that
transcribes **disfluent / stuttered speech verbatim** — preserving filled pauses (`uh`, `um`),
exact word repetitions, and partial-word fragments rather than smoothing them away.

This is the **verbatim** variant. A companion **smoothed** variant — which drops disfluencies
to give clean text of the speaker's intended words — is published as
[`nazarkozak/whisper-small-disfluent-smoothed-lora`](#related-models).

## Why verbatim, not smoothed?

Most ASR systems aim for smoothed output (clean intended speech). But for **clinical, research,
and assistive use cases** the disfluencies *are* the signal:

- **Speech therapy assessment** — clinicians need verbatim transcripts to measure stuttering
  frequency, type distribution (blocks vs repetitions vs prolongations), and progress over time.
- **Stuttering research** — corpus annotation work needs the model to capture what was actually
  said, including the disfluencies.
- **Accessibility tooling** — some downstream tools want to expose disfluencies to the user
  (e.g. "your speech included 3 blocks and 5 filled pauses in the last minute").
- **Faithful captioning of interviews / oral history** — when the disfluency is part of the
  speaker's voice, smoothing it out distorts the record.

The smoothed variant is the right pick for voice assistants, AAC apps, and most live-captioning
use cases. Use this verbatim variant only when the disfluencies themselves matter.

## How to use

```python
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-verbatim-lora")
model.eval()

import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav")  # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])
```

Compared to the smoothed variant, expect output like
`"and uh I- I- I think that um yeah"` rather than `"and I think that yeah"`.

## Performance

All numbers are Word Error Rate (lower is better), computed against verbatim references with
Whisper's official English text normalizer.

| Split | Vanilla Whisper-small (verbatim refs) | This adapter | Δ Absolute | Δ Relative |
|---|---|---|---|---|
| **Full FluencyBank AWS-interview (5,444 utts)** | **26.75 %** | **26.46 %** | **−0.29** | **−1.1 %** |
| Held-out validation (200 utts) | ~26 % | **20.29 %** | **−5.7** | **−22 %** |
| Held-out test (200 utts) | 28.71 % | 27.99 % | −0.72 | −2.5 % |

The contrast between the held-out validation (−22 %) and full-set (−1.1 %) numbers reveals that
this adapter **overfits to training sessions** — it learns the disfluency style of specific
speakers it saw rather than a general verbatim transcription policy. The companion *smoothed*
variant generalizes much better (−21 % rel reduction on the same full set). For most production
use cases, prefer the smoothed variant.

Note that verbatim is intrinsically a *harder* task than smoothed — the model must produce a
longer transcript that includes natural-spoken markers (`uh`, `um`) and word fragments (`I-`)
which are out-of-distribution for vanilla Whisper. The improvement on the held-out test split
is modest, while the validation split shows a much larger gain. Real-world WER will fall
somewhere in between depending on speaker/session similarity.

## Training data

| Source | Hours | License | Use |
|---|---|---|---|
| **FluencyBank Teaching — Adults Who Stutter (interview)** | ~17 | CC BY-NC-SA 4.0 (TalkBank) | Train (most of the disfluent signal) |
| **DisfluencySpeech** ([amaai-lab](https://huggingface.co/datasets/amaai-lab/DisfluencySpeech)) | 10 | Apache 2.0 | Train (single-speaker scripted, augmentation) |

Total: ~27 hr training, 9,402 utterance pairs.

The verbatim transcripts come from:
- For FluencyBank — CHAT-format transcripts processed with a custom parser that **preserves**
  disfluencies as natural words: `&-uh` → `uh`, `&fri` → `fri-`, exact repetitions kept.
  Bracket annotations (`[/]`, `[//]`), pause marks `(.)`, and unintelligible markers (`xxx`/`yyy`)
  are dropped — they are CHAT markup, not spoken content.
- For DisfluencySpeech — the dataset's `transcript_a` field (its most-detailed natural-text
  variant, which keeps filled pauses and repetitions).

## Training recipe

| Hyperparameter | Value |
|---|---|
| Base model | `openai/whisper-small` (244 M params) |
| LoRA target modules | `q_proj`, `v_proj` |
| LoRA rank / alpha / dropout | 32 / 64 / 0.05 |
| Trainable params | 3.5 M (1.4 % of total) |
| Batch size (effective) | 16 (8 × 2 grad accum) |
| Learning rate | 1e-4 with linear warmup over 100 steps, then linear decay |
| Epochs | 3 (1764 optimizer steps) |
| Hardware | Apple M1 Max 64 GB (PyTorch + MPS backend) |
| Wall clock | ~3.5 h |

The released checkpoint is the lowest-WER one over training (step 1600). Note: the eval
trajectory is **noisier than the smoothed variant** — there were transient generation spikes at
steps 1000, 1200, and 1764 where eval WER climbed back to ~37 % before recovering. This is a
known interaction between Seq2SeqTrainer's eval-time generation, MPS backend, and PEFT-wrapped
models on Apple Silicon. Loss decreased monotonically throughout, confirming the model itself
was healthy across the spikes.

## Limitations

- **English only.**
- **Adult speakers** — not validated on children.
- **Verbatim style is FluencyBank-shaped.** The model learned to emit `uh` / `um` / one-word
  repetitions in the style of CHAT-format transcripts. Other annotation conventions (e.g.
  Switchboard's `{F uh, }` markup) will not be reproduced.
- **Non-commercial license** (inherited from FluencyBank's CC BY-NC-SA 4.0).
- **Larger gap between validation and test** WER suggests session-level sensitivity. If the
  test speakers' speech patterns differ from training, gains may be smaller.

## Citation

Same as the smoothed variant — see
[`nazarkozak/whisper-small-disfluent-smoothed-lora`](https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora#citation).

## Related models

- **Smoothed variant** — drops disfluencies, returns clean intended speech:
  [`nazarkozak/whisper-small-disfluent-smoothed-lora`](https://huggingface.co/nazarkozak/whisper-small-disfluent-smoothed-lora)