Upload README.md

3ed8056 verified about 2 months ago

7.64 kB

language: en
license: cc-by-nc-sa-4.0
library_name: peft
base_model: openai/whisper-small
tags:
  - whisper
  - lora
  - automatic-speech-recognition
  - disfluent-speech
  - stuttering
  - accessibility
  - speech-disorders
  - verbatim-transcription
datasets:
  - amaai-lab/DisfluencySpeech
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
model-index:
  - name: whisper-small-disfluent-verbatim-lora
    results:
      - task:
          type: automatic-speech-recognition
          name: Disfluent Speech Recognition (verbatim)
        dataset:
          name: held-out validation split (verbatim transcripts)
          type: validation
          split: validation
        metrics:
          - type: wer
            value: 20.29
            name: Word Error Rate (held-out, verbatim)

whisper-small-disfluent-verbatim (LoRA adapter)

A LoRA fine-tune of openai/whisper-small that transcribes disfluent / stuttered speech verbatim — preserving filled pauses (uh, um), exact word repetitions, and partial-word fragments rather than smoothing them away.

This is the verbatim variant. A companion smoothed variant — which drops disfluencies to give clean text of the speaker's intended words — is published as nazarkozak/whisper-small-disfluent-smoothed-lora.

Why verbatim, not smoothed?

Most ASR systems aim for smoothed output (clean intended speech). But for clinical, research, and assistive use cases the disfluencies are the signal:

Speech therapy assessment — clinicians need verbatim transcripts to measure stuttering frequency, type distribution (blocks vs repetitions vs prolongations), and progress over time.
Stuttering research — corpus annotation work needs the model to capture what was actually said, including the disfluencies.
Accessibility tooling — some downstream tools want to expose disfluencies to the user (e.g. "your speech included 3 blocks and 5 filled pauses in the last minute").
Faithful captioning of interviews / oral history — when the disfluency is part of the speaker's voice, smoothing it out distorts the record.

The smoothed variant is the right pick for voice assistants, AAC apps, and most live-captioning use cases. Use this verbatim variant only when the disfluencies themselves matter.

How to use

from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="english", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(model, "nazarkozak/whisper-small-disfluent-verbatim-lora")
model.eval()

import soundfile as sf
audio, sr = sf.read("path/to/disfluent_speech.wav")  # 16 kHz mono
inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").input_features
generated = model.generate(inputs, language="english", task="transcribe", max_new_tokens=200)
print(processor.batch_decode(generated, skip_special_tokens=True)[0])

Compared to the smoothed variant, expect output like "and uh I- I- I think that um yeah" rather than "and I think that yeah".

Performance

All numbers are Word Error Rate (lower is better), computed against verbatim references with Whisper's official English text normalizer.

Split	Vanilla Whisper-small (verbatim refs)	This adapter	Δ Absolute	Δ Relative
Full FluencyBank AWS-interview (5,444 utts)	26.75 %	26.46 %	−0.29	−1.1 %
Held-out validation (200 utts)	~26 %	20.29 %	−5.7	−22 %
Held-out test (200 utts)	28.71 %	27.99 %	−0.72	−2.5 %

The contrast between the held-out validation (−22 %) and full-set (−1.1 %) numbers reveals that this adapter overfits to training sessions — it learns the disfluency style of specific speakers it saw rather than a general verbatim transcription policy. The companion smoothed variant generalizes much better (−21 % rel reduction on the same full set). For most production use cases, prefer the smoothed variant.

Note that verbatim is intrinsically a harder task than smoothed — the model must produce a longer transcript that includes natural-spoken markers (uh, um) and word fragments (I-) which are out-of-distribution for vanilla Whisper. The improvement on the held-out test split is modest, while the validation split shows a much larger gain. Real-world WER will fall somewhere in between depending on speaker/session similarity.

Training data

Source	Hours	License	Use
FluencyBank Teaching — Adults Who Stutter (interview)	~17	CC BY-NC-SA 4.0 (TalkBank)	Train (most of the disfluent signal)
DisfluencySpeech (amaai-lab)	10	Apache 2.0	Train (single-speaker scripted, augmentation)

Total: ~27 hr training, 9,402 utterance pairs.

The verbatim transcripts come from:

For FluencyBank — CHAT-format transcripts processed with a custom parser that preserves disfluencies as natural words: &-uh → uh, &fri → fri-, exact repetitions kept. Bracket annotations ([/], [//]), pause marks (.), and unintelligible markers (xxx/yyy) are dropped — they are CHAT markup, not spoken content.
For DisfluencySpeech — the dataset's transcript_a field (its most-detailed natural-text variant, which keeps filled pauses and repetitions).

Training recipe

Hyperparameter	Value
Base model	`openai/whisper-small` (244 M params)
LoRA target modules	`q_proj`, `v_proj`
LoRA rank / alpha / dropout	32 / 64 / 0.05
Trainable params	3.5 M (1.4 % of total)
Batch size (effective)	16 (8 × 2 grad accum)
Learning rate	1e-4 with linear warmup over 100 steps, then linear decay
Epochs	3 (1764 optimizer steps)
Hardware	Apple M1 Max 64 GB (PyTorch + MPS backend)
Wall clock	~3.5 h

The released checkpoint is the lowest-WER one over training (step 1600). Note: the eval trajectory is noisier than the smoothed variant — there were transient generation spikes at steps 1000, 1200, and 1764 where eval WER climbed back to ~37 % before recovering. This is a known interaction between Seq2SeqTrainer's eval-time generation, MPS backend, and PEFT-wrapped models on Apple Silicon. Loss decreased monotonically throughout, confirming the model itself was healthy across the spikes.

Limitations

English only.
Adult speakers — not validated on children.
Verbatim style is FluencyBank-shaped. The model learned to emit uh / um / one-word repetitions in the style of CHAT-format transcripts. Other annotation conventions (e.g. Switchboard's {F uh, } markup) will not be reproduced.
Non-commercial license (inherited from FluencyBank's CC BY-NC-SA 4.0).
Larger gap between validation and test WER suggests session-level sensitivity. If the test speakers' speech patterns differ from training, gains may be smaller.

Citation

Same as the smoothed variant — see nazarkozak/whisper-small-disfluent-smoothed-lora.

Related models

Smoothed variant — drops disfluencies, returns clean intended speech: nazarkozak/whisper-small-disfluent-smoothed-lora