phonetic-whisper-mlx-narrow-en

phonetic-whisper-mlx-narrow-en

Whisper-large-v3 decoder fine-tuned for narrow International Phonetic Alphabet (IPA) transcription of English, trained on TIMIT alone using MLX on a single Apple Silicon machine.

Companion variant: phonetic-whisper-mlx-broad-multi trains on TIMIT broad + CommonVoice broad in 7 languages and emits broad-phonemic IPA. Use this narrow-en variant for English narrow phonetic detail; use broad-multi for cross-lingual broad IPA.

Code: barathanaslan/phonetic-whisper-mlx

Model description

phonetic-whisper-mlx-narrow-en is a decoder-only fine-tune of mlx-community/whisper-large-v3-mlx. The encoder is frozen during training; only the decoder weights are updated. The model takes 16 kHz English audio and emits TIMIT-narrow IPA strings.

Output convention. TIMIT-narrow IPA, NFC-normalized, with the TIMIT-style closures (bcl, dcl, gcl, pcl, tcl, kcl) and silences (pau, epi, h#) dropped. The remaining 52-symbol inventory preserves narrow distinctions such as the glottal stop ʔ, the flap ɾ, syllabic consonants (m̩, n̩, l̩, ŋ̍), r-coloured vowels (ɝ, ɚ), the reduced vowel ɨ, the devoiced schwa ə̥, the fronted ʉ, the voiced glottal ɦ, and the nasal flap ɾ̃.

Intended use

Research on Whisper-decoder fine-tuning for narrow phonetic transcription of English.
Generation of TIMIT-style IPA transcripts for English speech corpora.
Comparison work against this checkpoint on TIMIT-narrow conventions.

Out of scope: broad-IPA transcription (use the companion broad-multi variant); non-English input (this model has only seen TIMIT-style English narrow); orthographic ASR; cross-lingual phonetic recognition; commercial deployment without complying with the upstream LDC TIMIT non-commercial licensing terms.

How to use

MLX (Apple Silicon)

from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_whisper.load_models import load_model
from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
from mlx_whisper.decoding import DecodingOptions, decode
from mlx.utils import tree_flatten, tree_unflatten

# Download checkpoint weights from HF.
ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-narrow-en")

# Load Whisper-large-v3 architecture and overlay our decoder weights.
model = load_model("mlx-community/whisper-large-v3-mlx")
model.set_dtype(mx.float32)
trained = mx.load(f"{ckpt}/model.safetensors")
decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
params = dict(tree_flatten(model.parameters()))
for k, v in decoder_weights.items():
    if k in params:
        params[k] = v
model.update(tree_unflatten(list(params.items())))

# Inference. ALWAYS pass language="en" — see Training-time language token.
audio = load_audio("your-english-audio.wav")
mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
mel = mx.expand_dims(mel, 0).astype(mx.float32)
features = model.encoder(mel)
result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
print(result[0].text.strip())

For training reproduction, see the GitHub repository.

Training data

Source	Samples	Convention
TIMIT narrow (English, ARPABET → IPA via `prepare_timit_dataset.py`)	4,620	Narrow

Approximately ~3 hours of English read speech.

TIMIT (LDC93S1) is licensed for non-commercial research only. The trained weights are distributed under CC BY-NC 4.0 in accordance with this restriction; see License.

Training procedure

Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with MLX. Training was set up with automatic early-stopping; full hyperparameters, launchers, and reproduction commands are in the GitHub repository.

Training-time language token

All training samples use <|en|> as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. Pass language="en" at inference.

Evaluation

PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over PanPhon's 24 articulatory features ÷ 24, with insertion/deletion cost = 1. PER is segment-level edit distance ÷ reference length.

Benchmark	n	PFER (%)	PER (%)
TIMIT narrow core test (in-distribution)	1,680	5.83	14.98

No fair peer comparison

There is no published Whisper-decoder fine-tune on TIMIT narrow at the per-phone Hamming/24 PFER convention used here; this is a standalone in-distribution result. The benchmark adapters in the GitHub repository can run this checkpoint on other narrow benchmarks, but the resulting numbers are dominated by inventory mismatch (this model emits TIMIT-narrow detail) and are not published as quality claims.

Limitations

English-only. This checkpoint has only seen TIMIT-style English narrow during training. For multilingual or broad-IPA transcription use the companion broad-multi variant.
Small training corpus. ~3 hours of audio; the in-training validation curve shows clear overfitting after step 4,000, which is why early stopping triggered at step 9,000.
AR-decoder repetition. Whisper's autoregressive decoder can produce repetition hallucinations on out-of-distribution short utterances; this is a known structural property of AR decoders vs. CTC.

Citation

@software{aslan2026phonetic_whisper_mlx,
  author       = {Aslan, Barathan},
  title        = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
  year         = {2026},
  url          = {https://github.com/barathanaslan/phonetic-whisper-mlx},
  version      = {0.1.0},
  license      = {MIT (code), CC BY-NC 4.0 (weights)}
}

For training data:

Garofolo, J. S., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web download. Philadelphia: Linguistic Data Consortium, 1993.

For the per-phone Hamming/24 PFER convention:

Taguchi, C. Universal Automatic Phonetic Transcription into the IPA. arXiv:2308.03917, 2023.

Lu et al. POWSM: A Phonetic Open Whisper-Style Speech Foundation Model. arXiv:2510.24992, 2025.

License

Trained model weights: CC BY-NC 4.0. The non-commercial restriction reflects the TIMIT (LDC93S1) data terms inherited via training data. Commercial deployment of derivative products may require obtaining a TIMIT For-Profit Membership from LDC; compliance with upstream training-data licenses is the deployer's responsibility.

Source code: MIT, distributed via the GitHub repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

2B params

Tensor type

F32

MLX

Hardware compatibility

Quantized

Model tree for Rayrui33/phonetic-whisper-mlx-narrow-en

Base model

mlx-community/whisper-large-v3-mlx

Finetuned

(6)

this model

Dataset used to train Rayrui33/phonetic-whisper-mlx-narrow-en

Papers for Rayrui33/phonetic-whisper-mlx-narrow-en

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Paper • 2510.24992 • Published Oct 28, 2025 • 4

Universal Automatic Phonetic Transcription into the International Phonetic Alphabet

Paper • 2308.03917 • Published Aug 7, 2023

Evaluation results

Phone Feature Error Rate (PanPhon Hamming/24) on TIMIT core test (narrow)
self-reported

5.830
Phone Error Rate (segment-level edit distance) on TIMIT core test (narrow)
self-reported

14.980