--- license: cc-by-nc-4.0 language: - en library_name: mlx pipeline_tag: automatic-speech-recognition tags: - speech-recognition - phonetic-transcription - ipa - narrow-ipa - whisper - whisper-decoder-finetune - mlx - apple-silicon - english datasets: - timit-asr/timit_asr metrics: - per - pfer base_model: mlx-community/whisper-large-v3-mlx model-index: - name: phonetic-whisper-mlx-narrow-en results: - task: type: automatic-speech-recognition name: Narrow-IPA phonetic transcription (English) dataset: name: TIMIT core test (narrow) type: timit metrics: - type: pfer value: 5.83 name: Phone Feature Error Rate (PanPhon Hamming/24) - type: per value: 14.98 name: Phone Error Rate (segment-level edit distance) --- # phonetic-whisper-mlx-narrow-en Whisper-large-v3 decoder fine-tuned for **narrow** International Phonetic Alphabet (IPA) transcription of English, trained on TIMIT alone using [MLX](https://github.com/ml-explore/mlx) on a single Apple Silicon machine. > **Companion variant:** [`phonetic-whisper-mlx-broad-multi`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-broad-multi) > trains on TIMIT broad + CommonVoice broad in 7 languages and emits > broad-phonemic IPA. Use this `narrow-en` variant for English narrow > phonetic detail; use `broad-multi` for cross-lingual broad IPA. > > **Code:** [`barathanaslan/phonetic-whisper-mlx`](https://github.com/barathanaslan/phonetic-whisper-mlx) ## Model description `phonetic-whisper-mlx-narrow-en` is a decoder-only fine-tune of [`mlx-community/whisper-large-v3-mlx`](https://huggingface.co/mlx-community/whisper-large-v3-mlx). The encoder is frozen during training; only the decoder weights are updated. The model takes 16 kHz English audio and emits TIMIT-narrow IPA strings. **Output convention.** TIMIT-narrow IPA, NFC-normalized, with the TIMIT-style closures (`bcl`, `dcl`, `gcl`, `pcl`, `tcl`, `kcl`) and silences (`pau`, `epi`, `h#`) dropped. The remaining 52-symbol inventory preserves narrow distinctions such as the glottal stop `ʔ`, the flap `ɾ`, syllabic consonants (`m̩`, `n̩`, `l̩`, `ŋ̍`), r-coloured vowels (`ɝ`, `ɚ`), the reduced vowel `ɨ`, the devoiced schwa `ə̥`, the fronted `ʉ`, the voiced glottal `ɦ`, and the nasal flap `ɾ̃`. ## Intended use - Research on Whisper-decoder fine-tuning for narrow phonetic transcription of English. - Generation of TIMIT-style IPA transcripts for English speech corpora. - Comparison work against this checkpoint on TIMIT-narrow conventions. **Out of scope:** broad-IPA transcription (use the companion `broad-multi` variant); non-English input (this model has only seen TIMIT-style English narrow); orthographic ASR; cross-lingual phonetic recognition; commercial deployment without complying with the upstream LDC TIMIT non-commercial licensing terms. ## How to use ### MLX (Apple Silicon) ```python from huggingface_hub import snapshot_download import mlx.core as mx from mlx_whisper.load_models import load_model from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram from mlx_whisper.decoding import DecodingOptions, decode from mlx.utils import tree_flatten, tree_unflatten # Download checkpoint weights from HF. ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-narrow-en") # Load Whisper-large-v3 architecture and overlay our decoder weights. model = load_model("mlx-community/whisper-large-v3-mlx") model.set_dtype(mx.float32) trained = mx.load(f"{ckpt}/model.safetensors") decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")} params = dict(tree_flatten(model.parameters())) for k, v in decoder_weights.items(): if k in params: params[k] = v model.update(tree_unflatten(list(params.items()))) # Inference. ALWAYS pass language="en" — see Training-time language token. audio = load_audio("your-english-audio.wav") mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128) mel = mx.expand_dims(mel, 0).astype(mx.float32) features = model.encoder(mel) result = decode(model, features, DecodingOptions(language="en", without_timestamps=True)) print(result[0].text.strip()) ``` For training reproduction, see the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx). ## Training data | Source | Samples | Convention | |---|---:|---| | TIMIT narrow (English, ARPABET → IPA via `prepare_timit_dataset.py`) | 4,620 | Narrow | Approximately ~3 hours of English read speech. TIMIT (LDC93S1) is licensed for non-commercial research only. The trained weights are distributed under CC BY-NC 4.0 in accordance with this restriction; see [License](#license). ## Training procedure Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with [MLX](https://github.com/ml-explore/mlx). Training was set up with automatic early-stopping; full hyperparameters, launchers, and reproduction commands are in the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx). ### Training-time language token All training samples use `<|en|>` as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. **Pass `language="en"` at inference.** ## Evaluation PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over PanPhon's 24 articulatory features ÷ 24, with insertion/deletion cost = 1. PER is segment-level edit distance ÷ reference length. | Benchmark | n | PFER (%) | PER (%) | |---|---:|---:|---:| | TIMIT narrow core test (in-distribution) | 1,680 | **5.83** | **14.98** | ### No fair peer comparison There is no published Whisper-decoder fine-tune on TIMIT narrow at the per-phone Hamming/24 PFER convention used here; this is a standalone in-distribution result. The benchmark adapters in the GitHub repository can run this checkpoint on other narrow benchmarks, but the resulting numbers are dominated by inventory mismatch (this model emits TIMIT-narrow detail) and are not published as quality claims. ## Limitations - **English-only.** This checkpoint has only seen TIMIT-style English narrow during training. For multilingual or broad-IPA transcription use the companion [`broad-multi`](https://huggingface.co/barathanasln/phonetic-whisper-mlx-broad-multi) variant. - **Small training corpus.** ~3 hours of audio; the in-training validation curve shows clear overfitting after step 4,000, which is why early stopping triggered at step 9,000. - **AR-decoder repetition.** Whisper's autoregressive decoder can produce repetition hallucinations on out-of-distribution short utterances; this is a known structural property of AR decoders vs. CTC. ## Citation ```bibtex @software{aslan2026phonetic_whisper_mlx, author = {Aslan, Barathan}, title = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon}, year = {2026}, url = {https://github.com/barathanaslan/phonetic-whisper-mlx}, version = {0.1.0}, license = {MIT (code), CC BY-NC 4.0 (weights)} } ``` For training data: > Garofolo, J. S., et al. *TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.* Web download. Philadelphia: Linguistic Data Consortium, 1993. For the per-phone Hamming/24 PFER convention: > Taguchi, C. *Universal Automatic Phonetic Transcription into the IPA.* arXiv:2308.03917, 2023. > > Lu et al. *POWSM: A Phonetic Open Whisper-Style Speech Foundation Model.* arXiv:2510.24992, 2025. ## License **Trained model weights:** [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). The non-commercial restriction reflects the TIMIT (LDC93S1) data terms inherited via training data. Commercial deployment of derivative products may require obtaining a TIMIT For-Profit Membership from LDC; compliance with upstream training-data licenses is the deployer's responsibility. **Source code:** MIT, distributed via the GitHub repository.