Automatic Speech Recognition
MLX
Thai
English
whisper
thai
asr
pathumma

Pathumma-whisper-th-large-v3 (MLX)

Unofficial community conversion. This repository is a community-maintained MLX-format conversion. It is not affiliated with, endorsed by, or maintained by NECTEC. Model weights are unchanged from the upstream release; only the storage format has been converted.

This repository provides NECTEC's Pathumma-whisper-th-large-v3 pre-converted to Apple MLX format, so Apple Silicon users can load the model directly with mlx-whisper without re-running the conversion (which takes ~5–10 minutes plus a 3 GB upstream download).

Apple Silicon required. MLX is not supported on Intel Macs or non-Apple platforms. If you are not on an M-series Mac, use the original nectec/Pathumma-whisper-th-large-v3 with HuggingFace Transformers or faster-whisper instead.

Why Pathumma over generic Whisper-large-v3

Generic Whisper-large-v3 (OpenAI release, or mlx-community/whisper-large-v3-mlx) is widely reported by the community to produce repetition-loop hallucinations on Thai audio — the same word or token repeats indefinitely until the segment ends. Pathumma is a Thai-language full fine-tune that mitigates this issue and reports the lowest CER among open-source offline baselines in the Typhoon ASR Real-time paper.

In informal testing on a 73-second Thai social-media audio clip:

  • Generic mlx-community/whisper-large-v3-mlx: produced repetition-loop output on the first run.
  • This Pathumma MLX conversion: produced fluent Thai output without repetition loops.

This is not a controlled hallucination-rate study, only an anecdotal observation. Readers who care about hallucination behaviour should evaluate on their own data.

Reported benchmarks (CER %, lower is better)

The CER values below are reported in the literature for the original Pathumma model. They have not been re-measured against this MLX conversion. Because the conversion preserves the FP16 weights bit-identically (no retraining, no quantization beyond FP16), accuracy should be equivalent up to floating-point conversion noise; this has not been independently verified in this repository.

Source: Sirichotedumrong et al., "Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition", arXiv:2601.13044, Table 6 (Impact of Data Quality on Model Performance).

Model TVSpeech (noisy) Gigaspeech2-Th FLEURS-Th (Orig. / Norm.)
Biodatlab Whisper Large (Thonburian) 18.96 13.22 16.50 / 15.26
Biodatlab Distil-Whisper Large 13.82 8.24 6.77 / 8.63
Pathumma Whisper Large-v3 10.36 5.84 6.29 / 7.88
(reference) Typhoon Whisper Large-v3 6.32 4.69 9.98 / 5.69

The FLEURS-Th column has two values in the source: Orig. (original references as released) and Norm. (after the Typhoon team's normalization pipeline). Pathumma achieves the best Orig. score; Typhoon Whisper Large-v3 achieves the best Norm. score. Neither is universally "best" — the right choice depends on whether your downstream pipeline normalizes Thai numbers, the mai yamok repetition marker, and similar conventions.

Usage

# Apple Silicon Mac required
pip install mlx-whisper
# ffmpeg is required by mlx-whisper to decode .m4a / .mp3 / .mp4
brew install ffmpeg
import mlx_whisper

result = mlx_whisper.transcribe(
    "audio.wav",
    path_or_hf_repo="kinoppy555/Pathumma-whisper-th-large-v3-mlx",
    language="th",
)
print(result["text"])

Passing language="th" explicitly is recommended. Whisper's automatic language detection can mislabel Thai-only content when the first 30 seconds contain music, silence, or code-switched English; specifying the language avoids these failure modes.

Anecdotal speed

On an Apple M3 Pro (36 GB unified memory), transcribing a 73-second Thai social-media audio clip:

  • This MLX conversion (Pathumma): ~85 s wall time (≈0.87× real-time, including model load).
  • mlx-community/whisper-large-v3-mlx (generic): ~100–230 s, output sometimes degraded by repetition loops.

These are single-run informal numbers, not a benchmark.

Conversion details

Converted using ml-explore/mlx-examples/whisper/convert.py at commit e52c128d113f10546f0fa391f87edcc58d3880cb (sha256 of convert.py: 1f1de41ac5d3faeb241bcea97a3f99c760b802cd92e02715699c17b5658f5cb2).

python3 convert.py \
    --torch-name-or-path nectec/Pathumma-whisper-th-large-v3 \
    --mlx-path ./pathumma-th-large-v3-mlx \
    --dtype float16

Note for re-converters only. Upstream convert.py writes weights as model.safetensors, but mlx_whisper.load_models expects weights.safetensors. After running the command above, rename the file:

mv pathumma-th-large-v3-mlx/model.safetensors pathumma-th-large-v3-mlx/weights.safetensors

The pre-converted files in this Hugging Face repository are already named correctly; the rename is only needed if you reproduce the conversion from scratch.

Limitations

  • Format conversion only. No retraining, no fine-tuning, no quantization beyond the FP16 down-cast performed by convert.py --dtype float16. All limitations of the upstream Pathumma model are inherited.
  • Tokenizer. mlx-whisper loads tokenizer assets from the openai-whisper distribution rather than this repo, so this repository ships only weights.safetensors and config.json.
  • Hallucinations. Whisper-family models can still produce repetition loops or skip segments on out-of-domain audio (very noisy, very short, multi-speaker overlap, code-switched audio, music). The mitigation described above is anecdotal, not a controlled study.
  • Streaming. Pathumma is an offline encoder-decoder Whisper. For real-time / low-latency streaming Thai ASR, see Typhoon ASR Real-time.
  • Domain coverage. Trained primarily on the corpora described in the upstream NECTEC release; performance on highly specialised domains (medical, legal, regional dialects) is not characterised here.

Files

File Size Description
weights.safetensors ~2.9 GB MLX-formatted FP16 weights (bit-identical conversion of the original)
config.json <1 KB MLX whisper model dimensions
LICENSE 11 KB Apache License 2.0 (inherited from upstream)

License

Apache License 2.0, inherited from nectec/Pathumma-whisper-th-large-v3. The full text is included in the LICENSE file in this repository. This MLX-format derivative is distributed under the same Apache License 2.0.

This repository constitutes a "Modification" under Apache 2.0 §1: weights have been converted from the upstream PyTorch checkpoint format to MLX-format safetensors (FP16). No model weights, architecture, training data, or training procedure have been changed.

Credits

Citation

If you use Pathumma in your research, please cite the upstream NECTEC work as instructed on the source model card:

@misc{tipaksorn2024PathummaWhisper,
    title        = { {Pathumma Whisper Large V3 (TH)} },
    author       = { Pattara Tipaksorn and Wayupuk Sommuang and Oatsada Chatthong and Kwanchiva Thangthai },
    url          = { https://huggingface.co/nectec/Pathumma-whisper-th-large-v3 },
    publisher    = { Hugging Face },
    year         = { 2024 },
}

For the benchmark numbers reproduced above, please cite:

@misc{sirichotedumrong2026typhoon,
    title         = { Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition },
    author        = { Warit Sirichotedumrong and Adisai Na-Thalang and Potsawee Manakul and Pittawat Taveekitworachai and Sittipong Sripaisarnmongkol and Kunat Pipatanakul },
    eprint        = { 2601.13044 },
    archivePrefix = { arXiv },
    year          = { 2026 },
}
Downloads last month
30
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kinoppy555/Pathumma-whisper-th-large-v3-mlx

Finetuned
(2)
this model

Dataset used to train kinoppy555/Pathumma-whisper-th-large-v3-mlx

Paper for kinoppy555/Pathumma-whisper-th-large-v3-mlx