Pathumma-whisper-th-large-v3 (MLX)

Unofficial community conversion. This repository is a community-maintained MLX-format conversion. It is not affiliated with, endorsed by, or maintained by NECTEC. Model weights are unchanged from the upstream release; only the storage format has been converted.

This repository provides NECTEC's Pathumma-whisper-th-large-v3 pre-converted to Apple MLX format, so Apple Silicon users can load the model directly with mlx-whisper without re-running the conversion (which takes ~5–10 minutes plus a 3 GB upstream download).

Apple Silicon required. MLX is not supported on Intel Macs or non-Apple platforms. If you are not on an M-series Mac, use the original nectec/Pathumma-whisper-th-large-v3 with HuggingFace Transformers or faster-whisper instead.

Why Pathumma over generic Whisper-large-v3

Generic Whisper-large-v3 (OpenAI release, or mlx-community/whisper-large-v3-mlx) is widely reported by the community to produce repetition-loop hallucinations on Thai audio — the same word or token repeats indefinitely until the segment ends. Pathumma is a Thai-language full fine-tune that mitigates this issue and reports the lowest CER among open-source offline baselines in the Typhoon ASR Real-time paper.

In informal testing on a 73-second Thai social-media audio clip:

Generic mlx-community/whisper-large-v3-mlx: produced repetition-loop output on the first run.
This Pathumma MLX conversion: produced fluent Thai output without repetition loops.

This is not a controlled hallucination-rate study, only an anecdotal observation. Readers who care about hallucination behaviour should evaluate on their own data.

Reported benchmarks (CER %, lower is better)

The CER values below are reported in the literature for the original Pathumma model. They have not been re-measured against this MLX conversion. Because the conversion preserves the FP16 weights bit-identically (no retraining, no quantization beyond FP16), accuracy should be equivalent up to floating-point conversion noise; this has not been independently verified in this repository.

Source: Sirichotedumrong et al., "Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition", arXiv:2601.13044, Table 6 (Impact of Data Quality on Model Performance).

Model	TVSpeech (noisy)	Gigaspeech2-Th	FLEURS-Th (Orig. / Norm.)
Biodatlab Whisper Large (Thonburian)	18.96	13.22	16.50 / 15.26
Biodatlab Distil-Whisper Large	13.82	8.24	6.77 / 8.63
Pathumma Whisper Large-v3	10.36	5.84	6.29 / 7.88
(reference) Typhoon Whisper Large-v3	6.32	4.69	9.98 / 5.69

The FLEURS-Th column has two values in the source: Orig. (original references as released) and Norm. (after the Typhoon team's normalization pipeline). Pathumma achieves the best Orig. score; Typhoon Whisper Large-v3 achieves the best Norm. score. Neither is universally "best" — the right choice depends on whether your downstream pipeline normalizes Thai numbers, the mai yamok repetition marker, and similar conventions.

Usage

# Apple Silicon Mac required
pip install mlx-whisper
# ffmpeg is required by mlx-whisper to decode .m4a / .mp3 / .mp4
brew install ffmpeg

import mlx_whisper

result = mlx_whisper.transcribe(
    "audio.wav",
    path_or_hf_repo="kinoppy555/Pathumma-whisper-th-large-v3-mlx",
    language="th",
)
print(result["text"])

Passing language="th" explicitly is recommended. Whisper's automatic language detection can mislabel Thai-only content when the first 30 seconds contain music, silence, or code-switched English; specifying the language avoids these failure modes.

Anecdotal speed

On an Apple M3 Pro (36 GB unified memory), transcribing a 73-second Thai social-media audio clip:

This MLX conversion (Pathumma): ~85 s wall time (≈0.87× real-time, including model load).
mlx-community/whisper-large-v3-mlx (generic): ~100–230 s, output sometimes degraded by repetition loops.

These are single-run informal numbers, not a benchmark.

Conversion details

Converted using ml-explore/mlx-examples/whisper/convert.py at commit e52c128d113f10546f0fa391f87edcc58d3880cb (sha256 of convert.py: 1f1de41ac5d3faeb241bcea97a3f99c760b802cd92e02715699c17b5658f5cb2).

python3 convert.py \
    --torch-name-or-path nectec/Pathumma-whisper-th-large-v3 \
    --mlx-path ./pathumma-th-large-v3-mlx \
    --dtype float16

Note for re-converters only. Upstream convert.py writes weights as model.safetensors, but mlx_whisper.load_models expects weights.safetensors. After running the command above, rename the file:
mv pathumma-th-large-v3-mlx/model.safetensors pathumma-th-large-v3-mlx/weights.safetensors
The pre-converted files in this Hugging Face repository are already named correctly; the rename is only needed if you reproduce the conversion from scratch.

Limitations

Format conversion only. No retraining, no fine-tuning, no quantization beyond the FP16 down-cast performed by convert.py --dtype float16. All limitations of the upstream Pathumma model are inherited.
Tokenizer. mlx-whisper loads tokenizer assets from the openai-whisper distribution rather than this repo, so this repository ships only weights.safetensors and config.json.
Hallucinations. Whisper-family models can still produce repetition loops or skip segments on out-of-domain audio (very noisy, very short, multi-speaker overlap, code-switched audio, music). The mitigation described above is anecdotal, not a controlled study.
Streaming. Pathumma is an offline encoder-decoder Whisper. For real-time / low-latency streaming Thai ASR, see Typhoon ASR Real-time.
Domain coverage. Trained primarily on the corpora described in the upstream NECTEC release; performance on highly specialised domains (medical, legal, regional dialects) is not characterised here.

Files

File	Size	Description
`weights.safetensors`	~2.9 GB	MLX-formatted FP16 weights (bit-identical conversion of the original)
`config.json`	<1 KB	MLX whisper model dimensions
`LICENSE`	11 KB	Apache License 2.0 (inherited from upstream)

License

Apache License 2.0, inherited from nectec/Pathumma-whisper-th-large-v3. The full text is included in the LICENSE file in this repository. This MLX-format derivative is distributed under the same Apache License 2.0.

This repository constitutes a "Modification" under Apache 2.0 §1: weights have been converted from the upstream PyTorch checkpoint format to MLX-format safetensors (FP16). No model weights, architecture, training data, or training procedure have been changed.

Credits

Original model: NECTEC (National Electronics and Computer Technology Center), Thailand — nectec/Pathumma-whisper-th-large-v3. Authors: Pattara Tipaksorn, Wayupuk Sommuang, Oatsada Chatthong, Kwanchiva Thangthai (Pathumma Audio Team).
Base architecture: OpenAI Whisper-large-v3 (openai/whisper-large-v3).
MLX framework: Apple Inc. — ml-explore/mlx.
Conversion script: ml-explore/mlx-examples (Apache 2.0).

Citation

If you use Pathumma in your research, please cite the upstream NECTEC work as instructed on the source model card:

@misc{tipaksorn2024PathummaWhisper,
    title        = { {Pathumma Whisper Large V3 (TH)} },
    author       = { Pattara Tipaksorn and Wayupuk Sommuang and Oatsada Chatthong and Kwanchiva Thangthai },
    url          = { https://huggingface.co/nectec/Pathumma-whisper-th-large-v3 },
    publisher    = { Hugging Face },
    year         = { 2024 },
}

For the benchmark numbers reproduced above, please cite:

@misc{sirichotedumrong2026typhoon,
    title         = { Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition },
    author        = { Warit Sirichotedumrong and Adisai Na-Thalang and Potsawee Manakul and Pittawat Taveekitworachai and Sittipong Sripaisarnmongkol and Kunat Pipatanakul },
    eprint        = { 2601.13044 },
    archivePrefix = { arXiv },
    year          = { 2026 },
}

Downloads last month: 30

MLX

Hardware compatibility

Quantized

Model tree for kinoppy555/Pathumma-whisper-th-large-v3-mlx

Base model

openai/whisper-large-v3

Finetuned

nectec/Pathumma-whisper-th-large-v3

Finetuned

(2)

this model

Dataset used to train kinoppy555/Pathumma-whisper-th-large-v3-mlx

Paper for kinoppy555/Pathumma-whisper-th-large-v3-mlx

Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

Paper • 2601.13044 • Published Jan 19 • 12