Whisper large-v3-turbo β€” Apple Core AI export (autoregressive)

A pre-converted .aimodel from Apple's official coreai-models Whisper recipe, packaged so it actually transcribes on the stock Core AI runtime (no engine patch).

Whisper large-v3-turbo (809 M) is OpenAI's multilingual ASR encoder-decoder.

What's different from the stock recipe (and why)

Apple's models/whisper/export.py traces the model with decoder_input_ids of shape [1, 1] β€” a single decode step, no KV cache. That graph can't be driven autoregressively: with one token and no cache, every step is "position 0" and loses all prior context (it emits nothing useful).

This bundle is the same recipe with one change: the decoder is traced at a fixed 128-token window (decoder_input_ids: [1, 128]). You pad the decoder buffer to 128 and read the logits at the real last position. Because the self-attention is causal, the real token at position k never attends to the padding, so the read is exact β€” and because the shape is constant, MPSGraph compiles once (a dynamic-length export instead recompiles every step β†’ ~15 s/token; this is ~0.18 s/token). Everything else is the upstream recipe and it runs on the stock runtime.

# stock single-step recipe:           uv run models/whisper/export.py
# this bundle (fixed 128-token decode): see _export_whisper_fixed.py in the zoo conversion/ dir

Bundle

whisper-large-v3-turbo_float16_fixed128.aimodel/   main.mlirb + main.hash + metadata.json
tokenizer/                                         HF Whisper tokenizer (detokenize output ids)
mel_filters_128.npy                                [201, 128] mel filterbank for the audio frontend
preprocessor_config.json                           n_fft=400, hop=160, 128 mels, 16 kHz
File SHA-256
…_fixed128.aimodel/main.mlirb (~1.5 GB) f5824a2e01906ad72bb3241573e75a41ecbf89c2ebe5fb8b87716752cf144881
…_fixed128.aimodel/main.hash 2bd169f0ca2812f7a6321f973a7bf88ef0ff37bc823265efb13e1beb12a7bf2c
mel_filters_128.npy 4eb6b0fe7aa985fa2ce80d81260d8b5b30ff908d2808a5d637683228c648db6f

Measured (M4 Max, GPU)

Greedy decode, English clip, vs the HF PyTorch reference (generate, greedy):

Metric Value
Transcript token-for-token identical to PyTorch greedy
First step (compile + warmup) 0.68 s
Per token (steady state) 0.18 s

The fixed window caps a single 30 s decode at 128 tokens (enough for a 30 s window; chunk longer audio into 30 s segments).

How to run it (the decode loop)

The graph takes input_features [1, 128, 3000] (log-mel) + decoder_input_ids [1, 128] β†’ logits [1, 128, 51866].

  1. Audio β†’ log-mel (mel_filters_128.npy, n_fft 400 / hop 160): STFT β†’ power β†’ mel filterbank β†’ log10, clamp to max-8, (x+4)/4; pad/trim to 3000 frames.
  2. Prompt: [<|startoftranscript|>, <|en|>, <|transcribe|>, <|notimestamps|>] (50258, 50259, 50360, 50364).
  3. Loop: pad the prompt to 128, run, take argmax(logits[0, k]) at the real last index k, append, repeat until <|endoftext|> (50257).
  4. Detokenize the generated ids with the bundled tokenizer.

Runs on macOS + iOS via coreai.runtime / the Swift CoreAI framework. See the CoreAITranscribe sample app.

Export environment

  • macOS 27.0 beta Β· coreai-core 1.0.0b1 Β· coreai-torch 0.4.0 Β· transformers 4.57
  • recipe: Apple models/whisper/export.py + a fixed 128-token decoder trace

License

Whisper is Apache-2.0 (OpenAI). This bundle is a format conversion and inherits that license.


Maintained alongside coreai-model-zoo (official/).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/whisper-large-v3-turbo-CoreAI-official

Finetuned
(548)
this model