Whisper large-v3-turbo for Apple Core AI (.aimodel)
OpenAI Whisper large-v3-turbo converted to Apple's Core AI format (.aimodel),
introduced with macOS 27 at WWDC 2026. Runs fully on-device on Apple Silicon
(GPU-accelerated, fp16), several times faster than real time.
Converted for and used by Sophist Lite β a local-only dictation app for macOS.
Requirements
- macOS 27.0 or later (Core AI framework), Apple Silicon
- Xcode 27+ to build an app against
CoreAI.framework
Contents
| File | Graph | Size |
|---|---|---|
whisper-large-v3-turbo-encoder_float16.aimodel |
audio encoder | ~1.6 GB |
whisper-large-v3-turbo-decoder_float16.aimodel |
text decoder | ~0.34 GB |
export_split.py |
conversion recipe (reproducibility) | β |
Architecture: encoder/decoder split
Two separate graphs instead of one monolith, so the 32-layer audio encoder runs once per 30-second chunk while only the 4-layer decoder runs per token (~100Γ faster than the naive single-graph export in our measurements; ~9Γ real time on a base M2 for typical dictation).
Encoder β function main:
input_features : [1, 128, 3000] float16 (Whisper log-mel, 30 s @ 16 kHz)
β encoder_hidden_states : [1, 1500, 1280] float16
Decoder β function main (static shape, full forward per step):
encoder_hidden_states : [1, 1500, 1280] float16 (reuse the encoder output every step)
decoder_input_ids : [1, 256] int32 (prompt + generated tokens, EOT-padded)
β logits : [1, 256, 51866] float16 (read the row at the real last position)
Decoding notes (what your app must implement)
The .aimodel files contain only the neural graphs. A minimal greedy decoder:
- Features: HF-
WhisperFeatureExtractor-compatible log-mel (128 bins, n_fft 400, hop 160), audio zero-padded/trimmed to exactly 30 s. - Prompt:
[50258 (<|startoftranscript|>), <lang>, 50360 (<|transcribe|>), 50364 (<|notimestamps|>)]. Language tokens start at50259 (<|en|>); for auto-detect, run one step with[50258]and take the argmax over the language-token range. - Each step: pad the token sequence to length 256 with
50257(EOT), run the decoder, read logits at the real last position, argmax with suppression, append, repeat until EOT. - Suppression: the
suppress_tokenslist from the model'sgeneration_config.json, plus all timestamp tokens (ids β₯ 50365) in no-timestamps mode, plusbegin_suppress_tokens{220, 50257}on the first generated step only. A repetition guard (e.g. stop on a 4Γ repeated token cycle) is recommended β plain greedy can loop on trailing silence. - Tokenizer: standard Whisper byte-level BPE (e.g.
openai/whisper-large-v3-turboviaswift-transformers/tokenizers).
Reference implementation: the CoreAIWhisperEngine/CoreAIWhisperClient sources in
Sophist Lite.
Swift sketch:
import CoreAI
let encoder = try await AIModel(contentsOf: encoderURL)
let decoder = try await AIModel(contentsOf: decoderURL)
let encFn = try encoder.loadFunction(named: "main")!
let decFn = try decoder.loadFunction(named: "main")!
let enc = try await encFn.run(inputs: ["input_features": melFeatures])
let hidden = enc["encoder_hidden_states"]!.ndArray
// per token:
let out = try await decFn.run(inputs: [
"encoder_hidden_states": hidden,
"decoder_input_ids": paddedTokenIds, // [1, 256] int32
])
let logits = out["logits"]!.ndArray // [1, 256, 51866]
Why not Apple's stock recipe, and why no KV cache?
As of the macOS 27.0 beta toolchain:
- Apple's stock
models/whisper/export.py(apple/coreai-models) exports a staticdecoder_input_idsshape of[1, 1], which cannot autoregress; re-exporting with dynamic shapes crashes (SIGSEGV in MPSGraph) mid-decode. The fixed[1, 256]shape used here is stable. - A KV-cached single-token decoder (cross-K/V precomputed, static self-cache) was implemented and was bit-exact vs. Hugging Face in eager PyTorch (fp32 and fp16), but the converted graphs produced systematically wrong logits across four different formulations β a Core AI beta miscompilation for that graph class. The split in this repo uses the standard full-decoder graph, which converts correctly and matches the HF fp16 greedy reference word-for-word in our tests.
These findings may improve with later Core AI releases; the .aimodel format itself is
from a beta toolchain and may need re-export for future macOS 27 builds.
Conversion provenance
- Source weights: openai/whisper-large-v3-turbo (Apache-2.0)
- Toolchain: apple/coreai-models,
coreai-torch==0.4.0,coreai-core==1.0.0b1,transformers==4.57.3 - Export:
export_split.py(this repo), dtype float16, macOS 27.0 / Xcode 27.0 - Verification: teacher-forcing parity vs. HF (PSNR > 130 dB in fp32) and word-for-word match with the HF fp16 greedy reference on real audio
License
Apache-2.0, same as the source model. Whisper is by OpenAI (paper); this repo only redistributes a converted, split variant for Apple Core AI.
Model tree for CarstenL/whisper-large-v3-turbo-coreai
Base model
openai/whisper-large-v3