Whisper large-v3-turbo for Apple Core AI (.aimodel)

OpenAI Whisper large-v3-turbo converted to Apple's Core AI format (.aimodel), introduced with macOS 27 at WWDC 2026. Runs fully on-device on Apple Silicon (GPU-accelerated, fp16), several times faster than real time.

Converted for and used by Sophist Lite — a local-only dictation app for macOS.

Requirements

macOS 27.0 or later (Core AI framework), Apple Silicon
Xcode 27+ to build an app against CoreAI.framework

File	Graph	Size
`whisper-large-v3-turbo-encoder_float16.aimodel`	audio encoder	~1.6 GB
`whisper-large-v3-turbo-decoder_float16.aimodel`	text decoder	~0.34 GB
`export_split.py`	conversion recipe (reproducibility)	—

Architecture: encoder/decoder split

Two separate graphs instead of one monolith, so the 32-layer audio encoder runs once per 30-second chunk while only the 4-layer decoder runs per token (~100× faster than the naive single-graph export in our measurements; ~9× real time on a base M2 for typical dictation).

Encoder — function main:

input_features        : [1, 128, 3000]  float16   (Whisper log-mel, 30 s @ 16 kHz)
→ encoder_hidden_states : [1, 1500, 1280] float16

Decoder — function main (static shape, full forward per step):

encoder_hidden_states : [1, 1500, 1280] float16   (reuse the encoder output every step)
decoder_input_ids     : [1, 256]        int32     (prompt + generated tokens, EOT-padded)
→ logits                : [1, 256, 51866] float16   (read the row at the real last position)

Decoding notes (what your app must implement)

The .aimodel files contain only the neural graphs. A minimal greedy decoder:

Features: HF-WhisperFeatureExtractor-compatible log-mel (128 bins, n_fft 400, hop 160), audio zero-padded/trimmed to exactly 30 s.
Prompt: [50258 (<|startoftranscript|>), <lang>, 50360 (<|transcribe|>), 50364 (<|notimestamps|>)]. Language tokens start at 50259 (<|en|>); for auto-detect, run one step with [50258] and take the argmax over the language-token range.
Each step: pad the token sequence to length 256 with 50257 (EOT), run the decoder, read logits at the real last position, argmax with suppression, append, repeat until EOT.
Suppression: the suppress_tokens list from the model's generation_config.json, plus all timestamp tokens (ids ≥ 50365) in no-timestamps mode, plus begin_suppress_tokens {220, 50257} on the first generated step only. A repetition guard (e.g. stop on a 4× repeated token cycle) is recommended — plain greedy can loop on trailing silence.
Tokenizer: standard Whisper byte-level BPE (e.g. openai/whisper-large-v3-turbo via swift-transformers / tokenizers).

Reference implementation: the CoreAIWhisperEngine/CoreAIWhisperClient sources in Sophist Lite.

Swift sketch:

import CoreAI

let encoder = try await AIModel(contentsOf: encoderURL)
let decoder = try await AIModel(contentsOf: decoderURL)
let encFn = try encoder.loadFunction(named: "main")!
let decFn = try decoder.loadFunction(named: "main")!

let enc = try await encFn.run(inputs: ["input_features": melFeatures])
let hidden = enc["encoder_hidden_states"]!.ndArray
// per token:
let out = try await decFn.run(inputs: [
    "encoder_hidden_states": hidden,
    "decoder_input_ids": paddedTokenIds,   // [1, 256] int32
])
let logits = out["logits"]!.ndArray        // [1, 256, 51866]

Why not Apple's stock recipe, and why no KV cache?

As of the macOS 27.0 beta toolchain:

Apple's stock models/whisper/export.py (apple/coreai-models) exports a static decoder_input_ids shape of [1, 1], which cannot autoregress; re-exporting with dynamic shapes crashes (SIGSEGV in MPSGraph) mid-decode. The fixed [1, 256] shape used here is stable.
A KV-cached single-token decoder (cross-K/V precomputed, static self-cache) was implemented and was bit-exact vs. Hugging Face in eager PyTorch (fp32 and fp16), but the converted graphs produced systematically wrong logits across four different formulations — a Core AI beta miscompilation for that graph class. The split in this repo uses the standard full-decoder graph, which converts correctly and matches the HF fp16 greedy reference word-for-word in our tests.

These findings may improve with later Core AI releases; the .aimodel format itself is from a beta toolchain and may need re-export for future macOS 27 builds.

Conversion provenance

Source weights: openai/whisper-large-v3-turbo (Apache-2.0)
Toolchain: apple/coreai-models, coreai-torch==0.4.0, coreai-core==1.0.0b1, transformers==4.57.3
Export: export_split.py (this repo), dtype float16, macOS 27.0 / Xcode 27.0
Verification: teacher-forcing parity vs. HF (PSNR > 130 dB in fp32) and word-for-word match with the HF fp16 greedy reference on real audio

License

Apache-2.0, same as the source model. Whisper is by OpenAI (paper); this repo only redistributes a converted, split variant for Apple Core AI.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for CarstenL/whisper-large-v3-turbo-coreai

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(548)

this model

Paper for CarstenL/whisper-large-v3-turbo-coreai

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 54

CarstenL
/

whisper-large-v3-turbo-coreai