Whisper large-v3-turbo for Apple Core AI (.aimodel)

OpenAI Whisper large-v3-turbo converted to Apple's Core AI format (.aimodel), introduced with macOS 27 at WWDC 2026. Runs fully on-device on Apple Silicon (GPU-accelerated, fp16), several times faster than real time.

Converted for and used by Sophist Lite β€” a local-only dictation app for macOS.

Requirements

  • macOS 27.0 or later (Core AI framework), Apple Silicon
  • Xcode 27+ to build an app against CoreAI.framework

Contents

File Graph Size
whisper-large-v3-turbo-encoder_float16.aimodel audio encoder ~1.6 GB
whisper-large-v3-turbo-decoder_float16.aimodel text decoder ~0.34 GB
export_split.py conversion recipe (reproducibility) β€”

Architecture: encoder/decoder split

Two separate graphs instead of one monolith, so the 32-layer audio encoder runs once per 30-second chunk while only the 4-layer decoder runs per token (~100Γ— faster than the naive single-graph export in our measurements; ~9Γ— real time on a base M2 for typical dictation).

Encoder β€” function main:

input_features        : [1, 128, 3000]  float16   (Whisper log-mel, 30 s @ 16 kHz)
β†’ encoder_hidden_states : [1, 1500, 1280] float16

Decoder β€” function main (static shape, full forward per step):

encoder_hidden_states : [1, 1500, 1280] float16   (reuse the encoder output every step)
decoder_input_ids     : [1, 256]        int32     (prompt + generated tokens, EOT-padded)
β†’ logits                : [1, 256, 51866] float16   (read the row at the real last position)

Decoding notes (what your app must implement)

The .aimodel files contain only the neural graphs. A minimal greedy decoder:

  1. Features: HF-WhisperFeatureExtractor-compatible log-mel (128 bins, n_fft 400, hop 160), audio zero-padded/trimmed to exactly 30 s.
  2. Prompt: [50258 (<|startoftranscript|>), <lang>, 50360 (<|transcribe|>), 50364 (<|notimestamps|>)]. Language tokens start at 50259 (<|en|>); for auto-detect, run one step with [50258] and take the argmax over the language-token range.
  3. Each step: pad the token sequence to length 256 with 50257 (EOT), run the decoder, read logits at the real last position, argmax with suppression, append, repeat until EOT.
  4. Suppression: the suppress_tokens list from the model's generation_config.json, plus all timestamp tokens (ids β‰₯ 50365) in no-timestamps mode, plus begin_suppress_tokens {220, 50257} on the first generated step only. A repetition guard (e.g. stop on a 4Γ— repeated token cycle) is recommended β€” plain greedy can loop on trailing silence.
  5. Tokenizer: standard Whisper byte-level BPE (e.g. openai/whisper-large-v3-turbo via swift-transformers / tokenizers).

Reference implementation: the CoreAIWhisperEngine/CoreAIWhisperClient sources in Sophist Lite.

Swift sketch:

import CoreAI

let encoder = try await AIModel(contentsOf: encoderURL)
let decoder = try await AIModel(contentsOf: decoderURL)
let encFn = try encoder.loadFunction(named: "main")!
let decFn = try decoder.loadFunction(named: "main")!

let enc = try await encFn.run(inputs: ["input_features": melFeatures])
let hidden = enc["encoder_hidden_states"]!.ndArray
// per token:
let out = try await decFn.run(inputs: [
    "encoder_hidden_states": hidden,
    "decoder_input_ids": paddedTokenIds,   // [1, 256] int32
])
let logits = out["logits"]!.ndArray        // [1, 256, 51866]

Why not Apple's stock recipe, and why no KV cache?

As of the macOS 27.0 beta toolchain:

  • Apple's stock models/whisper/export.py (apple/coreai-models) exports a static decoder_input_ids shape of [1, 1], which cannot autoregress; re-exporting with dynamic shapes crashes (SIGSEGV in MPSGraph) mid-decode. The fixed [1, 256] shape used here is stable.
  • A KV-cached single-token decoder (cross-K/V precomputed, static self-cache) was implemented and was bit-exact vs. Hugging Face in eager PyTorch (fp32 and fp16), but the converted graphs produced systematically wrong logits across four different formulations β€” a Core AI beta miscompilation for that graph class. The split in this repo uses the standard full-decoder graph, which converts correctly and matches the HF fp16 greedy reference word-for-word in our tests.

These findings may improve with later Core AI releases; the .aimodel format itself is from a beta toolchain and may need re-export for future macOS 27 builds.

Conversion provenance

  • Source weights: openai/whisper-large-v3-turbo (Apache-2.0)
  • Toolchain: apple/coreai-models, coreai-torch==0.4.0, coreai-core==1.0.0b1, transformers==4.57.3
  • Export: export_split.py (this repo), dtype float16, macOS 27.0 / Xcode 27.0
  • Verification: teacher-forcing parity vs. HF (PSNR > 130 dB in fp32) and word-for-word match with the HF fp16 greedy reference on real audio

License

Apache-2.0, same as the source model. Whisper is by OpenAI (paper); this repo only redistributes a converted, split variant for Apple Core AI.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for CarstenL/whisper-large-v3-turbo-coreai

Finetuned
(548)
this model

Paper for CarstenL/whisper-large-v3-turbo-coreai