Parakeet-TDT-0.6b-v3 Conformer encoder β€” CoreML / Apple Neural Engine

The Conformer encoder of nvidia/parakeet-tdt-0.6b-v3, converted to CoreML so it runs on the Apple Neural Engine (ANE). Pair it with the MLX TDT decoder in mlx-audio-swift: the encoder runs on the ANE while decoding stays on the GPU/CPU.

This is the encoder only β€” you still need the MLX model (beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16) for the decoder.

Format

  • CoreML MLProgram, fp16 weights, fp32 I/O, CPU_AND_NE.
  • Fixed input shape: features [1, 128, 1000] (1000 mel frames β‰ˆ 10 s) β†’ encoded [1, 1024, 125]. A fixed shape is required for ANE residency (a dynamic axis drops it to 0%); chunks are padded to 1000 frames and the output cropped back.

Usage (mlx-audio-swift)

mlx-audio-swift-stt \
  --model beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16 \
  --audio input.wav --output-path out \
  --coreml-encoder parakeet_enc_0.6b_v3.mlpackage \
  --chunk-duration 9.95

Keep --chunk-duration ≀ 10s (the fixed encoder length).

Measured (M1 Max, TED-LIUM 3 talk, 20.8 min)

all-MLX hybrid (this encoder)
ANE residency β€” 100% (0 graph interruptions)
WER vs reference 7.28% 7.11% (agreement 1.07%)
RTF (Swift release) ~95Γ— ~131Γ— (~1.38Γ—)
GPU power 17.3 W 3.0 W (Γ·5.8)

The transcript is reproduced ~1:1; CoreML-fp16 is actually closer to fp32 than the shipped MLX-bf16 encoder. Uses only public MLModel + MLComputeUnits APIs.

Conversion

Produced with the converter in mlx-audio-swift tools/coreml-ane/: NeMo encoder β†’ torch.jit.trace (fixed shape) β†’ coremltools (fp16 MLProgram, CPU_AND_NE).

License follows the base model (nvidia/parakeet-tdt-0.6b-v3, CC-BY-4.0).

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for beshkenadze/parakeet-tdt-0.6b-v3-coreml-ane

Quantized
(45)
this model