Parakeet-TDT-0.6b-v3 Conformer encoder — CoreML / Apple Neural Engine

The Conformer encoder of nvidia/parakeet-tdt-0.6b-v3, converted to CoreML so it runs on the Apple Neural Engine (ANE). Pair it with the MLX TDT decoder in mlx-audio-swift: the encoder runs on the ANE while decoding stays on the GPU/CPU.

This is the encoder only — you still need the MLX model (beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16) for the decoder.

Format

CoreML MLProgram, fp16 weights, fp32 I/O, CPU_AND_NE.
Fixed input shape: features [1, 128, 1000] (1000 mel frames ≈ 10 s) → encoded [1, 1024, 125]. A fixed shape is required for ANE residency (a dynamic axis drops it to 0%); chunks are padded to 1000 frames and the output cropped back.

Usage (mlx-audio-swift)

mlx-audio-swift-stt \
  --model beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16 \
  --audio input.wav --output-path out \
  --coreml-encoder parakeet_enc_0.6b_v3.mlpackage \
  --chunk-duration 9.95

Keep --chunk-duration ≤ 10s (the fixed encoder length).

Measured (M1 Max, TED-LIUM 3 talk, 20.8 min)

	all-MLX	hybrid (this encoder)
ANE residency	—	100% (0 graph interruptions)
WER vs reference	7.28%	7.11% (agreement 1.07%)
RTF (Swift release)	~95×	~131× (~1.38×)
GPU power	17.3 W	3.0 W (÷5.8)

The transcript is reproduced ~1:1; CoreML-fp16 is actually closer to fp32 than the shipped MLX-bf16 encoder. Uses only public MLModel + MLComputeUnits APIs.

Conversion

Produced with the converter in mlx-audio-swift tools/coreml-ane/: NeMo encoder → torch.jit.trace (fixed shape) → coremltools (fp16 MLProgram, CPU_AND_NE).

License follows the base model (nvidia/parakeet-tdt-0.6b-v3, CC-BY-4.0).

Downloads last month: 11

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for beshkenadze/parakeet-tdt-0.6b-v3-coreml-ane

Base model

nvidia/parakeet-tdt-0.6b-v3

Quantized

(45)

this model