CoreML Speech Models
Collection
Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. β’ 25 items β’ Updated β’ 4
Low-latency English streaming ASR with native punctuation and capitalization, converted to CoreML for Apple Neural Engine inference. Part of speech-swift β on-device speech AI for Apple Silicon.
Based on nvidia/nemotron-speech-streaming-en-0.6b (cache-aware FastConformer encoder + RNN-T decoder).
// Add to Package.swift:
// .package(url: "https://github.com/soniqo/speech-swift.git", branch: "main")
import NemotronStreamingASR
let model = try await NemotronStreamingASRModel.fromPretrained()
// Batch
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)
// Streaming
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
print(partial.text, partial.isFinal ? "[FINAL]" : "")
}
Or via CLI:
git clone https://github.com/soniqo/speech-swift && cd speech-swift && make build
.build/release/audio transcribe recording.wav --engine nemotron
.build/release/audio transcribe recording.wav --engine nemotron --stream --partial
Guide: soniqo.audio/guides/nemotron.
| Property | Value |
|---|---|
| Parameters | 600M |
| Architecture | Cache-aware FastConformer (24 layers, 1024 hidden) + RNN-T (2-layer LSTM, 640 hidden) |
| Format | CoreML (.mlmodelc) |
| Quantization | INT8 k-means palettization (encoder) |
| Vocabulary | 1024 BPE + blank (1025 total), punctuation + capitalization inline |
| Sample rate | 16 kHz mono |
| Streaming chunk | 160 ms (this bundle) β upstream supports 80 / 160 / 560 / 1120 ms |
| Language | English only |
| File | Size | Description |
|---|---|---|
encoder.mlmodelc |
562 MB | Cache-aware FastConformer encoder (INT8 palettized) |
decoder.mlmodelc |
14 MB | 2-layer LSTM prediction network |
joint.mlmodelc |
3.3 MB | RNN-T joint network (1025 outputs) |
config.json |
<1 KB | Model configuration + streaming params |
vocab.json |
~20 KB | SentencePiece BPE vocabulary |
From the NVIDIA model card:
| Dataset | WER (%) |
|---|---|
| Average | 6.93 |
| LibriSpeech test-clean | 2.32 |
| LibriSpeech test-other | 4.84 |
| SPGI Speech | 2.97 |
| TEDLIUM | 3.50 |
| VoxPopuli | 7.91 |
| Gigaspeech | 9.66 |
| AMI | 11.73 |
| Earnings22 | 12.52 |
By chunk size: 1.12 s β 6.93 %, 0.56 s β 7.07 %, 0.16 s β 7.67 %, 0.08 s β 8.43 %.
Unlike Parakeet-EOU, Nemotron does not emit a dedicated end-of-utterance token. Two ways to segment continuous audio into utterances:
finalize() to commit the current utterance.., ?, and ! inline, so a trailing sentence-ending punctuation in the partial text can be treated as a commit cue.Released under the NVIDIA Open Model License (same as the upstream checkpoint). See the license URL for the full terms.
Base model
nvidia/nemotron-speech-streaming-en-0.6b