Nemotron Speech Streaming 0.6B — CoreML INT8

Low-latency English streaming ASR with native punctuation and capitalization, converted to CoreML for Apple Neural Engine inference. Part of speech-swift — on-device speech AI for Apple Silicon.

Based on nvidia/nemotron-speech-streaming-en-0.6b (cache-aware FastConformer encoder + RNN-T decoder).

Quick Start

// Add to Package.swift:
// .package(url: "https://github.com/soniqo/speech-swift.git", branch: "main")

import NemotronStreamingASR

let model = try await NemotronStreamingASRModel.fromPretrained()

// Batch
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

// Streaming
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    print(partial.text, partial.isFinal ? "[FINAL]" : "")
}

Or via CLI:

git clone https://github.com/soniqo/speech-swift && cd speech-swift && make build
.build/release/audio transcribe recording.wav --engine nemotron
.build/release/audio transcribe recording.wav --engine nemotron --stream --partial

Guide: soniqo.audio/guides/nemotron.

Model

Property	Value
Parameters	600M
Architecture	Cache-aware FastConformer (24 layers, 1024 hidden) + RNN-T (2-layer LSTM, 640 hidden)
Format	CoreML (`.mlmodelc`)
Quantization	INT8 k-means palettization (encoder)
Vocabulary	1024 BPE + blank (1025 total), punctuation + capitalization inline
Sample rate	16 kHz mono
Streaming chunk	160 ms (this bundle) — upstream supports 80 / 160 / 560 / 1120 ms
Language	English only

Files

File	Size	Description
`encoder.mlmodelc`	562 MB	Cache-aware FastConformer encoder (INT8 palettized)
`decoder.mlmodelc`	14 MB	2-layer LSTM prediction network
`joint.mlmodelc`	3.3 MB	RNN-T joint network (1025 outputs)
`config.json`	<1 KB	Model configuration + streaming params
`vocab.json`	~20 KB	SentencePiece BPE vocabulary

Upstream WER (English, 1.12 s chunk)

From the NVIDIA model card:

Dataset	WER (%)
Average	6.93
LibriSpeech test-clean	2.32
LibriSpeech test-other	4.84
SPGI Speech	2.97
TEDLIUM	3.50
VoxPopuli	7.91
Gigaspeech	9.66
AMI	11.73
Earnings22	12.52

By chunk size: 1.12 s → 6.93 %, 0.56 s → 7.07 %, 0.16 s → 7.67 %, 0.08 s → 8.43 %.

No EOU head

Unlike Parakeet-EOU, Nemotron does not emit a dedicated end-of-utterance token. Two ways to segment continuous audio into utterances:

External VAD — pair the session with Silero VAD; on sustained silence, call finalize() to commit the current utterance.
Punctuation boundary — the model emits ., ?, and ! inline, so a trailing sentence-ending punctuation in the partial text can be treated as a commit cue.

License

Released under the NVIDIA Open Model License (same as the upstream checkpoint). See the license URL for the full terms.

Downloads last month: 200

Model tree for aufklarer/Nemotron-Speech-Streaming-0.6B-CoreML-INT8

Base model

nvidia/nemotron-speech-streaming-en-0.6b

Finetuned

(8)

this model

Collection including aufklarer/Nemotron-Speech-Streaming-0.6B-CoreML-INT8

CoreML Speech Models

Collection

Speech AI models for Apple Neural Engine via CoreML. iOS/macOS ready. ASR, TTS, VAD, diarization. • 25 items • Updated 1 day ago • 4