Nemotron Speech Streaming 0.6B β€” CoreML INT8

Low-latency English streaming ASR with native punctuation and capitalization, converted to CoreML for Apple Neural Engine inference. Part of speech-swift β€” on-device speech AI for Apple Silicon.

Based on nvidia/nemotron-speech-streaming-en-0.6b (cache-aware FastConformer encoder + RNN-T decoder).

Quick Start

// Add to Package.swift:
// .package(url: "https://github.com/soniqo/speech-swift.git", branch: "main")

import NemotronStreamingASR

let model = try await NemotronStreamingASRModel.fromPretrained()

// Batch
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

// Streaming
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    print(partial.text, partial.isFinal ? "[FINAL]" : "")
}

Or via CLI:

git clone https://github.com/soniqo/speech-swift && cd speech-swift && make build
.build/release/audio transcribe recording.wav --engine nemotron
.build/release/audio transcribe recording.wav --engine nemotron --stream --partial

Guide: soniqo.audio/guides/nemotron.

Model

Property Value
Parameters 600M
Architecture Cache-aware FastConformer (24 layers, 1024 hidden) + RNN-T (2-layer LSTM, 640 hidden)
Format CoreML (.mlmodelc)
Quantization INT8 k-means palettization (encoder)
Vocabulary 1024 BPE + blank (1025 total), punctuation + capitalization inline
Sample rate 16 kHz mono
Streaming chunk 160 ms (this bundle) β€” upstream supports 80 / 160 / 560 / 1120 ms
Language English only

Files

File Size Description
encoder.mlmodelc 562 MB Cache-aware FastConformer encoder (INT8 palettized)
decoder.mlmodelc 14 MB 2-layer LSTM prediction network
joint.mlmodelc 3.3 MB RNN-T joint network (1025 outputs)
config.json <1 KB Model configuration + streaming params
vocab.json ~20 KB SentencePiece BPE vocabulary

Upstream WER (English, 1.12 s chunk)

From the NVIDIA model card:

Dataset WER (%)
Average 6.93
LibriSpeech test-clean 2.32
LibriSpeech test-other 4.84
SPGI Speech 2.97
TEDLIUM 3.50
VoxPopuli 7.91
Gigaspeech 9.66
AMI 11.73
Earnings22 12.52

By chunk size: 1.12 s β†’ 6.93 %, 0.56 s β†’ 7.07 %, 0.16 s β†’ 7.67 %, 0.08 s β†’ 8.43 %.

No EOU head

Unlike Parakeet-EOU, Nemotron does not emit a dedicated end-of-utterance token. Two ways to segment continuous audio into utterances:

  1. External VAD β€” pair the session with Silero VAD; on sustained silence, call finalize() to commit the current utterance.
  2. Punctuation boundary β€” the model emits ., ?, and ! inline, so a trailing sentence-ending punctuation in the partial text can be treated as a commit cue.

License

Released under the NVIDIA Open Model License (same as the upstream checkpoint). See the license URL for the full terms.

Downloads last month
200
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Nemotron-Speech-Streaming-0.6B-CoreML-INT8

Finetuned
(8)
this model

Collection including aufklarer/Nemotron-Speech-Streaming-0.6B-CoreML-INT8