Instructions to use smcleod/ibm-granite-speech-4.1-2b-nar-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use smcleod/ibm-granite-speech-4.1-2b-nar-onnx with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="smcleod/ibm-granite-speech-4.1-2b-nar-onnx")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("smcleod/ibm-granite-speech-4.1-2b-nar-onnx", dtype="auto") - Notebooks
- Google Colab
- Kaggle
IBM Granite Speech 4.1 2b NAR - ONNX export
ONNX export of ibm-granite/granite-speech-4.1-2b-nar produced by smcleod. Three precision tiers (fp32/, int8/, fp16w/) ship in this repo - see Files below for sizes and trade-offs. The graphs target opset 20 / IR 10 / ai.onnx-only, so they load under the ort 2.0-rc.x Rust crate and onnxruntime 1.17 - 1.25.
Three graphs and a host-side splice: encoder.onnx runs the conformer + CTC heads + BPE-collapsing projector and emits bpe_logits_dense plus pooled audio embeddings. embed_tokens.onnx looks up text-token embeddings for the CTC draft (with insertion slots). editor.onnx runs the bidirectional NLE editor over the concatenation of audio embeddings and text-with-slots embeddings and emits per-position vocab logits. Decoding is a single argmax pass; no KV cache, no autoregression. See How to use for the slot-insertion algorithm.
Files
Each precision tier ships in its own subdirectory (fp32/, int8/, fp16w/). Inside, files use the clean stem (no precision suffix) - the directory name carries the tier. Download a single subdirectory if you only need one precision; the tokeniser, processor, scripts, and metadata at the bundle root are shared across all tiers.
fp32/ - FP32 (reference, full precision) - 9.8 GB total
Use when you need byte-for-byte parity with the upstream PyTorch reference, or as a baseline for quantisation/conversion experiments.
fp32/encoder.onnx+fp32/encoder.onnx_datafp32/editor.onnx+fp32/editor.onnx_datafp32/embed_tokens.onnx+fp32/embed_tokens.onnx_data
int8/ - INT8 (smallest) - 2.5 GB total
Dynamic weights-only INT8 (MatMulInteger + ConvInteger, all ai.onnx). Mild quality drop on case/punctuation but transcripts remain semantically accurate. Choose when disk or memory is tight.
int8/encoder.onnx+int8/encoder.onnx_dataint8/editor.onnx+int8/editor.onnx_dataint8/embed_tokens.onnx+int8/embed_tokens.onnx_data
fp16w/ - FP16w (recommended for highest quality at smaller-than-FP32 size) - 4.9 GB total
Weights-FP16 with FP32 compute and IO. Each FP32 initializer is rewritten to FP16 storage with a Cast(FP16->FP32) inserted before each consumer; arithmetic and IO stay FP32. Quality is essentially identical to FP32 (mean norm WER 0.04% vs 0.72% for INT8) at 50% of FP32 storage. Choose when you have the disk and want FP32-grade transcripts.
fp16w/encoder.onnx+fp16w/encoder.onnx_datafp16w/editor.onnx+fp16w/editor.onnx_datafp16w/embed_tokens.onnx+fp16w/embed_tokens.onnx_data
Shared (used by every precision tier)
- Tokeniser / processor:
tokenizer.json,tokenizer_config.json,special_tokens_map.json,preprocessor_config.json - Export scripts:
export_nar_encoder.py,export_nar_editor.py,export_embed_tokens.py,quantise.py,convert_fp.py granite_export_metadata.json(graph IO, parity numbers, toolchain)LICENSE(Apache 2.0)test_fixtures/- golden inputs/outputs for integration testing. Seetest_fixtures/README.md.
How to use end-to-end
Audio frontend (shared across variants)
Same 16 kHz log-mel frontend as the AR variants (see the bundled
preprocessor_config.json). NAR's encoder additionally wants an
attention_mask [B, T] int64 (1 = valid, 0 = padding). The included
test_fixtures/expected_attention_mask.npy and
test_fixtures/expected_input_features.npy are reference outputs from the
upstream AutoFeatureExtractor for verifying your frontend.
Call sequence (non-autoregressive)
Three graphs and a host-side splice; no KV cache, no per-token loop.
1. encoder.onnx (input_features, attention_mask)
-> bpe_logits_dense, bpe_mask, audio_embeds, audio_lengths, char_logits
2. CTC decode (host) (bpe_logits_dense + bpe_mask)
-> draft text token IDs (greedy + collapse blanks/dupes)
3. embed_tokens.onnx (text token IDs with insertion-slot tokens) -> text_embeds
4. splice (host) concat(audio_embeds[:audio_len], text_embeds) -> inputs_embeds [1, N, 2048]
5. editor.onnx (inputs_embeds, position_ids, 4-D zero attention_mask)
-> logits [1, N, 100352]
6. argmax + slice (host) over the text segment of `logits` -> final token IDs -> tokenizer.decode
bpe_logits_dense (shape [B, T_bpe, V_bpe]) is the head used downstream.
char_logits is exposed for diagnostics but not part of the inference path.
Slot-insertion algorithm (NAR)
add_insertion_slots interleaves the LLM's eos_token_id between every CTC
draft token (and at the boundaries), giving the editor a fixed insertion slot
to rewrite or expand each span. Reproduce it directly without the upstream
class:
def add_insertion_slots(t, eos_id):
# t: list/tensor of CTC draft token ids (after greedy argmax + blank-collapse).
# eos_id: tokenizer.eos_token_id (read from tokenizer_config.json).
n = len(t)
out_len = max(2 * n + 1, 8)
out = [eos_id] * out_len
for i in range(n):
out[2 * i + 1] = t[i]
return out
Then plug it in:
slots = add_insertion_slots(t, eos_id) # length 2n+1 (>=8)
text_emb = embed_tokens.onnx(slots) # [1, len(slots), 2048]
audio = audio_embeds[:audio_len] # [audio_len, 2048]
flat = concat([audio, text_emb], dim=0) # [audio_len + len(slots), 2048]
position = arange(audio_len + len(slots))
attn = zeros([1, 1, N, N], float32) # bidirectional, no masking
logits = editor.onnx(flat.unsqueeze(0), position.unsqueeze(0), attn)
The text segment of logits is logits[:, audio_len:, :]; argmax over that
segment yields the final token IDs, decode via the LLM tokeniser. The 4-D
attention_mask is identically zero (additive-mask convention: 0 = unmasked,
-inf = masked); the graph expects an explicit input.
Note: if config.scale_projected_embeddings is set on the upstream config,
divide audio_embeds by config.embedding_multiplier before splicing.
The shipped encoder graph does not bake in that division; do it host-side.
Tokeniser
We ship tokenizer.json + tokenizer_config.json from
ibm-granite/granite-speech-4.1-2b-nar. NAR has no chat
template (no chat_template.jinja); the tokeniser is used directly on the
draft text and on the editor output.
Runtime / EP notes
ai.onnx-only at opset 20; nocom.microsoft.*ops. Load underort2.0-rc.x oronnxruntime1.17 - 1.25.- CoreML EP: opset 20 contains ops without CoreML kernels; ORT falls back to CPU silently at session-load. FP16w is the better fit for MPS-targeted inference.
- CUDA / CPU: work out of the box across all three tiers.
How the tiers are produced
- INT8 is dynamic, weights-only, per-channel
QInt8overMatMul+Convops. The quantiser emitsMatMulInteger+ConvIntegerand leaves activations in FP32. The unquantised ~22% of MatMul nodes in the LLM body graphs are activation x activation (attentionQK^Tandattention_weights x V); dynamic weight-only INT8 cannot quantise those, so this is the expected ceiling, not a coverage gap. - FP16w stores weights as FP16 initializers with a
Cast(FP16->FP32)inserted before each consumer, so arithmetic and IO stay FP32. Quality matches FP32 within numeric tolerance at ~50% of FP32 storage. embed_tokensis shipped as its own graph in all three tiers. INT8 uses per-row symmetric quantisation rather than the dynamic MatMul/Conv quantiser (Gather is not in that op set), giving the embedding table its own ~4x storage win at INT8.- No
com.microsoft.*ops are used. Re-validate the op-domain set withassert_pure_ai_onnxinquantise.py/convert_fp.pyafter any change.
Parity
Parity is taken against the upstream PyTorch reference on a single LibriSpeech
clip (10226_10111_000000.wav, 8.43 seconds, 844 mel frames). FP32 graphs
match the reference within numeric tolerance; INT8 graphs are validated in
argmax-only mode (logit values shift but token argmax is preserved, so the
decoded transcript is unchanged).
| graph | precision | max-abs-err | argmax mismatches | transcript match |
|---|---|---|---|---|
| encoder (bpe_logits_dense) | FP32 | 0.00204 | 0/211 | n/a |
| encoder (bpe_logits_dense) | INT8 | 1.84 | 0/211 | n/a |
| editor | FP32 | 0.00147 | 0/257 | Y |
| editor | INT8 | 94.5 | 15/257 | Y |
INT8 note: the encoder graph emits two CTC heads. bpe_logits_dense (used downstream) holds argmax-stable through quantisation; char_logits (unused downstream) drifts noticeably and is not part of the inference path. The editor INT8 graph reproduces the reference transcript despite logit max-abs delta, because argmax decoding is invariant to the residual quant error.
Multi-clip transcript parity
Three additional 16 kHz mono clips covering longer utterances (39 to 94 seconds), single and two-speaker conversational content. Word error rate (WER) and Levenshtein edit distance computed against the upstream PyTorch reference. Numbers measured end-to-end through the full ONNX pipeline (no PyTorch encoder fallback).
WER is the strict word-error rate against the PyTorch reference (case + punctuation sensitive). norm WER lower-cases both transcripts and strips punctuation before comparing - the dominant driver of strict WER on this model at INT8 is capitalisation and trailing punctuation drift, not actual word substitution. Pick whichever metric matches your downstream task. FP16w is essentially FP32 quality at 50% of FP32 storage; INT8 is the smallest tier with a mild quality drop.
| Clip | Duration | FP32 byte-exact | INT8 byte-exact | INT8 WER | INT8 norm WER | FP16w byte-exact | FP16w WER | FP16w norm WER |
|---|---|---|---|---|---|---|---|---|
| is-it-more-wood | 46.9 s | Y | N | 4.3% | 2.05% | Y | 0.0% | 0.00% |
| two-speakers-1 | 93.8 s | N | N | 3.5% | 1.72% | N | 0.4% | 0.34% |
| two-speakers-2 | 38.8 s | Y | N | 5.1% | 0.96% | Y | 0.0% | 0.00% |
Raw multi-clip data including full transcripts: see granite_export_metadata.json multi_clip_parity block.
Reference transcript:
after his nap timothy lazily stretched first one gray velvet foot then another strolled indolently to his plate turning over the food carefully selecting choice bits nosing out that which he scorned upon the clean hearth
The FP32 and FP16w paths reproduce this transcript exactly on the test clip, and INT8 reproduces it within argmax-only tolerance (token argmax preserved).
Toolchain
- transformers 5.8.0
- torch 2.11.0
- onnx 1.21.0
- onnxruntime 1.25.1
- exporter: torch.onnx.export TorchScript path (dynamo=False)
- opset: 20 (
ai.onnxonly) - IR version: 10
- external data layout: single
<stem>.onnx_datasidecar per graph
Compatibility
Targeted at the ort 2.0-rc.x Rust crate.
Compatible with onnxruntime Python 1.17 through 1.25. No com.microsoft
ops are used. Graphs were emitted via the TorchScript path
(torch.onnx.export(..., dynamo=False)); the dynamo exporter was deliberately
avoided because it injects aten::* ops ort does not understand. See the
Runtime / EP notes above for CoreML / CUDA / CPU
specifics including which precision tier to pick per backend.
Reproducing the export
The included scripts and quantise.py regenerate every artefact in this
bundle. The export pipeline writes flat-layout files into exports/<variant>/;
the per-tier subdirectory layout you see in this repo is produced by
scripts/stage_bundles.py (in the source tree at
https://github.com/sammcj/granite-speech-4.1-onnx). From a checkout:
python export_nar_encoder.py \
--model-dir <path-to-ibm-granite/granite-speech-4.1-2b-nar> \
--out-dir exports/granite-speech-4.1-2b-nar
python export_nar_editor.py \
--model-dir <path-to-ibm-granite/granite-speech-4.1-2b-nar> \
--out-dir exports/granite-speech-4.1-2b-nar
python export_embed_tokens.py --variant nar
# INT8 (NAR variant: no exclusion - both AR-style exclusions regressed NAR norm WER).
# embed_tokens uses a hand-rolled per-row INT8 path baked into export_embed_tokens.py.
python quantise.py --input exports/granite-speech-4.1-2b-nar/encoder.onnx --output exports/granite-speech-4.1-2b-nar/encoder_int8.onnx
python quantise.py --input exports/granite-speech-4.1-2b-nar/editor.onnx --output exports/granite-speech-4.1-2b-nar/editor_int8.onnx
# FP16w (weights-FP16, FP32 compute - no exclusions needed):
python convert_fp.py --precision fp16w --input exports/granite-speech-4.1-2b-nar/encoder.onnx --output exports/granite-speech-4.1-2b-nar/encoder_fp16w.onnx
python convert_fp.py --precision fp16w --input exports/granite-speech-4.1-2b-nar/editor.onnx --output exports/granite-speech-4.1-2b-nar/editor_fp16w.onnx
Licence
Apache 2.0 for both the upstream IBM model and this ONNX export. See
LICENSE for the full text.
Model tree for smcleod/ibm-granite-speech-4.1-2b-nar-onnx
Base model
ibm-granite/granite-4.0-1b-base