IBM Granite Speech 4.1 2b NAR - ONNX export

ONNX export of ibm-granite/granite-speech-4.1-2b-nar produced by smcleod. Three precision tiers (fp32/, int8/, fp16w/) ship in this repo - see Files below for sizes and trade-offs. The graphs target opset 20 / IR 10 / ai.onnx-only, so they load under the ort 2.0-rc.x Rust crate and onnxruntime 1.17 - 1.25.

Three graphs and a host-side splice: encoder.onnx runs the conformer + CTC heads + BPE-collapsing projector and emits bpe_logits_dense plus pooled audio embeddings. embed_tokens.onnx looks up text-token embeddings for the CTC draft (with insertion slots). editor.onnx runs the bidirectional NLE editor over the concatenation of audio embeddings and text-with-slots embeddings and emits per-position vocab logits. Decoding is a single argmax pass; no KV cache, no autoregression. See How to use for the slot-insertion algorithm.

Files

Each precision tier ships in its own subdirectory (fp32/, int8/, fp16w/). Inside, files use the clean stem (no precision suffix) - the directory name carries the tier. Download a single subdirectory if you only need one precision; the tokeniser, processor, scripts, and metadata at the bundle root are shared across all tiers.

`fp32/` - FP32 (reference, full precision) - 9.8 GB total

Use when you need byte-for-byte parity with the upstream PyTorch reference, or as a baseline for quantisation/conversion experiments.

fp32/encoder.onnx + fp32/encoder.onnx_data
fp32/editor.onnx + fp32/editor.onnx_data
fp32/embed_tokens.onnx + fp32/embed_tokens.onnx_data

`int8/` - INT8 (smallest) - 2.5 GB total

Dynamic weights-only INT8 (MatMulInteger + ConvInteger, all ai.onnx). Mild quality drop on case/punctuation but transcripts remain semantically accurate. Choose when disk or memory is tight.

int8/encoder.onnx + int8/encoder.onnx_data
int8/editor.onnx + int8/editor.onnx_data
int8/embed_tokens.onnx + int8/embed_tokens.onnx_data

`fp16w/` - FP16w (recommended for highest quality at smaller-than-FP32 size) - 4.9 GB total

Weights-FP16 with FP32 compute and IO. Each FP32 initializer is rewritten to FP16 storage with a Cast(FP16->FP32) inserted before each consumer; arithmetic and IO stay FP32. Quality is essentially identical to FP32 (mean norm WER 0.04% vs 0.72% for INT8) at 50% of FP32 storage. Choose when you have the disk and want FP32-grade transcripts.

fp16w/encoder.onnx + fp16w/encoder.onnx_data
fp16w/editor.onnx + fp16w/editor.onnx_data
fp16w/embed_tokens.onnx + fp16w/embed_tokens.onnx_data

Shared (used by every precision tier)

Tokeniser / processor: tokenizer.json, tokenizer_config.json, special_tokens_map.json, preprocessor_config.json
Export scripts: export_nar_encoder.py, export_nar_editor.py, export_embed_tokens.py, quantise.py, convert_fp.py
granite_export_metadata.json (graph IO, parity numbers, toolchain)
LICENSE (Apache 2.0)
test_fixtures/ - golden inputs/outputs for integration testing. See test_fixtures/README.md.

How to use end-to-end

Audio frontend (shared across variants)

Same 16 kHz log-mel frontend as the AR variants (see the bundled preprocessor_config.json). NAR's encoder additionally wants an attention_mask [B, T] int64 (1 = valid, 0 = padding). The included test_fixtures/expected_attention_mask.npy and test_fixtures/expected_input_features.npy are reference outputs from the upstream AutoFeatureExtractor for verifying your frontend.

Call sequence (non-autoregressive)

Three graphs and a host-side splice; no KV cache, no per-token loop.

1. encoder.onnx       (input_features, attention_mask)
                                          -> bpe_logits_dense, bpe_mask, audio_embeds, audio_lengths, char_logits
2. CTC decode (host)  (bpe_logits_dense + bpe_mask)
                                          -> draft text token IDs (greedy + collapse blanks/dupes)
3. embed_tokens.onnx  (text token IDs with insertion-slot tokens) -> text_embeds
4. splice (host)      concat(audio_embeds[:audio_len], text_embeds) -> inputs_embeds [1, N, 2048]
5. editor.onnx        (inputs_embeds, position_ids, 4-D zero attention_mask)
                                          -> logits [1, N, 100352]
6. argmax + slice (host) over the text segment of `logits` -> final token IDs -> tokenizer.decode

bpe_logits_dense (shape [B, T_bpe, V_bpe]) is the head used downstream. char_logits is exposed for diagnostics but not part of the inference path.

Slot-insertion algorithm (NAR)

add_insertion_slots interleaves the LLM's eos_token_id between every CTC draft token (and at the boundaries), giving the editor a fixed insertion slot to rewrite or expand each span. Reproduce it directly without the upstream class:

def add_insertion_slots(t, eos_id):
    # t: list/tensor of CTC draft token ids (after greedy argmax + blank-collapse).
    # eos_id: tokenizer.eos_token_id (read from tokenizer_config.json).
    n = len(t)
    out_len = max(2 * n + 1, 8)
    out = [eos_id] * out_len
    for i in range(n):
        out[2 * i + 1] = t[i]
    return out

Then plug it in:

slots    = add_insertion_slots(t, eos_id)             # length 2n+1 (>=8)
text_emb = embed_tokens.onnx(slots)                    # [1, len(slots), 2048]
audio    = audio_embeds[:audio_len]                    # [audio_len, 2048]
flat     = concat([audio, text_emb], dim=0)            # [audio_len + len(slots), 2048]
position = arange(audio_len + len(slots))
attn     = zeros([1, 1, N, N], float32)                # bidirectional, no masking

logits   = editor.onnx(flat.unsqueeze(0), position.unsqueeze(0), attn)

The text segment of logits is logits[:, audio_len:, :]; argmax over that segment yields the final token IDs, decode via the LLM tokeniser. The 4-D attention_mask is identically zero (additive-mask convention: 0 = unmasked, -inf = masked); the graph expects an explicit input.

Note: if config.scale_projected_embeddings is set on the upstream config, divide audio_embeds by config.embedding_multiplier before splicing. The shipped encoder graph does not bake in that division; do it host-side.

Tokeniser

We ship tokenizer.json + tokenizer_config.json from ibm-granite/granite-speech-4.1-2b-nar. NAR has no chat template (no chat_template.jinja); the tokeniser is used directly on the draft text and on the editor output.

Runtime / EP notes

ai.onnx-only at opset 20; no com.microsoft.* ops. Load under ort 2.0-rc.x or onnxruntime 1.17 - 1.25.
CoreML EP: opset 20 contains ops without CoreML kernels; ORT falls back to CPU silently at session-load. FP16w is the better fit for MPS-targeted inference.
CUDA / CPU: work out of the box across all three tiers.

How the tiers are produced

INT8 is dynamic, weights-only, per-channel QInt8 over MatMul + Conv ops. The quantiser emits MatMulInteger + ConvInteger and leaves activations in FP32. The unquantised ~22% of MatMul nodes in the LLM body graphs are activation x activation (attention QK^T and attention_weights x V); dynamic weight-only INT8 cannot quantise those, so this is the expected ceiling, not a coverage gap.
FP16w stores weights as FP16 initializers with a Cast(FP16->FP32) inserted before each consumer, so arithmetic and IO stay FP32. Quality matches FP32 within numeric tolerance at ~50% of FP32 storage.
embed_tokens is shipped as its own graph in all three tiers. INT8 uses per-row symmetric quantisation rather than the dynamic MatMul/Conv quantiser (Gather is not in that op set), giving the embedding table its own ~4x storage win at INT8.
No com.microsoft.* ops are used. Re-validate the op-domain set with assert_pure_ai_onnx in quantise.py / convert_fp.py after any change.

Parity

Parity is taken against the upstream PyTorch reference on a single LibriSpeech clip (10226_10111_000000.wav, 8.43 seconds, 844 mel frames). FP32 graphs match the reference within numeric tolerance; INT8 graphs are validated in argmax-only mode (logit values shift but token argmax is preserved, so the decoded transcript is unchanged).

graph	precision	max-abs-err	argmax mismatches	transcript match
encoder (bpe_logits_dense)	FP32	0.00204	0/211	n/a
encoder (bpe_logits_dense)	INT8	1.84	0/211	n/a
editor	FP32	0.00147	0/257	Y
editor	INT8	94.5	15/257	Y

INT8 note: the encoder graph emits two CTC heads. bpe_logits_dense (used downstream) holds argmax-stable through quantisation; char_logits (unused downstream) drifts noticeably and is not part of the inference path. The editor INT8 graph reproduces the reference transcript despite logit max-abs delta, because argmax decoding is invariant to the residual quant error.

Multi-clip transcript parity

Three additional 16 kHz mono clips covering longer utterances (39 to 94 seconds), single and two-speaker conversational content. Word error rate (WER) and Levenshtein edit distance computed against the upstream PyTorch reference. Numbers measured end-to-end through the full ONNX pipeline (no PyTorch encoder fallback).

WER is the strict word-error rate against the PyTorch reference (case + punctuation sensitive). norm WER lower-cases both transcripts and strips punctuation before comparing - the dominant driver of strict WER on this model at INT8 is capitalisation and trailing punctuation drift, not actual word substitution. Pick whichever metric matches your downstream task. FP16w is essentially FP32 quality at 50% of FP32 storage; INT8 is the smallest tier with a mild quality drop.

Clip	Duration	FP32 byte-exact	INT8 byte-exact	INT8 WER	INT8 norm WER	FP16w byte-exact	FP16w WER	FP16w norm WER
is-it-more-wood	46.9 s	Y	N	4.3%	2.05%	Y	0.0%	0.00%
two-speakers-1	93.8 s	N	N	3.5%	1.72%	N	0.4%	0.34%
two-speakers-2	38.8 s	Y	N	5.1%	0.96%	Y	0.0%	0.00%

Raw multi-clip data including full transcripts: see granite_export_metadata.json multi_clip_parity block.

Reference transcript:

after his nap timothy lazily stretched first one gray velvet foot then another strolled indolently to his plate turning over the food carefully selecting choice bits nosing out that which he scorned upon the clean hearth

The FP32 and FP16w paths reproduce this transcript exactly on the test clip, and INT8 reproduces it within argmax-only tolerance (token argmax preserved).

Toolchain

transformers 5.8.0
torch 2.11.0
onnx 1.21.0
onnxruntime 1.25.1
exporter: torch.onnx.export TorchScript path (dynamo=False)
opset: 20 (ai.onnx only)
IR version: 10
external data layout: single <stem>.onnx_data sidecar per graph

Compatibility

Targeted at the ort 2.0-rc.x Rust crate. Compatible with onnxruntime Python 1.17 through 1.25. No com.microsoft ops are used. Graphs were emitted via the TorchScript path (torch.onnx.export(..., dynamo=False)); the dynamo exporter was deliberately avoided because it injects aten::* ops ort does not understand. See the Runtime / EP notes above for CoreML / CUDA / CPU specifics including which precision tier to pick per backend.

Reproducing the export

The included scripts and quantise.py regenerate every artefact in this bundle. The export pipeline writes flat-layout files into exports/<variant>/; the per-tier subdirectory layout you see in this repo is produced by scripts/stage_bundles.py (in the source tree at https://github.com/sammcj/granite-speech-4.1-onnx). From a checkout:

python export_nar_encoder.py \
    --model-dir <path-to-ibm-granite/granite-speech-4.1-2b-nar> \
    --out-dir exports/granite-speech-4.1-2b-nar
python export_nar_editor.py \
    --model-dir <path-to-ibm-granite/granite-speech-4.1-2b-nar> \
    --out-dir exports/granite-speech-4.1-2b-nar
python export_embed_tokens.py --variant nar

# INT8 (NAR variant: no exclusion - both AR-style exclusions regressed NAR norm WER).
# embed_tokens uses a hand-rolled per-row INT8 path baked into export_embed_tokens.py.
python quantise.py --input exports/granite-speech-4.1-2b-nar/encoder.onnx --output exports/granite-speech-4.1-2b-nar/encoder_int8.onnx
python quantise.py --input exports/granite-speech-4.1-2b-nar/editor.onnx  --output exports/granite-speech-4.1-2b-nar/editor_int8.onnx

# FP16w (weights-FP16, FP32 compute - no exclusions needed):
python convert_fp.py --precision fp16w --input exports/granite-speech-4.1-2b-nar/encoder.onnx --output exports/granite-speech-4.1-2b-nar/encoder_fp16w.onnx
python convert_fp.py --precision fp16w --input exports/granite-speech-4.1-2b-nar/editor.onnx  --output exports/granite-speech-4.1-2b-nar/editor_fp16w.onnx

Licence

Apache 2.0 for both the upstream IBM model and this ONNX export. See LICENSE for the full text.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for smcleod/ibm-granite-speech-4.1-2b-nar-onnx

Base model

ibm-granite/granite-4.0-1b-base

Finetuned

ibm-granite/granite-speech-4.1-2b-nar

Quantized

(3)

this model