IBM Granite Speech 4.1 2b - ONNX export

ONNX export of ibm-granite/granite-speech-4.1-2b produced by Sam McLeod (https://smcleod.net). Repository: smcleod/ibm-granite-speech-4.1-2b-onnx. Both FP32 and INT8 weight-only graphs are included. The graphs target opset 20, IR 10, ai.onnx operators only - no com.microsoft ops - so they load under the ort 2.0-rc.x Rust crate as well as standard onnxruntime 1.17 - 1.25.

Additional precision tiers in progress. A statically-calibrated INT8 variant (better quality vs the dynamic INT8 already in this repo) and a half-precision encoder are in active development. The repo will be updated when those graphs pass the multi-clip parity gate.

Three graphs cooperate: encoder.onnx projects mel features to audio embeddings; prompt_encode.onnx runs the LLM forward over the full prompt (text tokens + projected audio embeds) and returns the first-token logits plus a 40-layer KV cache; decode_step.onnx consumes one token at a time plus the past KV cache and emits the next logits.

The audio placeholder token id is 100352. Replace those positions in the prompt with the projector outputs from encoder.onnx before running prompt_encode.onnx.

Files

encoder.onnx + encoder.onnx_data (FP32) and encoder_int8.onnx + encoder_int8.onnx_data (INT8 weight-only quantisation)
prompt_encode.onnx + prompt_encode.onnx_data (FP32) and prompt_encode_int8.onnx + prompt_encode_int8.onnx_data (INT8 weight-only quantisation)
decode_step.onnx + decode_step.onnx_data (FP32) and decode_step_int8.onnx + decode_step_int8.onnx_data (INT8 weight-only quantisation)
Tokeniser / processor: tokenizer.json, tokenizer_config.json, processor_config.json, chat_template.jinja, special_tokens_map.json, preprocessor_config.json
Export scripts: export_speech_2b_ar.py, quantise.py
granite_export_metadata.json (graph IO, parity numbers, toolchain)
LICENSE (Apache 2.0)

Parity

Parity is taken against the upstream PyTorch reference on a single LibriSpeech clip (10226_10111_000000.wav, 8.43 seconds, 844 mel frames). FP32 graphs match the reference within numeric tolerance; INT8 graphs are validated in argmax-only mode (logit values shift but token argmax is preserved, so the decoded transcript is unchanged).

Encoder (numeric output, no argmax decoding):

precision	max-abs-err	mean-abs-err	p99-abs-err
FP32	4.48e-06	1.24e-07	6.46e-07
INT8	0.169	0.0109	0.0447

LLM stages (argmax decoding; INT8 logit max-abs delta is large but argmax is preserved):

graph	precision	max-abs-err	argmax mismatches	transcript match
prompt_encode	FP32	0.000364	0/190	Y
prompt_encode	INT8	10.1	58/190	Y
decode_step	FP32	n/a	0/51	Y
decode_step	INT8	5.76	0/51	Y

Multi-clip transcript parity

Three additional 16 kHz mono clips covering longer utterances (39 to 94 seconds), single and two-speaker conversational content. Word error rate (WER) and Levenshtein edit distance computed against the upstream PyTorch reference. Numbers measured end-to-end through the full ONNX pipeline (no PyTorch encoder fallback).

Clip	Duration	FP32 byte-exact vs PT	INT8 byte-exact vs PT	INT8 WER vs PT	INT8 vs FP32 Lev
is-it-more-wood	46.9 s	Y	N	1.4%	2
two-speakers-1	93.8 s	Y	N	1.0%	12
two-speakers-2	38.8 s	Y	N	23.5%	26

Raw multi-clip data including full transcripts: see granite_export_metadata.json multi_clip_parity block.

Reference transcript:

After his nap, Timothy lazily stretched, first one gray velvet foot, then another, strolled indolently to his plate, turning over the food, carefully selecting choice bits, nosing out that which he scorned upon the clean hearth

Both FP32 and INT8 paths reproduce this transcript exactly on the test clip.

Toolchain

transformers 5.8.0
torch 2.11.0
onnx 1.21.0
onnxruntime 1.25.1
exporter: torch.onnx.export TorchScript path (dynamo=False)
opset: 20 (ai.onnx only)
IR version: 10
external data layout: single <stem>.onnx_data sidecar per graph

Compatibility

Targeted at the ort 2.0-rc.x Rust crate. Compatible with onnxruntime Python 1.17 through 1.25. No com.microsoft ops are used. Graphs were emitted via the TorchScript path (torch.onnx.export(..., dynamo=False)); the dynamo exporter was deliberately avoided because it injects aten::* ops ort does not understand.

Reproducing the export

The included scripts and quantise.py regenerate every artefact in this bundle. From a checkout of https://github.com/sammcj/granite-speech-4.1-onnx:

python export_speech_2b_ar.py \
    --model-dir <path-to-ibm-granite/granite-speech-4.1-2b> \
    --out-dir exports/granite-speech-4.1-2b
python quantise.py --input exports/granite-speech-4.1-2b/encoder.onnx       --output exports/granite-speech-4.1-2b/encoder_int8.onnx
python quantise.py --input exports/granite-speech-4.1-2b/prompt_encode.onnx --output exports/granite-speech-4.1-2b/prompt_encode_int8.onnx
python quantise.py --input exports/granite-speech-4.1-2b/decode_step.onnx   --output exports/granite-speech-4.1-2b/decode_step_int8.onnx

Sandboxed environments may need:

HF_HOME=$TMPDIR/hf_home HF_MODULES_CACHE=$TMPDIR/hf_modules <command above>

Licence

Apache 2.0 for both the upstream IBM model and this ONNX export. See LICENSE for the full text.