# IBM Granite Speech 4.1 2b - ONNX export ONNX export of [`ibm-granite/granite-speech-4.1-2b`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b) produced by Sam McLeod (). Repository: `smcleod/ibm-granite-speech-4.1-2b-onnx`. Both FP32 and INT8 weight-only graphs are included. The graphs target opset 20, IR 10, `ai.onnx` operators only - no `com.microsoft` ops - so they load under the `ort` 2.0-rc.x Rust crate as well as standard `onnxruntime` 1.17 - 1.25. > **Additional precision tiers in progress.** A statically-calibrated INT8 variant (better quality vs the dynamic INT8 already in this repo) and a half-precision encoder are in active development. The repo will be updated when those graphs pass the multi-clip parity gate. Three graphs cooperate: `encoder.onnx` projects mel features to audio embeddings; `prompt_encode.onnx` runs the LLM forward over the full prompt (text tokens + projected audio embeds) and returns the first-token logits plus a 40-layer KV cache; `decode_step.onnx` consumes one token at a time plus the past KV cache and emits the next logits. The audio placeholder token id is `100352`. Replace those positions in the prompt with the projector outputs from `encoder.onnx` before running `prompt_encode.onnx`. ## Files - `encoder.onnx` + `encoder.onnx_data` (FP32) and `encoder_int8.onnx` + `encoder_int8.onnx_data` (INT8 weight-only quantisation) - `prompt_encode.onnx` + `prompt_encode.onnx_data` (FP32) and `prompt_encode_int8.onnx` + `prompt_encode_int8.onnx_data` (INT8 weight-only quantisation) - `decode_step.onnx` + `decode_step.onnx_data` (FP32) and `decode_step_int8.onnx` + `decode_step_int8.onnx_data` (INT8 weight-only quantisation) - Tokeniser / processor: `tokenizer.json`, `tokenizer_config.json`, `processor_config.json`, `chat_template.jinja`, `special_tokens_map.json`, `preprocessor_config.json` - Export scripts: `export_speech_2b_ar.py`, `quantise.py` - `granite_export_metadata.json` (graph IO, parity numbers, toolchain) - `LICENSE` (Apache 2.0) ## Parity Parity is taken against the upstream PyTorch reference on a single LibriSpeech clip (`10226_10111_000000.wav`, 8.43 seconds, 844 mel frames). FP32 graphs match the reference within numeric tolerance; INT8 graphs are validated in argmax-only mode (logit values shift but token argmax is preserved, so the decoded transcript is unchanged). Encoder (numeric output, no argmax decoding): | precision | max-abs-err | mean-abs-err | p99-abs-err | | --- | --- | --- | --- | | FP32 | 4.48e-06 | 1.24e-07 | 6.46e-07 | | INT8 | 0.169 | 0.0109 | 0.0447 | LLM stages (argmax decoding; INT8 logit max-abs delta is large but argmax is preserved): | graph | precision | max-abs-err | argmax mismatches | transcript match | | --- | --- | --- | --- | --- | | prompt_encode | FP32 | 0.000364 | 0/190 | Y | | prompt_encode | INT8 | 10.1 | 58/190 | Y | | decode_step | FP32 | n/a | 0/51 | Y | | decode_step | INT8 | 5.76 | 0/51 | Y | ### Multi-clip transcript parity Three additional 16 kHz mono clips covering longer utterances (39 to 94 seconds), single and two-speaker conversational content. Word error rate (WER) and Levenshtein edit distance computed against the upstream PyTorch reference. Numbers measured end-to-end through the full ONNX pipeline (no PyTorch encoder fallback). | Clip | Duration | FP32 byte-exact vs PT | INT8 byte-exact vs PT | INT8 WER vs PT | INT8 vs FP32 Lev | | --- | ---: | :---: | :---: | ---: | ---: | | is-it-more-wood | 46.9 s | Y | N | 1.4% | 2 | | two-speakers-1 | 93.8 s | Y | N | 1.0% | 12 | | two-speakers-2 | 38.8 s | Y | N | 23.5% | 26 | Raw multi-clip data including full transcripts: see `granite_export_metadata.json` `multi_clip_parity` block. Reference transcript: > After his nap, Timothy lazily stretched, first one gray velvet foot, then another, strolled indolently to his plate, turning over the food, carefully selecting choice bits, nosing out that which he scorned upon the clean hearth Both FP32 and INT8 paths reproduce this transcript exactly on the test clip. ## Toolchain - transformers 5.8.0 - torch 2.11.0 - onnx 1.21.0 - onnxruntime 1.25.1 - exporter: torch.onnx.export TorchScript path (dynamo=False) - opset: 20 (`ai.onnx` only) - IR version: 10 - external data layout: single `.onnx_data` sidecar per graph ## Compatibility Targeted at the [`ort`](https://crates.io/crates/ort) 2.0-rc.x Rust crate. Compatible with `onnxruntime` Python 1.17 through 1.25. No `com.microsoft` ops are used. Graphs were emitted via the TorchScript path (`torch.onnx.export(..., dynamo=False)`); the dynamo exporter was deliberately avoided because it injects `aten::*` ops `ort` does not understand. ## Reproducing the export The included scripts and `quantise.py` regenerate every artefact in this bundle. From a checkout of : ```bash python export_speech_2b_ar.py \ --model-dir \ --out-dir exports/granite-speech-4.1-2b python quantise.py --input exports/granite-speech-4.1-2b/encoder.onnx --output exports/granite-speech-4.1-2b/encoder_int8.onnx python quantise.py --input exports/granite-speech-4.1-2b/prompt_encode.onnx --output exports/granite-speech-4.1-2b/prompt_encode_int8.onnx python quantise.py --input exports/granite-speech-4.1-2b/decode_step.onnx --output exports/granite-speech-4.1-2b/decode_step_int8.onnx ``` Sandboxed environments may need: ```bash HF_HOME=$TMPDIR/hf_home HF_MODULES_CACHE=$TMPDIR/hf_modules ``` ## Licence Apache 2.0 for both the upstream IBM model and this ONNX export. See [`LICENSE`](LICENSE) for the full text.