Instructions to use smcleod/ibm-granite-speech-4.1-2b-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use smcleod/ibm-granite-speech-4.1-2b-onnx with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="smcleod/ibm-granite-speech-4.1-2b-onnx")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("smcleod/ibm-granite-speech-4.1-2b-onnx", dtype="auto") - Notebooks
- Google Colab
- Kaggle
IBM Granite Speech 4.1 2b - ONNX export
ONNX export of ibm-granite/granite-speech-4.1-2b produced by Sam McLeod
(https://smcleod.net). Repository: smcleod/ibm-granite-speech-4.1-2b-onnx. Both FP32 and INT8 weight-only
graphs are included. The graphs target opset 20, IR 10, ai.onnx operators
only - no com.microsoft ops - so they load under the ort 2.0-rc.x Rust
crate as well as standard onnxruntime 1.17 - 1.25.
Additional precision tiers in progress. A statically-calibrated INT8 variant (better quality vs the dynamic INT8 already in this repo) and a half-precision encoder are in active development. The repo will be updated when those graphs pass the multi-clip parity gate.
Three graphs cooperate: encoder.onnx projects mel features to audio embeddings; prompt_encode.onnx runs the LLM forward over the full prompt (text tokens + projected audio embeds) and returns the first-token logits plus a 40-layer KV cache; decode_step.onnx consumes one token at a time plus the past KV cache and emits the next logits.
The audio placeholder token id is 100352. Replace those positions in the prompt with the projector outputs from encoder.onnx before running prompt_encode.onnx.
Files
encoder.onnx+encoder.onnx_data(FP32) andencoder_int8.onnx+encoder_int8.onnx_data(INT8 weight-only quantisation)prompt_encode.onnx+prompt_encode.onnx_data(FP32) andprompt_encode_int8.onnx+prompt_encode_int8.onnx_data(INT8 weight-only quantisation)decode_step.onnx+decode_step.onnx_data(FP32) anddecode_step_int8.onnx+decode_step_int8.onnx_data(INT8 weight-only quantisation)- Tokeniser / processor:
tokenizer.json,tokenizer_config.json,processor_config.json,chat_template.jinja,special_tokens_map.json,preprocessor_config.json - Export scripts:
export_speech_2b_ar.py,quantise.py granite_export_metadata.json(graph IO, parity numbers, toolchain)LICENSE(Apache 2.0)
Parity
Parity is taken against the upstream PyTorch reference on a single LibriSpeech
clip (10226_10111_000000.wav, 8.43 seconds, 844 mel frames). FP32 graphs
match the reference within numeric tolerance; INT8 graphs are validated in
argmax-only mode (logit values shift but token argmax is preserved, so the
decoded transcript is unchanged).
Encoder (numeric output, no argmax decoding):
| precision | max-abs-err | mean-abs-err | p99-abs-err |
|---|---|---|---|
| FP32 | 4.48e-06 | 1.24e-07 | 6.46e-07 |
| INT8 | 0.169 | 0.0109 | 0.0447 |
LLM stages (argmax decoding; INT8 logit max-abs delta is large but argmax is preserved):
| graph | precision | max-abs-err | argmax mismatches | transcript match |
|---|---|---|---|---|
| prompt_encode | FP32 | 0.000364 | 0/190 | Y |
| prompt_encode | INT8 | 10.1 | 58/190 | Y |
| decode_step | FP32 | n/a | 0/51 | Y |
| decode_step | INT8 | 5.76 | 0/51 | Y |
Multi-clip transcript parity
Three additional 16 kHz mono clips covering longer utterances (39 to 94 seconds), single and two-speaker conversational content. Word error rate (WER) and Levenshtein edit distance computed against the upstream PyTorch reference. Numbers measured end-to-end through the full ONNX pipeline (no PyTorch encoder fallback).
| Clip | Duration | FP32 byte-exact vs PT | INT8 byte-exact vs PT | INT8 WER vs PT | INT8 vs FP32 Lev |
|---|---|---|---|---|---|
| is-it-more-wood | 46.9 s | Y | N | 1.4% | 2 |
| two-speakers-1 | 93.8 s | Y | N | 1.0% | 12 |
| two-speakers-2 | 38.8 s | Y | N | 23.5% | 26 |
Raw multi-clip data including full transcripts: see granite_export_metadata.json multi_clip_parity block.
Reference transcript:
After his nap, Timothy lazily stretched, first one gray velvet foot, then another, strolled indolently to his plate, turning over the food, carefully selecting choice bits, nosing out that which he scorned upon the clean hearth
Both FP32 and INT8 paths reproduce this transcript exactly on the test clip.
Toolchain
- transformers 5.8.0
- torch 2.11.0
- onnx 1.21.0
- onnxruntime 1.25.1
- exporter: torch.onnx.export TorchScript path (dynamo=False)
- opset: 20 (
ai.onnxonly) - IR version: 10
- external data layout: single
<stem>.onnx_datasidecar per graph
Compatibility
Targeted at the ort 2.0-rc.x Rust crate.
Compatible with onnxruntime Python 1.17 through 1.25. No com.microsoft
ops are used. Graphs were emitted via the TorchScript path
(torch.onnx.export(..., dynamo=False)); the dynamo exporter was deliberately
avoided because it injects aten::* ops ort does not understand.
Reproducing the export
The included scripts and quantise.py regenerate every artefact in this
bundle. From a checkout of https://github.com/sammcj/granite-speech-4.1-onnx:
python export_speech_2b_ar.py \
--model-dir <path-to-ibm-granite/granite-speech-4.1-2b> \
--out-dir exports/granite-speech-4.1-2b
python quantise.py --input exports/granite-speech-4.1-2b/encoder.onnx --output exports/granite-speech-4.1-2b/encoder_int8.onnx
python quantise.py --input exports/granite-speech-4.1-2b/prompt_encode.onnx --output exports/granite-speech-4.1-2b/prompt_encode_int8.onnx
python quantise.py --input exports/granite-speech-4.1-2b/decode_step.onnx --output exports/granite-speech-4.1-2b/decode_step_int8.onnx
Sandboxed environments may need:
HF_HOME=$TMPDIR/hf_home HF_MODULES_CACHE=$TMPDIR/hf_modules <command above>
Licence
Apache 2.0 for both the upstream IBM model and this ONNX export. See
LICENSE for the full text.