# IBM Granite Speech 4.1 2b - ONNX export

ONNX export of [`ibm-granite/granite-speech-4.1-2b`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b) produced by Sam McLeod
(<https://smcleod.net>). Repository: `smcleod/ibm-granite-speech-4.1-2b-onnx`. Both FP32 and INT8 weight-only
graphs are included. The graphs target opset 20, IR 10, `ai.onnx` operators
only - no `com.microsoft` ops - so they load under the `ort` 2.0-rc.x Rust
crate as well as standard `onnxruntime` 1.17 - 1.25.

> **Additional precision tiers in progress.** A statically-calibrated INT8 variant (better quality vs the dynamic INT8 already in this repo) and a half-precision encoder are in active development. The repo will be updated when those graphs pass the multi-clip parity gate.

Three graphs cooperate: `encoder.onnx` projects mel features to audio embeddings; `prompt_encode.onnx` runs the LLM forward over the full prompt (text tokens + projected audio embeds) and returns the first-token logits plus a 40-layer KV cache; `decode_step.onnx` consumes one token at a time plus the past KV cache and emits the next logits.

The audio placeholder token id is `100352`. Replace those positions in the prompt with the projector outputs from `encoder.onnx` before running `prompt_encode.onnx`.

## Files

- `encoder.onnx` + `encoder.onnx_data` (FP32) and `encoder_int8.onnx` + `encoder_int8.onnx_data` (INT8 weight-only quantisation)
- `prompt_encode.onnx` + `prompt_encode.onnx_data` (FP32) and `prompt_encode_int8.onnx` + `prompt_encode_int8.onnx_data` (INT8 weight-only quantisation)
- `decode_step.onnx` + `decode_step.onnx_data` (FP32) and `decode_step_int8.onnx` + `decode_step_int8.onnx_data` (INT8 weight-only quantisation)
- Tokeniser / processor: `tokenizer.json`, `tokenizer_config.json`, `processor_config.json`, `chat_template.jinja`, `special_tokens_map.json`, `preprocessor_config.json`
- Export scripts: `export_speech_2b_ar.py`, `quantise.py`
- `granite_export_metadata.json` (graph IO, parity numbers, toolchain)
- `LICENSE` (Apache 2.0)

## Parity

Parity is taken against the upstream PyTorch reference on a single LibriSpeech
clip (`10226_10111_000000.wav`, 8.43 seconds, 844 mel frames). FP32 graphs
match the reference within numeric tolerance; INT8 graphs are validated in
argmax-only mode (logit values shift but token argmax is preserved, so the
decoded transcript is unchanged).

Encoder (numeric output, no argmax decoding):

| precision | max-abs-err | mean-abs-err | p99-abs-err |
| --- | --- | --- | --- |
| FP32 | 4.48e-06 | 1.24e-07 | 6.46e-07 |
| INT8 | 0.169 | 0.0109 | 0.0447 |

LLM stages (argmax decoding; INT8 logit max-abs delta is large but argmax is preserved):

| graph | precision | max-abs-err | argmax mismatches | transcript match |
| --- | --- | --- | --- | --- |
| prompt_encode | FP32 | 0.000364 | 0/190 | Y |
| prompt_encode | INT8 | 10.1 | 58/190 | Y |
| decode_step | FP32 | n/a | 0/51 | Y |
| decode_step | INT8 | 5.76 | 0/51 | Y |

### Multi-clip transcript parity

Three additional 16 kHz mono clips covering longer utterances (39 to 94 seconds), single and two-speaker conversational content. Word error rate (WER) and Levenshtein edit distance computed against the upstream PyTorch reference. Numbers measured end-to-end through the full ONNX pipeline (no PyTorch encoder fallback).

| Clip | Duration | FP32 byte-exact vs PT | INT8 byte-exact vs PT | INT8 WER vs PT | INT8 vs FP32 Lev |
| --- | ---: | :---: | :---: | ---: | ---: |
| is-it-more-wood | 46.9 s | Y | N | 1.4% | 2 |
| two-speakers-1 | 93.8 s | Y | N | 1.0% | 12 |
| two-speakers-2 | 38.8 s | Y | N | 23.5% | 26 |

Raw multi-clip data including full transcripts: see `granite_export_metadata.json` `multi_clip_parity` block.

Reference transcript:

> After his nap, Timothy lazily stretched, first one gray velvet foot, then another, strolled indolently to his plate, turning over the food, carefully selecting choice bits, nosing out that which he scorned upon the clean hearth

Both FP32 and INT8 paths reproduce this transcript exactly on the test clip.

## Toolchain

- transformers 5.8.0
- torch 2.11.0
- onnx 1.21.0
- onnxruntime 1.25.1
- exporter: torch.onnx.export TorchScript path (dynamo=False)
- opset: 20 (`ai.onnx` only)
- IR version: 10
- external data layout: single `<stem>.onnx_data` sidecar per graph

## Compatibility

Targeted at the [`ort`](https://crates.io/crates/ort) 2.0-rc.x Rust crate.
Compatible with `onnxruntime` Python 1.17 through 1.25. No `com.microsoft`
ops are used. Graphs were emitted via the TorchScript path
(`torch.onnx.export(..., dynamo=False)`); the dynamo exporter was deliberately
avoided because it injects `aten::*` ops `ort` does not understand.

## Reproducing the export

The included scripts and `quantise.py` regenerate every artefact in this
bundle. From a checkout of <https://github.com/sammcj/granite-speech-4.1-onnx>:

```bash
python export_speech_2b_ar.py \
    --model-dir <path-to-ibm-granite/granite-speech-4.1-2b> \
    --out-dir exports/granite-speech-4.1-2b
python quantise.py --input exports/granite-speech-4.1-2b/encoder.onnx       --output exports/granite-speech-4.1-2b/encoder_int8.onnx
python quantise.py --input exports/granite-speech-4.1-2b/prompt_encode.onnx --output exports/granite-speech-4.1-2b/prompt_encode_int8.onnx
python quantise.py --input exports/granite-speech-4.1-2b/decode_step.onnx   --output exports/granite-speech-4.1-2b/decode_step_int8.onnx
```

Sandboxed environments may need:

```bash
HF_HOME=$TMPDIR/hf_home HF_MODULES_CACHE=$TMPDIR/hf_modules <command above>
```

## Licence

Apache 2.0 for both the upstream IBM model and this ONNX export. See
[`LICENSE`](LICENSE) for the full text.