---
license: cc-by-4.0
language:
  - en
  - es
  - fr
  - de
  - bg
  - hr
  - cs
  - da
  - nl
  - et
  - fi
  - el
  - hu
  - it
  - lv
  - lt
  - mt
  - pl
  - pt
  - ro
  - sk
  - sl
  - sv
  - ru
  - uk
base_model:
  - nvidia/parakeet-tdt-0.6b-v3
  - istupakov/parakeet-tdt-0.6b-v3-onnx
pipeline_tag: automatic-speech-recognition
tags:
  - automatic-speech-recognition
  - asr
  - onnx
  - onnx-asr
  - smoothquant
  - quantization
---

# Parakeet TDT 0.6B v3 (Multilingual), ONNX with a SmoothQuant int8 encoder

This is [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
with **one change**: the int8 encoder (`encoder-model.int8.onnx`) is rebuilt with
[SmoothQuant](https://github.com/onnx/neural-compressor), and with it I was no
longer able to reproduce the loss of accuracy on longer audio that I had measured
on the original int8 encoder. Everything else (the fp32 encoder, the fp16 encoder, the
decoder, the preprocessor and the tokenizer) is unchanged, so this repo is a
drop-in replacement for the original: point your loader at it and the better int8
is picked up automatically by its canonical name.

It also ships the **fp16** encoder (which the upstream istupakov repo does not),
so all three precisions are available here in one place. Unlike the int8 encoder,
the fp16 one is **not** SmoothQuant or anything clever: it is a naive fp16 cast of
the fp32 pieces ([`scripts/quantize-fp16.py`](./scripts/quantize-fp16.py)). In my testing it scored
**exactly equal to fp32** (same WER, overall and in every section), at half the
size. Note that fp16 compute is not implemented on every backend (for example the
CPU / WASM ONNX Runtime EP has no fp16 kernels), so there it is upcast back to fp32
at session build and gives you no runtime benefit. Even then it is still useful
purely as a smaller artifact: it is half the download / packaging size of fp32, for
identical accuracy. That is why [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web)
serves it (on a WebGPU backend it also runs natively in fp16, halving GPU memory).

This was originally built to improve the int8 transcription quality of
[parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web) (live demo:
[parakeetweb.olicorne.org](https://parakeetweb.olicorne.org/)), a browser-based
Parakeet ASR app that runs the int8 encoder on its CPU / WASM backend, where fp16
is not an option.

I also contribute to [Kieirra/murmure](https://github.com/Kieirra/murmure),
another browser-based Parakeet ASR project, where this SmoothQuant int8 encoder is
progressively being upstreamed (see the
[discussion](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354)).

## Why this exists

The stock int8 encoder transcribes short clips fine, but its accuracy degrades
badly once a single pass runs past roughly 20 to 30 seconds. The fp16 and fp32
encoders do **not** show this: so it is not the model architecture, it is an int8
*numerics* problem. The stock int8 uses fully **dynamic, per-tensor** activation
quantization (one runtime scale for an entire activation tensor). Once a longer
sequence widens the activation distribution, that single scale can no longer
represent it and the transcript falls apart.

SmoothQuant targets exactly this failure mode: it migrates the per-channel
activation outliers into the weights (a folded multiply), then statically
quantizes activations together with **per-channel** weights. With the smoothed,
per-channel int8 encoder I was no longer able to reproduce the long-audio
degradation in my own testing (see the numbers below).

Background and discussion:
[Kieirra/murmure#289 (comment)](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354).

## Results

Benchmark: a single ~390 second pass of a JFK speech clip (no chunking), scored
**per 60 second section** against the fp32 encoder as the oracle (each section is
also transcribed independently as a short clip, which the encoders all handle
well, and that short-clip transcription is the reference). A WER that climbs as
you go down the table is the long-audio degradation. Run with `scripts/wer-quants.py`
from the [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web)
project repository; the export and comparison are fully reproducible with the
scripts included in this repo.

Overall (single 390 s pass, lower WER is better):

| encoder precision           | encoder size | overall WER | peak RAM |
| --------------------------- | ------------ | ----------- | -------- |
| stock int8 (istupakov)      | 622 MB       | 40.40%      | ~5.0 GB  |
| **SmoothQuant int8 (this)** | **842 MB**   | **11.32%**  | ~5.0 GB  |
| fp16/fp32                   | ~1.2 GB      | 10.17%      | ~9.5 GB  |

Per-section WER:

| section     | stock int8 | **SmoothQuant int8** | fp16/fp32 |
| ----------- | ---------- | -------------------- | --------- |
| 0 to 60 s   | 41.4%      | **3.4%**             | 2.6%      |
| 60 to 120 s | 29.2%      | **3.5%**             | 5.3%      |
| 120 to 180 s| 39.1%      | **7.0%**             | 3.9%      |
| 180 to 240 s| 28.2%      | **4.3%**             | 3.4%      |
| 240 to 300 s| 69.5%      | **46.3%**            | 45.1%     |
| 300 to 360 s| 46.8%      | **25.5%**            | 23.4%     |
| 360 to 390 s| 37.5%      | **6.2%**             | 4.2%      |

*fp16 and fp32 produced the exact same WER (overall and in every section), so they
share one column. The encoder size and peak RAM in the overall table are fp16's;
the fp32 encoder is roughly twice as large.*

The SmoothQuant int8 **tracks fp16 closely** (11.32% overall vs fp16's 10.17%, a
1.2 point gap) and is about 3.6x better than the stock int8's 40.40%. The 240 to
360 s sections are elevated for fp16 too, so that is the audio / oracle for those
windows, not a quantization artifact: the SmoothQuant int8 matches fp16 there
while the stock int8 blows up to 69.5%. The JFK clip is **held out of the
calibration set** (see below), so this is an out-of-sample measurement, not a fit
to the eval audio.

### Calibration data (no labels, disjoint from every eval, bilingual audio)

SmoothQuant is a static method: it needs representative **activations** (not
labels or transcripts) to estimate per-channel ranges, which it then folds into
the weights as an exact equivalence transform. No labels, transcripts, or training
targets are used. It does use audio data, and that audio is deliberately bilingual
(French and English): but only as raw signal to exercise the activation ranges.
Nothing is fit to any transcript, and the model's multilingual ability is inherited
unchanged from the base model rather than learned or tuned here.

The calibration corpus is eight public political speeches, chosen to be
**disjoint from every evaluation set** (the JFK long-audio WER clip and the FLEURS
French split are both strictly held out) and to span decades, recording
conditions and two languages so the activation distribution stays broad:

| speaker | lang | speech | year | crop | source (YouTube id) |
| ------- | :--: | ------ | :--: | :--: | ------------------- |
| Dominique de Villepin | FR | UN Security Council address against the Iraq war | 2003 | 390 s | `RNxU-tN8qNc` |
| Bernie Sanders | EN | Senate floor filibuster against the tax-cut extension | 2010 | 390 s | `K6pa-QdL4Wo` |
| Georges Pompidou | FR | presidential press conference (INA archive) | 1970 | 390 s | `RNWFPX_Yafw` |
| Lyndon B. Johnson | EN | "We Shall Overcome" voting-rights address | 1965 | 390 s | `o74X_rTzrGI` |
| Jacques Chirac | FR | "Notre maison brule" Earth Summit speech | 2002 | 60 s | `M_oR0wZ3lI4` |
| Richard Nixon | EN | resignation address | 1974 | 60 s | `ZEOGJJ7UKFM` |
| Simone Veil | FR | speech defending the law legalizing abortion (INA) | 1974 | 390 s | `45MOc6PYoY8` |
| Robert Badinter | FR | speech for abolishing the death penalty (INA) | 1981 | 390 s | `kIVuz9NGQXY` |

Each clip is decoded to 16 kHz mono and sliced into 30 s windows (the six long
crops deliberately exercise the long-range regime where the int8 long-audio bug
lives), then evenly subsampled across all eight speakers for the calibration pass.
The fp32 encoder is the accuracy oracle, and the export ends with a
cosine-similarity fidelity check of the new encoder's output against fp32.

These clips are **not redistributed here** (they are copyrighted third-party
broadcasts; this repo ships under `cc-by-4.0`), which is why they are documented
by source above rather than committed. To re-run the export, fetch them yourself
from the listed sources and drop them in a `calibration_audio/` folder at the
repo root: `scripts/quantize-int8-smoothquant.py` reads that folder by default (or
pass your own clips/folders with `--audio`).

### Generalization (held-out, two domains, greedy vs beam)

As independent checks that the recalibrated int8 generalizes beyond the JFK clip,
it was evaluated on two sets that are **not** in the calibration data: the
[FLEURS](https://huggingface.co/datasets/google/fleurs) French validation split (a
general-French read-speech benchmark) and a small in-house medical-dictation set.
Both are scored greedy (beam 1) and with MAES beam search (width 10):

| dataset | utterances | beam 1 WER | beam 10 WER | beam 10 CER |
| ------- | :--------: | :--------: | :---------: | :---------: |
| FLEURS French (validation) | 289 | 5.05% | **4.98%** | **2.06%** |
| in-house medical dictation | 205 | 17.65% | **17.27%** | 10.39% |
| overall | 494 | 9.37% | **9.19%** | 5.13% |

The **2.06% FLEURS-fr CER** confirms the model stays strongly multilingual: French
audio is part of the calibration set, but FLEURS itself is held out and no French
transcript or label was ever used, so this is a genuine held-out measurement.
Width-10 beam search buys only a small accuracy gain over greedy (roughly 0.1 to
0.4 WER points here) at about 10x the decode cost, so greedy is a reasonable default
and the beam is there when the last fraction of a point matters. Run with
`scripts/grid_search_benchmark.mjs` from the
[parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web) repository.

### Trade-off: heavier than the stock int8, much more accurate

This int8 encoder is **842 MB versus the stock 622 MB**. That is deliberate: only
the MatMul ops are quantized, and the convolutional subsampling front-end is kept
in fp32 (statically quantizing it collapsed the encoder to an empty transcript).
The extra size buys long-audio accuracy that tracks fp16. It still uses about
**half the RAM of fp16** (~5.0 GB versus ~9.5 GB), which is the point: if you can
run fp16 or fp32 (for example on a WebGPU backend), prefer those. This int8
matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8
is the only precision that both fits and runs.

## Browser-friendly fp32 shards (`sharded/`)

The fp32 encoder is shipped two ways here: as the canonical single sidecar
(`encoder-model.onnx` + a ~2.3 GB `encoder-model.onnx.data`), and, under
`sharded/`, as the **same weights repacked into several files each under 2 GB**.
The sharded copy exists so the fp32 encoder can be loaded **in a web browser** (and
on the CPU / WASM ONNX Runtime backend generally), which the single-file fp32
**cannot**.

Why a browser cannot load the single 2.3 GB sidecar (these are *ingest* limits, not
a total-memory limit):

1. **32-bit WASM ArrayBuffer cap.** A WASM build is wasm32, so any single
   `ArrayBuffer` it holds caps at `2^31 - 1` bytes (~2 GB). A 2.3 GB sidecar cannot
   live in one buffer. (This is the same wall that forces projects like wllama to
   shard their GGUF files.)
2. **Chromium blob-URL fetch cap.** Fetching a `blob:` URL larger than ~2 GB fails
   in Chromium with `TypeError: Failed to fetch`, so the file cannot even be read
   into memory in one piece.

Note the wasm32 heap ceiling itself is ~4 GB, and fp32 stays ~2.3 GB resident (it
is *not* upcast the way the CPU / WASM EP upcasts fp16 to fp32 at session build), so
fp32 **fits** once no single buffer or fetch exceeds 2 GB. Sharding is purely about
clearing the two per-buffer ingest walls above.

`scripts/shard-fp32.py` rewrites each big initializer's `external_data` location to spread
the encoder's tensors across N shard files (`encoder-model.onnx.data.000`,
`encoder-model.onnx.data.001`, ... each under a 1.5 GB budget by default), leaving a
small rewritten `encoder-model.onnx` graph that points at them. Here that produces
**two shards** (~1.4 GB + ~0.9 GB). It is a **pure repack**: no tensor value is
touched, so the sharded encoder is **byte-for-byte numerically identical** to the
single-file fp32 and has the **exact same WER**. A loader (for example
[parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web), with its
`allowWasmFp32` opt-in) mounts each shard as a separate `externalData` entry, each
under the 2 GB caps, and reads them straight to bytes (no >2 GB `blob:` URL, no
multi-GB IndexedDB blob). The decoder, tokenizer and config are **not** duplicated
into `sharded/`; a loader takes the rewritten encoder + shards from `sharded/` and
everything else from the repo root.

When to use which: on **WebGPU**, prefer fp16 (half the download, native fp16
kernels) or the single-file fp32; the GPU EP has no 2 GB per-buffer wall. The shards
matter on **CPU / WASM**, where fp16 has no compute kernels and the single-file fp32
cannot be ingested, so the sharded fp32 is the only way to run full precision.

The shards are regenerated with [`scripts/shard-fp32.py`](./scripts/shard-fp32.py) (see
[How it was built](#how-it-was-built)).

## Files

| file                            | what it is                                              |
| ------------------------------- | ------------------------------------------------------- |
| `encoder-model.onnx` (+ `.data`)| fp32 encoder (unchanged from istupakov)                 |
| `encoder-model.fp16.onnx`       | fp16 encoder (not shipped by the upstream istupakov repo)|
| `encoder-model.int8.onnx`       | **SmoothQuant int8 encoder (the reason for this repo)** |
| `sharded/encoder-model.onnx` (+ `.data.000`, `.data.001`) | fp32 encoder repacked into <2 GB shards so a browser / WASM backend can load it (see [Browser-friendly fp32 shards](#browser-friendly-fp32-shards-sharded)) |
| `decoder_joint-model.onnx`      | fp32 decoder / joint network (unchanged)                |
| `decoder_joint-model.fp16.onnx` | fp16 decoder / joint network (unchanged)                |
| `decoder_joint-model.int8.onnx` | int8 decoder / joint network (unchanged)                |
| `nemo128.onnx`                  | 128-bin mel preprocessor (unchanged)                    |
| `vocab.txt`, `config.json`      | tokenizer and model config (unchanged)                  |
| `scripts/quantize-int8-smoothquant.py` | script that produced the SmoothQuant int8 encoder |
| `scripts/quantize-fp16.py`      | script that produced the fp16 encoder                   |
| `scripts/shard-fp32.py`         | script that produced the sharded fp32 encoder           |

## How it was built

- `encoder-model.int8.onnx`: `scripts/quantize-int8-smoothquant.py`. SmoothQuant +
  static per-channel int8, MatMul ops only (convolutions stay fp32), with
  Percentile activation calibration. Calibration uses no labels: it is the eight
  held-out public speeches listed under [Calibration data](#calibration-data-no-labels-disjoint-from-every-eval-bilingual-audio),
  read from a local `calibration_audio/` folder by default (override with
  `--audio`), sliced into 30 s windows, with the fp32 encoder as the accuracy
  oracle. The script ends with a cosine-similarity fidelity check of the new
  encoder's output against fp32.
- `encoder-model.fp16.onnx`: `scripts/quantize-fp16.py`, a straight fp16 cast of the
  fp32 encoder pieces.
- `sharded/`: `scripts/shard-fp32.py`, a pure repack of the single-file fp32 encoder into
  <2 GB shards (see [Browser-friendly fp32 shards](#browser-friendly-fp32-shards-sharded)).
  No weights are altered, so the sharded encoder is numerically identical to the
  single-file fp32.

All three scripts live in `scripts/`, are self-contained, and run from the repo
root against the model files here, which they default to finding in the current
directory (so invoke them as e.g. `uv run scripts/quantize-fp16.py`). Each declares
its own dependencies via a [PEP 723](https://peps.python.org/pep-0723/) header and
runs with `uv run` (which installs them on the fly).
The only external inputs are the calibration clips for the int8 export (fetched
from the documented sources into `calibration_audio/`, see above) and the
optional WER comparison harnesses the scripts print at the end
(`wer-quants.py`, `wer-bench.mjs`), which live in the
[parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web)
project repository.

## Sources and credits

- ONNX base model this repo is built on:
  [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
- Original model:
  [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
- SmoothQuant implementation:
  [onnx/neural-compressor](https://github.com/onnx/neural-compressor)
- Loaded with [onnx-asr](https://github.com/istupakov/onnx-asr)
- Discussion:
  [Kieirra/murmure#289 (comment)](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354)
- This repository (model export, benchmarking and documentation) was produced
  with [Claude Code](https://claude.com/claude-code).

## License

`cc-by-4.0`, inherited from the upstream istupakov ONNX model and the original
NVIDIA Parakeet TDT 0.6B v3.