license: cc-by-4.0
language:
- en
- es
- fr
- de
- bg
- hr
- cs
- da
- nl
- et
- fi
- el
- hu
- it
- lv
- lt
- mt
- pl
- pt
- ro
- sk
- sl
- sv
- ru
- uk
base_model:
- nvidia/parakeet-tdt-0.6b-v3
- istupakov/parakeet-tdt-0.6b-v3-onnx
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- asr
- onnx
- onnx-asr
- smoothquant
- quantization
Parakeet TDT 0.6B v3 (Multilingual), ONNX with a SmoothQuant int8 encoder
This is istupakov/parakeet-tdt-0.6b-v3-onnx
with one change: the int8 encoder (encoder-model.int8.onnx) is rebuilt with
SmoothQuant, and with it I was no
longer able to reproduce the loss of accuracy on longer audio that I had measured
on the original int8 encoder. Everything else (the fp32 encoder, the fp16 encoder, the
decoder, the preprocessor and the tokenizer) is unchanged, so this repo is a
drop-in replacement for the original: point your loader at it and the better int8
is picked up automatically by its canonical name.
It also ships the fp16 encoder (which the upstream istupakov repo does not),
so all three precisions are available here in one place. Unlike the int8 encoder,
the fp16 one is not SmoothQuant or anything clever: it is a naive fp16 cast of
the fp32 pieces (scripts/quantize-fp16.py). In my testing it scored
exactly equal to fp32 (same WER, overall and in every section), at half the
size. Note that fp16 compute is not implemented on every backend (for example the
CPU / WASM ONNX Runtime EP has no fp16 kernels), so there it is upcast back to fp32
at session build and gives you no runtime benefit. Even then it is still useful
purely as a smaller artifact: it is half the download / packaging size of fp32, for
identical accuracy. That is why parakeet_web
serves it (on a WebGPU backend it also runs natively in fp16, halving GPU memory).
This was originally built to improve the int8 transcription quality of parakeet_web (live demo: parakeetweb.olicorne.org), a browser-based Parakeet ASR app that runs the int8 encoder on its CPU / WASM backend, where fp16 is not an option.
I also contribute to Kieirra/murmure, another browser-based Parakeet ASR project, where this SmoothQuant int8 encoder is progressively being upstreamed (see the discussion).
Why this exists
The stock int8 encoder transcribes short clips fine, but its accuracy degrades badly once a single pass runs past roughly 20 to 30 seconds. The fp16 and fp32 encoders do not show this: so it is not the model architecture, it is an int8 numerics problem. The stock int8 uses fully dynamic, per-tensor activation quantization (one runtime scale for an entire activation tensor). Once a longer sequence widens the activation distribution, that single scale can no longer represent it and the transcript falls apart.
SmoothQuant targets exactly this failure mode: it migrates the per-channel activation outliers into the weights (a folded multiply), then statically quantizes activations together with per-channel weights. With the smoothed, per-channel int8 encoder I was no longer able to reproduce the long-audio degradation in my own testing (see the numbers below).
Background and discussion: Kieirra/murmure#289 (comment).
Results
Benchmark: a single ~390 second pass of a JFK speech clip (no chunking), scored
per 60 second section against the fp32 encoder as the oracle (each section is
also transcribed independently as a short clip, which the encoders all handle
well, and that short-clip transcription is the reference). A WER that climbs as
you go down the table is the long-audio degradation. Run with scripts/wer-quants.py
from the parakeet_web
project repository; the export and comparison are fully reproducible with the
scripts included in this repo.
Overall (single 390 s pass, lower WER is better):
| encoder precision | encoder size | overall WER | peak RAM |
|---|---|---|---|
| stock int8 (istupakov) | 622 MB | 40.40% | ~5.0 GB |
| SmoothQuant int8 (this) | 842 MB | 11.32% | ~5.0 GB |
| fp16/fp32 | ~1.2 GB | 10.17% | ~9.5 GB |
Per-section WER:
| section | stock int8 | SmoothQuant int8 | fp16/fp32 |
|---|---|---|---|
| 0 to 60 s | 41.4% | 3.4% | 2.6% |
| 60 to 120 s | 29.2% | 3.5% | 5.3% |
| 120 to 180 s | 39.1% | 7.0% | 3.9% |
| 180 to 240 s | 28.2% | 4.3% | 3.4% |
| 240 to 300 s | 69.5% | 46.3% | 45.1% |
| 300 to 360 s | 46.8% | 25.5% | 23.4% |
| 360 to 390 s | 37.5% | 6.2% | 4.2% |
fp16 and fp32 produced the exact same WER (overall and in every section), so they share one column. The encoder size and peak RAM in the overall table are fp16's; the fp32 encoder is roughly twice as large.
The SmoothQuant int8 tracks fp16 closely (11.32% overall vs fp16's 10.17%, a 1.2 point gap) and is about 3.6x better than the stock int8's 40.40%. The 240 to 360 s sections are elevated for fp16 too, so that is the audio / oracle for those windows, not a quantization artifact: the SmoothQuant int8 matches fp16 there while the stock int8 blows up to 69.5%. The JFK clip is held out of the calibration set (see below), so this is an out-of-sample measurement, not a fit to the eval audio.
Calibration data (no labels, disjoint from every eval, bilingual audio)
SmoothQuant is a static method: it needs representative activations (not labels or transcripts) to estimate per-channel ranges, which it then folds into the weights as an exact equivalence transform. No labels, transcripts, or training targets are used. It does use audio data, and that audio is deliberately bilingual (French and English): but only as raw signal to exercise the activation ranges. Nothing is fit to any transcript, and the model's multilingual ability is inherited unchanged from the base model rather than learned or tuned here.
The calibration corpus is eight public political speeches, chosen to be disjoint from every evaluation set (the JFK long-audio WER clip and the FLEURS French split are both strictly held out) and to span decades, recording conditions and two languages so the activation distribution stays broad:
| speaker | lang | speech | year | crop | source (YouTube id) |
|---|---|---|---|---|---|
| Dominique de Villepin | FR | UN Security Council address against the Iraq war | 2003 | 390 s | RNxU-tN8qNc |
| Bernie Sanders | EN | Senate floor filibuster against the tax-cut extension | 2010 | 390 s | K6pa-QdL4Wo |
| Georges Pompidou | FR | presidential press conference (INA archive) | 1970 | 390 s | RNWFPX_Yafw |
| Lyndon B. Johnson | EN | "We Shall Overcome" voting-rights address | 1965 | 390 s | o74X_rTzrGI |
| Jacques Chirac | FR | "Notre maison brule" Earth Summit speech | 2002 | 60 s | M_oR0wZ3lI4 |
| Richard Nixon | EN | resignation address | 1974 | 60 s | ZEOGJJ7UKFM |
| Simone Veil | FR | speech defending the law legalizing abortion (INA) | 1974 | 390 s | 45MOc6PYoY8 |
| Robert Badinter | FR | speech for abolishing the death penalty (INA) | 1981 | 390 s | kIVuz9NGQXY |
Each clip is decoded to 16 kHz mono and sliced into 30 s windows (the six long crops deliberately exercise the long-range regime where the int8 long-audio bug lives), then evenly subsampled across all eight speakers for the calibration pass. The fp32 encoder is the accuracy oracle, and the export ends with a cosine-similarity fidelity check of the new encoder's output against fp32.
These clips are not redistributed here (they are copyrighted third-party
broadcasts; this repo ships under cc-by-4.0), which is why they are documented
by source above rather than committed. To re-run the export, fetch them yourself
from the listed sources and drop them in a calibration_audio/ folder at the
repo root: scripts/quantize-int8-smoothquant.py reads that folder by default (or
pass your own clips/folders with --audio).
Generalization (held-out, two domains, greedy vs beam)
As independent checks that the recalibrated int8 generalizes beyond the JFK clip, it was evaluated on two sets that are not in the calibration data: the FLEURS French validation split (a general-French read-speech benchmark) and a small in-house medical-dictation set. Both are scored greedy (beam 1) and with MAES beam search (width 10):
| dataset | utterances | beam 1 WER | beam 10 WER | beam 10 CER |
|---|---|---|---|---|
| FLEURS French (validation) | 289 | 5.05% | 4.98% | 2.06% |
| in-house medical dictation | 205 | 17.65% | 17.27% | 10.39% |
| overall | 494 | 9.37% | 9.19% | 5.13% |
The 2.06% FLEURS-fr CER confirms the model stays strongly multilingual: French
audio is part of the calibration set, but FLEURS itself is held out and no French
transcript or label was ever used, so this is a genuine held-out measurement.
Width-10 beam search buys only a small accuracy gain over greedy (roughly 0.1 to
0.4 WER points here) at about 10x the decode cost, so greedy is a reasonable default
and the beam is there when the last fraction of a point matters. Run with
scripts/grid_search_benchmark.mjs from the
parakeet_web repository.
Trade-off: heavier than the stock int8, much more accurate
This int8 encoder is 842 MB versus the stock 622 MB. That is deliberate: only the MatMul ops are quantized, and the convolutional subsampling front-end is kept in fp32 (statically quantizing it collapsed the encoder to an empty transcript). The extra size buys long-audio accuracy that tracks fp16. It still uses about half the RAM of fp16 (~5.0 GB versus ~9.5 GB), which is the point: if you can run fp16 or fp32 (for example on a WebGPU backend), prefer those. This int8 matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8 is the only precision that both fits and runs.
Browser-friendly fp32 shards (sharded/)
The fp32 encoder is shipped two ways here: as the canonical single sidecar
(encoder-model.onnx + a ~2.3 GB encoder-model.onnx.data), and, under
sharded/, as the same weights repacked into several files each under 2 GB.
The sharded copy exists so the fp32 encoder can be loaded in a web browser (and
on the CPU / WASM ONNX Runtime backend generally), which the single-file fp32
cannot.
Why a browser cannot load the single 2.3 GB sidecar (these are ingest limits, not a total-memory limit):
- 32-bit WASM ArrayBuffer cap. A WASM build is wasm32, so any single
ArrayBufferit holds caps at2^31 - 1bytes (~2 GB). A 2.3 GB sidecar cannot live in one buffer. (This is the same wall that forces projects like wllama to shard their GGUF files.) - Chromium blob-URL fetch cap. Fetching a
blob:URL larger than ~2 GB fails in Chromium withTypeError: Failed to fetch, so the file cannot even be read into memory in one piece.
Note the wasm32 heap ceiling itself is ~4 GB, and fp32 stays ~2.3 GB resident (it is not upcast the way the CPU / WASM EP upcasts fp16 to fp32 at session build), so fp32 fits once no single buffer or fetch exceeds 2 GB. Sharding is purely about clearing the two per-buffer ingest walls above.
scripts/shard-fp32.py rewrites each big initializer's external_data location to spread
the encoder's tensors across N shard files (encoder-model.onnx.data.000,
encoder-model.onnx.data.001, ... each under a 1.5 GB budget by default), leaving a
small rewritten encoder-model.onnx graph that points at them. Here that produces
two shards (~1.4 GB + ~0.9 GB). It is a pure repack: no tensor value is
touched, so the sharded encoder is byte-for-byte numerically identical to the
single-file fp32 and has the exact same WER. A loader (for example
parakeet_web, with its
allowWasmFp32 opt-in) mounts each shard as a separate externalData entry, each
under the 2 GB caps, and reads them straight to bytes (no >2 GB blob: URL, no
multi-GB IndexedDB blob). The decoder, tokenizer and config are not duplicated
into sharded/; a loader takes the rewritten encoder + shards from sharded/ and
everything else from the repo root.
When to use which: on WebGPU, prefer fp16 (half the download, native fp16 kernels) or the single-file fp32; the GPU EP has no 2 GB per-buffer wall. The shards matter on CPU / WASM, where fp16 has no compute kernels and the single-file fp32 cannot be ingested, so the sharded fp32 is the only way to run full precision.
The shards are regenerated with scripts/shard-fp32.py (see
How it was built).
Files
| file | what it is |
|---|---|
encoder-model.onnx (+ .data) |
fp32 encoder (unchanged from istupakov) |
encoder-model.fp16.onnx |
fp16 encoder (not shipped by the upstream istupakov repo) |
encoder-model.int8.onnx |
SmoothQuant int8 encoder (the reason for this repo) |
sharded/encoder-model.onnx (+ .data.000, .data.001) |
fp32 encoder repacked into <2 GB shards so a browser / WASM backend can load it (see Browser-friendly fp32 shards) |
decoder_joint-model.onnx |
fp32 decoder / joint network (unchanged) |
decoder_joint-model.fp16.onnx |
fp16 decoder / joint network (unchanged) |
decoder_joint-model.int8.onnx |
int8 decoder / joint network (unchanged) |
nemo128.onnx |
128-bin mel preprocessor (unchanged) |
vocab.txt, config.json |
tokenizer and model config (unchanged) |
scripts/quantize-int8-smoothquant.py |
script that produced the SmoothQuant int8 encoder |
scripts/quantize-fp16.py |
script that produced the fp16 encoder |
scripts/shard-fp32.py |
script that produced the sharded fp32 encoder |
How it was built
encoder-model.int8.onnx:scripts/quantize-int8-smoothquant.py. SmoothQuant + static per-channel int8, MatMul ops only (convolutions stay fp32), with Percentile activation calibration. Calibration uses no labels: it is the eight held-out public speeches listed under Calibration data, read from a localcalibration_audio/folder by default (override with--audio), sliced into 30 s windows, with the fp32 encoder as the accuracy oracle. The script ends with a cosine-similarity fidelity check of the new encoder's output against fp32.encoder-model.fp16.onnx:scripts/quantize-fp16.py, a straight fp16 cast of the fp32 encoder pieces.sharded/:scripts/shard-fp32.py, a pure repack of the single-file fp32 encoder into <2 GB shards (see Browser-friendly fp32 shards). No weights are altered, so the sharded encoder is numerically identical to the single-file fp32.
All three scripts live in scripts/, are self-contained, and run from the repo
root against the model files here, which they default to finding in the current
directory (so invoke them as e.g. uv run scripts/quantize-fp16.py). Each declares
its own dependencies via a PEP 723 header and
runs with uv run (which installs them on the fly).
The only external inputs are the calibration clips for the int8 export (fetched
from the documented sources into calibration_audio/, see above) and the
optional WER comparison harnesses the scripts print at the end
(wer-quants.py, wer-bench.mjs), which live in the
parakeet_web
project repository.
Sources and credits
- ONNX base model this repo is built on: istupakov/parakeet-tdt-0.6b-v3-onnx
- Original model: nvidia/parakeet-tdt-0.6b-v3
- SmoothQuant implementation: onnx/neural-compressor
- Loaded with onnx-asr
- Discussion: Kieirra/murmure#289 (comment)
- This repository (model export, benchmarking and documentation) was produced with Claude Code.
License
cc-by-4.0, inherited from the upstream istupakov ONNX model and the original
NVIDIA Parakeet TDT 0.6B v3.