--- license: cc-by-4.0 language: - en - es - fr - de - bg - hr - cs - da - nl - et - fi - el - hu - it - lv - lt - mt - pl - pt - ro - sk - sl - sv - ru - uk base_model: - nvidia/parakeet-tdt-0.6b-v3 - istupakov/parakeet-tdt-0.6b-v3-onnx pipeline_tag: automatic-speech-recognition tags: - automatic-speech-recognition - asr - onnx - onnx-asr - smoothquant - quantization --- # Parakeet TDT 0.6B v3 (Multilingual), ONNX with a SmoothQuant int8 encoder This is [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx) with **one change**: the int8 encoder (`encoder-model.int8.onnx`) is rebuilt with [SmoothQuant](https://github.com/onnx/neural-compressor), and with it I was no longer able to reproduce the loss of accuracy on longer audio that I had measured on the original int8 encoder. Everything else (the fp32 encoder, the fp16 encoder, the decoder, the preprocessor and the tokenizer) is unchanged, so this repo is a drop-in replacement for the original: point your loader at it and the better int8 is picked up automatically by its canonical name. It also ships the **fp16** encoder (which the upstream istupakov repo does not), so all three precisions are available here in one place. Unlike the int8 encoder, the fp16 one is **not** SmoothQuant or anything clever: it is a naive fp16 cast of the fp32 pieces ([`scripts/quantize-fp16.py`](./scripts/quantize-fp16.py)). In my testing it scored **exactly equal to fp32** (same WER, overall and in every section), at half the size. Note that fp16 compute is not implemented on every backend (for example the CPU / WASM ONNX Runtime EP has no fp16 kernels), so there it is upcast back to fp32 at session build and gives you no runtime benefit. Even then it is still useful purely as a smaller artifact: it is half the download / packaging size of fp32, for identical accuracy. That is why [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web) serves it (on a WebGPU backend it also runs natively in fp16, halving GPU memory). This was originally built to improve the int8 transcription quality of [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web) (live demo: [parakeetweb.olicorne.org](https://parakeetweb.olicorne.org/)), a browser-based Parakeet ASR app that runs the int8 encoder on its CPU / WASM backend, where fp16 is not an option. I also contribute to [Kieirra/murmure](https://github.com/Kieirra/murmure), another browser-based Parakeet ASR project, where this SmoothQuant int8 encoder is progressively being upstreamed (see the [discussion](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354)). ## Why this exists The stock int8 encoder transcribes short clips fine, but its accuracy degrades badly once a single pass runs past roughly 20 to 30 seconds. The fp16 and fp32 encoders do **not** show this: so it is not the model architecture, it is an int8 *numerics* problem. The stock int8 uses fully **dynamic, per-tensor** activation quantization (one runtime scale for an entire activation tensor). Once a longer sequence widens the activation distribution, that single scale can no longer represent it and the transcript falls apart. SmoothQuant targets exactly this failure mode: it migrates the per-channel activation outliers into the weights (a folded multiply), then statically quantizes activations together with **per-channel** weights. With the smoothed, per-channel int8 encoder I was no longer able to reproduce the long-audio degradation in my own testing (see the numbers below). Background and discussion: [Kieirra/murmure#289 (comment)](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354). ## Results Benchmark: a single ~390 second pass of a JFK speech clip (no chunking), scored **per 60 second section** against the fp32 encoder as the oracle (each section is also transcribed independently as a short clip, which the encoders all handle well, and that short-clip transcription is the reference). A WER that climbs as you go down the table is the long-audio degradation. Run with `scripts/wer-quants.py` from the [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web) project repository; the export and comparison are fully reproducible with the scripts included in this repo. Overall (single 390 s pass, lower WER is better): | encoder precision | encoder size | overall WER | peak RAM | | --------------------------- | ------------ | ----------- | -------- | | stock int8 (istupakov) | 622 MB | 40.40% | ~5.0 GB | | **SmoothQuant int8 (this)** | **842 MB** | **11.32%** | ~5.0 GB | | fp16/fp32 | ~1.2 GB | 10.17% | ~9.5 GB | Per-section WER: | section | stock int8 | **SmoothQuant int8** | fp16/fp32 | | ----------- | ---------- | -------------------- | --------- | | 0 to 60 s | 41.4% | **3.4%** | 2.6% | | 60 to 120 s | 29.2% | **3.5%** | 5.3% | | 120 to 180 s| 39.1% | **7.0%** | 3.9% | | 180 to 240 s| 28.2% | **4.3%** | 3.4% | | 240 to 300 s| 69.5% | **46.3%** | 45.1% | | 300 to 360 s| 46.8% | **25.5%** | 23.4% | | 360 to 390 s| 37.5% | **6.2%** | 4.2% | *fp16 and fp32 produced the exact same WER (overall and in every section), so they share one column. The encoder size and peak RAM in the overall table are fp16's; the fp32 encoder is roughly twice as large.* The SmoothQuant int8 **tracks fp16 closely** (11.32% overall vs fp16's 10.17%, a 1.2 point gap) and is about 3.6x better than the stock int8's 40.40%. The 240 to 360 s sections are elevated for fp16 too, so that is the audio / oracle for those windows, not a quantization artifact: the SmoothQuant int8 matches fp16 there while the stock int8 blows up to 69.5%. The JFK clip is **held out of the calibration set** (see below), so this is an out-of-sample measurement, not a fit to the eval audio. ### Calibration data (no labels, disjoint from every eval, bilingual audio) SmoothQuant is a static method: it needs representative **activations** (not labels or transcripts) to estimate per-channel ranges, which it then folds into the weights as an exact equivalence transform. No labels, transcripts, or training targets are used. It does use audio data, and that audio is deliberately bilingual (French and English): but only as raw signal to exercise the activation ranges. Nothing is fit to any transcript, and the model's multilingual ability is inherited unchanged from the base model rather than learned or tuned here. The calibration corpus is eight public political speeches, chosen to be **disjoint from every evaluation set** (the JFK long-audio WER clip and the FLEURS French split are both strictly held out) and to span decades, recording conditions and two languages so the activation distribution stays broad: | speaker | lang | speech | year | crop | source (YouTube id) | | ------- | :--: | ------ | :--: | :--: | ------------------- | | Dominique de Villepin | FR | UN Security Council address against the Iraq war | 2003 | 390 s | `RNxU-tN8qNc` | | Bernie Sanders | EN | Senate floor filibuster against the tax-cut extension | 2010 | 390 s | `K6pa-QdL4Wo` | | Georges Pompidou | FR | presidential press conference (INA archive) | 1970 | 390 s | `RNWFPX_Yafw` | | Lyndon B. Johnson | EN | "We Shall Overcome" voting-rights address | 1965 | 390 s | `o74X_rTzrGI` | | Jacques Chirac | FR | "Notre maison brule" Earth Summit speech | 2002 | 60 s | `M_oR0wZ3lI4` | | Richard Nixon | EN | resignation address | 1974 | 60 s | `ZEOGJJ7UKFM` | | Simone Veil | FR | speech defending the law legalizing abortion (INA) | 1974 | 390 s | `45MOc6PYoY8` | | Robert Badinter | FR | speech for abolishing the death penalty (INA) | 1981 | 390 s | `kIVuz9NGQXY` | Each clip is decoded to 16 kHz mono and sliced into 30 s windows (the six long crops deliberately exercise the long-range regime where the int8 long-audio bug lives), then evenly subsampled across all eight speakers for the calibration pass. The fp32 encoder is the accuracy oracle, and the export ends with a cosine-similarity fidelity check of the new encoder's output against fp32. These clips are **not redistributed here** (they are copyrighted third-party broadcasts; this repo ships under `cc-by-4.0`), which is why they are documented by source above rather than committed. To re-run the export, fetch them yourself from the listed sources and drop them in a `calibration_audio/` folder at the repo root: `scripts/quantize-int8-smoothquant.py` reads that folder by default (or pass your own clips/folders with `--audio`). ### Generalization (held-out, two domains, greedy vs beam) As independent checks that the recalibrated int8 generalizes beyond the JFK clip, it was evaluated on two sets that are **not** in the calibration data: the [FLEURS](https://huggingface.co/datasets/google/fleurs) French validation split (a general-French read-speech benchmark) and a small in-house medical-dictation set. Both are scored greedy (beam 1) and with MAES beam search (width 10): | dataset | utterances | beam 1 WER | beam 10 WER | beam 10 CER | | ------- | :--------: | :--------: | :---------: | :---------: | | FLEURS French (validation) | 289 | 5.05% | **4.98%** | **2.06%** | | in-house medical dictation | 205 | 17.65% | **17.27%** | 10.39% | | overall | 494 | 9.37% | **9.19%** | 5.13% | The **2.06% FLEURS-fr CER** confirms the model stays strongly multilingual: French audio is part of the calibration set, but FLEURS itself is held out and no French transcript or label was ever used, so this is a genuine held-out measurement. Width-10 beam search buys only a small accuracy gain over greedy (roughly 0.1 to 0.4 WER points here) at about 10x the decode cost, so greedy is a reasonable default and the beam is there when the last fraction of a point matters. Run with `scripts/grid_search_benchmark.mjs` from the [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web) repository. ### Trade-off: heavier than the stock int8, much more accurate This int8 encoder is **842 MB versus the stock 622 MB**. That is deliberate: only the MatMul ops are quantized, and the convolutional subsampling front-end is kept in fp32 (statically quantizing it collapsed the encoder to an empty transcript). The extra size buys long-audio accuracy that tracks fp16. It still uses about **half the RAM of fp16** (~5.0 GB versus ~9.5 GB), which is the point: if you can run fp16 or fp32 (for example on a WebGPU backend), prefer those. This int8 matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8 is the only precision that both fits and runs. ## Browser-friendly fp32 shards (`sharded/`) The fp32 encoder is shipped two ways here: as the canonical single sidecar (`encoder-model.onnx` + a ~2.3 GB `encoder-model.onnx.data`), and, under `sharded/`, as the **same weights repacked into several files each under 2 GB**. The sharded copy exists so the fp32 encoder can be loaded **in a web browser** (and on the CPU / WASM ONNX Runtime backend generally), which the single-file fp32 **cannot**. Why a browser cannot load the single 2.3 GB sidecar (these are *ingest* limits, not a total-memory limit): 1. **32-bit WASM ArrayBuffer cap.** A WASM build is wasm32, so any single `ArrayBuffer` it holds caps at `2^31 - 1` bytes (~2 GB). A 2.3 GB sidecar cannot live in one buffer. (This is the same wall that forces projects like wllama to shard their GGUF files.) 2. **Chromium blob-URL fetch cap.** Fetching a `blob:` URL larger than ~2 GB fails in Chromium with `TypeError: Failed to fetch`, so the file cannot even be read into memory in one piece. Note the wasm32 heap ceiling itself is ~4 GB, and fp32 stays ~2.3 GB resident (it is *not* upcast the way the CPU / WASM EP upcasts fp16 to fp32 at session build), so fp32 **fits** once no single buffer or fetch exceeds 2 GB. Sharding is purely about clearing the two per-buffer ingest walls above. `scripts/shard-fp32.py` rewrites each big initializer's `external_data` location to spread the encoder's tensors across N shard files (`encoder-model.onnx.data.000`, `encoder-model.onnx.data.001`, ... each under a 1.5 GB budget by default), leaving a small rewritten `encoder-model.onnx` graph that points at them. Here that produces **two shards** (~1.4 GB + ~0.9 GB). It is a **pure repack**: no tensor value is touched, so the sharded encoder is **byte-for-byte numerically identical** to the single-file fp32 and has the **exact same WER**. A loader (for example [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web), with its `allowWasmFp32` opt-in) mounts each shard as a separate `externalData` entry, each under the 2 GB caps, and reads them straight to bytes (no >2 GB `blob:` URL, no multi-GB IndexedDB blob). The decoder, tokenizer and config are **not** duplicated into `sharded/`; a loader takes the rewritten encoder + shards from `sharded/` and everything else from the repo root. When to use which: on **WebGPU**, prefer fp16 (half the download, native fp16 kernels) or the single-file fp32; the GPU EP has no 2 GB per-buffer wall. The shards matter on **CPU / WASM**, where fp16 has no compute kernels and the single-file fp32 cannot be ingested, so the sharded fp32 is the only way to run full precision. The shards are regenerated with [`scripts/shard-fp32.py`](./scripts/shard-fp32.py) (see [How it was built](#how-it-was-built)). ## Files | file | what it is | | ------------------------------- | ------------------------------------------------------- | | `encoder-model.onnx` (+ `.data`)| fp32 encoder (unchanged from istupakov) | | `encoder-model.fp16.onnx` | fp16 encoder (not shipped by the upstream istupakov repo)| | `encoder-model.int8.onnx` | **SmoothQuant int8 encoder (the reason for this repo)** | | `sharded/encoder-model.onnx` (+ `.data.000`, `.data.001`) | fp32 encoder repacked into <2 GB shards so a browser / WASM backend can load it (see [Browser-friendly fp32 shards](#browser-friendly-fp32-shards-sharded)) | | `decoder_joint-model.onnx` | fp32 decoder / joint network (unchanged) | | `decoder_joint-model.fp16.onnx` | fp16 decoder / joint network (unchanged) | | `decoder_joint-model.int8.onnx` | int8 decoder / joint network (unchanged) | | `nemo128.onnx` | 128-bin mel preprocessor (unchanged) | | `vocab.txt`, `config.json` | tokenizer and model config (unchanged) | | `scripts/quantize-int8-smoothquant.py` | script that produced the SmoothQuant int8 encoder | | `scripts/quantize-fp16.py` | script that produced the fp16 encoder | | `scripts/shard-fp32.py` | script that produced the sharded fp32 encoder | ## How it was built - `encoder-model.int8.onnx`: `scripts/quantize-int8-smoothquant.py`. SmoothQuant + static per-channel int8, MatMul ops only (convolutions stay fp32), with Percentile activation calibration. Calibration uses no labels: it is the eight held-out public speeches listed under [Calibration data](#calibration-data-no-labels-disjoint-from-every-eval-bilingual-audio), read from a local `calibration_audio/` folder by default (override with `--audio`), sliced into 30 s windows, with the fp32 encoder as the accuracy oracle. The script ends with a cosine-similarity fidelity check of the new encoder's output against fp32. - `encoder-model.fp16.onnx`: `scripts/quantize-fp16.py`, a straight fp16 cast of the fp32 encoder pieces. - `sharded/`: `scripts/shard-fp32.py`, a pure repack of the single-file fp32 encoder into <2 GB shards (see [Browser-friendly fp32 shards](#browser-friendly-fp32-shards-sharded)). No weights are altered, so the sharded encoder is numerically identical to the single-file fp32. All three scripts live in `scripts/`, are self-contained, and run from the repo root against the model files here, which they default to finding in the current directory (so invoke them as e.g. `uv run scripts/quantize-fp16.py`). Each declares its own dependencies via a [PEP 723](https://peps.python.org/pep-0723/) header and runs with `uv run` (which installs them on the fly). The only external inputs are the calibration clips for the int8 export (fetched from the documented sources into `calibration_audio/`, see above) and the optional WER comparison harnesses the scripts print at the end (`wer-quants.py`, `wer-bench.mjs`), which live in the [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web) project repository. ## Sources and credits - ONNX base model this repo is built on: [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx) - Original model: [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) - SmoothQuant implementation: [onnx/neural-compressor](https://github.com/onnx/neural-compressor) - Loaded with [onnx-asr](https://github.com/istupakov/onnx-asr) - Discussion: [Kieirra/murmure#289 (comment)](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354) - This repository (model export, benchmarking and documentation) was produced with [Claude Code](https://claude.com/claude-code). ## License `cc-by-4.0`, inherited from the upstream istupakov ONNX model and the original NVIDIA Parakeet TDT 0.6B v3.