thiswillbeyourgithub

scripts: move quantize/shard tools into scripts/ subfolder

7d8c10c 12 days ago

17.7 kB

license: cc-by-4.0
language:
  - en
  - es
  - fr
  - de
  - bg
  - hr
  - cs
  - da
  - nl
  - et
  - fi
  - el
  - hu
  - it
  - lv
  - lt
  - mt
  - pl
  - pt
  - ro
  - sk
  - sl
  - sv
  - ru
  - uk
base_model:
  - nvidia/parakeet-tdt-0.6b-v3
  - istupakov/parakeet-tdt-0.6b-v3-onnx
pipeline_tag: automatic-speech-recognition
tags:
  - automatic-speech-recognition
  - asr
  - onnx
  - onnx-asr
  - smoothquant
  - quantization

Parakeet TDT 0.6B v3 (Multilingual), ONNX with a SmoothQuant int8 encoder

This is istupakov/parakeet-tdt-0.6b-v3-onnx with one change: the int8 encoder (encoder-model.int8.onnx) is rebuilt with SmoothQuant, and with it I was no longer able to reproduce the loss of accuracy on longer audio that I had measured on the original int8 encoder. Everything else (the fp32 encoder, the fp16 encoder, the decoder, the preprocessor and the tokenizer) is unchanged, so this repo is a drop-in replacement for the original: point your loader at it and the better int8 is picked up automatically by its canonical name.

It also ships the fp16 encoder (which the upstream istupakov repo does not), so all three precisions are available here in one place. Unlike the int8 encoder, the fp16 one is not SmoothQuant or anything clever: it is a naive fp16 cast of the fp32 pieces (scripts/quantize-fp16.py). In my testing it scored exactly equal to fp32 (same WER, overall and in every section), at half the size. Note that fp16 compute is not implemented on every backend (for example the CPU / WASM ONNX Runtime EP has no fp16 kernels), so there it is upcast back to fp32 at session build and gives you no runtime benefit. Even then it is still useful purely as a smaller artifact: it is half the download / packaging size of fp32, for identical accuracy. That is why parakeet_web serves it (on a WebGPU backend it also runs natively in fp16, halving GPU memory).

This was originally built to improve the int8 transcription quality of parakeet_web (live demo: parakeetweb.olicorne.org), a browser-based Parakeet ASR app that runs the int8 encoder on its CPU / WASM backend, where fp16 is not an option.

I also contribute to Kieirra/murmure, another browser-based Parakeet ASR project, where this SmoothQuant int8 encoder is progressively being upstreamed (see the discussion).

Why this exists

The stock int8 encoder transcribes short clips fine, but its accuracy degrades badly once a single pass runs past roughly 20 to 30 seconds. The fp16 and fp32 encoders do not show this: so it is not the model architecture, it is an int8 numerics problem. The stock int8 uses fully dynamic, per-tensor activation quantization (one runtime scale for an entire activation tensor). Once a longer sequence widens the activation distribution, that single scale can no longer represent it and the transcript falls apart.

SmoothQuant targets exactly this failure mode: it migrates the per-channel activation outliers into the weights (a folded multiply), then statically quantizes activations together with per-channel weights. With the smoothed, per-channel int8 encoder I was no longer able to reproduce the long-audio degradation in my own testing (see the numbers below).

Background and discussion: Kieirra/murmure#289 (comment).

Results

Benchmark: a single ~390 second pass of a JFK speech clip (no chunking), scored per 60 second section against the fp32 encoder as the oracle (each section is also transcribed independently as a short clip, which the encoders all handle well, and that short-clip transcription is the reference). A WER that climbs as you go down the table is the long-audio degradation. Run with scripts/wer-quants.py from the parakeet_web project repository; the export and comparison are fully reproducible with the scripts included in this repo.

Overall (single 390 s pass, lower WER is better):

encoder precision	encoder size	overall WER	peak RAM
stock int8 (istupakov)	622 MB	40.40%	~5.0 GB
SmoothQuant int8 (this)	842 MB	11.32%	~5.0 GB
fp16/fp32	~1.2 GB	10.17%	~9.5 GB

Per-section WER:

section	stock int8	SmoothQuant int8	fp16/fp32
0 to 60 s	41.4%	3.4%	2.6%
60 to 120 s	29.2%	3.5%	5.3%
120 to 180 s	39.1%	7.0%	3.9%
180 to 240 s	28.2%	4.3%	3.4%
240 to 300 s	69.5%	46.3%	45.1%
300 to 360 s	46.8%	25.5%	23.4%
360 to 390 s	37.5%	6.2%	4.2%

fp16 and fp32 produced the exact same WER (overall and in every section), so they share one column. The encoder size and peak RAM in the overall table are fp16's; the fp32 encoder is roughly twice as large.

The SmoothQuant int8 tracks fp16 closely (11.32% overall vs fp16's 10.17%, a 1.2 point gap) and is about 3.6x better than the stock int8's 40.40%. The 240 to 360 s sections are elevated for fp16 too, so that is the audio / oracle for those windows, not a quantization artifact: the SmoothQuant int8 matches fp16 there while the stock int8 blows up to 69.5%. The JFK clip is held out of the calibration set (see below), so this is an out-of-sample measurement, not a fit to the eval audio.

Calibration data (no labels, disjoint from every eval, bilingual audio)

SmoothQuant is a static method: it needs representative activations (not labels or transcripts) to estimate per-channel ranges, which it then folds into the weights as an exact equivalence transform. No labels, transcripts, or training targets are used. It does use audio data, and that audio is deliberately bilingual (French and English): but only as raw signal to exercise the activation ranges. Nothing is fit to any transcript, and the model's multilingual ability is inherited unchanged from the base model rather than learned or tuned here.

The calibration corpus is eight public political speeches, chosen to be disjoint from every evaluation set (the JFK long-audio WER clip and the FLEURS French split are both strictly held out) and to span decades, recording conditions and two languages so the activation distribution stays broad:

speaker	lang	speech	year	crop	source (YouTube id)
Dominique de Villepin	FR	UN Security Council address against the Iraq war	2003	390 s	`RNxU-tN8qNc`
Bernie Sanders	EN	Senate floor filibuster against the tax-cut extension	2010	390 s	`K6pa-QdL4Wo`
Georges Pompidou	FR	presidential press conference (INA archive)	1970	390 s	`RNWFPX_Yafw`
Lyndon B. Johnson	EN	"We Shall Overcome" voting-rights address	1965	390 s	`o74X_rTzrGI`
Jacques Chirac	FR	"Notre maison brule" Earth Summit speech	2002	60 s	`M_oR0wZ3lI4`
Richard Nixon	EN	resignation address	1974	60 s	`ZEOGJJ7UKFM`
Simone Veil	FR	speech defending the law legalizing abortion (INA)	1974	390 s	`45MOc6PYoY8`
Robert Badinter	FR	speech for abolishing the death penalty (INA)	1981	390 s	`kIVuz9NGQXY`

Each clip is decoded to 16 kHz mono and sliced into 30 s windows (the six long crops deliberately exercise the long-range regime where the int8 long-audio bug lives), then evenly subsampled across all eight speakers for the calibration pass. The fp32 encoder is the accuracy oracle, and the export ends with a cosine-similarity fidelity check of the new encoder's output against fp32.

These clips are not redistributed here (they are copyrighted third-party broadcasts; this repo ships under cc-by-4.0), which is why they are documented by source above rather than committed. To re-run the export, fetch them yourself from the listed sources and drop them in a calibration_audio/ folder at the repo root: scripts/quantize-int8-smoothquant.py reads that folder by default (or pass your own clips/folders with --audio).

Generalization (held-out, two domains, greedy vs beam)

As independent checks that the recalibrated int8 generalizes beyond the JFK clip, it was evaluated on two sets that are not in the calibration data: the FLEURS French validation split (a general-French read-speech benchmark) and a small in-house medical-dictation set. Both are scored greedy (beam 1) and with MAES beam search (width 10):

dataset	utterances	beam 1 WER	beam 10 WER	beam 10 CER
FLEURS French (validation)	289	5.05%	4.98%	2.06%
in-house medical dictation	205	17.65%	17.27%	10.39%
overall	494	9.37%	9.19%	5.13%

The 2.06% FLEURS-fr CER confirms the model stays strongly multilingual: French audio is part of the calibration set, but FLEURS itself is held out and no French transcript or label was ever used, so this is a genuine held-out measurement. Width-10 beam search buys only a small accuracy gain over greedy (roughly 0.1 to 0.4 WER points here) at about 10x the decode cost, so greedy is a reasonable default and the beam is there when the last fraction of a point matters. Run with scripts/grid_search_benchmark.mjs from the parakeet_web repository.

Trade-off: heavier than the stock int8, much more accurate

This int8 encoder is 842 MB versus the stock 622 MB. That is deliberate: only the MatMul ops are quantized, and the convolutional subsampling front-end is kept in fp32 (statically quantizing it collapsed the encoder to an empty transcript). The extra size buys long-audio accuracy that tracks fp16. It still uses about half the RAM of fp16 (~5.0 GB versus ~9.5 GB), which is the point: if you can run fp16 or fp32 (for example on a WebGPU backend), prefer those. This int8 matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8 is the only precision that both fits and runs.

Browser-friendly fp32 shards (`sharded/`)

The fp32 encoder is shipped two ways here: as the canonical single sidecar (encoder-model.onnx + a ~2.3 GB encoder-model.onnx.data), and, under sharded/, as the same weights repacked into several files each under 2 GB. The sharded copy exists so the fp32 encoder can be loaded in a web browser (and on the CPU / WASM ONNX Runtime backend generally), which the single-file fp32 cannot.

Why a browser cannot load the single 2.3 GB sidecar (these are ingest limits, not a total-memory limit):

32-bit WASM ArrayBuffer cap. A WASM build is wasm32, so any single ArrayBuffer it holds caps at 2^31 - 1 bytes (~2 GB). A 2.3 GB sidecar cannot live in one buffer. (This is the same wall that forces projects like wllama to shard their GGUF files.)
Chromium blob-URL fetch cap. Fetching a blob: URL larger than ~2 GB fails in Chromium with TypeError: Failed to fetch, so the file cannot even be read into memory in one piece.

Note the wasm32 heap ceiling itself is ~4 GB, and fp32 stays ~2.3 GB resident (it is not upcast the way the CPU / WASM EP upcasts fp16 to fp32 at session build), so fp32 fits once no single buffer or fetch exceeds 2 GB. Sharding is purely about clearing the two per-buffer ingest walls above.

scripts/shard-fp32.py rewrites each big initializer's external_data location to spread the encoder's tensors across N shard files (encoder-model.onnx.data.000, encoder-model.onnx.data.001, ... each under a 1.5 GB budget by default), leaving a small rewritten encoder-model.onnx graph that points at them. Here that produces two shards (~1.4 GB + ~0.9 GB). It is a pure repack: no tensor value is touched, so the sharded encoder is byte-for-byte numerically identical to the single-file fp32 and has the exact same WER. A loader (for example parakeet_web, with its allowWasmFp32 opt-in) mounts each shard as a separate externalData entry, each under the 2 GB caps, and reads them straight to bytes (no >2 GB blob: URL, no multi-GB IndexedDB blob). The decoder, tokenizer and config are not duplicated into sharded/; a loader takes the rewritten encoder + shards from sharded/ and everything else from the repo root.

When to use which: on WebGPU, prefer fp16 (half the download, native fp16 kernels) or the single-file fp32; the GPU EP has no 2 GB per-buffer wall. The shards matter on CPU / WASM, where fp16 has no compute kernels and the single-file fp32 cannot be ingested, so the sharded fp32 is the only way to run full precision.

The shards are regenerated with scripts/shard-fp32.py (see How it was built).

Files

file	what it is
`encoder-model.onnx` (+ `.data`)	fp32 encoder (unchanged from istupakov)
`encoder-model.fp16.onnx`	fp16 encoder (not shipped by the upstream istupakov repo)
`encoder-model.int8.onnx`	SmoothQuant int8 encoder (the reason for this repo)
`sharded/encoder-model.onnx` (+ `.data.000`, `.data.001`)	fp32 encoder repacked into <2 GB shards so a browser / WASM backend can load it (see Browser-friendly fp32 shards)
`decoder_joint-model.onnx`	fp32 decoder / joint network (unchanged)
`decoder_joint-model.fp16.onnx`	fp16 decoder / joint network (unchanged)
`decoder_joint-model.int8.onnx`	int8 decoder / joint network (unchanged)
`nemo128.onnx`	128-bin mel preprocessor (unchanged)
`vocab.txt`, `config.json`	tokenizer and model config (unchanged)
`scripts/quantize-int8-smoothquant.py`	script that produced the SmoothQuant int8 encoder
`scripts/quantize-fp16.py`	script that produced the fp16 encoder
`scripts/shard-fp32.py`	script that produced the sharded fp32 encoder

How it was built

encoder-model.int8.onnx: scripts/quantize-int8-smoothquant.py. SmoothQuant + static per-channel int8, MatMul ops only (convolutions stay fp32), with Percentile activation calibration. Calibration uses no labels: it is the eight held-out public speeches listed under Calibration data, read from a local calibration_audio/ folder by default (override with --audio), sliced into 30 s windows, with the fp32 encoder as the accuracy oracle. The script ends with a cosine-similarity fidelity check of the new encoder's output against fp32.
encoder-model.fp16.onnx: scripts/quantize-fp16.py, a straight fp16 cast of the fp32 encoder pieces.
sharded/: scripts/shard-fp32.py, a pure repack of the single-file fp32 encoder into <2 GB shards (see Browser-friendly fp32 shards). No weights are altered, so the sharded encoder is numerically identical to the single-file fp32.

All three scripts live in scripts/, are self-contained, and run from the repo root against the model files here, which they default to finding in the current directory (so invoke them as e.g. uv run scripts/quantize-fp16.py). Each declares its own dependencies via a PEP 723 header and runs with uv run (which installs them on the fly). The only external inputs are the calibration clips for the int8 export (fetched from the documented sources into calibration_audio/, see above) and the optional WER comparison harnesses the scripts print at the end (wer-quants.py, wer-bench.mjs), which live in the parakeet_web project repository.

Sources and credits

ONNX base model this repo is built on: istupakov/parakeet-tdt-0.6b-v3-onnx
Original model: nvidia/parakeet-tdt-0.6b-v3
SmoothQuant implementation: onnx/neural-compressor
Loaded with onnx-asr
Discussion: Kieirra/murmure#289 (comment)
This repository (model export, benchmarking and documentation) was produced with Claude Code.

License

cc-by-4.0, inherited from the upstream istupakov ONNX model and the original NVIDIA Parakeet TDT 0.6B v3.