Parakeet TDT 0.6B v3 (Multilingual), ONNX with a SmoothQuant int8 encoder
Changelog
Click to expand
2026-06-17
- Third candidate:
models_in_testing/all_fleurs_balanced/. A SmoothQuant int8 encoder calibrated on the FLEURS train split of every language, ~50 clips per language, balanced by audio-seconds budget with no per-language weight (every language gets the same calibration-seconds budget, filled by a deterministic shuffle and truncated, so none dominates). It searches the finer--auto-alpha-step 0.1SmoothQuant grid (vs the coarser 0.2 thefrench_only_100run pinned). The intent is a calibration set that represents the whole multilingual distribution evenly, rather than down-weighting one language or going monolingual. Like the other candidates it is a realencoder-model.int8.onnxplus relative symlinks to the shared base files, shipped for tester feedback, not promoted to the canonical names. wer-fleurs-validation.shis now committed (was local-only). It compares a roster of models (istupakov fp32/int8, this repo's int8, and themodels_in_testing/candidates) by per-language WER over the whole FLEURS validation split, driving the parent repo'sscripts/wer-quants.py --manifestwith each model loaded once over every language. It reads machine-specific paths (FLEURS_DIR,WER_QUANTS_SCRIPT) from the gitignored.env, so it carries no personal paths.
2026-06-16
- New
models_in_testing/area for sharing candidate (not-yet-promoted) encoder builds with testers. Each subfolder is a self-contained, loadable model directory: it holds one candidate file as a real artifact (for exampleencoder-model.int8.onnx) and borrows everything else (decoder, preprocessor, tokenizer, config, the fp32 / fp16 encoders) as relative symlinks back to the repo root, so a candidate costs only its own weights on disk instead of a full copy of the repo.models_in_testing/link-model-files.sh <dir>creates those missing links for any directory without overwriting files already present. The canonicalencoder-model.int8.onnxat the repo root is unchanged (still the 2026-06-14 build); these are candidates under evaluation, not a new default. - First candidate:
models_in_testing/english_downweighted_0.2/. A SmoothQuant int8 encoder calibrated with English downweighted to a fifth of every other language's calibration-seconds budget (--fleurs-lang-weight en=0.2, with--fleurs-per-lang 6and the duration-balanced FLEURS selection). The intent is to bias the int8 activation ranges away from the dominant training language so a degraded embedding is less likely to tip non-English audio into English (see Per-language weighting). It is shipped here for tester feedback, not promoted to the canonical names. - Second candidate:
models_in_testing/french_only_100/. A SmoothQuant int8 encoder calibrated on French only (a FLEURS root holding a singlefrsplit,--fleurs-per-lang 100, no speeches), to test whether a monolingual calibration set sharpens the int8 activation ranges for French audio. Like the first candidate it is a realencoder-model.int8.onnxplus relative symlinks to the shared base files, shipped for tester feedback, not promoted to the canonical names.
2026-06-15
- Prefix-search recombination now ships off by default for decoding. A grid
search with this int8 encoder + int8 decoder (in-house French-medical dictation
- FLEURS-fr validation, 494 utterances, MAES beam 5), repeated on both the CPU
(ONNX Runtime
node) and GPU (cuda) backends, found the MAESprefix_alphaknob moved WER/CER only within noise (under 0.05 WER points either way) while costing ~15-20% more decode time, so parakeet_web defaultsmaes_prefix_alphato 0 (NeMo uses 1). The same sweep kept beam 5 (overall accuracy plateaus by ~beam 5 on both backends) and the other MAES knobs at their defaults (num-steps 2, gamma 2.3); note CPU decode cost scales roughly linearly with beam width whereas the GPU's batched curve is flatter.
- FLEURS-fr validation, 494 utterances, MAES beam 5), repeated on both the CPU
(ONNX Runtime
- The int8 decoder / joint network is as accurate as the fp32 one. Crossing encoder precision against decoder precision on the 494-utterance held-out harness (FLEURS-fr validation + in-house dictation, MAES beam 10), the int8 decoder matches the fp32 decoder to within noise for either encoder (with this int8 encoder and keyword boosting, 8.98% vs 9.01% overall WER), at about a quarter the size (18 MB vs ~70 MB). So the decoder quantization is effectively lossless. See The int8 decoder is as accurate as fp32.
2026-06-14
- Rebuilt
encoder-model.int8.onnxto essentially match fp32 on long audio, at the cost of size. The export keeps the 11 most quantization-damaged MatMuls in fp32 (--exclude-worst 0.05, the mid-layer feed-forwardlinear2projections) and quantizes the other 206 MatMuls plus all 77 Convs to int8. The file is 757 MB (stock int8 is 622 MB); the extra size is what buys back the accuracy. - Calibration is now FLEURS-only, so the eight political speeches are no longer calibration data and are reused as a held-out long-audio evaluation set. Each of the 25 supported languages contributes 3 FLEURS train clips, one calibration window per clip with no concatenation (75 windows, kept whole).
- New long-audio benchmark (eight held-out speeches, single ~390 s pass each, fp32 encoder as oracle, fp32 decoder): mean overall WER 7.5% for this int8 versus 7.8% for fp32 and 24.8% for the stock int8. The recalibrated int8 now tracks fp32 to within noise and is about 3x better than the stock int8.
- Every fp32 weight file is mantissa-rounded (12 low bits zeroed,
round-to-nearest) for compressibility (the 2.3 GB fp32 encoder's weights then
compress to 1.55 GB, down from 2.26 GB unrounded, 31.6% smaller). This is
near-lossless (it keeps 11 of fp32's 23 mantissa bits, one more than fp16) and
applies to the int8 encoder's
leftover fp32 weights (
--zero-mantissa-bits 12), the fp32 decoder, and the standalone + sharded fp32 encoder. The fp32 decoder used in every benchmark here is itself rounded, and reproduced the stock-decoder WER to within 0.2 points, so the rounding does not move accuracy. - Generalization re-measured (FLEURS-fr validation + in-house medical dictation, MAES beam 10, fp32 decoder): this int8 tracks fp32 closely, 5.18% / 4.87% WER on FLEURS-fr and 17.60% / 17.40% on the dictation set (int8 / fp32, no keyword boost). Details below.
- Export tooling: GPU calibration (
--ep cuda), per-op-type alpha grids (--op-alpha), FLEURS calibration via--fleurs-dir, streaming Percentile calibration (flat RAM in the number of calibration windows), and an auto-alpha search that no longer retains protobuf arena memory on every evaluation (fine alpha grids used to exhaust RAM on the 2.4 GB encoder), all in the vendored neural-compressor fork. The fork'sdiybranch is pushed and its README documents every change versus upstream, making the whole export reproducible.
This is istupakov/parakeet-tdt-0.6b-v3-onnx
with two kinds of change. The substantive one is the int8 encoder
(encoder-model.int8.onnx), rebuilt with
SmoothQuant: with it I was no longer
able to reproduce the loss of accuracy on longer audio that I had measured on the
original int8 encoder. The cosmetic one is that every fp32 weight file here is
mantissa-rounded for compressibility (12 low mantissa bits zeroed,
round-to-nearest): it is near-lossless (it keeps one more mantissa bit than fp16),
so the models stay numerically equivalent but get far smaller once compressed.
The preprocessor and tokenizer are byte-for-byte from istupakov. This is still a
drop-in replacement: canonical names, numerically equivalent fp32/decoder, and
the better int8 is picked up automatically.
It also ships the fp16 encoder (which the upstream istupakov repo does not),
so all three precisions are available here in one place. Unlike the int8 encoder,
the fp16 one is not SmoothQuant or anything clever: it is a naive fp16 cast of
the fp32 pieces (scripts/quantize-fp16.py). In my testing it scored
exactly equal to fp32 (the same WER), at half the
size. Note that fp16 compute is not implemented on every backend (for example the
CPU / WASM ONNX Runtime EP has no fp16 kernels), so there it is upcast back to fp32
at session build and gives you no runtime benefit. Even then it is still useful
purely as a smaller artifact: it is half the download / packaging size of fp32, for
identical accuracy. That is why parakeet_web
serves it (on a WebGPU backend it also runs natively in fp16, halving GPU memory).
This was originally built to improve the int8 transcription quality of parakeet_web (live demo: parakeetweb.olicorne.org), a browser-based Parakeet ASR app that runs the int8 encoder on its CPU / WASM backend, where fp16 is not an option.
I also contribute to Kieirra/murmure, another browser-based Parakeet ASR project, where this SmoothQuant int8 encoder is progressively being upstreamed (see the discussion).
Why this exists
The stock int8 encoder transcribes short clips fine, but its accuracy degrades badly once a single pass runs past roughly 20 to 30 seconds. The fp16 and fp32 encoders do not show this: so it is not the model architecture, it is an int8 numerics problem. The stock int8 uses fully dynamic, per-tensor activation quantization (one runtime scale for an entire activation tensor). Once a longer sequence widens the activation distribution, that single scale can no longer represent it and the transcript falls apart.
SmoothQuant targets exactly this failure mode: it migrates the per-channel activation outliers into the weights (a folded multiply), then statically quantizes activations together with per-channel weights. With the smoothed, per-channel int8 encoder I was no longer able to reproduce the long-audio degradation in my own testing (see the numbers below).
Background and discussion: Kieirra/murmure#289 (comment).
Results
Benchmark: eight political speeches (listed under
Evaluation data), each
transcribed in a single pass with no chunking (capped at 390 s, the encoder's
positional-encoding reach), scored against the fp32 encoder as the oracle. The
eight speeches are held out of calibration (this build calibrates on FLEURS
only), so this is an out-of-sample long-audio test, exactly the regime where the
stock int8 falls apart. All three precisions use the fp32 decoder / joint
network (not the int8 decoder), and that fp32 decoder is itself
mantissa-rounded (see the changelog and
Files). The fp32 decoder is used here only to keep the encoder the
single variable under test; the int8 decoder / joint network is just as
accurate (see The int8 decoder is as accurate as fp32).
Run with scripts/wer-quants.py from the
parakeet_web repo.
Mean across the eight speeches (lower WER is better):
| encoder precision | encoder size | mean overall WER | mean worst-chunk WER |
|---|---|---|---|
| stock int8 (istupakov) | 622 MB | 24.8% | 58.7% |
| SmoothQuant int8 (this) | 757 MB | 7.5% | 37.1% |
| fp32 (oracle) | ~2.3 GB | 7.8% | 36.8% |
Per-speech overall WER:
| speech | stock int8 | SmoothQuant int8 | fp32 |
|---|---|---|---|
| Villepin (FR, Iraq) | 24.2% | 20.4% | 22.0% |
| Sanders (EN, filibuster) | 30.5% | 9.3% | 10.8% |
| Pompidou (FR, press) | 12.0% | 8.6% | 8.5% |
| Johnson (EN, voting rights) | 48.8% | 5.9% | 7.8% |
| Chirac (FR, Earth Summit) | 25.8% | 0.0% | 0.0% |
| Nixon (EN, resignation) | 0.9% | 0.9% | 0.0% |
| Veil (FR, abortion law) | 8.8% | 5.4% | 5.2% |
| Badinter (FR, death penalty) | 47.6% | 9.4% | 8.0% |
| mean | 24.8% | 7.5% | 7.8% |
The recalibrated int8 matches fp32 to within noise (7.5% vs 7.8% mean; it is better on some speeches, slightly worse on others) and is about 3x better than the stock int8 (24.8%). The "worst-chunk" column is each speech's single worst 60 s window; its mean is inflated by a few windows that score badly for fp32 too (for example one Villepin window is ~160% WER for every precision, an audio/oracle artifact rather than a quantization failure), which is why the SmoothQuant int8 (37.1%) again tracks fp32 (36.8%) rather than the stock int8 (58.7%).
Speed and RAM are intentionally omitted. This run used --cuda, so the fp32
encoder ran on the GPU while the int8 encoder fell back to the CPU (ONNX Runtime
has no CUDA kernels for the QOperator int8 format), which makes the two
precisions' timing and host-memory numbers incomparable. Encoder size on disk is
the one hardware-independent cost shown.
Calibration data (FLEURS train only, no labels)
SmoothQuant is a static method: it needs representative activations (not labels or transcripts) to estimate per-channel ranges, which it then folds into the weights as an exact equivalence transform. No labels, transcripts or training targets are used, only raw audio as signal to exercise the activation ranges; the model's multilingual ability is inherited unchanged from the base model rather than learned or tuned here.
This build calibrates on the
FLEURS train split only
(--fleurs-dir). For each of the model's 25 supported languages, train clips are
taken (a deterministic shuffle, diverse speakers) as their own calibration
windows with no concatenation (FLEURS clips are short, so one window per clip).
Selection is balanced by audio duration, not clip count: SmoothQuant
accumulates its ranges over activation frames (∝ clip duration), so equal clip
counts would let whichever language has the longest clips dominate. Instead every
language is given the same audio-seconds budget (--fleurs-per-lang x the median clip length), filling it in shuffle order and truncating the clip that crosses it,
so a single long clip cannot skew a language's total. With uniform-length clips
this still works out to 400 s), so an unusually long clip would be
truncated rather than split.--fleurs-per-lang clips per language.
--window-sec is a separate per-window length cap that defaults to the
encoder's positional-encoding reach (
Because calibration is FLEURS-train-only, every evaluation set is held out: the eight political speeches (the long-audio WER above), the FLEURS French validation split and the in-house medical dictation (both below), and the JFK clip are all disjoint from calibration. The fp32 encoder is the accuracy oracle, and the export ends with a cosine-similarity fidelity check of the new encoder's output against fp32 (0.96 for this build).
The FLEURS audio is not redistributed here (it is a public dataset): download it
and point --fleurs-dir at a directory holding one <lang>/wavs_train/ folder
per language.
Per-language weighting (--fleurs-lang-weight). Every language gets the same
seconds budget by default, so the int8 activation ranges are a uniform 25-language
compromise. Because the multilingual TDT model decides its output language
implicitly from the (quantized) encoder embedding, letting a dominant training
language (English) weigh on those ranges can make a degraded embedding tip the
output into English on non-English audio. --fleurs-lang-weight LANG=W (repeatable)
scales one language's seconds budget: W applies to every FLEURS dir matching
LANG (the exact name like en_us, or the code before _ like en to cover all
variants), so 0 drops the language, <1 downweights it and >1 upweights it.
Unlisted languages keep the full budget. For example
--fleurs-lang-weight en=0.2 gives English a fifth of every other language's
calibration seconds, biasing the ranges away from English without favouring any
single target language.
Balance to the smallest language (--fleurs-balance-smallest). Instead of a
fixed --fleurs-per-lang count, this calibrates on the whole FLEURS train set
of every language, capped so no language uses more audio than the least-represented
one. The per-language seconds budget becomes the true minimum total (capped)
duration across all languages: the smallest language is used whole and sets the cap,
every larger language is truncated to it, so all 25 languages contribute equal
seconds with no weight imbalance. It scans every clip's duration up front (decoding
the train set once), so it is slower to start and produces a large calibration pool
(~the minimum language's clip count x number of languages). It overrides
--fleurs-per-lang; --fleurs-lang-weight still scales the per-language budget.
Evaluation data: eight held-out speeches (long audio)
The long-audio benchmark uses eight public political speeches, chosen to span decades, recording conditions and two languages, and held out of calibration:
| speaker | lang | speech | year | crop | source (YouTube id) |
|---|---|---|---|---|---|
| Dominique de Villepin | FR | UN Security Council address against the Iraq war | 2003 | 390 s | RNxU-tN8qNc |
| Bernie Sanders | EN | Senate floor filibuster against the tax-cut extension | 2010 | 390 s | K6pa-QdL4Wo |
| Georges Pompidou | FR | presidential press conference (INA archive) | 1970 | 390 s | RNWFPX_Yafw |
| Lyndon B. Johnson | EN | "We Shall Overcome" voting-rights address | 1965 | 390 s | o74X_rTzrGI |
| Jacques Chirac | FR | "Notre maison brule" Earth Summit speech | 2002 | 60 s | M_oR0wZ3lI4 |
| Richard Nixon | EN | resignation address | 1974 | 60 s | ZEOGJJ7UKFM |
| Simone Veil | FR | speech defending the law legalizing abortion (INA) | 1974 | 390 s | 45MOc6PYoY8 |
| Robert Badinter | FR | speech for abolishing the death penalty (INA) | 1981 | 390 s | kIVuz9NGQXY |
Each clip is decoded to 16 kHz mono and transcribed in a single pass (capped at
390 s). These clips are not redistributed here (they are copyrighted
third-party broadcasts; this repo ships under cc-by-4.0), which is why they are
documented by source above rather than committed. To reproduce the long-audio
benchmark, fetch them from the listed sources and drop them in a
calibration_audio/ folder at the repo root, then run
scripts/wer-quants.py --audio calibration_audio --decoder-quant fp32 from the
parakeet_web repo (the
--decoder-quant fp32 reproduces the fp32-decoder oracle of the table above;
wer-quants.py now defaults to the int8 decoder to match the production app). (The
folder name is historical: in this build the speeches are evaluation, not
calibration. To fold them back into calibration as long-text coverage, pass
--audio calibration_audio to scripts/quantize-int8-smoothquant.py.)
Generalization (held-out, two domains)
As independent checks that the recalibrated int8 generalizes beyond the eight speeches, it was evaluated on two sets that are not in the calibration data: the FLEURS French validation split (general-French read speech) and a small in-house medical-dictation set. Both are scored with MAES beam search (width 10) and the fp32 decoder, alongside the fp32 encoder under the identical harness, so this is a like-for-like int8-versus-fp32 comparison (lower is better):
| dataset | utterances | int8 WER | fp32 WER | int8 CER | fp32 CER |
|---|---|---|---|---|---|
| FLEURS French (validation) | 289 | 5.18% | 4.87% | 2.23% | 1.98% |
| in-house medical dictation | 205 | 17.60% | 17.40% | 10.40% | 10.38% |
| overall | 494 | 9.43% | 9.16% | 5.24% | 5.08% |
The int8 tracks fp32 within ~0.3 WER on FLEURS-fr and within ~0.2 WER on the dictation set, confirming the recalibrated encoder stays strongly multilingual (2.23% FLEURS-fr CER). French audio is in the calibration set, but the FLEURS validation split evaluated here is held out (calibration uses FLEURS train only) and no French transcript or label was ever used, so this is a genuine held-out measurement.
parakeet_web's domain-keyword boosting is orthogonal to quantization (it would
help any encoder) but shows the practical ceiling on the drug-name-heavy
dictation set: with boosting the dictation WER drops to 15.30% for int8 and
14.89% for fp32. On FLEURS-fr, which has no relevant keywords, boosting is a
near-no-op (5.72% int8 / 5.17% fp32), a sanity check. Run with
scripts/grid_search_benchmark.mjs from the
parakeet_web repository.
Full FLEURS validation, all 25 languages
The French number above is one slice of a wider check: the whole FLEURS
validation split of every one of the 25 supported languages (all clips, no
--limit), comparing the stock istupakov encoder at fp32 and int8 against this
repo's SmoothQuant int8. Every model uses the fp32 decoder / joint (so the
encoder is the only variable), WER is normalized (case / punctuation folded), and
all three were scored under one harness with scripts/wer-quants.py --manifest
from the parakeet_web
repo. The validation split is held out of calibration (this build calibrates on
FLEURS train only).
Every WER below uses greedy decoding (beam width 1). A wider beam search would lower the absolute numbers across all three models, so read these as a conservative floor (and as relative gaps between the encoders, which is what the comparison is about), not as the best achievable WER. The width-10 MAES beam used for the French-only figure above is why that number reads lower.
Per-language WER (%), lower is better:
| lang | ref words | istupakov fp32 | istupakov int8 | this repo int8 |
|---|---|---|---|---|
| bg | 8084 | 11.94 | 15.51 | 12.37 |
| cs | 5396 | 12.92 | 21.98 | 12.75 |
| da | 8018 | 18.06 | 25.95 | 19.54 |
| de | 7410 | 5.47 | 8.07 | 5.82 |
| el | 6061 | 36.69 | 43.18 | 37.96 |
| en | 8323 | 5.88 | 7.88 | 6.12 |
| es | 10125 | 3.58 | 4.58 | 3.85 |
| et | 6035 | 17.28 | 23.55 | 18.00 |
| fi | 6216 | 12.81 | 18.48 | 13.80 |
| fr | 7021 | 4.90 | 6.08 | 5.70 |
| hr | 6904 | 11.62 | 17.45 | 12.40 |
| hu | 7356 | 15.61 | 30.26 | 16.23 |
| it | 9018 | 2.71 | 3.81 | 3.10 |
| lt | 6760 | 22.91 | 34.05 | 24.66 |
| lv | 5988 | 23.75 | 37.12 | 25.52 |
| mt | 9185 | 20.88 | 40.97 | 22.46 |
| nl | 3736 | 8.43 | 10.44 | 8.94 |
| pl | 5940 | 6.23 | 9.71 | 6.75 |
| pt | 8464 | 4.86 | 4.69 | 5.07 |
| ro | 8746 | 13.31 | 21.39 | 14.35 |
| ru | 6501 | 6.21 | 8.14 | 6.60 |
| sk | 6440 | 9.19 | 16.06 | 9.91 |
| sl | 6431 | 24.97 | 46.60 | 27.27 |
| sv | 6370 | 14.58 | 19.20 | 15.65 |
| uk | 5900 | 6.53 | 9.80 | 6.95 |
| MACRO | 12.85 | 19.40 | 13.67 | |
| MICRO | 12.49 | 18.99 | 13.30 |
MACRO is the unweighted mean of per-language WER; MICRO is total word edits over total reference words across all languages. The SmoothQuant int8 closes almost the entire stock-int8 gap to fp32 across the board: it is within ~0.8 WER of fp32 on the macro average (13.67% vs 12.85%) while the stock int8 trails by ~6.5 WER (19.40%). The gain is largest exactly where the stock int8 collapses worst, the morphologically rich / lower-resource languages (Maltese 40.97% -> 22.46%, Slovenian 46.60% -> 27.27%, Hungarian 30.26% -> 16.23%, Latvian 37.12% -> 25.52%), and it never regresses below the stock int8 on any language.
The int8 decoder is as accurate as fp32
Every WER above uses the fp32 decoder / joint network as a clean oracle, so that the encoder stays the only variable under test. But the int8 decoder / joint is just as accurate, and it is much smaller: 18 MB versus ~70 MB for the fp32 decoder (and 35 MB for fp16). Crossing encoder precision against decoder precision on the same 494-utterance held-out harness (FLEURS-fr validation + in-house medical dictation, MAES beam 10, keyword boosting on), the int8 decoder matches the fp32 decoder to within noise (here even marginally better) for either encoder:
| encoder | decoder | overall WER | overall CER |
|---|---|---|---|
| int8 (this repo) | int8 | 8.98% | 4.73% |
| int8 (this repo) | fp32 | 9.01% | 4.87% |
| fp32 (oracle) | int8 | 8.32% | 3.84% |
| fp32 (oracle) | fp32 | 8.48% | 4.14% |
Without keyword boosting the two decoders are likewise tied (this int8 encoder
scores 9.43% overall WER with either decoder). So the decoder / joint
quantization is effectively lossless here: a loader can pair any encoder
precision with the much smaller int8 decoder at no measurable accuracy cost. The
benchmarks above still report the fp32 decoder only so the encoder stays the
single variable. Run with scripts/grid_search_benchmark.mjs from the
parakeet_web repository.
Trade-off: larger than the stock int8, but fp32-grade accuracy
This build deliberately trades file size for accuracy. At 757 MB it is larger
than the stock int8 (622 MB), because --exclude-worst 0.05 keeps the 11 most
quantization-damaged MatMuls
(mid-layer feed-forward linear2 projections) in fp32 instead of int8. That
escape hatch is what closes the long-audio gap: this int8 now matches fp32
(7.5% vs 7.8% mean WER on the eight held-out speeches) instead of trailing it.
The leftover fp32 weights are mantissa-rounded (12 low bits zeroed), so although
the file is 757 MB on disk it compresses well for transfer (zip / HTTP
content-encoding) while staying more precise than an fp16 cast.
Dialing --exclude-worst higher (size / RAM / accuracy)
--exclude-worst is the lever for this trade-off, and it is close to linear in the
useful range. Of the 217 searched MatMuls the quantization damage is concentrated
in the feed-forward linear2 down-projections: the 46 worst-ranked nodes are
all linear2, and rank 47 (the first attention layer) sits at less than half the
loss, a clean gap. Since every linear2 weight is the same size (4096x1024), each
node you keep in fp32 instead of int8 costs a fixed amount:
- +12.0 MiB (~12.6 MB) uncompressed, which is also the +resident RAM at inference (ONNX Runtime loads weights resident; the resident size tracks the uncompressed file, not the gzip size, so the mantissa-rounding that shrinks the download does not shrink RAM).
- +~7 MiB gzip (the fp32-rounded weights compress to ~0.635 of raw, the int8 they replace to ~0.71-0.80, measured on these files).
--exclude-worst |
fp32 linear2 kept |
encoder file ≈ resident RAM | gzip download |
|---|---|---|---|
0 (pure int8) |
0 | 622 MB | 443 MB |
11 (0.05, shipped) |
11 | 757 MB | 579 MB |
35 |
35 | ~1045 MB | ~745 MB |
46 (whole hard cluster) |
46 | ~1177 MB | ~830 MB |
Pass it as an integer (e.g. --exclude-worst 35) to keep exactly the N worst
nodes; a float in (0, 1) keeps that fraction (round(N * 217)). Returns diminish
fast: the shipped 11 already match fp32 on long audio (7.5% vs 7.8% WER), so higher
values buy fidelity in fractions of a percent for hundreds of MB of file/RAM. The
single-file int8 stays WASM-loadable up to ~0.85 (above that it crosses the ~2 GB
single-file ingest wall and would need sharding like the fp32 encoder).
int8 still matters most on a CPU / WASM backend, where fp16 has no compute kernels and the single-file fp32 cannot be ingested (see the shards section below), so int8 is the only precision that both fits and runs. If you can run fp16 or fp32 (for example on a WebGPU backend), prefer those.
Browser-friendly fp32 shards (sharded/)
The fp32 encoder is shipped two ways here: as the canonical single sidecar
(encoder-model.onnx + a ~2.3 GB encoder-model.onnx.data), and, under
sharded/, as the same weights repacked into several files each under 2 GB.
Both copies hold the mantissa-rounded fp32 weights (see the
changelog); the rounding is near-lossless but makes the encoder far
more compressible: DEFLATE squeezes the rounded weights to 1.55 GB versus
2.26 GB unrounded (31.6% smaller, ~716 MB saved), while the raw fp32 stays
2.44 GB on disk. The sharded copy exists so the fp32 encoder
can be loaded in a web browser (and on the CPU / WASM ONNX Runtime backend
generally), which the single-file fp32 cannot.
Why a browser cannot load the single 2.3 GB sidecar (these are ingest limits, not a total-memory limit):
- 32-bit WASM ArrayBuffer cap. A WASM build is wasm32, so any single
ArrayBufferit holds caps at2^31 - 1bytes (~2 GB). A 2.3 GB sidecar cannot live in one buffer. (This is the same wall that forces projects like wllama to shard their GGUF files.) - Chromium blob-URL fetch cap. Fetching a
blob:URL larger than ~2 GB fails in Chromium withTypeError: Failed to fetch, so the file cannot even be read into memory in one piece.
Note the wasm32 heap ceiling itself is ~4 GB, and fp32 stays ~2.3 GB resident (it is not upcast the way the CPU / WASM EP upcasts fp16 to fp32 at session build), so fp32 fits once no single buffer or fetch exceeds 2 GB. Sharding is purely about clearing the two per-buffer ingest walls above.
scripts/shard-fp32.py rewrites each big initializer's external_data location to spread
the encoder's tensors across N shard files (encoder-model.onnx.data.000,
encoder-model.onnx.data.001, ... each under a 1.5 GB budget by default), leaving a
small rewritten encoder-model.onnx graph that points at them. Here that produces
two shards (~1.4 GB + ~0.9 GB). It is a pure repack: no tensor value is
touched, so the sharded encoder is byte-for-byte numerically identical to the
single-file fp32 and has the exact same WER. A loader (for example
parakeet_web, with its
allowWasmFp32 opt-in) mounts each shard as a separate externalData entry, each
under the 2 GB caps, and reads them straight to bytes (no >2 GB blob: URL, no
multi-GB IndexedDB blob). The decoder, tokenizer and config are not duplicated
into sharded/; a loader takes the rewritten encoder + shards from sharded/ and
everything else from the repo root.
When to use which: on WebGPU, prefer fp16 (half the download, native fp16 kernels) or the single-file fp32; the GPU EP has no 2 GB per-buffer wall. The shards matter on CPU / WASM, where fp16 has no compute kernels and the single-file fp32 cannot be ingested, so the sharded fp32 is the only way to run full precision.
The shards are regenerated with scripts/shard-fp32.py (see
How it was built).
Files
Every tracked file is listed below with its origin. "from istupakov" means the
file is copied byte-for-byte from
istupakov/parakeet-tdt-0.6b-v3-onnx
and is unchanged. "from istupakov, mantissa-rounded" means istupakov's
weights with the 12-low-bit mantissa rounding applied (near-lossless, still fp32,
see the changelog). "generated" means the file is the output of
one of the scripts in scripts/; the exact command that produces it is given (run
from the repo root, see How it was built).
| file | origin | what it is |
|---|---|---|
encoder-model.onnx (+ .data) |
from istupakov, mantissa-rounded | fp32 encoder (the accuracy oracle the other quants are built from); istupakov weights with 12 low mantissa bits zeroed for compressibility |
encoder-model.fp16.onnx |
generated: uv run scripts/quantize-fp16.py |
fp16 encoder (a naive fp16 cast of the fp32 pieces; not shipped by the upstream istupakov repo). Mantissa rounding is fp32-only, so not applied here |
encoder-model.int8.onnx |
generated: uv run scripts/quantize-int8-smoothquant.py |
SmoothQuant int8 encoder (the reason for this repo); its 11 excluded-worst layers and other leftover fp32 weights are mantissa-rounded (--zero-mantissa-bits 12) |
sharded/encoder-model.onnx (+ .data.000, .data.001) |
generated: uv run scripts/shard-fp32.py |
the mantissa-rounded fp32 encoder repacked into <2 GB shards (a pure byte-identical repack) so a browser / WASM backend can load it (see Browser-friendly fp32 shards) |
decoder_joint-model.onnx |
from istupakov, mantissa-rounded | fp32 decoder / joint network; istupakov weights with 12 low mantissa bits zeroed (near-lossless). The fp32 decoder used in every benchmark above |
decoder_joint-model.fp16.onnx |
from istupakov | fp16 decoder / joint network (fp16, not rounded) |
decoder_joint-model.int8.onnx |
from istupakov | int8 decoder / joint network (int8, not rounded); as accurate as the fp32 decoder at ~1/4 the size (see The int8 decoder is as accurate as fp32) |
nemo128.onnx |
from istupakov | 128-bin mel preprocessor (137 KB; left stock, too small for rounding to matter) |
vocab.txt |
from istupakov | tokenizer vocabulary |
config.json |
from istupakov | model config |
scripts/quantize-int8-smoothquant.py |
this repo | script that produced encoder-model.int8.onnx |
scripts/quantize-fp16.py |
this repo | script that produced encoder-model.fp16.onnx |
scripts/shard-fp32.py |
this repo | script that produced sharded/ |
scripts/zero-mantissa-bits.py |
this repo | standalone tool: rounds an fp32 model's weights to N zero low mantissa bits (round-to-nearest, default 12) to make it far more compressible while staying fp32; refuses to re-round an already-rounded file unless --force, and by default reports a DEFLATE (gzip/zip, the algorithm murmure ships its model with) before/after size + compress/decompress-time comparison |
scripts/mantissa.py |
this repo | shared fp32 mantissa helper imported by the int8 export (--zero-mantissa-bits) and zero-mantissa-bits.py: zero_fp32_mantissa (the rounding) and mantissa_floor_bits (existing-truncation detector) |
scripts/test_quantize-int8-smoothquant.py |
this repo | regression tests (T1-T34) for the SmoothQuant fixes in the vendored fork and the export/tooling helpers (uv run scripts/test_quantize-int8-smoothquant.py) |
neural-compressor-fork/ |
git submodule (fork, branch diy) |
our fork of onnx/neural-compressor carrying the SmoothQuant fixes the int8 script needs; imported via the script's [tool.uv.sources] |
models_in_testing/link-model-files.sh |
this repo | helper that turns any directory into a complete, loadable model dir by symlinking every missing standard model file (relative links) back to the repo root, without overwriting files already present; used to assemble the candidate dirs below |
models_in_testing/english_downweighted_0.2/ |
this repo (candidate) | a candidate SmoothQuant int8 encoder shared with testers but not promoted to the canonical names: a real encoder-model.int8.onnx (English downweighted in calibration, --fleurs-lang-weight en=0.2) plus relative symlinks to the shared base files. See the changelog |
models_in_testing/french_only_100/ |
this repo (candidate) | a candidate SmoothQuant int8 encoder shared with testers but not promoted to the canonical names: a real encoder-model.int8.onnx (calibrated on French only: a single-fr FLEURS root, --fleurs-per-lang 100, no speeches) plus relative symlinks to the shared base files. See the changelog |
models_in_testing/all_fleurs_balanced/ |
this repo (candidate) | a candidate SmoothQuant int8 encoder shared with testers but not promoted to the canonical names: a real encoder-model.int8.onnx (calibrated on every FLEURS language, ~50 clips/language, balanced by audio-seconds budget with no per-language weight, --auto-alpha-step 0.1) plus relative symlinks to the shared base files. See the changelog |
run.sh |
this repo | convenience wrapper that runs the int8 export, then the WER eval, with the exact flags used for the current build; reads machine-specific paths from a gitignored .env (see How it was built) so it carries no personal paths |
wer-fleurs-validation.sh |
this repo | per-language FLEURS validation WER comparison harness: drives the parent repo's scripts/wer-quants.py --manifest to score a roster of models (istupakov fp32/int8, this repo's int8, and the models_in_testing/ candidates) against the human labels, each model loaded once over every language, and prints a model x language WER matrix. Reads FLEURS_DIR/WER_QUANTS_SCRIPT from the gitignored .env so it carries no personal paths; pre-flight skips any model dir that does not resolve |
.env.example |
this repo | template for the gitignored .env that run.sh and wer-fleurs-validation.sh source; copy it to .env and set FLEURS_DIR (your local FLEURS path) and, if your layout differs, WER_QUANTS_SCRIPT (path to the parent repo's wer-quants.py) |
README.md |
this repo | this document |
.gitmodules |
this repo | declares the neural-compressor-fork/ submodule |
.gitattributes |
this repo | Git LFS / line-ending attributes for the model artifacts |
.gitignore |
this repo | excludes the (copyrighted) calibration_audio/, local logs, and the personal .env from the repo |
(The copyrighted calibration_audio/ clips and the local *.log benchmark
outputs are gitignored and not part of the published repo; the speeches it holds
are documented by source under Evaluation data instead of redistributed.)
How it was built
encoder-model.int8.onnx:scripts/quantize-int8-smoothquant.py. SmoothQuant + static per-channel int8 of the MatMul and Conv ops, with Percentile activation calibration. The exact command that produced the current build:uv run scripts/quantize-int8-smoothquant.py \ --op-types MatMul,Conv \ --op-alpha MatMul=0.0:1.0:0.1 \ --ep cuda \ --fleurs-dir /path/to/fleurs --fleurs-per-lang 3 \ --exclude-worst 0.05 \ --zero-mantissa-bits 12 # No --audio -> calibrate on FLEURS only; the eight speeches stay held out as # long-audio evaluation. Add --audio calibration_audio to fold them back in.--exclude-worst 0.05keeps the most quantization-damaged ~5% of searched MatMuls in fp32 (here 11 of 217, the mid-layer feed-forwardlinear2projections); this is what brings the long-audio WER down to fp32 level, at the cost of file size.--zero-mantissa-bits 12rounds the leftover fp32 weights for compressibility (near-lossless).Note: if you re-run this, the coarser
MatMul=0.0:1.0:0.2alpha grid is the better default: in earlier testing it scored just as well as the0.1grid used above while running the alpha search in about half the time and with less RAM.The export is resumable: each run streams its intermediates (mel features, smoother calibration, the per-node auto-alpha results in
alphas.jsonl, the static-calibration params) intosq-cache/<hash-of-the-configuration>/as they are produced, so if the run dies (typically OOM during static calibration) re-running the same command with--resumepicks up from the last completed step instead of re-paying the hours-long alpha search.--resume DIRkeeps the cache in a folder of your choosing instead (created if missing, refused if its recorded configuration differs). Resume granularity goes below whole steps: the two per-sample calibration loops (the smoother's activation collection and the static Percentile/Entropy calibration) additionally dump their in-progress state at most once per--checkpoint-interval-min(default 20 minutes), so a run killed inside one of those passes resumes from the last dumped sample rather than redoing the pass; a pass faster than the interval writes nothing extra. The smoother's activation collection streams its per-channel percentile through a running top-k (bit-identical to the stackednp.percentile, enforced by test) instead of holding every sample's activations in RAM, so its memory no longer grows with windows x window length (75 x 395 s windows used to need ~444 GB; the streamed state is a few hundred MB).alphas.jsonl(one appended line per node) doubles as a per-layer report of the alpha each node picked and its QDQ loss over the whole grid.On
--ep cudawith long windows the remaining peak is the static calibration's GPU memory: it augments the encoder so every calibrated tensor is a graph output, and ONNX Runtime keeps all of them resident for the whole forward, so one long-window forward (the per-layer attention-score MatMuls are ~0.8 GB each near the ~400 s reach) can exceed a 24 GB GPU even though the smoother and alpha-search passes fit.--calib-dump-batch Ncaps it by dumping the tensors in slices ofNgraph outputs per forward (re-running the calibration set once per slice); the per-tensor result is bit-identical to dumping all at once (enforced by test), so it is a pure memory/speed knob and, like--ep, is not part of the cache hash, so you can add it on--resumewithout invalidating the alpha search. On a 24 GB GPU with long windows try--calib-dump-batch 32(or64). A progress bar tracks each calibration slice.Calibration uses no labels: the FLEURS train splits only (see Calibration data), one window per clip, with the fp32 encoder as the accuracy oracle. The script ends with a cosine-similarity fidelity check of the new encoder's output against fp32. (Pass
--audio calibration_audioto also calibrate on the eight speeches as long-text coverage.) It imports SmoothQuant from our vendoredneural-compressor-fork/submodule (branchdiy), whose README documents every change versus upstream onnx/neural-compressor: auto-alpha search fixes (it otherwise runs on an exhausted calibration reader and returns a degenerate alpha for every layer), per-op-type alpha grids, streaming Percentile calibration (flat RAM in the window count), per-node QDQ session caching, an alpha search that retains no protobuf arena memory per evaluation, and more. Clone with--recurse-submodules(or rungit submodule update --init) souv runcan find the fork.encoder-model.fp16.onnx:scripts/quantize-fp16.py, a straight fp16 cast of the fp32 encoder pieces.sharded/:scripts/shard-fp32.py, a pure repack of the single-file fp32 encoder into <2 GB shards (see Browser-friendly fp32 shards). No weights are altered, so the sharded encoder is numerically identical to the single-file fp32.Mantissa rounding (compressibility):
scripts/zero-mantissa-bits.pyrounds an fp32 model's weights to 12 zero low mantissa bits (round-to-nearest, near-lossless, more precise than an fp16 cast). It is applied to the standalone fp32 encoder and the fp32 decoder; the int8 export does the same to its leftover fp32 weights via--zero-mantissa-bits 12. Round the fp32 encoder before sharding so the shards carry the rounded weights:uv run scripts/zero-mantissa-bits.py encoder-model.onnx # -> encoder-model.zm12.onnx uv run scripts/zero-mantissa-bits.py decoder_joint-model.onnx # -> decoder_joint-model.zm12.onnxEach rounded file then takes its canonical name in the published repo, so the models stay drop-in. To round directly under the canonical name (so the
<name>.datasidecar reference inside the.onnxis never left pointing at a renamed<stem>.zm12.onnx.data), pass--inplaceto overwrite the input and reuse its.dataname:uv run scripts/zero-mantissa-bits.py encoder-model.onnx --inplacezero-mantissa-bits.pyrefuses to re-round an already-rounded file unless--force.
run.sh is a convenience wrapper that runs the int8 export above (with the exact
flags used for the current build) followed by the WER eval. It keeps no personal
paths: copy .env.example to .env (gitignored), set FLEURS_DIR to your local
FLEURS path (and WER_QUANTS_SCRIPT if your wer-quants.py is not at the default
in-monorepo location), then run ./run.sh. It exits early with a clear message if
.env is missing.
These scripts live in scripts/, are self-contained, and run from the repo
root against the model files here, which they default to finding in the current
directory (so invoke them as e.g. uv run scripts/quantize-fp16.py). Each declares
its own dependencies via a PEP 723 header and
runs with uv run (which installs them on the fly).
The only external inputs are the FLEURS audio for the int8 calibration (via
--fleurs-dir) and the held-out speeches for the long-audio benchmark (fetched
from the documented sources into calibration_audio/, see above), plus the
optional WER comparison harnesses the scripts print at the end
(wer-quants.py, wer-bench.mjs), which live in the
parakeet_web
project repository.
Sources and credits
- ONNX base model this repo is built on: istupakov/parakeet-tdt-0.6b-v3-onnx
- Original model: nvidia/parakeet-tdt-0.6b-v3
- SmoothQuant implementation:
onnx/neural-compressor, used here via
our fork (branch
diy, vendored as theneural-compressor-fork/submodule) with auto-alpha bug fixes intended for upstream PRs - Loaded with onnx-asr
- Discussion: Kieirra/murmure#289 (comment)
- This repository (model export, benchmarking and documentation) was produced with Claude Code.
License
cc-by-4.0, inherited from the upstream istupakov ONNX model and the original
NVIDIA Parakeet TDT 0.6B v3.
- Downloads last month
- 106
Model tree for Olicorne/parakeet-tdt-0.6b-v3-smoothquant-onnx
Base model
nvidia/parakeet-tdt-0.6b-v3