Parakeet TDT 0.6B v3 (Multilingual), ONNX with a SmoothQuant int8 encoder

Changelog

Click to expand

2026-06-17

  • Third candidate: models_in_testing/all_fleurs_balanced/. A SmoothQuant int8 encoder calibrated on the FLEURS train split of every language, ~50 clips per language, balanced by audio-seconds budget with no per-language weight (every language gets the same calibration-seconds budget, filled by a deterministic shuffle and truncated, so none dominates). It searches the finer --auto-alpha-step 0.1 SmoothQuant grid (vs the coarser 0.2 the french_only_100 run pinned). The intent is a calibration set that represents the whole multilingual distribution evenly, rather than down-weighting one language or going monolingual. Like the other candidates it is a real encoder-model.int8.onnx plus relative symlinks to the shared base files, shipped for tester feedback, not promoted to the canonical names.
  • wer-fleurs-validation.sh is now committed (was local-only). It compares a roster of models (istupakov fp32/int8, this repo's int8, and the models_in_testing/ candidates) by per-language WER over the whole FLEURS validation split, driving the parent repo's scripts/wer-quants.py --manifest with each model loaded once over every language. It reads machine-specific paths (FLEURS_DIR, WER_QUANTS_SCRIPT) from the gitignored .env, so it carries no personal paths.

2026-06-16

  • New models_in_testing/ area for sharing candidate (not-yet-promoted) encoder builds with testers. Each subfolder is a self-contained, loadable model directory: it holds one candidate file as a real artifact (for example encoder-model.int8.onnx) and borrows everything else (decoder, preprocessor, tokenizer, config, the fp32 / fp16 encoders) as relative symlinks back to the repo root, so a candidate costs only its own weights on disk instead of a full copy of the repo. models_in_testing/link-model-files.sh <dir> creates those missing links for any directory without overwriting files already present. The canonical encoder-model.int8.onnx at the repo root is unchanged (still the 2026-06-14 build); these are candidates under evaluation, not a new default.
  • First candidate: models_in_testing/english_downweighted_0.2/. A SmoothQuant int8 encoder calibrated with English downweighted to a fifth of every other language's calibration-seconds budget (--fleurs-lang-weight en=0.2, with --fleurs-per-lang 6 and the duration-balanced FLEURS selection). The intent is to bias the int8 activation ranges away from the dominant training language so a degraded embedding is less likely to tip non-English audio into English (see Per-language weighting). It is shipped here for tester feedback, not promoted to the canonical names.
  • Second candidate: models_in_testing/french_only_100/. A SmoothQuant int8 encoder calibrated on French only (a FLEURS root holding a single fr split, --fleurs-per-lang 100, no speeches), to test whether a monolingual calibration set sharpens the int8 activation ranges for French audio. Like the first candidate it is a real encoder-model.int8.onnx plus relative symlinks to the shared base files, shipped for tester feedback, not promoted to the canonical names.

2026-06-15

  • Prefix-search recombination now ships off by default for decoding. A grid search with this int8 encoder + int8 decoder (in-house French-medical dictation
    • FLEURS-fr validation, 494 utterances, MAES beam 5), repeated on both the CPU (ONNX Runtime node) and GPU (cuda) backends, found the MAES prefix_alpha knob moved WER/CER only within noise (under 0.05 WER points either way) while costing ~15-20% more decode time, so parakeet_web defaults maes_prefix_alpha to 0 (NeMo uses 1). The same sweep kept beam 5 (overall accuracy plateaus by ~beam 5 on both backends) and the other MAES knobs at their defaults (num-steps 2, gamma 2.3); note CPU decode cost scales roughly linearly with beam width whereas the GPU's batched curve is flatter.
  • The int8 decoder / joint network is as accurate as the fp32 one. Crossing encoder precision against decoder precision on the 494-utterance held-out harness (FLEURS-fr validation + in-house dictation, MAES beam 10), the int8 decoder matches the fp32 decoder to within noise for either encoder (with this int8 encoder and keyword boosting, 8.98% vs 9.01% overall WER), at about a quarter the size (18 MB vs ~70 MB). So the decoder quantization is effectively lossless. See The int8 decoder is as accurate as fp32.

2026-06-14

  • Rebuilt encoder-model.int8.onnx to essentially match fp32 on long audio, at the cost of size. The export keeps the 11 most quantization-damaged MatMuls in fp32 (--exclude-worst 0.05, the mid-layer feed-forward linear2 projections) and quantizes the other 206 MatMuls plus all 77 Convs to int8. The file is 757 MB (stock int8 is 622 MB); the extra size is what buys back the accuracy.
  • Calibration is now FLEURS-only, so the eight political speeches are no longer calibration data and are reused as a held-out long-audio evaluation set. Each of the 25 supported languages contributes 3 FLEURS train clips, one calibration window per clip with no concatenation (75 windows, kept whole).
  • New long-audio benchmark (eight held-out speeches, single ~390 s pass each, fp32 encoder as oracle, fp32 decoder): mean overall WER 7.5% for this int8 versus 7.8% for fp32 and 24.8% for the stock int8. The recalibrated int8 now tracks fp32 to within noise and is about 3x better than the stock int8.
  • Every fp32 weight file is mantissa-rounded (12 low bits zeroed, round-to-nearest) for compressibility (the 2.3 GB fp32 encoder's weights then compress to 1.55 GB, down from 2.26 GB unrounded, 31.6% smaller). This is near-lossless (it keeps 11 of fp32's 23 mantissa bits, one more than fp16) and applies to the int8 encoder's leftover fp32 weights (--zero-mantissa-bits 12), the fp32 decoder, and the standalone + sharded fp32 encoder. The fp32 decoder used in every benchmark here is itself rounded, and reproduced the stock-decoder WER to within 0.2 points, so the rounding does not move accuracy.
  • Generalization re-measured (FLEURS-fr validation + in-house medical dictation, MAES beam 10, fp32 decoder): this int8 tracks fp32 closely, 5.18% / 4.87% WER on FLEURS-fr and 17.60% / 17.40% on the dictation set (int8 / fp32, no keyword boost). Details below.
  • Export tooling: GPU calibration (--ep cuda), per-op-type alpha grids (--op-alpha), FLEURS calibration via --fleurs-dir, streaming Percentile calibration (flat RAM in the number of calibration windows), and an auto-alpha search that no longer retains protobuf arena memory on every evaluation (fine alpha grids used to exhaust RAM on the 2.4 GB encoder), all in the vendored neural-compressor fork. The fork's diy branch is pushed and its README documents every change versus upstream, making the whole export reproducible.

This is istupakov/parakeet-tdt-0.6b-v3-onnx with two kinds of change. The substantive one is the int8 encoder (encoder-model.int8.onnx), rebuilt with SmoothQuant: with it I was no longer able to reproduce the loss of accuracy on longer audio that I had measured on the original int8 encoder. The cosmetic one is that every fp32 weight file here is mantissa-rounded for compressibility (12 low mantissa bits zeroed, round-to-nearest): it is near-lossless (it keeps one more mantissa bit than fp16), so the models stay numerically equivalent but get far smaller once compressed. The preprocessor and tokenizer are byte-for-byte from istupakov. This is still a drop-in replacement: canonical names, numerically equivalent fp32/decoder, and the better int8 is picked up automatically.

It also ships the fp16 encoder (which the upstream istupakov repo does not), so all three precisions are available here in one place. Unlike the int8 encoder, the fp16 one is not SmoothQuant or anything clever: it is a naive fp16 cast of the fp32 pieces (scripts/quantize-fp16.py). In my testing it scored exactly equal to fp32 (the same WER), at half the size. Note that fp16 compute is not implemented on every backend (for example the CPU / WASM ONNX Runtime EP has no fp16 kernels), so there it is upcast back to fp32 at session build and gives you no runtime benefit. Even then it is still useful purely as a smaller artifact: it is half the download / packaging size of fp32, for identical accuracy. That is why parakeet_web serves it (on a WebGPU backend it also runs natively in fp16, halving GPU memory).

This was originally built to improve the int8 transcription quality of parakeet_web (live demo: parakeetweb.olicorne.org), a browser-based Parakeet ASR app that runs the int8 encoder on its CPU / WASM backend, where fp16 is not an option.

I also contribute to Kieirra/murmure, another browser-based Parakeet ASR project, where this SmoothQuant int8 encoder is progressively being upstreamed (see the discussion).

Why this exists

The stock int8 encoder transcribes short clips fine, but its accuracy degrades badly once a single pass runs past roughly 20 to 30 seconds. The fp16 and fp32 encoders do not show this: so it is not the model architecture, it is an int8 numerics problem. The stock int8 uses fully dynamic, per-tensor activation quantization (one runtime scale for an entire activation tensor). Once a longer sequence widens the activation distribution, that single scale can no longer represent it and the transcript falls apart.

SmoothQuant targets exactly this failure mode: it migrates the per-channel activation outliers into the weights (a folded multiply), then statically quantizes activations together with per-channel weights. With the smoothed, per-channel int8 encoder I was no longer able to reproduce the long-audio degradation in my own testing (see the numbers below).

Background and discussion: Kieirra/murmure#289 (comment).

Results

Benchmark: eight political speeches (listed under Evaluation data), each transcribed in a single pass with no chunking (capped at 390 s, the encoder's positional-encoding reach), scored against the fp32 encoder as the oracle. The eight speeches are held out of calibration (this build calibrates on FLEURS only), so this is an out-of-sample long-audio test, exactly the regime where the stock int8 falls apart. All three precisions use the fp32 decoder / joint network (not the int8 decoder), and that fp32 decoder is itself mantissa-rounded (see the changelog and Files). The fp32 decoder is used here only to keep the encoder the single variable under test; the int8 decoder / joint network is just as accurate (see The int8 decoder is as accurate as fp32). Run with scripts/wer-quants.py from the parakeet_web repo.

Mean across the eight speeches (lower WER is better):

encoder precision encoder size mean overall WER mean worst-chunk WER
stock int8 (istupakov) 622 MB 24.8% 58.7%
SmoothQuant int8 (this) 757 MB 7.5% 37.1%
fp32 (oracle) ~2.3 GB 7.8% 36.8%

Per-speech overall WER:

speech stock int8 SmoothQuant int8 fp32
Villepin (FR, Iraq) 24.2% 20.4% 22.0%
Sanders (EN, filibuster) 30.5% 9.3% 10.8%
Pompidou (FR, press) 12.0% 8.6% 8.5%
Johnson (EN, voting rights) 48.8% 5.9% 7.8%
Chirac (FR, Earth Summit) 25.8% 0.0% 0.0%
Nixon (EN, resignation) 0.9% 0.9% 0.0%
Veil (FR, abortion law) 8.8% 5.4% 5.2%
Badinter (FR, death penalty) 47.6% 9.4% 8.0%
mean 24.8% 7.5% 7.8%

The recalibrated int8 matches fp32 to within noise (7.5% vs 7.8% mean; it is better on some speeches, slightly worse on others) and is about 3x better than the stock int8 (24.8%). The "worst-chunk" column is each speech's single worst 60 s window; its mean is inflated by a few windows that score badly for fp32 too (for example one Villepin window is ~160% WER for every precision, an audio/oracle artifact rather than a quantization failure), which is why the SmoothQuant int8 (37.1%) again tracks fp32 (36.8%) rather than the stock int8 (58.7%).

Speed and RAM are intentionally omitted. This run used --cuda, so the fp32 encoder ran on the GPU while the int8 encoder fell back to the CPU (ONNX Runtime has no CUDA kernels for the QOperator int8 format), which makes the two precisions' timing and host-memory numbers incomparable. Encoder size on disk is the one hardware-independent cost shown.

Calibration data (FLEURS train only, no labels)

SmoothQuant is a static method: it needs representative activations (not labels or transcripts) to estimate per-channel ranges, which it then folds into the weights as an exact equivalence transform. No labels, transcripts or training targets are used, only raw audio as signal to exercise the activation ranges; the model's multilingual ability is inherited unchanged from the base model rather than learned or tuned here.

This build calibrates on the FLEURS train split only (--fleurs-dir). For each of the model's 25 supported languages, train clips are taken (a deterministic shuffle, diverse speakers) as their own calibration windows with no concatenation (FLEURS clips are short, so one window per clip). Selection is balanced by audio duration, not clip count: SmoothQuant accumulates its ranges over activation frames (∝ clip duration), so equal clip counts would let whichever language has the longest clips dominate. Instead every language is given the same audio-seconds budget (--fleurs-per-lang x the median clip length), filling it in shuffle order and truncating the clip that crosses it, so a single long clip cannot skew a language's total. With uniform-length clips this still works out to --fleurs-per-lang clips per language. --window-sec is a separate per-window length cap that defaults to the encoder's positional-encoding reach (400 s), so an unusually long clip would be truncated rather than split.

Because calibration is FLEURS-train-only, every evaluation set is held out: the eight political speeches (the long-audio WER above), the FLEURS French validation split and the in-house medical dictation (both below), and the JFK clip are all disjoint from calibration. The fp32 encoder is the accuracy oracle, and the export ends with a cosine-similarity fidelity check of the new encoder's output against fp32 (0.96 for this build).

The FLEURS audio is not redistributed here (it is a public dataset): download it and point --fleurs-dir at a directory holding one <lang>/wavs_train/ folder per language.

Per-language weighting (--fleurs-lang-weight). Every language gets the same seconds budget by default, so the int8 activation ranges are a uniform 25-language compromise. Because the multilingual TDT model decides its output language implicitly from the (quantized) encoder embedding, letting a dominant training language (English) weigh on those ranges can make a degraded embedding tip the output into English on non-English audio. --fleurs-lang-weight LANG=W (repeatable) scales one language's seconds budget: W applies to every FLEURS dir matching LANG (the exact name like en_us, or the code before _ like en to cover all variants), so 0 drops the language, <1 downweights it and >1 upweights it. Unlisted languages keep the full budget. For example --fleurs-lang-weight en=0.2 gives English a fifth of every other language's calibration seconds, biasing the ranges away from English without favouring any single target language.

Balance to the smallest language (--fleurs-balance-smallest). Instead of a fixed --fleurs-per-lang count, this calibrates on the whole FLEURS train set of every language, capped so no language uses more audio than the least-represented one. The per-language seconds budget becomes the true minimum total (capped) duration across all languages: the smallest language is used whole and sets the cap, every larger language is truncated to it, so all 25 languages contribute equal seconds with no weight imbalance. It scans every clip's duration up front (decoding the train set once), so it is slower to start and produces a large calibration pool (~the minimum language's clip count x number of languages). It overrides --fleurs-per-lang; --fleurs-lang-weight still scales the per-language budget.

Evaluation data: eight held-out speeches (long audio)

The long-audio benchmark uses eight public political speeches, chosen to span decades, recording conditions and two languages, and held out of calibration:

speaker lang speech year crop source (YouTube id)
Dominique de Villepin FR UN Security Council address against the Iraq war 2003 390 s RNxU-tN8qNc
Bernie Sanders EN Senate floor filibuster against the tax-cut extension 2010 390 s K6pa-QdL4Wo
Georges Pompidou FR presidential press conference (INA archive) 1970 390 s RNWFPX_Yafw
Lyndon B. Johnson EN "We Shall Overcome" voting-rights address 1965 390 s o74X_rTzrGI
Jacques Chirac FR "Notre maison brule" Earth Summit speech 2002 60 s M_oR0wZ3lI4
Richard Nixon EN resignation address 1974 60 s ZEOGJJ7UKFM
Simone Veil FR speech defending the law legalizing abortion (INA) 1974 390 s 45MOc6PYoY8
Robert Badinter FR speech for abolishing the death penalty (INA) 1981 390 s kIVuz9NGQXY

Each clip is decoded to 16 kHz mono and transcribed in a single pass (capped at 390 s). These clips are not redistributed here (they are copyrighted third-party broadcasts; this repo ships under cc-by-4.0), which is why they are documented by source above rather than committed. To reproduce the long-audio benchmark, fetch them from the listed sources and drop them in a calibration_audio/ folder at the repo root, then run scripts/wer-quants.py --audio calibration_audio --decoder-quant fp32 from the parakeet_web repo (the --decoder-quant fp32 reproduces the fp32-decoder oracle of the table above; wer-quants.py now defaults to the int8 decoder to match the production app). (The folder name is historical: in this build the speeches are evaluation, not calibration. To fold them back into calibration as long-text coverage, pass --audio calibration_audio to scripts/quantize-int8-smoothquant.py.)

Generalization (held-out, two domains)

As independent checks that the recalibrated int8 generalizes beyond the eight speeches, it was evaluated on two sets that are not in the calibration data: the FLEURS French validation split (general-French read speech) and a small in-house medical-dictation set. Both are scored with MAES beam search (width 10) and the fp32 decoder, alongside the fp32 encoder under the identical harness, so this is a like-for-like int8-versus-fp32 comparison (lower is better):

dataset utterances int8 WER fp32 WER int8 CER fp32 CER
FLEURS French (validation) 289 5.18% 4.87% 2.23% 1.98%
in-house medical dictation 205 17.60% 17.40% 10.40% 10.38%
overall 494 9.43% 9.16% 5.24% 5.08%

The int8 tracks fp32 within ~0.3 WER on FLEURS-fr and within ~0.2 WER on the dictation set, confirming the recalibrated encoder stays strongly multilingual (2.23% FLEURS-fr CER). French audio is in the calibration set, but the FLEURS validation split evaluated here is held out (calibration uses FLEURS train only) and no French transcript or label was ever used, so this is a genuine held-out measurement.

parakeet_web's domain-keyword boosting is orthogonal to quantization (it would help any encoder) but shows the practical ceiling on the drug-name-heavy dictation set: with boosting the dictation WER drops to 15.30% for int8 and 14.89% for fp32. On FLEURS-fr, which has no relevant keywords, boosting is a near-no-op (5.72% int8 / 5.17% fp32), a sanity check. Run with scripts/grid_search_benchmark.mjs from the parakeet_web repository.

Full FLEURS validation, all 25 languages

The French number above is one slice of a wider check: the whole FLEURS validation split of every one of the 25 supported languages (all clips, no --limit), comparing the stock istupakov encoder at fp32 and int8 against this repo's SmoothQuant int8. Every model uses the fp32 decoder / joint (so the encoder is the only variable), WER is normalized (case / punctuation folded), and all three were scored under one harness with scripts/wer-quants.py --manifest from the parakeet_web repo. The validation split is held out of calibration (this build calibrates on FLEURS train only).

Every WER below uses greedy decoding (beam width 1). A wider beam search would lower the absolute numbers across all three models, so read these as a conservative floor (and as relative gaps between the encoders, which is what the comparison is about), not as the best achievable WER. The width-10 MAES beam used for the French-only figure above is why that number reads lower.

Per-language WER (%), lower is better:

lang ref words istupakov fp32 istupakov int8 this repo int8
bg 8084 11.94 15.51 12.37
cs 5396 12.92 21.98 12.75
da 8018 18.06 25.95 19.54
de 7410 5.47 8.07 5.82
el 6061 36.69 43.18 37.96
en 8323 5.88 7.88 6.12
es 10125 3.58 4.58 3.85
et 6035 17.28 23.55 18.00
fi 6216 12.81 18.48 13.80
fr 7021 4.90 6.08 5.70
hr 6904 11.62 17.45 12.40
hu 7356 15.61 30.26 16.23
it 9018 2.71 3.81 3.10
lt 6760 22.91 34.05 24.66
lv 5988 23.75 37.12 25.52
mt 9185 20.88 40.97 22.46
nl 3736 8.43 10.44 8.94
pl 5940 6.23 9.71 6.75
pt 8464 4.86 4.69 5.07
ro 8746 13.31 21.39 14.35
ru 6501 6.21 8.14 6.60
sk 6440 9.19 16.06 9.91
sl 6431 24.97 46.60 27.27
sv 6370 14.58 19.20 15.65
uk 5900 6.53 9.80 6.95
MACRO 12.85 19.40 13.67
MICRO 12.49 18.99 13.30

MACRO is the unweighted mean of per-language WER; MICRO is total word edits over total reference words across all languages. The SmoothQuant int8 closes almost the entire stock-int8 gap to fp32 across the board: it is within ~0.8 WER of fp32 on the macro average (13.67% vs 12.85%) while the stock int8 trails by ~6.5 WER (19.40%). The gain is largest exactly where the stock int8 collapses worst, the morphologically rich / lower-resource languages (Maltese 40.97% -> 22.46%, Slovenian 46.60% -> 27.27%, Hungarian 30.26% -> 16.23%, Latvian 37.12% -> 25.52%), and it never regresses below the stock int8 on any language.

The int8 decoder is as accurate as fp32

Every WER above uses the fp32 decoder / joint network as a clean oracle, so that the encoder stays the only variable under test. But the int8 decoder / joint is just as accurate, and it is much smaller: 18 MB versus ~70 MB for the fp32 decoder (and 35 MB for fp16). Crossing encoder precision against decoder precision on the same 494-utterance held-out harness (FLEURS-fr validation + in-house medical dictation, MAES beam 10, keyword boosting on), the int8 decoder matches the fp32 decoder to within noise (here even marginally better) for either encoder:

encoder decoder overall WER overall CER
int8 (this repo) int8 8.98% 4.73%
int8 (this repo) fp32 9.01% 4.87%
fp32 (oracle) int8 8.32% 3.84%
fp32 (oracle) fp32 8.48% 4.14%

Without keyword boosting the two decoders are likewise tied (this int8 encoder scores 9.43% overall WER with either decoder). So the decoder / joint quantization is effectively lossless here: a loader can pair any encoder precision with the much smaller int8 decoder at no measurable accuracy cost. The benchmarks above still report the fp32 decoder only so the encoder stays the single variable. Run with scripts/grid_search_benchmark.mjs from the parakeet_web repository.

Trade-off: larger than the stock int8, but fp32-grade accuracy

This build deliberately trades file size for accuracy. At 757 MB it is larger than the stock int8 (622 MB), because --exclude-worst 0.05 keeps the 11 most quantization-damaged MatMuls (mid-layer feed-forward linear2 projections) in fp32 instead of int8. That escape hatch is what closes the long-audio gap: this int8 now matches fp32 (7.5% vs 7.8% mean WER on the eight held-out speeches) instead of trailing it. The leftover fp32 weights are mantissa-rounded (12 low bits zeroed), so although the file is 757 MB on disk it compresses well for transfer (zip / HTTP content-encoding) while staying more precise than an fp16 cast.

Dialing --exclude-worst higher (size / RAM / accuracy)

--exclude-worst is the lever for this trade-off, and it is close to linear in the useful range. Of the 217 searched MatMuls the quantization damage is concentrated in the feed-forward linear2 down-projections: the 46 worst-ranked nodes are all linear2, and rank 47 (the first attention layer) sits at less than half the loss, a clean gap. Since every linear2 weight is the same size (4096x1024), each node you keep in fp32 instead of int8 costs a fixed amount:

  • +12.0 MiB (~12.6 MB) uncompressed, which is also the +resident RAM at inference (ONNX Runtime loads weights resident; the resident size tracks the uncompressed file, not the gzip size, so the mantissa-rounding that shrinks the download does not shrink RAM).
  • +~7 MiB gzip (the fp32-rounded weights compress to ~0.635 of raw, the int8 they replace to ~0.71-0.80, measured on these files).
--exclude-worst fp32 linear2 kept encoder file ≈ resident RAM gzip download
0 (pure int8) 0 622 MB 443 MB
11 (0.05, shipped) 11 757 MB 579 MB
35 35 ~1045 MB ~745 MB
46 (whole hard cluster) 46 ~1177 MB ~830 MB

Pass it as an integer (e.g. --exclude-worst 35) to keep exactly the N worst nodes; a float in (0, 1) keeps that fraction (round(N * 217)). Returns diminish fast: the shipped 11 already match fp32 on long audio (7.5% vs 7.8% WER), so higher values buy fidelity in fractions of a percent for hundreds of MB of file/RAM. The single-file int8 stays WASM-loadable up to ~0.85 (above that it crosses the ~2 GB single-file ingest wall and would need sharding like the fp32 encoder).

int8 still matters most on a CPU / WASM backend, where fp16 has no compute kernels and the single-file fp32 cannot be ingested (see the shards section below), so int8 is the only precision that both fits and runs. If you can run fp16 or fp32 (for example on a WebGPU backend), prefer those.

Browser-friendly fp32 shards (sharded/)

The fp32 encoder is shipped two ways here: as the canonical single sidecar (encoder-model.onnx + a ~2.3 GB encoder-model.onnx.data), and, under sharded/, as the same weights repacked into several files each under 2 GB. Both copies hold the mantissa-rounded fp32 weights (see the changelog); the rounding is near-lossless but makes the encoder far more compressible: DEFLATE squeezes the rounded weights to 1.55 GB versus 2.26 GB unrounded (31.6% smaller, ~716 MB saved), while the raw fp32 stays 2.44 GB on disk. The sharded copy exists so the fp32 encoder can be loaded in a web browser (and on the CPU / WASM ONNX Runtime backend generally), which the single-file fp32 cannot.

Why a browser cannot load the single 2.3 GB sidecar (these are ingest limits, not a total-memory limit):

  1. 32-bit WASM ArrayBuffer cap. A WASM build is wasm32, so any single ArrayBuffer it holds caps at 2^31 - 1 bytes (~2 GB). A 2.3 GB sidecar cannot live in one buffer. (This is the same wall that forces projects like wllama to shard their GGUF files.)
  2. Chromium blob-URL fetch cap. Fetching a blob: URL larger than ~2 GB fails in Chromium with TypeError: Failed to fetch, so the file cannot even be read into memory in one piece.

Note the wasm32 heap ceiling itself is ~4 GB, and fp32 stays ~2.3 GB resident (it is not upcast the way the CPU / WASM EP upcasts fp16 to fp32 at session build), so fp32 fits once no single buffer or fetch exceeds 2 GB. Sharding is purely about clearing the two per-buffer ingest walls above.

scripts/shard-fp32.py rewrites each big initializer's external_data location to spread the encoder's tensors across N shard files (encoder-model.onnx.data.000, encoder-model.onnx.data.001, ... each under a 1.5 GB budget by default), leaving a small rewritten encoder-model.onnx graph that points at them. Here that produces two shards (~1.4 GB + ~0.9 GB). It is a pure repack: no tensor value is touched, so the sharded encoder is byte-for-byte numerically identical to the single-file fp32 and has the exact same WER. A loader (for example parakeet_web, with its allowWasmFp32 opt-in) mounts each shard as a separate externalData entry, each under the 2 GB caps, and reads them straight to bytes (no >2 GB blob: URL, no multi-GB IndexedDB blob). The decoder, tokenizer and config are not duplicated into sharded/; a loader takes the rewritten encoder + shards from sharded/ and everything else from the repo root.

When to use which: on WebGPU, prefer fp16 (half the download, native fp16 kernels) or the single-file fp32; the GPU EP has no 2 GB per-buffer wall. The shards matter on CPU / WASM, where fp16 has no compute kernels and the single-file fp32 cannot be ingested, so the sharded fp32 is the only way to run full precision.

The shards are regenerated with scripts/shard-fp32.py (see How it was built).

Files

Every tracked file is listed below with its origin. "from istupakov" means the file is copied byte-for-byte from istupakov/parakeet-tdt-0.6b-v3-onnx and is unchanged. "from istupakov, mantissa-rounded" means istupakov's weights with the 12-low-bit mantissa rounding applied (near-lossless, still fp32, see the changelog). "generated" means the file is the output of one of the scripts in scripts/; the exact command that produces it is given (run from the repo root, see How it was built).

file origin what it is
encoder-model.onnx (+ .data) from istupakov, mantissa-rounded fp32 encoder (the accuracy oracle the other quants are built from); istupakov weights with 12 low mantissa bits zeroed for compressibility
encoder-model.fp16.onnx generated: uv run scripts/quantize-fp16.py fp16 encoder (a naive fp16 cast of the fp32 pieces; not shipped by the upstream istupakov repo). Mantissa rounding is fp32-only, so not applied here
encoder-model.int8.onnx generated: uv run scripts/quantize-int8-smoothquant.py SmoothQuant int8 encoder (the reason for this repo); its 11 excluded-worst layers and other leftover fp32 weights are mantissa-rounded (--zero-mantissa-bits 12)
sharded/encoder-model.onnx (+ .data.000, .data.001) generated: uv run scripts/shard-fp32.py the mantissa-rounded fp32 encoder repacked into <2 GB shards (a pure byte-identical repack) so a browser / WASM backend can load it (see Browser-friendly fp32 shards)
decoder_joint-model.onnx from istupakov, mantissa-rounded fp32 decoder / joint network; istupakov weights with 12 low mantissa bits zeroed (near-lossless). The fp32 decoder used in every benchmark above
decoder_joint-model.fp16.onnx from istupakov fp16 decoder / joint network (fp16, not rounded)
decoder_joint-model.int8.onnx from istupakov int8 decoder / joint network (int8, not rounded); as accurate as the fp32 decoder at ~1/4 the size (see The int8 decoder is as accurate as fp32)
nemo128.onnx from istupakov 128-bin mel preprocessor (137 KB; left stock, too small for rounding to matter)
vocab.txt from istupakov tokenizer vocabulary
config.json from istupakov model config
scripts/quantize-int8-smoothquant.py this repo script that produced encoder-model.int8.onnx
scripts/quantize-fp16.py this repo script that produced encoder-model.fp16.onnx
scripts/shard-fp32.py this repo script that produced sharded/
scripts/zero-mantissa-bits.py this repo standalone tool: rounds an fp32 model's weights to N zero low mantissa bits (round-to-nearest, default 12) to make it far more compressible while staying fp32; refuses to re-round an already-rounded file unless --force, and by default reports a DEFLATE (gzip/zip, the algorithm murmure ships its model with) before/after size + compress/decompress-time comparison
scripts/mantissa.py this repo shared fp32 mantissa helper imported by the int8 export (--zero-mantissa-bits) and zero-mantissa-bits.py: zero_fp32_mantissa (the rounding) and mantissa_floor_bits (existing-truncation detector)
scripts/test_quantize-int8-smoothquant.py this repo regression tests (T1-T34) for the SmoothQuant fixes in the vendored fork and the export/tooling helpers (uv run scripts/test_quantize-int8-smoothquant.py)
neural-compressor-fork/ git submodule (fork, branch diy) our fork of onnx/neural-compressor carrying the SmoothQuant fixes the int8 script needs; imported via the script's [tool.uv.sources]
models_in_testing/link-model-files.sh this repo helper that turns any directory into a complete, loadable model dir by symlinking every missing standard model file (relative links) back to the repo root, without overwriting files already present; used to assemble the candidate dirs below
models_in_testing/english_downweighted_0.2/ this repo (candidate) a candidate SmoothQuant int8 encoder shared with testers but not promoted to the canonical names: a real encoder-model.int8.onnx (English downweighted in calibration, --fleurs-lang-weight en=0.2) plus relative symlinks to the shared base files. See the changelog
models_in_testing/french_only_100/ this repo (candidate) a candidate SmoothQuant int8 encoder shared with testers but not promoted to the canonical names: a real encoder-model.int8.onnx (calibrated on French only: a single-fr FLEURS root, --fleurs-per-lang 100, no speeches) plus relative symlinks to the shared base files. See the changelog
models_in_testing/all_fleurs_balanced/ this repo (candidate) a candidate SmoothQuant int8 encoder shared with testers but not promoted to the canonical names: a real encoder-model.int8.onnx (calibrated on every FLEURS language, ~50 clips/language, balanced by audio-seconds budget with no per-language weight, --auto-alpha-step 0.1) plus relative symlinks to the shared base files. See the changelog
run.sh this repo convenience wrapper that runs the int8 export, then the WER eval, with the exact flags used for the current build; reads machine-specific paths from a gitignored .env (see How it was built) so it carries no personal paths
wer-fleurs-validation.sh this repo per-language FLEURS validation WER comparison harness: drives the parent repo's scripts/wer-quants.py --manifest to score a roster of models (istupakov fp32/int8, this repo's int8, and the models_in_testing/ candidates) against the human labels, each model loaded once over every language, and prints a model x language WER matrix. Reads FLEURS_DIR/WER_QUANTS_SCRIPT from the gitignored .env so it carries no personal paths; pre-flight skips any model dir that does not resolve
.env.example this repo template for the gitignored .env that run.sh and wer-fleurs-validation.sh source; copy it to .env and set FLEURS_DIR (your local FLEURS path) and, if your layout differs, WER_QUANTS_SCRIPT (path to the parent repo's wer-quants.py)
README.md this repo this document
.gitmodules this repo declares the neural-compressor-fork/ submodule
.gitattributes this repo Git LFS / line-ending attributes for the model artifacts
.gitignore this repo excludes the (copyrighted) calibration_audio/, local logs, and the personal .env from the repo

(The copyrighted calibration_audio/ clips and the local *.log benchmark outputs are gitignored and not part of the published repo; the speeches it holds are documented by source under Evaluation data instead of redistributed.)

How it was built

  • encoder-model.int8.onnx: scripts/quantize-int8-smoothquant.py. SmoothQuant + static per-channel int8 of the MatMul and Conv ops, with Percentile activation calibration. The exact command that produced the current build:

    uv run scripts/quantize-int8-smoothquant.py \
      --op-types MatMul,Conv \
      --op-alpha MatMul=0.0:1.0:0.1 \
      --ep cuda \
      --fleurs-dir /path/to/fleurs --fleurs-per-lang 3 \
      --exclude-worst 0.05 \
      --zero-mantissa-bits 12
    # No --audio -> calibrate on FLEURS only; the eight speeches stay held out as
    # long-audio evaluation. Add --audio calibration_audio to fold them back in.
    

    --exclude-worst 0.05 keeps the most quantization-damaged ~5% of searched MatMuls in fp32 (here 11 of 217, the mid-layer feed-forward linear2 projections); this is what brings the long-audio WER down to fp32 level, at the cost of file size. --zero-mantissa-bits 12 rounds the leftover fp32 weights for compressibility (near-lossless).

    Note: if you re-run this, the coarser MatMul=0.0:1.0:0.2 alpha grid is the better default: in earlier testing it scored just as well as the 0.1 grid used above while running the alpha search in about half the time and with less RAM.

    The export is resumable: each run streams its intermediates (mel features, smoother calibration, the per-node auto-alpha results in alphas.jsonl, the static-calibration params) into sq-cache/<hash-of-the-configuration>/ as they are produced, so if the run dies (typically OOM during static calibration) re-running the same command with --resume picks up from the last completed step instead of re-paying the hours-long alpha search. --resume DIR keeps the cache in a folder of your choosing instead (created if missing, refused if its recorded configuration differs). Resume granularity goes below whole steps: the two per-sample calibration loops (the smoother's activation collection and the static Percentile/Entropy calibration) additionally dump their in-progress state at most once per --checkpoint-interval-min (default 20 minutes), so a run killed inside one of those passes resumes from the last dumped sample rather than redoing the pass; a pass faster than the interval writes nothing extra. The smoother's activation collection streams its per-channel percentile through a running top-k (bit-identical to the stacked np.percentile, enforced by test) instead of holding every sample's activations in RAM, so its memory no longer grows with windows x window length (75 x 395 s windows used to need ~444 GB; the streamed state is a few hundred MB). alphas.jsonl (one appended line per node) doubles as a per-layer report of the alpha each node picked and its QDQ loss over the whole grid.

    On --ep cuda with long windows the remaining peak is the static calibration's GPU memory: it augments the encoder so every calibrated tensor is a graph output, and ONNX Runtime keeps all of them resident for the whole forward, so one long-window forward (the per-layer attention-score MatMuls are ~0.8 GB each near the ~400 s reach) can exceed a 24 GB GPU even though the smoother and alpha-search passes fit. --calib-dump-batch N caps it by dumping the tensors in slices of N graph outputs per forward (re-running the calibration set once per slice); the per-tensor result is bit-identical to dumping all at once (enforced by test), so it is a pure memory/speed knob and, like --ep, is not part of the cache hash, so you can add it on --resume without invalidating the alpha search. On a 24 GB GPU with long windows try --calib-dump-batch 32 (or 64). A progress bar tracks each calibration slice.

    Calibration uses no labels: the FLEURS train splits only (see Calibration data), one window per clip, with the fp32 encoder as the accuracy oracle. The script ends with a cosine-similarity fidelity check of the new encoder's output against fp32. (Pass --audio calibration_audio to also calibrate on the eight speeches as long-text coverage.) It imports SmoothQuant from our vendored neural-compressor-fork/ submodule (branch diy), whose README documents every change versus upstream onnx/neural-compressor: auto-alpha search fixes (it otherwise runs on an exhausted calibration reader and returns a degenerate alpha for every layer), per-op-type alpha grids, streaming Percentile calibration (flat RAM in the window count), per-node QDQ session caching, an alpha search that retains no protobuf arena memory per evaluation, and more. Clone with --recurse-submodules (or run git submodule update --init) so uv run can find the fork.

  • encoder-model.fp16.onnx: scripts/quantize-fp16.py, a straight fp16 cast of the fp32 encoder pieces.

  • sharded/: scripts/shard-fp32.py, a pure repack of the single-file fp32 encoder into <2 GB shards (see Browser-friendly fp32 shards). No weights are altered, so the sharded encoder is numerically identical to the single-file fp32.

  • Mantissa rounding (compressibility): scripts/zero-mantissa-bits.py rounds an fp32 model's weights to 12 zero low mantissa bits (round-to-nearest, near-lossless, more precise than an fp16 cast). It is applied to the standalone fp32 encoder and the fp32 decoder; the int8 export does the same to its leftover fp32 weights via --zero-mantissa-bits 12. Round the fp32 encoder before sharding so the shards carry the rounded weights:

    uv run scripts/zero-mantissa-bits.py encoder-model.onnx        # -> encoder-model.zm12.onnx
    uv run scripts/zero-mantissa-bits.py decoder_joint-model.onnx  # -> decoder_joint-model.zm12.onnx
    

    Each rounded file then takes its canonical name in the published repo, so the models stay drop-in. To round directly under the canonical name (so the <name>.data sidecar reference inside the .onnx is never left pointing at a renamed <stem>.zm12.onnx.data), pass --inplace to overwrite the input and reuse its .data name:

    uv run scripts/zero-mantissa-bits.py encoder-model.onnx --inplace
    

    zero-mantissa-bits.py refuses to re-round an already-rounded file unless --force.

run.sh is a convenience wrapper that runs the int8 export above (with the exact flags used for the current build) followed by the WER eval. It keeps no personal paths: copy .env.example to .env (gitignored), set FLEURS_DIR to your local FLEURS path (and WER_QUANTS_SCRIPT if your wer-quants.py is not at the default in-monorepo location), then run ./run.sh. It exits early with a clear message if .env is missing.

These scripts live in scripts/, are self-contained, and run from the repo root against the model files here, which they default to finding in the current directory (so invoke them as e.g. uv run scripts/quantize-fp16.py). Each declares its own dependencies via a PEP 723 header and runs with uv run (which installs them on the fly). The only external inputs are the FLEURS audio for the int8 calibration (via --fleurs-dir) and the held-out speeches for the long-audio benchmark (fetched from the documented sources into calibration_audio/, see above), plus the optional WER comparison harnesses the scripts print at the end (wer-quants.py, wer-bench.mjs), which live in the parakeet_web project repository.

Sources and credits

License

cc-by-4.0, inherited from the upstream istupakov ONNX model and the original NVIDIA Parakeet TDT 0.6B v3.

Downloads last month
106
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Olicorne/parakeet-tdt-0.6b-v3-smoothquant-onnx

Quantized
(2)
this model