Parakeet TDT 0.6B v3 (Multilingual) ONNX with a SmoothQuant int8 encoder

Drop-in replacement for istupakov/parakeet-tdt-0.6b-v3-onnx whose int8
encoder is rebuilt with SmoothQuant (MatMul-only static per-channel int8,
Percentile calibration; convolutions kept fp32) so it no longer degrades
on long audio: WER 10.89% overall vs the stock int8's 40.40% and fp16's
10.17% on a 390s single pass. Heavier than the stock int8 (842 vs 622 MB)
but tracks fp16 while using about half its RAM. Also ships the fp16
encoder (absent upstream). Includes the export scripts for provenance.

Built with Claude Code.

Files changed (13) hide show

.gitattributes +36 -0
README.md +170 -0
config.json +5 -0
decoder_joint-model.fp16.onnx +3 -0
decoder_joint-model.int8.onnx +3 -0
decoder_joint-model.onnx +3 -0
encoder-model.fp16.onnx +3 -0
encoder-model.onnx +3 -0
encoder-model.onnx.data +3 -0
nemo128.onnx +3 -0
quantize-fp16.py +185 -0
quantize-int8-smoothquant.py +416 -0
vocab.txt +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+encoder-model.onnx.data filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,170 @@

+---
+license: cc-by-4.0
+language:
+  - en
+  - es
+  - fr
+  - de
+  - bg
+  - hr
+  - cs
+  - da
+  - nl
+  - et
+  - fi
+  - el
+  - hu
+  - it
+  - lv
+  - lt
+  - mt
+  - pl
+  - pt
+  - ro
+  - sk
+  - sl
+  - sv
+  - ru
+  - uk
+base_model:
+  - nvidia/parakeet-tdt-0.6b-v3
+  - istupakov/parakeet-tdt-0.6b-v3-onnx
+pipeline_tag: automatic-speech-recognition
+tags:
+  - automatic-speech-recognition
+  - asr
+  - onnx
+  - onnx-asr
+  - smoothquant
+  - quantization
+---
+# Parakeet TDT 0.6B v3 (Multilingual), ONNX with a SmoothQuant int8 encoder
+This is [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
+with **one change**: the int8 encoder (`encoder-model.int8.onnx`) is rebuilt with
+[SmoothQuant](https://github.com/onnx/neural-compressor) so it no longer loses
+accuracy on long audio. Everything else (the fp32 encoder, the fp16 encoder, the
+decoder, the preprocessor and the tokenizer) is unchanged, so this repo is a
+drop-in replacement for the original: point your loader at it and the better int8
+is picked up automatically by its canonical name.
+It also ships the **fp16** encoder (which the upstream istupakov repo does not),
+so all three precisions are available here in one place.
+## Why this exists
+The stock int8 encoder transcribes short clips fine, but its accuracy degrades
+badly once a single pass runs past roughly 20 to 30 seconds. The fp16 and fp32
+encoders do **not** show this: so it is not the model architecture, it is an int8
+*numerics* problem. The stock int8 uses fully **dynamic, per-tensor** activation
+quantization (one runtime scale for an entire activation tensor). Once a longer
+sequence widens the activation distribution, that single scale can no longer
+represent it and the transcript falls apart.
+SmoothQuant targets exactly this failure mode: it migrates the per-channel
+activation outliers into the weights (a folded multiply), then statically
+quantizes activations together with **per-channel** weights. The smoothed,
+per-channel int8 encoder holds up over long audio instead of collapsing.
+Background and discussion:
+[Kieirra/murmure#289 (comment)](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354).
+## Results
+Benchmark: a single ~390 second pass of a JFK speech clip (no chunking), scored
+**per 60 second section** against the fp32 encoder as the oracle (each section is
+also transcribed independently as a short clip, which the encoders all handle
+well, and that short-clip transcription is the reference). A WER that climbs as
+you go down the table is the long-audio degradation. Run with `scripts/wer-quants.py`
+from the project repository (TODO: add the GitHub URL of the parakeet_web project
+here); the export and comparison are fully reproducible with the scripts included
+in this repo.
+Overall (single 390 s pass, lower WER is better):
+| encoder precision           | encoder size | overall WER | peak RAM |
+| --------------------------- | ------------ | ----------- | -------- |
+| stock int8 (istupakov)      | 622 MB       | 40.40%      | ~5.0 GB  |
+| **SmoothQuant int8 (this)** | **842 MB**   | **10.89%**  | ~4.9 GB  |
+| fp16                        | ~1.2 GB      | 10.17%      | ~9.5 GB  |
+Per-section WER:
+| section     | stock int8 | **SmoothQuant int8** | fp16  |
+| ----------- | ---------- | -------------------- | ----- |
+| 0 to 60 s   | 41.4%      | **2.6%**             | 2.6%  |
+| 60 to 120 s | 29.2%      | **3.5%**             | 5.3%  |
+| 120 to 180 s| 39.1%      | **5.5%**             | 3.9%  |
+| 180 to 240 s| 28.2%      | **3.4%**             | 3.4%  |
+| 240 to 300 s| 69.5%      | **45.1%**            | 45.1% |
+| 300 to 360 s| 46.8%      | **25.5%**            | 23.4% |
+| 360 to 390 s| 37.5%      | **8.3%**             | 4.2%  |
+The SmoothQuant int8 **tracks fp16 almost exactly** (10.89% overall vs fp16's
+10.17%, a 0.7 point gap) and is about 4x better than the stock int8's 40.40%.
+The 240 to 360 s sections are elevated for fp16 too, so that is the audio /
+oracle for those windows, not a quantization artifact: the SmoothQuant int8
+matches fp16 there while the stock int8 blows up to 69.5%.
+### Trade-off: heavier than the stock int8, much more accurate
+This int8 encoder is **842 MB versus the stock 622 MB**. That is deliberate: only
+the MatMul ops are quantized, and the convolutional subsampling front-end is kept
+in fp32 (statically quantizing it collapsed the encoder to an empty transcript).
+The extra size buys long-audio accuracy that tracks fp16. It still uses about
+**half the RAM of fp16** (~4.9 GB versus ~9.5 GB), which is the point: if you can
+run fp16 or fp32 (for example on a WebGPU backend), prefer those. This int8
+matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8
+is the only precision that both fits and runs.
+## Files
+| file                            | what it is                                              |
+| ------------------------------- | ------------------------------------------------------- |
+| `encoder-model.onnx` (+ `.data`)| fp32 encoder (unchanged from istupakov)                 |
+| `encoder-model.fp16.onnx`       | fp16 encoder (not shipped by the upstream istupakov repo)|
+| `encoder-model.int8.onnx`       | **SmoothQuant int8 encoder (the reason for this repo)** |
+| `decoder_joint-model.onnx`      | fp32 decoder / joint network (unchanged)                |
+| `decoder_joint-model.fp16.onnx` | fp16 decoder / joint network (unchanged)                |
+| `decoder_joint-model.int8.onnx` | int8 decoder / joint network (unchanged)                |
+| `nemo128.onnx`                  | 128-bin mel preprocessor (unchanged)                    |
+| `vocab.txt`, `config.json`      | tokenizer and model config (unchanged)                  |
+| `quantize-int8-smoothquant.py`  | script that produced the SmoothQuant int8 encoder       |
+| `quantize-fp16.py`              | script that produced the fp16 encoder                   |
+## How it was built
+- `encoder-model.int8.onnx`: `quantize-int8-smoothquant.py`. SmoothQuant +
+  static per-channel int8, MatMul ops only (convolutions stay fp32), with
+  Percentile activation calibration. Calibration needs no labelled or long
+  dataset: it auto-discovers local speech and slices it into long 30 s windows,
+  and the fp32 encoder is used as the accuracy oracle. The script ends with a
+  cosine-similarity fidelity check of the new encoder's output against fp32.
+- `encoder-model.fp16.onnx`: `quantize-fp16.py`, a straight fp16 cast of the
+  fp32 encoder pieces.
+Both scripts are self-contained (they declare their own dependencies via a
+[PEP 723](https://peps.python.org/pep-0723/) header and run with `uv run`) and
+are included here for provenance and reproducibility. They reference fixtures
+from the parakeet_web project repository, so to re-run them clone that project
+(TODO: add the GitHub URL here) and run them from there.
+## Sources and credits
+- ONNX base model this repo is built on:
+  [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
+- Original model:
+  [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
+- SmoothQuant implementation:
+  [onnx/neural-compressor](https://github.com/onnx/neural-compressor)
+- Loaded with [onnx-asr](https://github.com/istupakov/onnx-asr)
+- Discussion:
+  [Kieirra/murmure#289 (comment)](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354)
+- This repository (model export, benchmarking and documentation) was produced
+  with [Claude Code](https://claude.com/claude-code).
+## License
+`cc-by-4.0`, inherited from the upstream istupakov ONNX model and the original
+NVIDIA Parakeet TDT 0.6B v3.

config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "model_type": "nemo-conformer-tdt",
+    "features_size": 128,
+    "subsampling_factor": 8
+}

decoder_joint-model.fp16.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1832e60f96f6e7725ceeab5c346c84484c9ac55e12b3e8b2f4296e1710d02b2e
+size 36264822

decoder_joint-model.int8.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eea7483ee3d1a30375daedc8ed83e3960c91b098812127a0d99d1c8977667a70
+size 18202004

decoder_joint-model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e978ddf6688527182c10fde2eb4b83068421648985ef23f7a86be732be8706c1
+size 72520893

encoder-model.fp16.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fa398ad252bdeaa714cbc67d3add0a0e28f15bcd8bce2e4d0ee1eb0d4351b36
+size 1238960362

encoder-model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:98a74b21b4cc0017c1e7030319a4a96f4a9506e50f0708f3a516d02a77c96bb1
+size 41770866

encoder-model.onnx.data ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9a22d372c51455c34f13405da2520baefb7125bd16981397561423ed32d24f36
+size 2435420160

nemo128.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a9fde1486ebfcc08f328d75ad4610c67835fea58c73ba57e3209a6f6cf019e9f
+size 139764

quantize-fp16.py ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env python
+"""Convert the fp32 Parakeet ONNX pieces to float16, to land under the WASM /
+Chromium ~2 GB blob limits without the heavy accuracy loss of int8.
+Why fp16 (see CLAUDE.md for the full reasoning): the fp32 encoder is ~2.44 GB
+of external weights, which cannot load on the WASM backend (32-bit WASM caps a
+single ArrayBuffer at ~2 GB and Chromium's blob-URL fetch caps around 2 GB
+too). int8 (~600 MB) fits but degrades quality. fp16 halves the fp32 weights to
+~1.2 GB: under both caps, and near-lossless versus fp32. This script produces
+that fp16 variant from locally-supplied fp32 files so it can be benchmarked
+(scripts/wer-bench.mjs) before deciding whether to ship it.
+It converts the two pieces that matter:
+  - encoder-model.onnx (+ encoder-model.onnx.data)  -> encoder-model.fp16.onnx
+  - decoder_joint-model.onnx                         -> decoder_joint-model.fp16.onnx
+nemo128.onnx (the ONNX preprocessor) is intentionally skipped: the web app and
+scripts/transcribe.mjs use the pure-JS mel preprocessor (mel.js), so the ONNX
+preprocessor is never loaded.
+Useful reference: https://huggingface.co/grikdotnet/parakeet-tdt-0.6b-fp16
+documents the same conversion (same pieces, same keep_io_types=True /
+disable_shape_infer=True settings). It uses onnxconverter_common.float16 plus a
+separate post-processing pass to rewrite leftover internal Cast(to=FLOAT) nodes
+to Cast(to=FLOAT16). We instead use onnxruntime.transformers.float16, the
+evolved fork of that same converter, which handles those internal casts itself,
+so no separate cast-fixing pass is needed (a topological_sort below is enough).
+keep_io_types=True is deliberate and load-bearing: the encoder/decoder graphs
+take and return float32 tensors (audio_signal, outputs, encoder_outputs, and
+the decoder's LSTM input_states_*/output_states_*). Keeping the I/O boundary at
+float32 means the JS pipeline (parakeet.js) feeds and reads exactly the same
+dtypes as for the fp32/int8 models, so NOTHING in the JS side needs to change;
+only the weights and internal compute become fp16.
+Usage:
+  python scripts/quantize-fp16.py                       # ./fallback_models in place
+  python scripts/quantize-fp16.py --model-dir DIR --out-dir DIR
+  python scripts/quantize-fp16.py --external-data       # force .onnx.data sidecar
+Requires: onnx, onnxruntime (provides onnxruntime.transformers.float16).
+Built with Claude Code.
+"""
+import argparse
+import os
+import sys
+import time
+import onnx
+from onnxruntime.transformers.float16 import convert_float_to_float16
+from onnxruntime.transformers.onnx_model import OnnxModel
+# (input fp32 file, output fp16 file). Only the encoder carries external weights.
+PIECES = [
+    ("encoder-model.onnx", "encoder-model.fp16.onnx"),
+    ("decoder_joint-model.onnx", "decoder_joint-model.fp16.onnx"),
+]
+# Single-protobuf serialisation hard-caps at 2 GB. The fp16 encoder is ~1.2 GB
+# so an inline save normally fits, but we keep a margin and fall back to an
+# external-data sidecar (which scripts/transcribe.mjs createSession() already
+# resolves via the "<model>.data" probe) if we get close.
+TWO_GB = 2 * 1024 ** 3
+def human(n):
+    for unit in ("B", "KB", "MB", "GB"):
+        if n < 1024 or unit == "GB":
+            return f"{n:.1f} {unit}"
+        n /= 1024
+def file_size(path):
+    total = os.path.getsize(path)
+    data = path + ".data"
+    if os.path.exists(data):
+        total += os.path.getsize(data)
+    return total
+def convert_one(in_path, out_path, force_external, op_block_list):
+    if not os.path.exists(in_path):
+        raise FileNotFoundError(f"missing input model: {in_path}")
+    in_size = file_size(in_path)
+    print(f"[fp16] {os.path.basename(in_path)} ({human(in_size)}) -> "
+          f"{os.path.basename(out_path)}")
+    # load_external_data=True (default) pulls the sibling .onnx.data into memory
+    # so the converter sees real tensors. This needs ~the fp32 model's size in
+    # RAM for the encoder (~2.4 GB); that is the price of an in-memory convert.
+    t0 = time.time()
+    model = onnx.load(in_path, load_external_data=True)
+    # disable_shape_infer=True: onnx shape inference serialises the model to run,
+    # which would hit the 2 GB protobuf limit on the fp32 encoder. keep_io_types
+    # pins the float32 boundary so the converter still inserts the right casts.
+    fp16_model = convert_float_to_float16(
+        model,
+        keep_io_types=True,
+        disable_shape_infer=True,
+        op_block_list=op_block_list if op_block_list else None,
+    )
+    # keep_io_types=True prepends graph_input_cast_* / appends graph_output_cast_*
+    # nodes but does NOT re-sort the graph, leaving it not topologically sorted.
+    # onnx.checker rejects that and ORT-web fails to build the session (it
+    # surfaced as a std::bad_alloc). A topological sort fixes the node order.
+    OnnxModel(fp16_model).topological_sort()
+    convert_s = time.time() - t0
+    # Estimate serialized size to choose inline vs external. ByteSize() is exact
+    # but can itself overflow near 2 GB, so guard it.
+    try:
+        approx = fp16_model.ByteSize()
+        big = approx >= TWO_GB - (64 * 1024 ** 2)  # 64 MB safety margin
+    except (ValueError, OverflowError):
+        big = True
+    use_external = force_external or big
+    # A stale sidecar from a previous run would be silently reused by ORT, so
+    # clear it when we are NOT writing external data this time.
+    sidecar = out_path + ".data"
+    if not use_external and os.path.exists(sidecar):
+        os.remove(sidecar)
+    if use_external:
+        onnx.save(
+            fp16_model, out_path,
+            save_as_external_data=True,
+            all_tensors_to_one_file=True,
+            location=os.path.basename(sidecar),
+            convert_attribute=False,
+        )
+    else:
+        onnx.save(fp16_model, out_path)
+    out_size = file_size(out_path)
+    print(f"       converted in {convert_s:.1f}s, "
+          f"{'external' if use_external else 'inline'} -> {human(out_size)} "
+          f"({100 * out_size / in_size:.0f}% of fp32)")
+    return in_size, out_size
+def main():
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--model-dir", default="fallback_models",
+                    help="directory holding the fp32 .onnx files (default: fallback_models)")
+    ap.add_argument("--out-dir", default=None,
+                    help="output directory (default: same as --model-dir)")
+    ap.add_argument("--external-data", action="store_true",
+                    help="always write weights to a .onnx.data sidecar (default: inline when it fits under 2 GB)")
+    ap.add_argument("--op-block-list", default="",
+                    help="comma-separated ONNX op types to keep in fp32 (default: the converter's built-in list)")
+    args = ap.parse_args()
+    model_dir = args.model_dir
+    out_dir = args.out_dir or model_dir
+    os.makedirs(out_dir, exist_ok=True)
+    op_block_list = [s.strip() for s in args.op_block_list.split(",") if s.strip()]
+    total_in = total_out = 0
+    for in_name, out_name in PIECES:
+        in_size, out_size = convert_one(
+            os.path.join(model_dir, in_name),
+            os.path.join(out_dir, out_name),
+            args.external_data,
+            op_block_list,
+        )
+        total_in += in_size
+        total_out += out_size
+    print(f"[fp16] done: {human(total_in)} fp32 -> {human(total_out)} fp16 "
+          f"({100 * total_out / total_in:.0f}%). vocab.txt is reused as-is.")
+    enc_out = file_size(os.path.join(out_dir, "encoder-model.fp16.onnx"))
+    if enc_out >= TWO_GB:
+        print(f"[fp16] WARNING: fp16 encoder is {human(enc_out)}, still >= 2 GB; "
+              f"it will NOT load on the WASM backend.", file=sys.stderr)
+if __name__ == "__main__":
+    main()

quantize-int8-smoothquant.py ADDED Viewed

	@@ -0,0 +1,416 @@

+#!/usr/bin/env -S uv run --script
+# /// script
+# requires-python = ">=3.11"
+# dependencies = [
+#     "onnx",
+#     "onnxruntime",
+#     "onnx-neural-compressor",
+#     "numpy",
+#     "sympy",
+#     "prettytable",
+#     "psutil",
+#     "scipy",
+# ]
+# ///
+"""Export a *better* int8 Parakeet encoder using SmoothQuant static quantization.
+Why this exists (see CLAUDE.md / ARCHITECTURE.md for the full story): the int8
+encoder we currently ship (istupakov's) silently loses long-range information
+past ~20 s within a single chunk, so the WASM backend is pinned to a 20 s chunk
+window while fp16/fp32 happily run 60 s. Crucially the model architecture is NOT
+the problem (fp16 holds flat at long windows); it is an int8 *numerics* problem:
+a single per-tensor activation scale copes badly once a longer sequence widens
+the activation distribution. That is exactly the regime SmoothQuant targets: it
+migrates the per-channel activation outliers into the weights (a folded Mul),
+then static-quantizes activations + per-channel weights. The bet is that a
+SmoothQuant + per-channel int8 encoder degrades far less over a long chunk,
+which would let WASM use the full 60 s window.
+This produces ONLY the encoder int8 (`encoder-model.int8.smoothquant.onnx`). The
+decoder is tiny and is not where the long-range loss lives, so we deliberately
+reuse istupakov's existing `decoder_joint-model.int8.onnx`; that isolates the
+comparison to the encoder change.
+Calibration data, with ZERO digging required: SmoothQuant needs representative
+*activations*, not labels, so any speech works. We auto-discover whatever audio
+is already in the tree (the committed FLEURS fixture is always present) and slice
+it into deliberately LONG windows (default 30 s) so the smoothing scales are
+computed over the very long-range distribution we are trying to fix. The encoder
+takes mel features, not raw audio, so each window is first run through the
+committed `nemo128.onnx` preprocessor (raw waveform -> 128-bin mel features) and
+those features are fed to the encoder, exactly as the real pipeline does.
+After export, compare against fp16 with the existing per-section harness:
+    # the NEW SmoothQuant int8 (served from the symlinked candidate dir):
+    uv run scripts/wer-quants.py --model-dir fallback_models_sq --quants int8
+    # the OLD istupakov int8 + the fp16 reference, for the baseline:
+    uv run scripts/wer-quants.py --model-dir fallback_models   --quants int8,fp16
+Both use the same fp32 oracle reference, so a per-section WER that rises less
+steeply for the new int8 (closer to fp16) is the win we are after. This script
+prints those two commands at the end and, unless --no-candidate is passed, builds
+the `fallback_models_sq` symlink farm they need.
+By default only MatMul ops are quantized (the conv subsampling front-end stays
+fp32: it is quant-fragile and collapsed the encoder when quantized) and
+activations are calibrated with the Percentile method (MinMax let a single
+long-tail outlier crush the scale). A post-export fidelity check compares the new
+encoder's output to the fp32 encoder by cosine similarity and warns loudly on a
+likely collapse, instead of only checking output shape.
+Usage:
+  uv run scripts/quantize-int8-smoothquant.py                  # auto everything
+  uv run scripts/quantize-int8-smoothquant.py --alpha 0.6      # more weight-side migration
+  uv run scripts/quantize-int8-smoothquant.py --num-windows 32 --window-sec 30
+  uv run scripts/quantize-int8-smoothquant.py --audio a.mp3 --audio b.wav
+  uv run scripts/quantize-int8-smoothquant.py --op-types MatMul,Conv   # also quantize convs
+  uv run scripts/quantize-int8-smoothquant.py --calibrate-method entropy
+  uv run scripts/quantize-int8-smoothquant.py --quant-format qdq
+Built with Claude Code.
+"""
+import argparse
+import os
+import shutil
+import subprocess
+import sys
+import time
+from pathlib import Path
+import numpy as np
+import onnx
+import onnxruntime as ort
+from onnxruntime.quantization import CalibrationMethod, QuantFormat, QuantType
+from onnx_neural_compressor import data_reader
+from onnx_neural_compressor.quantization import config, quantize
+from onnx_neural_compressor.algorithms.smoother import core as _sq_core
+ROOT = Path(__file__).resolve().parent.parent
+# --- FastConformer compatibility shim for onnx-neural-compressor's SmoothQuant -
+# The library's smoother hard-assumes a 3D activation is (batch, seq, in_channel)
+# with the in-channel LAST (there is a literal TODO admitting this in
+# Calibrator._get_max_per_channel). That holds for BERT-style graphs but NOT for
+# a few FastConformer MatMuls (the relative-position attention projections, where
+# the weight is the first operand and the activation contracts over the sequence
+# axis). For those, the per-channel activation max is taken over the wrong axis
+# and no longer matches the weight's in-channel length, so _get_smooth_scale dies
+# broadcasting e.g. (101,) against (2048,).
+#
+# These two wrappers make the smoother SKIP exactly those unresolvable nodes
+# (return None -> stripped before any Mul is inserted) instead of crashing. All
+# the well-behaved linears (FFN, standard projections, the bulk of the weights)
+# are still smoothed; the skipped handful simply fall through to plain static
+# int8. _insert_smooth_mul_op iterates scales.keys() and _adjust_weights guards
+# with `if key not in scales`, so omitting a node is safe. NOTE: this monkeypatch
+# reaches into library internals and may need revisiting on a neural-compressor
+# upgrade; it is contained to this experimental export script.
+_SKIPPED = {"count": 0}
+_orig_get_smooth_scale = _sq_core.Smoother._get_smooth_scale
+_orig_get_smooth_scales = _sq_core.Smoother._get_smooth_scales
+def _safe_get_smooth_scale(self, weights, specific_alpha, tensor):
+    weights_max = np.amax(np.abs(weights.reshape(weights.shape[0], -1)), axis=-1)
+    if self.max_vals_per_channel[tensor].shape != weights_max.shape:
+        _SKIPPED["count"] += 1
+        return None  # layout the per-channel logic can't resolve: don't smooth it
+    return _orig_get_smooth_scale(self, weights, specific_alpha, tensor)
+def _safe_get_smooth_scales(self, alpha, target_list=[]):
+    scales = _orig_get_smooth_scales(self, alpha, target_list)
+    return {k: v for k, v in scales.items() if v is not None}
+_sq_core.Smoother._get_smooth_scale = _safe_get_smooth_scale
+_sq_core.Smoother._get_smooth_scales = _safe_get_smooth_scales
+# Audio we can use for calibration with no user input. The first entry is the
+# committed FLEURS fixture (always present); the others are picked up only if
+# they happen to exist locally (the gitignored moon-speech cache is a long,
+# single-speaker bonus that strengthens the long-range calibration).
+DEFAULT_CALIB_AUDIO = [
+    ROOT / "test/fixtures/fleurs/stitched.mp3",
+    ROOT / "test/e2e/.cache/jfk-moon/full.mp3",
+    ROOT / "venlaf.aac",
+]
+SAMPLE_RATE = 16000
+def human(n):
+    for unit in ("B", "KB", "MB", "GB"):
+        if n < 1024 or unit == "GB":
+            return f"{n:.1f} {unit}"
+        n /= 1024
+def find_ffmpeg(explicit=None):
+    cand = explicit or os.environ.get("FFMPEG") or shutil.which("ffmpeg")
+    if not cand or not shutil.which(cand) and not os.path.exists(cand):
+        sys.exit("ffmpeg not found (set $FFMPEG or pass --ffmpeg).")
+    return cand
+def decode_pcm(ffmpeg, path):
+    """Decode any audio file to mono 16 kHz float32 PCM via ffmpeg."""
+    cmd = [ffmpeg, "-v", "error", "-i", str(path),
+           "-f", "f32le", "-ac", "1", "-ar", str(SAMPLE_RATE), "-"]
+    out = subprocess.run(cmd, capture_output=True)
+    if out.returncode != 0:
+        raise RuntimeError(f"ffmpeg failed on {path}: {out.stderr.decode()[-300:]}")
+    return np.frombuffer(out.stdout, dtype=np.float32)
+def collect_windows(ffmpeg, audio_paths, window_sec, num_windows):
+    """Slice every available clip into non-overlapping FULL-length windows, then
+    evenly subsample down to num_windows so calibration stays quick but diverse.
+    All windows are exactly `win` samples long on purpose: SmoothQuant's
+    calibrator np.stacks the per-op activations across calibration samples, so a
+    variable-length tail window (different T -> different activation shape) makes
+    it raise 'all input arrays must have the same shape'. We therefore drop any
+    partial tail rather than pad it."""
+    win = int(window_sec * SAMPLE_RATE)
+    windows = []
+    for p in audio_paths:
+        if not Path(p).exists():
+            continue
+        pcm = decode_pcm(ffmpeg, p)
+        n = len(pcm)
+        count = 0
+        start = 0
+        while start + win <= n:
+            windows.append(pcm[start:start + win])
+            start += win
+            count += 1
+        print(f"  [calib] {Path(p).name}: {n / SAMPLE_RATE:.0f}s -> {count} full window(s)")
+    if not windows:
+        sys.exit(f"No calibration audio yielded a full {window_sec:g}s window. "
+                 "Pass --audio <file> or lower --window-sec.")
+    if len(windows) > num_windows:
+        # Even stride across the whole pool for speaker/content diversity.
+        idx = np.linspace(0, len(windows) - 1, num_windows).round().astype(int)
+        windows = [windows[i] for i in dict.fromkeys(idx)]
+    return windows
+def build_features(pre_path, windows):
+    """Run each raw-audio window through nemo128.onnx -> encoder mel features.
+    Precomputed once into memory so the calibration reader can rewind cheaply
+    (SmoothQuant + the static min/max + calibration passes each re-read it)."""
+    sess = ort.InferenceSession(str(pre_path), providers=["CPUExecutionProvider"])
+    feats = []
+    for w in windows:
+        wav = w.astype(np.float32)[None, :]
+        lens = np.array([wav.shape[1]], dtype=np.int64)
+        features, features_lens = sess.run(None, {"waveforms": wav, "waveforms_lens": lens})
+        feats.append({
+            "audio_signal": features.astype(np.float32),
+            "length": features_lens.astype(np.int64),
+        })
+    return feats
+class FeatureReader(data_reader.CalibrationDataReader):
+    """Feeds the encoder its real (audio_signal, length) inputs for calibration."""
+    def __init__(self, feats):
+        self.feats = feats
+        self.i = 0
+    def get_next(self):
+        if self.i >= len(self.feats):
+            return None
+        item = self.feats[self.i]
+        self.i += 1
+        return item
+    def rewind(self):
+        self.i = 0
+def build_candidate_dir(model_dir, new_encoder, candidate_dir):
+    """Symlink-farm a model dir where encoder-model.int8.onnx IS the new encoder,
+    so wer-quants.py (which loads int8 by that canonical name via onnx-asr) serves
+    the SmoothQuant encoder while reusing every other unchanged file."""
+    model_dir = Path(model_dir).resolve()
+    candidate_dir = Path(candidate_dir).resolve()
+    candidate_dir.mkdir(parents=True, exist_ok=True)
+    for f in model_dir.iterdir():
+        if f.is_dir():
+            continue
+        link = candidate_dir / f.name
+        if link.is_symlink() or link.exists():
+            link.unlink()
+        link.symlink_to(f.resolve())
+    # Override the int8 encoder to point at the freshly exported SmoothQuant file.
+    enc_link = candidate_dir / "encoder-model.int8.onnx"
+    if enc_link.is_symlink() or enc_link.exists():
+        enc_link.unlink()
+    enc_link.symlink_to(Path(new_encoder).resolve())
+    return candidate_dir
+def main():
+    ap = argparse.ArgumentParser(
+        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--model-dir", default=str(ROOT / "fallback_models"),
+                    help="dir holding encoder-model.onnx (+.data) and nemo128.onnx")
+    ap.add_argument("--out-name", default="encoder-model.int8.smoothquant.onnx",
+                    help="output filename (written into --model-dir)")
+    ap.add_argument("--candidate-dir", default=str(ROOT / "fallback_models_sq"),
+                    help="symlink-farm dir wer-quants.py points at for the new int8")
+    ap.add_argument("--no-candidate", action="store_true",
+                    help="skip building the wer-quants candidate symlink dir")
+    ap.add_argument("--alpha", type=float, default=0.5,
+                    help="SmoothQuant alpha (0..1): higher migrates more difficulty "
+                         "to the weights, better for big activation outliers")
+    ap.add_argument("--num-windows", type=int, default=24,
+                    help="max calibration windows (evenly sampled across all audio)")
+    ap.add_argument("--window-sec", type=float, default=30.0,
+                    help="calibration window length; long on purpose (the bug is long-range)")
+    ap.add_argument("--audio", action="append", default=None,
+                    help="calibration audio file(s); repeatable. Default: auto-discover.")
+    ap.add_argument("--quant-format", choices=["qoperator", "qdq"], default="qoperator",
+                    help="QOperator (QLinear* ops, matches the shipped int8) or QDQ")
+    ap.add_argument("--op-types", default="MatMul",
+                    help="comma-separated op types to quantize. Default MatMul ONLY: the "
+                         "conv subsampling front-end is quant-fragile and is the prime suspect "
+                         "for a collapsed encoder, so convs stay fp32. Pass 'MatMul,Conv' to "
+                         "also quantize convs (matches istupakov's scope).")
+    ap.add_argument("--calibrate-method", choices=["minmax", "entropy", "percentile"],
+                    default="percentile",
+                    help="static activation calibration. MinMax (the library default) lets a "
+                         "single long-tail outlier crush the scale and can collapse the encoder; "
+                         "percentile/entropy clip the tail and are far more robust here.")
+    ap.add_argument("--fidelity-warn", type=float, default=0.90,
+                    help="cosine-similarity floor (vs the fp32 encoder, one window) below which "
+                         "the export is flagged as a likely collapse before any WER run. This is "
+                         "a COLLAPSE detector, not a quality score: a healthy MatMul-only export "
+                         "measured ~0.96 cosine yet tracked fp16 WER (10.9%% vs 10.2%%), so the "
+                         "floor sits well below that. A true collapse lands far lower.")
+    ap.add_argument("--ffmpeg", default=None, help="ffmpeg binary (else $FFMPEG / PATH)")
+    args = ap.parse_args()
+    model_dir = Path(args.model_dir)
+    in_encoder = model_dir / "encoder-model.onnx"
+    pre_path = model_dir / "nemo128.onnx"
+    out_encoder = model_dir / args.out_name
+    for p in (in_encoder, pre_path):
+        if not p.exists():
+            sys.exit(f"missing required file: {p}")
+    ffmpeg = find_ffmpeg(args.ffmpeg)
+    audio = [Path(a) for a in args.audio] if args.audio else DEFAULT_CALIB_AUDIO
+    print(f"[sq] calibration: up to {args.num_windows} x {args.window_sec:g}s windows")
+    windows = collect_windows(ffmpeg, audio, args.window_sec, args.num_windows)
+    print(f"[sq] using {len(windows)} calibration window(s); extracting mel features...")
+    feats = build_features(pre_path, windows)
+    fmt = QuantFormat.QOperator if args.quant_format == "qoperator" else QuantFormat.QDQ
+    calib = {"minmax": CalibrationMethod.MinMax,
+             "entropy": CalibrationMethod.Entropy,
+             "percentile": CalibrationMethod.Percentile}[args.calibrate_method]
+    op_types = [t.strip() for t in args.op_types.split(",") if t.strip()]
+    cfg = config.StaticQuantConfig(
+        calibration_data_reader=FeatureReader(feats),
+        quant_format=fmt,
+        calibrate_method=calib,
+        activation_type=QuantType.QUInt8,
+        weight_type=QuantType.QInt8,
+        # Which weight-bearing ops to quantize. Default is MatMul ONLY: the conv
+        # subsampling front-end (pre_encode.*) sees the raw mel features with a
+        # wide dynamic range and is notoriously quant-fragile; statically
+        # quantizing it can produce garbage that propagates and empties the
+        # transcript, so we leave all convs fp32 (the user is fine trading the
+        # extra size for safety). MatMul-only also dodges the static quantizer's
+        # Pad handler, which trips on FastConformer's optional/empty Pad inputs
+        # ("Quantization parameters are not specified for param .").
+        op_types_to_quantize=op_types,
+        per_channel=True,        # the other half of the fix: per-channel weights
+        reduce_range=True,       # recommended on non-VNNI CPUs (the WASM target)
+        use_external_data_format=False,  # int8 encoder ~600 MB, fits a single file
+        calibration_sampling_size=len(feats),
+        execution_provider="CPUExecutionProvider",
+        extra_options={
+            "SmoothQuant": True,
+            "SmoothQuantAlpha": args.alpha,
+            "SmoothQuantFolding": True,
+        },
+    )
+    print(f"[sq] SmoothQuant(alpha={args.alpha}) static int8, per-channel, "
+          f"calib={args.calibrate_method}, ops={op_types}, format={args.quant_format} ...")
+    print(f"[sq]   {human(os.path.getsize(in_encoder) + os.path.getsize(str(in_encoder) + '.data'))} fp32 encoder")
+    t0 = time.time()
+    # ORT_DISABLE_ALL skips neural-compressor's pre-optimization InferenceSession
+    # (which has a `provides=` kwarg typo that crashes on this version) and avoids
+    # re-serializing the 2.4 GB fp32 graph.
+    quantize(str(in_encoder), str(out_encoder), cfg,
+             optimization_level=ort.GraphOptimizationLevel.ORT_DISABLE_ALL)
+    dt = time.time() - t0
+    if _SKIPPED["count"]:
+        print(f"[sq] note: {_SKIPPED['count']} node(s) had a layout SmoothQuant could not "
+              f"resolve and were left as plain static int8 (everything else was smoothed)")
+    # neural-compressor always writes the quantized weights to an external
+    # `<name>_data` sidecar for a model this size, ignoring use_external_data_format.
+    # The int8 weights are ~620 MB, well under the 2 GB single-protobuf cap, so
+    # fold them back into ONE self-contained .onnx (matching the shipped
+    # single-file int8 and keeping the candidate symlink dir trivial).
+    sidecar = str(out_encoder) + "_data"
+    if os.path.exists(sidecar):
+        merged = onnx.load(str(out_encoder), load_external_data=True)
+        onnx.save(merged, str(out_encoder), save_as_external_data=False)
+        os.remove(sidecar)
+    out_size = os.path.getsize(out_encoder)
+    baseline = model_dir / "encoder-model.int8.onnx"
+    base_note = f" (istupakov int8 is {human(os.path.getsize(baseline))})" if baseline.exists() else ""
+    print(f"[sq] done in {dt:.0f}s -> {out_encoder.name} {human(out_size)}{base_note}")
+    # Fidelity smoke test (NOT just shape): run one calibration window through both
+    # the fp32 reference and the new int8 encoder and compare the encoder outputs by
+    # cosine similarity. A shape-only check let a fully collapsed encoder (empty
+    # transcript everywhere) pass silently once; this catches that in ~30 s instead
+    # of after a multi-minute WER run. A healthy int8 sits well above ~0.99.
+    try:
+        inp = {"audio_signal": feats[0]["audio_signal"], "length": feats[0]["length"]}
+        s_q = ort.InferenceSession(str(out_encoder), providers=["CPUExecutionProvider"])
+        out_q = s_q.run(None, inp)[0].astype(np.float64).ravel()
+        s_f = ort.InferenceSession(str(in_encoder), providers=["CPUExecutionProvider"])
+        out_f = s_f.run(None, inp)[0].astype(np.float64).ravel()
+        denom = (np.linalg.norm(out_q) * np.linalg.norm(out_f)) or 1.0
+        cos = float(np.dot(out_q, out_f) / denom)
+        if cos < args.fidelity_warn:
+            print(f"[sq] WARNING: encoder-output cosine vs fp32 is {cos:.4f} "
+                  f"(< {args.fidelity_warn}). This export likely COLLAPSED; expect a near-100% "
+                  f"WER. Try a different --calibrate-method/--alpha or keep more ops fp32.",
+                  file=sys.stderr)
+        else:
+            print(f"[sq] fidelity: encoder-output cosine vs fp32 = {cos:.4f} (>= "
+                  f"{args.fidelity_warn}). Looks healthy.")
+    except Exception as e:
+        print(f"[sq] WARNING: exported encoder failed the fidelity smoke test: {e}", file=sys.stderr)
+    if not args.no_candidate:
+        cand = build_candidate_dir(model_dir, out_encoder, args.candidate_dir)
+        print(f"[sq] candidate model dir (for wer-quants): {cand}")
+    rel_cand = os.path.relpath(args.candidate_dir, ROOT)
+    rel_model = os.path.relpath(model_dir, ROOT)
+    print("\nCompare per-section degradation vs fp16:")
+    print(f"  uv run scripts/wer-quants.py --model-dir {rel_cand} --quants int8")
+    print(f"  uv run scripts/wer-quants.py --model-dir {rel_model} --quants int8,fp16")
+    print("A new-int8 per-section WER that tracks fp16 (instead of climbing) is the win.")
+if __name__ == "__main__":
+    main()

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff