thiswillbeyourgithub commited on
Commit
d48fd91
·
0 Parent(s):

Parakeet TDT 0.6B v3 (Multilingual) ONNX with a SmoothQuant int8 encoder

Browse files

Drop-in replacement for istupakov/parakeet-tdt-0.6b-v3-onnx whose int8
encoder is rebuilt with SmoothQuant (MatMul-only static per-channel int8,
Percentile calibration; convolutions kept fp32) so it no longer degrades
on long audio: WER 10.89% overall vs the stock int8's 40.40% and fp16's
10.17% on a 390s single pass. Heavier than the stock int8 (842 vs 622 MB)
but tracks fp16 while using about half its RAM. Also ships the fp16
encoder (absent upstream). Includes the export scripts for provenance.

Built with Claude Code.

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ encoder-model.onnx.data filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ - es
6
+ - fr
7
+ - de
8
+ - bg
9
+ - hr
10
+ - cs
11
+ - da
12
+ - nl
13
+ - et
14
+ - fi
15
+ - el
16
+ - hu
17
+ - it
18
+ - lv
19
+ - lt
20
+ - mt
21
+ - pl
22
+ - pt
23
+ - ro
24
+ - sk
25
+ - sl
26
+ - sv
27
+ - ru
28
+ - uk
29
+ base_model:
30
+ - nvidia/parakeet-tdt-0.6b-v3
31
+ - istupakov/parakeet-tdt-0.6b-v3-onnx
32
+ pipeline_tag: automatic-speech-recognition
33
+ tags:
34
+ - automatic-speech-recognition
35
+ - asr
36
+ - onnx
37
+ - onnx-asr
38
+ - smoothquant
39
+ - quantization
40
+ ---
41
+
42
+ # Parakeet TDT 0.6B v3 (Multilingual), ONNX with a SmoothQuant int8 encoder
43
+
44
+ This is [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
45
+ with **one change**: the int8 encoder (`encoder-model.int8.onnx`) is rebuilt with
46
+ [SmoothQuant](https://github.com/onnx/neural-compressor) so it no longer loses
47
+ accuracy on long audio. Everything else (the fp32 encoder, the fp16 encoder, the
48
+ decoder, the preprocessor and the tokenizer) is unchanged, so this repo is a
49
+ drop-in replacement for the original: point your loader at it and the better int8
50
+ is picked up automatically by its canonical name.
51
+
52
+ It also ships the **fp16** encoder (which the upstream istupakov repo does not),
53
+ so all three precisions are available here in one place.
54
+
55
+ ## Why this exists
56
+
57
+ The stock int8 encoder transcribes short clips fine, but its accuracy degrades
58
+ badly once a single pass runs past roughly 20 to 30 seconds. The fp16 and fp32
59
+ encoders do **not** show this: so it is not the model architecture, it is an int8
60
+ *numerics* problem. The stock int8 uses fully **dynamic, per-tensor** activation
61
+ quantization (one runtime scale for an entire activation tensor). Once a longer
62
+ sequence widens the activation distribution, that single scale can no longer
63
+ represent it and the transcript falls apart.
64
+
65
+ SmoothQuant targets exactly this failure mode: it migrates the per-channel
66
+ activation outliers into the weights (a folded multiply), then statically
67
+ quantizes activations together with **per-channel** weights. The smoothed,
68
+ per-channel int8 encoder holds up over long audio instead of collapsing.
69
+
70
+ Background and discussion:
71
+ [Kieirra/murmure#289 (comment)](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354).
72
+
73
+ ## Results
74
+
75
+ Benchmark: a single ~390 second pass of a JFK speech clip (no chunking), scored
76
+ **per 60 second section** against the fp32 encoder as the oracle (each section is
77
+ also transcribed independently as a short clip, which the encoders all handle
78
+ well, and that short-clip transcription is the reference). A WER that climbs as
79
+ you go down the table is the long-audio degradation. Run with `scripts/wer-quants.py`
80
+ from the project repository (TODO: add the GitHub URL of the parakeet_web project
81
+ here); the export and comparison are fully reproducible with the scripts included
82
+ in this repo.
83
+
84
+ Overall (single 390 s pass, lower WER is better):
85
+
86
+ | encoder precision | encoder size | overall WER | peak RAM |
87
+ | --------------------------- | ------------ | ----------- | -------- |
88
+ | stock int8 (istupakov) | 622 MB | 40.40% | ~5.0 GB |
89
+ | **SmoothQuant int8 (this)** | **842 MB** | **10.89%** | ~4.9 GB |
90
+ | fp16 | ~1.2 GB | 10.17% | ~9.5 GB |
91
+
92
+ Per-section WER:
93
+
94
+ | section | stock int8 | **SmoothQuant int8** | fp16 |
95
+ | ----------- | ---------- | -------------------- | ----- |
96
+ | 0 to 60 s | 41.4% | **2.6%** | 2.6% |
97
+ | 60 to 120 s | 29.2% | **3.5%** | 5.3% |
98
+ | 120 to 180 s| 39.1% | **5.5%** | 3.9% |
99
+ | 180 to 240 s| 28.2% | **3.4%** | 3.4% |
100
+ | 240 to 300 s| 69.5% | **45.1%** | 45.1% |
101
+ | 300 to 360 s| 46.8% | **25.5%** | 23.4% |
102
+ | 360 to 390 s| 37.5% | **8.3%** | 4.2% |
103
+
104
+ The SmoothQuant int8 **tracks fp16 almost exactly** (10.89% overall vs fp16's
105
+ 10.17%, a 0.7 point gap) and is about 4x better than the stock int8's 40.40%.
106
+ The 240 to 360 s sections are elevated for fp16 too, so that is the audio /
107
+ oracle for those windows, not a quantization artifact: the SmoothQuant int8
108
+ matches fp16 there while the stock int8 blows up to 69.5%.
109
+
110
+ ### Trade-off: heavier than the stock int8, much more accurate
111
+
112
+ This int8 encoder is **842 MB versus the stock 622 MB**. That is deliberate: only
113
+ the MatMul ops are quantized, and the convolutional subsampling front-end is kept
114
+ in fp32 (statically quantizing it collapsed the encoder to an empty transcript).
115
+ The extra size buys long-audio accuracy that tracks fp16. It still uses about
116
+ **half the RAM of fp16** (~4.9 GB versus ~9.5 GB), which is the point: if you can
117
+ run fp16 or fp32 (for example on a WebGPU backend), prefer those. This int8
118
+ matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8
119
+ is the only precision that both fits and runs.
120
+
121
+ ## Files
122
+
123
+ | file | what it is |
124
+ | ------------------------------- | ------------------------------------------------------- |
125
+ | `encoder-model.onnx` (+ `.data`)| fp32 encoder (unchanged from istupakov) |
126
+ | `encoder-model.fp16.onnx` | fp16 encoder (not shipped by the upstream istupakov repo)|
127
+ | `encoder-model.int8.onnx` | **SmoothQuant int8 encoder (the reason for this repo)** |
128
+ | `decoder_joint-model.onnx` | fp32 decoder / joint network (unchanged) |
129
+ | `decoder_joint-model.fp16.onnx` | fp16 decoder / joint network (unchanged) |
130
+ | `decoder_joint-model.int8.onnx` | int8 decoder / joint network (unchanged) |
131
+ | `nemo128.onnx` | 128-bin mel preprocessor (unchanged) |
132
+ | `vocab.txt`, `config.json` | tokenizer and model config (unchanged) |
133
+ | `quantize-int8-smoothquant.py` | script that produced the SmoothQuant int8 encoder |
134
+ | `quantize-fp16.py` | script that produced the fp16 encoder |
135
+
136
+ ## How it was built
137
+
138
+ - `encoder-model.int8.onnx`: `quantize-int8-smoothquant.py`. SmoothQuant +
139
+ static per-channel int8, MatMul ops only (convolutions stay fp32), with
140
+ Percentile activation calibration. Calibration needs no labelled or long
141
+ dataset: it auto-discovers local speech and slices it into long 30 s windows,
142
+ and the fp32 encoder is used as the accuracy oracle. The script ends with a
143
+ cosine-similarity fidelity check of the new encoder's output against fp32.
144
+ - `encoder-model.fp16.onnx`: `quantize-fp16.py`, a straight fp16 cast of the
145
+ fp32 encoder pieces.
146
+
147
+ Both scripts are self-contained (they declare their own dependencies via a
148
+ [PEP 723](https://peps.python.org/pep-0723/) header and run with `uv run`) and
149
+ are included here for provenance and reproducibility. They reference fixtures
150
+ from the parakeet_web project repository, so to re-run them clone that project
151
+ (TODO: add the GitHub URL here) and run them from there.
152
+
153
+ ## Sources and credits
154
+
155
+ - ONNX base model this repo is built on:
156
+ [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx)
157
+ - Original model:
158
+ [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
159
+ - SmoothQuant implementation:
160
+ [onnx/neural-compressor](https://github.com/onnx/neural-compressor)
161
+ - Loaded with [onnx-asr](https://github.com/istupakov/onnx-asr)
162
+ - Discussion:
163
+ [Kieirra/murmure#289 (comment)](https://github.com/Kieirra/murmure/issues/289#issuecomment-4621249354)
164
+ - This repository (model export, benchmarking and documentation) was produced
165
+ with [Claude Code](https://claude.com/claude-code).
166
+
167
+ ## License
168
+
169
+ `cc-by-4.0`, inherited from the upstream istupakov ONNX model and the original
170
+ NVIDIA Parakeet TDT 0.6B v3.
config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "model_type": "nemo-conformer-tdt",
3
+ "features_size": 128,
4
+ "subsampling_factor": 8
5
+ }
decoder_joint-model.fp16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1832e60f96f6e7725ceeab5c346c84484c9ac55e12b3e8b2f4296e1710d02b2e
3
+ size 36264822
decoder_joint-model.int8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eea7483ee3d1a30375daedc8ed83e3960c91b098812127a0d99d1c8977667a70
3
+ size 18202004
decoder_joint-model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e978ddf6688527182c10fde2eb4b83068421648985ef23f7a86be732be8706c1
3
+ size 72520893
encoder-model.fp16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3fa398ad252bdeaa714cbc67d3add0a0e28f15bcd8bce2e4d0ee1eb0d4351b36
3
+ size 1238960362
encoder-model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98a74b21b4cc0017c1e7030319a4a96f4a9506e50f0708f3a516d02a77c96bb1
3
+ size 41770866
encoder-model.onnx.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9a22d372c51455c34f13405da2520baefb7125bd16981397561423ed32d24f36
3
+ size 2435420160
nemo128.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9fde1486ebfcc08f328d75ad4610c67835fea58c73ba57e3209a6f6cf019e9f
3
+ size 139764
quantize-fp16.py ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """Convert the fp32 Parakeet ONNX pieces to float16, to land under the WASM /
3
+ Chromium ~2 GB blob limits without the heavy accuracy loss of int8.
4
+
5
+ Why fp16 (see CLAUDE.md for the full reasoning): the fp32 encoder is ~2.44 GB
6
+ of external weights, which cannot load on the WASM backend (32-bit WASM caps a
7
+ single ArrayBuffer at ~2 GB and Chromium's blob-URL fetch caps around 2 GB
8
+ too). int8 (~600 MB) fits but degrades quality. fp16 halves the fp32 weights to
9
+ ~1.2 GB: under both caps, and near-lossless versus fp32. This script produces
10
+ that fp16 variant from locally-supplied fp32 files so it can be benchmarked
11
+ (scripts/wer-bench.mjs) before deciding whether to ship it.
12
+
13
+ It converts the two pieces that matter:
14
+ - encoder-model.onnx (+ encoder-model.onnx.data) -> encoder-model.fp16.onnx
15
+ - decoder_joint-model.onnx -> decoder_joint-model.fp16.onnx
16
+ nemo128.onnx (the ONNX preprocessor) is intentionally skipped: the web app and
17
+ scripts/transcribe.mjs use the pure-JS mel preprocessor (mel.js), so the ONNX
18
+ preprocessor is never loaded.
19
+
20
+ Useful reference: https://huggingface.co/grikdotnet/parakeet-tdt-0.6b-fp16
21
+ documents the same conversion (same pieces, same keep_io_types=True /
22
+ disable_shape_infer=True settings). It uses onnxconverter_common.float16 plus a
23
+ separate post-processing pass to rewrite leftover internal Cast(to=FLOAT) nodes
24
+ to Cast(to=FLOAT16). We instead use onnxruntime.transformers.float16, the
25
+ evolved fork of that same converter, which handles those internal casts itself,
26
+ so no separate cast-fixing pass is needed (a topological_sort below is enough).
27
+
28
+ keep_io_types=True is deliberate and load-bearing: the encoder/decoder graphs
29
+ take and return float32 tensors (audio_signal, outputs, encoder_outputs, and
30
+ the decoder's LSTM input_states_*/output_states_*). Keeping the I/O boundary at
31
+ float32 means the JS pipeline (parakeet.js) feeds and reads exactly the same
32
+ dtypes as for the fp32/int8 models, so NOTHING in the JS side needs to change;
33
+ only the weights and internal compute become fp16.
34
+
35
+ Usage:
36
+ python scripts/quantize-fp16.py # ./fallback_models in place
37
+ python scripts/quantize-fp16.py --model-dir DIR --out-dir DIR
38
+ python scripts/quantize-fp16.py --external-data # force .onnx.data sidecar
39
+
40
+ Requires: onnx, onnxruntime (provides onnxruntime.transformers.float16).
41
+
42
+ Built with Claude Code.
43
+ """
44
+
45
+ import argparse
46
+ import os
47
+ import sys
48
+ import time
49
+
50
+ import onnx
51
+ from onnxruntime.transformers.float16 import convert_float_to_float16
52
+ from onnxruntime.transformers.onnx_model import OnnxModel
53
+
54
+ # (input fp32 file, output fp16 file). Only the encoder carries external weights.
55
+ PIECES = [
56
+ ("encoder-model.onnx", "encoder-model.fp16.onnx"),
57
+ ("decoder_joint-model.onnx", "decoder_joint-model.fp16.onnx"),
58
+ ]
59
+
60
+ # Single-protobuf serialisation hard-caps at 2 GB. The fp16 encoder is ~1.2 GB
61
+ # so an inline save normally fits, but we keep a margin and fall back to an
62
+ # external-data sidecar (which scripts/transcribe.mjs createSession() already
63
+ # resolves via the "<model>.data" probe) if we get close.
64
+ TWO_GB = 2 * 1024 ** 3
65
+
66
+
67
+ def human(n):
68
+ for unit in ("B", "KB", "MB", "GB"):
69
+ if n < 1024 or unit == "GB":
70
+ return f"{n:.1f} {unit}"
71
+ n /= 1024
72
+
73
+
74
+ def file_size(path):
75
+ total = os.path.getsize(path)
76
+ data = path + ".data"
77
+ if os.path.exists(data):
78
+ total += os.path.getsize(data)
79
+ return total
80
+
81
+
82
+ def convert_one(in_path, out_path, force_external, op_block_list):
83
+ if not os.path.exists(in_path):
84
+ raise FileNotFoundError(f"missing input model: {in_path}")
85
+
86
+ in_size = file_size(in_path)
87
+ print(f"[fp16] {os.path.basename(in_path)} ({human(in_size)}) -> "
88
+ f"{os.path.basename(out_path)}")
89
+
90
+ # load_external_data=True (default) pulls the sibling .onnx.data into memory
91
+ # so the converter sees real tensors. This needs ~the fp32 model's size in
92
+ # RAM for the encoder (~2.4 GB); that is the price of an in-memory convert.
93
+ t0 = time.time()
94
+ model = onnx.load(in_path, load_external_data=True)
95
+
96
+ # disable_shape_infer=True: onnx shape inference serialises the model to run,
97
+ # which would hit the 2 GB protobuf limit on the fp32 encoder. keep_io_types
98
+ # pins the float32 boundary so the converter still inserts the right casts.
99
+ fp16_model = convert_float_to_float16(
100
+ model,
101
+ keep_io_types=True,
102
+ disable_shape_infer=True,
103
+ op_block_list=op_block_list if op_block_list else None,
104
+ )
105
+
106
+ # keep_io_types=True prepends graph_input_cast_* / appends graph_output_cast_*
107
+ # nodes but does NOT re-sort the graph, leaving it not topologically sorted.
108
+ # onnx.checker rejects that and ORT-web fails to build the session (it
109
+ # surfaced as a std::bad_alloc). A topological sort fixes the node order.
110
+ OnnxModel(fp16_model).topological_sort()
111
+ convert_s = time.time() - t0
112
+
113
+ # Estimate serialized size to choose inline vs external. ByteSize() is exact
114
+ # but can itself overflow near 2 GB, so guard it.
115
+ try:
116
+ approx = fp16_model.ByteSize()
117
+ big = approx >= TWO_GB - (64 * 1024 ** 2) # 64 MB safety margin
118
+ except (ValueError, OverflowError):
119
+ big = True
120
+
121
+ use_external = force_external or big
122
+
123
+ # A stale sidecar from a previous run would be silently reused by ORT, so
124
+ # clear it when we are NOT writing external data this time.
125
+ sidecar = out_path + ".data"
126
+ if not use_external and os.path.exists(sidecar):
127
+ os.remove(sidecar)
128
+
129
+ if use_external:
130
+ onnx.save(
131
+ fp16_model, out_path,
132
+ save_as_external_data=True,
133
+ all_tensors_to_one_file=True,
134
+ location=os.path.basename(sidecar),
135
+ convert_attribute=False,
136
+ )
137
+ else:
138
+ onnx.save(fp16_model, out_path)
139
+
140
+ out_size = file_size(out_path)
141
+ print(f" converted in {convert_s:.1f}s, "
142
+ f"{'external' if use_external else 'inline'} -> {human(out_size)} "
143
+ f"({100 * out_size / in_size:.0f}% of fp32)")
144
+ return in_size, out_size
145
+
146
+
147
+ def main():
148
+ ap = argparse.ArgumentParser(description=__doc__,
149
+ formatter_class=argparse.RawDescriptionHelpFormatter)
150
+ ap.add_argument("--model-dir", default="fallback_models",
151
+ help="directory holding the fp32 .onnx files (default: fallback_models)")
152
+ ap.add_argument("--out-dir", default=None,
153
+ help="output directory (default: same as --model-dir)")
154
+ ap.add_argument("--external-data", action="store_true",
155
+ help="always write weights to a .onnx.data sidecar (default: inline when it fits under 2 GB)")
156
+ ap.add_argument("--op-block-list", default="",
157
+ help="comma-separated ONNX op types to keep in fp32 (default: the converter's built-in list)")
158
+ args = ap.parse_args()
159
+
160
+ model_dir = args.model_dir
161
+ out_dir = args.out_dir or model_dir
162
+ os.makedirs(out_dir, exist_ok=True)
163
+ op_block_list = [s.strip() for s in args.op_block_list.split(",") if s.strip()]
164
+
165
+ total_in = total_out = 0
166
+ for in_name, out_name in PIECES:
167
+ in_size, out_size = convert_one(
168
+ os.path.join(model_dir, in_name),
169
+ os.path.join(out_dir, out_name),
170
+ args.external_data,
171
+ op_block_list,
172
+ )
173
+ total_in += in_size
174
+ total_out += out_size
175
+
176
+ print(f"[fp16] done: {human(total_in)} fp32 -> {human(total_out)} fp16 "
177
+ f"({100 * total_out / total_in:.0f}%). vocab.txt is reused as-is.")
178
+ enc_out = file_size(os.path.join(out_dir, "encoder-model.fp16.onnx"))
179
+ if enc_out >= TWO_GB:
180
+ print(f"[fp16] WARNING: fp16 encoder is {human(enc_out)}, still >= 2 GB; "
181
+ f"it will NOT load on the WASM backend.", file=sys.stderr)
182
+
183
+
184
+ if __name__ == "__main__":
185
+ main()
quantize-int8-smoothquant.py ADDED
@@ -0,0 +1,416 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env -S uv run --script
2
+ # /// script
3
+ # requires-python = ">=3.11"
4
+ # dependencies = [
5
+ # "onnx",
6
+ # "onnxruntime",
7
+ # "onnx-neural-compressor",
8
+ # "numpy",
9
+ # "sympy",
10
+ # "prettytable",
11
+ # "psutil",
12
+ # "scipy",
13
+ # ]
14
+ # ///
15
+ """Export a *better* int8 Parakeet encoder using SmoothQuant static quantization.
16
+
17
+ Why this exists (see CLAUDE.md / ARCHITECTURE.md for the full story): the int8
18
+ encoder we currently ship (istupakov's) silently loses long-range information
19
+ past ~20 s within a single chunk, so the WASM backend is pinned to a 20 s chunk
20
+ window while fp16/fp32 happily run 60 s. Crucially the model architecture is NOT
21
+ the problem (fp16 holds flat at long windows); it is an int8 *numerics* problem:
22
+ a single per-tensor activation scale copes badly once a longer sequence widens
23
+ the activation distribution. That is exactly the regime SmoothQuant targets: it
24
+ migrates the per-channel activation outliers into the weights (a folded Mul),
25
+ then static-quantizes activations + per-channel weights. The bet is that a
26
+ SmoothQuant + per-channel int8 encoder degrades far less over a long chunk,
27
+ which would let WASM use the full 60 s window.
28
+
29
+ This produces ONLY the encoder int8 (`encoder-model.int8.smoothquant.onnx`). The
30
+ decoder is tiny and is not where the long-range loss lives, so we deliberately
31
+ reuse istupakov's existing `decoder_joint-model.int8.onnx`; that isolates the
32
+ comparison to the encoder change.
33
+
34
+ Calibration data, with ZERO digging required: SmoothQuant needs representative
35
+ *activations*, not labels, so any speech works. We auto-discover whatever audio
36
+ is already in the tree (the committed FLEURS fixture is always present) and slice
37
+ it into deliberately LONG windows (default 30 s) so the smoothing scales are
38
+ computed over the very long-range distribution we are trying to fix. The encoder
39
+ takes mel features, not raw audio, so each window is first run through the
40
+ committed `nemo128.onnx` preprocessor (raw waveform -> 128-bin mel features) and
41
+ those features are fed to the encoder, exactly as the real pipeline does.
42
+
43
+ After export, compare against fp16 with the existing per-section harness:
44
+
45
+ # the NEW SmoothQuant int8 (served from the symlinked candidate dir):
46
+ uv run scripts/wer-quants.py --model-dir fallback_models_sq --quants int8
47
+ # the OLD istupakov int8 + the fp16 reference, for the baseline:
48
+ uv run scripts/wer-quants.py --model-dir fallback_models --quants int8,fp16
49
+
50
+ Both use the same fp32 oracle reference, so a per-section WER that rises less
51
+ steeply for the new int8 (closer to fp16) is the win we are after. This script
52
+ prints those two commands at the end and, unless --no-candidate is passed, builds
53
+ the `fallback_models_sq` symlink farm they need.
54
+
55
+ By default only MatMul ops are quantized (the conv subsampling front-end stays
56
+ fp32: it is quant-fragile and collapsed the encoder when quantized) and
57
+ activations are calibrated with the Percentile method (MinMax let a single
58
+ long-tail outlier crush the scale). A post-export fidelity check compares the new
59
+ encoder's output to the fp32 encoder by cosine similarity and warns loudly on a
60
+ likely collapse, instead of only checking output shape.
61
+
62
+ Usage:
63
+ uv run scripts/quantize-int8-smoothquant.py # auto everything
64
+ uv run scripts/quantize-int8-smoothquant.py --alpha 0.6 # more weight-side migration
65
+ uv run scripts/quantize-int8-smoothquant.py --num-windows 32 --window-sec 30
66
+ uv run scripts/quantize-int8-smoothquant.py --audio a.mp3 --audio b.wav
67
+ uv run scripts/quantize-int8-smoothquant.py --op-types MatMul,Conv # also quantize convs
68
+ uv run scripts/quantize-int8-smoothquant.py --calibrate-method entropy
69
+ uv run scripts/quantize-int8-smoothquant.py --quant-format qdq
70
+
71
+ Built with Claude Code.
72
+ """
73
+
74
+ import argparse
75
+ import os
76
+ import shutil
77
+ import subprocess
78
+ import sys
79
+ import time
80
+ from pathlib import Path
81
+
82
+ import numpy as np
83
+ import onnx
84
+ import onnxruntime as ort
85
+ from onnxruntime.quantization import CalibrationMethod, QuantFormat, QuantType
86
+ from onnx_neural_compressor import data_reader
87
+ from onnx_neural_compressor.quantization import config, quantize
88
+ from onnx_neural_compressor.algorithms.smoother import core as _sq_core
89
+
90
+ ROOT = Path(__file__).resolve().parent.parent
91
+
92
+
93
+ # --- FastConformer compatibility shim for onnx-neural-compressor's SmoothQuant -
94
+ # The library's smoother hard-assumes a 3D activation is (batch, seq, in_channel)
95
+ # with the in-channel LAST (there is a literal TODO admitting this in
96
+ # Calibrator._get_max_per_channel). That holds for BERT-style graphs but NOT for
97
+ # a few FastConformer MatMuls (the relative-position attention projections, where
98
+ # the weight is the first operand and the activation contracts over the sequence
99
+ # axis). For those, the per-channel activation max is taken over the wrong axis
100
+ # and no longer matches the weight's in-channel length, so _get_smooth_scale dies
101
+ # broadcasting e.g. (101,) against (2048,).
102
+ #
103
+ # These two wrappers make the smoother SKIP exactly those unresolvable nodes
104
+ # (return None -> stripped before any Mul is inserted) instead of crashing. All
105
+ # the well-behaved linears (FFN, standard projections, the bulk of the weights)
106
+ # are still smoothed; the skipped handful simply fall through to plain static
107
+ # int8. _insert_smooth_mul_op iterates scales.keys() and _adjust_weights guards
108
+ # with `if key not in scales`, so omitting a node is safe. NOTE: this monkeypatch
109
+ # reaches into library internals and may need revisiting on a neural-compressor
110
+ # upgrade; it is contained to this experimental export script.
111
+ _SKIPPED = {"count": 0}
112
+ _orig_get_smooth_scale = _sq_core.Smoother._get_smooth_scale
113
+ _orig_get_smooth_scales = _sq_core.Smoother._get_smooth_scales
114
+
115
+
116
+ def _safe_get_smooth_scale(self, weights, specific_alpha, tensor):
117
+ weights_max = np.amax(np.abs(weights.reshape(weights.shape[0], -1)), axis=-1)
118
+ if self.max_vals_per_channel[tensor].shape != weights_max.shape:
119
+ _SKIPPED["count"] += 1
120
+ return None # layout the per-channel logic can't resolve: don't smooth it
121
+ return _orig_get_smooth_scale(self, weights, specific_alpha, tensor)
122
+
123
+
124
+ def _safe_get_smooth_scales(self, alpha, target_list=[]):
125
+ scales = _orig_get_smooth_scales(self, alpha, target_list)
126
+ return {k: v for k, v in scales.items() if v is not None}
127
+
128
+
129
+ _sq_core.Smoother._get_smooth_scale = _safe_get_smooth_scale
130
+ _sq_core.Smoother._get_smooth_scales = _safe_get_smooth_scales
131
+
132
+ # Audio we can use for calibration with no user input. The first entry is the
133
+ # committed FLEURS fixture (always present); the others are picked up only if
134
+ # they happen to exist locally (the gitignored moon-speech cache is a long,
135
+ # single-speaker bonus that strengthens the long-range calibration).
136
+ DEFAULT_CALIB_AUDIO = [
137
+ ROOT / "test/fixtures/fleurs/stitched.mp3",
138
+ ROOT / "test/e2e/.cache/jfk-moon/full.mp3",
139
+ ROOT / "venlaf.aac",
140
+ ]
141
+
142
+ SAMPLE_RATE = 16000
143
+
144
+
145
+ def human(n):
146
+ for unit in ("B", "KB", "MB", "GB"):
147
+ if n < 1024 or unit == "GB":
148
+ return f"{n:.1f} {unit}"
149
+ n /= 1024
150
+
151
+
152
+ def find_ffmpeg(explicit=None):
153
+ cand = explicit or os.environ.get("FFMPEG") or shutil.which("ffmpeg")
154
+ if not cand or not shutil.which(cand) and not os.path.exists(cand):
155
+ sys.exit("ffmpeg not found (set $FFMPEG or pass --ffmpeg).")
156
+ return cand
157
+
158
+
159
+ def decode_pcm(ffmpeg, path):
160
+ """Decode any audio file to mono 16 kHz float32 PCM via ffmpeg."""
161
+ cmd = [ffmpeg, "-v", "error", "-i", str(path),
162
+ "-f", "f32le", "-ac", "1", "-ar", str(SAMPLE_RATE), "-"]
163
+ out = subprocess.run(cmd, capture_output=True)
164
+ if out.returncode != 0:
165
+ raise RuntimeError(f"ffmpeg failed on {path}: {out.stderr.decode()[-300:]}")
166
+ return np.frombuffer(out.stdout, dtype=np.float32)
167
+
168
+
169
+ def collect_windows(ffmpeg, audio_paths, window_sec, num_windows):
170
+ """Slice every available clip into non-overlapping FULL-length windows, then
171
+ evenly subsample down to num_windows so calibration stays quick but diverse.
172
+
173
+ All windows are exactly `win` samples long on purpose: SmoothQuant's
174
+ calibrator np.stacks the per-op activations across calibration samples, so a
175
+ variable-length tail window (different T -> different activation shape) makes
176
+ it raise 'all input arrays must have the same shape'. We therefore drop any
177
+ partial tail rather than pad it."""
178
+ win = int(window_sec * SAMPLE_RATE)
179
+ windows = []
180
+ for p in audio_paths:
181
+ if not Path(p).exists():
182
+ continue
183
+ pcm = decode_pcm(ffmpeg, p)
184
+ n = len(pcm)
185
+ count = 0
186
+ start = 0
187
+ while start + win <= n:
188
+ windows.append(pcm[start:start + win])
189
+ start += win
190
+ count += 1
191
+ print(f" [calib] {Path(p).name}: {n / SAMPLE_RATE:.0f}s -> {count} full window(s)")
192
+ if not windows:
193
+ sys.exit(f"No calibration audio yielded a full {window_sec:g}s window. "
194
+ "Pass --audio <file> or lower --window-sec.")
195
+ if len(windows) > num_windows:
196
+ # Even stride across the whole pool for speaker/content diversity.
197
+ idx = np.linspace(0, len(windows) - 1, num_windows).round().astype(int)
198
+ windows = [windows[i] for i in dict.fromkeys(idx)]
199
+ return windows
200
+
201
+
202
+ def build_features(pre_path, windows):
203
+ """Run each raw-audio window through nemo128.onnx -> encoder mel features.
204
+
205
+ Precomputed once into memory so the calibration reader can rewind cheaply
206
+ (SmoothQuant + the static min/max + calibration passes each re-read it)."""
207
+ sess = ort.InferenceSession(str(pre_path), providers=["CPUExecutionProvider"])
208
+ feats = []
209
+ for w in windows:
210
+ wav = w.astype(np.float32)[None, :]
211
+ lens = np.array([wav.shape[1]], dtype=np.int64)
212
+ features, features_lens = sess.run(None, {"waveforms": wav, "waveforms_lens": lens})
213
+ feats.append({
214
+ "audio_signal": features.astype(np.float32),
215
+ "length": features_lens.astype(np.int64),
216
+ })
217
+ return feats
218
+
219
+
220
+ class FeatureReader(data_reader.CalibrationDataReader):
221
+ """Feeds the encoder its real (audio_signal, length) inputs for calibration."""
222
+
223
+ def __init__(self, feats):
224
+ self.feats = feats
225
+ self.i = 0
226
+
227
+ def get_next(self):
228
+ if self.i >= len(self.feats):
229
+ return None
230
+ item = self.feats[self.i]
231
+ self.i += 1
232
+ return item
233
+
234
+ def rewind(self):
235
+ self.i = 0
236
+
237
+
238
+ def build_candidate_dir(model_dir, new_encoder, candidate_dir):
239
+ """Symlink-farm a model dir where encoder-model.int8.onnx IS the new encoder,
240
+ so wer-quants.py (which loads int8 by that canonical name via onnx-asr) serves
241
+ the SmoothQuant encoder while reusing every other unchanged file."""
242
+ model_dir = Path(model_dir).resolve()
243
+ candidate_dir = Path(candidate_dir).resolve()
244
+ candidate_dir.mkdir(parents=True, exist_ok=True)
245
+ for f in model_dir.iterdir():
246
+ if f.is_dir():
247
+ continue
248
+ link = candidate_dir / f.name
249
+ if link.is_symlink() or link.exists():
250
+ link.unlink()
251
+ link.symlink_to(f.resolve())
252
+ # Override the int8 encoder to point at the freshly exported SmoothQuant file.
253
+ enc_link = candidate_dir / "encoder-model.int8.onnx"
254
+ if enc_link.is_symlink() or enc_link.exists():
255
+ enc_link.unlink()
256
+ enc_link.symlink_to(Path(new_encoder).resolve())
257
+ return candidate_dir
258
+
259
+
260
+ def main():
261
+ ap = argparse.ArgumentParser(
262
+ description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
263
+ ap.add_argument("--model-dir", default=str(ROOT / "fallback_models"),
264
+ help="dir holding encoder-model.onnx (+.data) and nemo128.onnx")
265
+ ap.add_argument("--out-name", default="encoder-model.int8.smoothquant.onnx",
266
+ help="output filename (written into --model-dir)")
267
+ ap.add_argument("--candidate-dir", default=str(ROOT / "fallback_models_sq"),
268
+ help="symlink-farm dir wer-quants.py points at for the new int8")
269
+ ap.add_argument("--no-candidate", action="store_true",
270
+ help="skip building the wer-quants candidate symlink dir")
271
+ ap.add_argument("--alpha", type=float, default=0.5,
272
+ help="SmoothQuant alpha (0..1): higher migrates more difficulty "
273
+ "to the weights, better for big activation outliers")
274
+ ap.add_argument("--num-windows", type=int, default=24,
275
+ help="max calibration windows (evenly sampled across all audio)")
276
+ ap.add_argument("--window-sec", type=float, default=30.0,
277
+ help="calibration window length; long on purpose (the bug is long-range)")
278
+ ap.add_argument("--audio", action="append", default=None,
279
+ help="calibration audio file(s); repeatable. Default: auto-discover.")
280
+ ap.add_argument("--quant-format", choices=["qoperator", "qdq"], default="qoperator",
281
+ help="QOperator (QLinear* ops, matches the shipped int8) or QDQ")
282
+ ap.add_argument("--op-types", default="MatMul",
283
+ help="comma-separated op types to quantize. Default MatMul ONLY: the "
284
+ "conv subsampling front-end is quant-fragile and is the prime suspect "
285
+ "for a collapsed encoder, so convs stay fp32. Pass 'MatMul,Conv' to "
286
+ "also quantize convs (matches istupakov's scope).")
287
+ ap.add_argument("--calibrate-method", choices=["minmax", "entropy", "percentile"],
288
+ default="percentile",
289
+ help="static activation calibration. MinMax (the library default) lets a "
290
+ "single long-tail outlier crush the scale and can collapse the encoder; "
291
+ "percentile/entropy clip the tail and are far more robust here.")
292
+ ap.add_argument("--fidelity-warn", type=float, default=0.90,
293
+ help="cosine-similarity floor (vs the fp32 encoder, one window) below which "
294
+ "the export is flagged as a likely collapse before any WER run. This is "
295
+ "a COLLAPSE detector, not a quality score: a healthy MatMul-only export "
296
+ "measured ~0.96 cosine yet tracked fp16 WER (10.9%% vs 10.2%%), so the "
297
+ "floor sits well below that. A true collapse lands far lower.")
298
+ ap.add_argument("--ffmpeg", default=None, help="ffmpeg binary (else $FFMPEG / PATH)")
299
+ args = ap.parse_args()
300
+
301
+ model_dir = Path(args.model_dir)
302
+ in_encoder = model_dir / "encoder-model.onnx"
303
+ pre_path = model_dir / "nemo128.onnx"
304
+ out_encoder = model_dir / args.out_name
305
+ for p in (in_encoder, pre_path):
306
+ if not p.exists():
307
+ sys.exit(f"missing required file: {p}")
308
+
309
+ ffmpeg = find_ffmpeg(args.ffmpeg)
310
+ audio = [Path(a) for a in args.audio] if args.audio else DEFAULT_CALIB_AUDIO
311
+
312
+ print(f"[sq] calibration: up to {args.num_windows} x {args.window_sec:g}s windows")
313
+ windows = collect_windows(ffmpeg, audio, args.window_sec, args.num_windows)
314
+ print(f"[sq] using {len(windows)} calibration window(s); extracting mel features...")
315
+ feats = build_features(pre_path, windows)
316
+
317
+ fmt = QuantFormat.QOperator if args.quant_format == "qoperator" else QuantFormat.QDQ
318
+ calib = {"minmax": CalibrationMethod.MinMax,
319
+ "entropy": CalibrationMethod.Entropy,
320
+ "percentile": CalibrationMethod.Percentile}[args.calibrate_method]
321
+ op_types = [t.strip() for t in args.op_types.split(",") if t.strip()]
322
+ cfg = config.StaticQuantConfig(
323
+ calibration_data_reader=FeatureReader(feats),
324
+ quant_format=fmt,
325
+ calibrate_method=calib,
326
+ activation_type=QuantType.QUInt8,
327
+ weight_type=QuantType.QInt8,
328
+ # Which weight-bearing ops to quantize. Default is MatMul ONLY: the conv
329
+ # subsampling front-end (pre_encode.*) sees the raw mel features with a
330
+ # wide dynamic range and is notoriously quant-fragile; statically
331
+ # quantizing it can produce garbage that propagates and empties the
332
+ # transcript, so we leave all convs fp32 (the user is fine trading the
333
+ # extra size for safety). MatMul-only also dodges the static quantizer's
334
+ # Pad handler, which trips on FastConformer's optional/empty Pad inputs
335
+ # ("Quantization parameters are not specified for param .").
336
+ op_types_to_quantize=op_types,
337
+ per_channel=True, # the other half of the fix: per-channel weights
338
+ reduce_range=True, # recommended on non-VNNI CPUs (the WASM target)
339
+ use_external_data_format=False, # int8 encoder ~600 MB, fits a single file
340
+ calibration_sampling_size=len(feats),
341
+ execution_provider="CPUExecutionProvider",
342
+ extra_options={
343
+ "SmoothQuant": True,
344
+ "SmoothQuantAlpha": args.alpha,
345
+ "SmoothQuantFolding": True,
346
+ },
347
+ )
348
+
349
+ print(f"[sq] SmoothQuant(alpha={args.alpha}) static int8, per-channel, "
350
+ f"calib={args.calibrate_method}, ops={op_types}, format={args.quant_format} ...")
351
+ print(f"[sq] {human(os.path.getsize(in_encoder) + os.path.getsize(str(in_encoder) + '.data'))} fp32 encoder")
352
+ t0 = time.time()
353
+ # ORT_DISABLE_ALL skips neural-compressor's pre-optimization InferenceSession
354
+ # (which has a `provides=` kwarg typo that crashes on this version) and avoids
355
+ # re-serializing the 2.4 GB fp32 graph.
356
+ quantize(str(in_encoder), str(out_encoder), cfg,
357
+ optimization_level=ort.GraphOptimizationLevel.ORT_DISABLE_ALL)
358
+ dt = time.time() - t0
359
+ if _SKIPPED["count"]:
360
+ print(f"[sq] note: {_SKIPPED['count']} node(s) had a layout SmoothQuant could not "
361
+ f"resolve and were left as plain static int8 (everything else was smoothed)")
362
+
363
+ # neural-compressor always writes the quantized weights to an external
364
+ # `<name>_data` sidecar for a model this size, ignoring use_external_data_format.
365
+ # The int8 weights are ~620 MB, well under the 2 GB single-protobuf cap, so
366
+ # fold them back into ONE self-contained .onnx (matching the shipped
367
+ # single-file int8 and keeping the candidate symlink dir trivial).
368
+ sidecar = str(out_encoder) + "_data"
369
+ if os.path.exists(sidecar):
370
+ merged = onnx.load(str(out_encoder), load_external_data=True)
371
+ onnx.save(merged, str(out_encoder), save_as_external_data=False)
372
+ os.remove(sidecar)
373
+
374
+ out_size = os.path.getsize(out_encoder)
375
+ baseline = model_dir / "encoder-model.int8.onnx"
376
+ base_note = f" (istupakov int8 is {human(os.path.getsize(baseline))})" if baseline.exists() else ""
377
+ print(f"[sq] done in {dt:.0f}s -> {out_encoder.name} {human(out_size)}{base_note}")
378
+
379
+ # Fidelity smoke test (NOT just shape): run one calibration window through both
380
+ # the fp32 reference and the new int8 encoder and compare the encoder outputs by
381
+ # cosine similarity. A shape-only check let a fully collapsed encoder (empty
382
+ # transcript everywhere) pass silently once; this catches that in ~30 s instead
383
+ # of after a multi-minute WER run. A healthy int8 sits well above ~0.99.
384
+ try:
385
+ inp = {"audio_signal": feats[0]["audio_signal"], "length": feats[0]["length"]}
386
+ s_q = ort.InferenceSession(str(out_encoder), providers=["CPUExecutionProvider"])
387
+ out_q = s_q.run(None, inp)[0].astype(np.float64).ravel()
388
+ s_f = ort.InferenceSession(str(in_encoder), providers=["CPUExecutionProvider"])
389
+ out_f = s_f.run(None, inp)[0].astype(np.float64).ravel()
390
+ denom = (np.linalg.norm(out_q) * np.linalg.norm(out_f)) or 1.0
391
+ cos = float(np.dot(out_q, out_f) / denom)
392
+ if cos < args.fidelity_warn:
393
+ print(f"[sq] WARNING: encoder-output cosine vs fp32 is {cos:.4f} "
394
+ f"(< {args.fidelity_warn}). This export likely COLLAPSED; expect a near-100% "
395
+ f"WER. Try a different --calibrate-method/--alpha or keep more ops fp32.",
396
+ file=sys.stderr)
397
+ else:
398
+ print(f"[sq] fidelity: encoder-output cosine vs fp32 = {cos:.4f} (>= "
399
+ f"{args.fidelity_warn}). Looks healthy.")
400
+ except Exception as e:
401
+ print(f"[sq] WARNING: exported encoder failed the fidelity smoke test: {e}", file=sys.stderr)
402
+
403
+ if not args.no_candidate:
404
+ cand = build_candidate_dir(model_dir, out_encoder, args.candidate_dir)
405
+ print(f"[sq] candidate model dir (for wer-quants): {cand}")
406
+
407
+ rel_cand = os.path.relpath(args.candidate_dir, ROOT)
408
+ rel_model = os.path.relpath(model_dir, ROOT)
409
+ print("\nCompare per-section degradation vs fp16:")
410
+ print(f" uv run scripts/wer-quants.py --model-dir {rel_cand} --quants int8")
411
+ print(f" uv run scripts/wer-quants.py --model-dir {rel_model} --quants int8,fp16")
412
+ print("A new-int8 per-section WER that tracks fp16 (instead of climbing) is the win.")
413
+
414
+
415
+ if __name__ == "__main__":
416
+ main()
vocab.txt ADDED
The diff for this file is too large to render. See raw diff