thiswillbeyourgithub Claude Opus 4.8 commited on
Commit
d99636b
·
1 Parent(s): 57efb46

Add browser-friendly sharded fp32 encoder + shard-fp32.py

Browse files

The single-file fp32 encoder (encoder-model.onnx + a ~2.3 GB .data sidecar)
cannot be loaded in a browser / on the CPU-WASM ONNX Runtime backend: a wasm32
ArrayBuffer caps at 2^31-1 bytes (~2 GB) and Chromium's blob-URL fetch caps near
2 GB, so the single sidecar trips both ingest walls. sharded/ ships the same fp32
weights repacked (pure repack, byte-identical numerics, same WER) across two
<2 GB shards plus a rewritten encoder graph that points at them, so a loader can
mount each shard under the caps and run full-precision fp32 on WASM.

- shard-fp32.py: the repack script (moved here from parakeet_web for provenance,
PEP 723 self-contained), producing sharded/.
- .gitattributes: LFS-track the *.onnx.data.* shards.
- README: new 'Browser-friendly fp32 shards' section, Files table + build notes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  encoder-model.onnx.data filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  encoder-model.onnx.data filter=lfs diff=lfs merge=lfs -text
37
+ *.onnx.data.* filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -201,6 +201,53 @@ run fp16 or fp32 (for example on a WebGPU backend), prefer those. This int8
201
  matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8
202
  is the only precision that both fits and runs.
203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  ## Files
205
 
206
  | file | what it is |
@@ -208,6 +255,7 @@ is the only precision that both fits and runs.
208
  | `encoder-model.onnx` (+ `.data`)| fp32 encoder (unchanged from istupakov) |
209
  | `encoder-model.fp16.onnx` | fp16 encoder (not shipped by the upstream istupakov repo)|
210
  | `encoder-model.int8.onnx` | **SmoothQuant int8 encoder (the reason for this repo)** |
 
211
  | `decoder_joint-model.onnx` | fp32 decoder / joint network (unchanged) |
212
  | `decoder_joint-model.fp16.onnx` | fp16 decoder / joint network (unchanged) |
213
  | `decoder_joint-model.int8.onnx` | int8 decoder / joint network (unchanged) |
@@ -215,6 +263,7 @@ is the only precision that both fits and runs.
215
  | `vocab.txt`, `config.json` | tokenizer and model config (unchanged) |
216
  | `quantize-int8-smoothquant.py` | script that produced the SmoothQuant int8 encoder |
217
  | `quantize-fp16.py` | script that produced the fp16 encoder |
 
218
 
219
  ## How it was built
220
 
@@ -227,8 +276,12 @@ is the only precision that both fits and runs.
227
  fidelity check of the new encoder's output against fp32.
228
  - `encoder-model.fp16.onnx`: `quantize-fp16.py`, a straight fp16 cast of the
229
  fp32 encoder pieces.
 
 
 
 
230
 
231
- Both scripts are self-contained (they declare their own dependencies via a
232
  [PEP 723](https://peps.python.org/pep-0723/) header and run with `uv run`) and
233
  are included here for provenance and reproducibility. They reference fixtures
234
  from the [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web)
 
201
  matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8
202
  is the only precision that both fits and runs.
203
 
204
+ ## Browser-friendly fp32 shards (`sharded/`)
205
+
206
+ The fp32 encoder is shipped two ways here: as the canonical single sidecar
207
+ (`encoder-model.onnx` + a ~2.3 GB `encoder-model.onnx.data`), and, under
208
+ `sharded/`, as the **same weights repacked into several files each under 2 GB**.
209
+ The sharded copy exists so the fp32 encoder can be loaded **in a web browser** (and
210
+ on the CPU / WASM ONNX Runtime backend generally), which the single-file fp32
211
+ **cannot**.
212
+
213
+ Why a browser cannot load the single 2.3 GB sidecar (these are *ingest* limits, not
214
+ a total-memory limit):
215
+
216
+ 1. **32-bit WASM ArrayBuffer cap.** A WASM build is wasm32, so any single
217
+ `ArrayBuffer` it holds caps at `2^31 - 1` bytes (~2 GB). A 2.3 GB sidecar cannot
218
+ live in one buffer. (This is the same wall that forces projects like wllama to
219
+ shard their GGUF files.)
220
+ 2. **Chromium blob-URL fetch cap.** Fetching a `blob:` URL larger than ~2 GB fails
221
+ in Chromium with `TypeError: Failed to fetch`, so the file cannot even be read
222
+ into memory in one piece.
223
+
224
+ Note the wasm32 heap ceiling itself is ~4 GB, and fp32 stays ~2.3 GB resident (it
225
+ is *not* upcast the way the CPU / WASM EP upcasts fp16 to fp32 at session build), so
226
+ fp32 **fits** once no single buffer or fetch exceeds 2 GB. Sharding is purely about
227
+ clearing the two per-buffer ingest walls above.
228
+
229
+ `shard-fp32.py` rewrites each big initializer's `external_data` location to spread
230
+ the encoder's tensors across N shard files (`encoder-model.onnx.data.000`,
231
+ `encoder-model.onnx.data.001`, ... each under a 1.5 GB budget by default), leaving a
232
+ small rewritten `encoder-model.onnx` graph that points at them. Here that produces
233
+ **two shards** (~1.4 GB + ~0.9 GB). It is a **pure repack**: no tensor value is
234
+ touched, so the sharded encoder is **byte-for-byte numerically identical** to the
235
+ single-file fp32 and has the **exact same WER**. A loader (for example
236
+ [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web), with its
237
+ `allowWasmFp32` opt-in) mounts each shard as a separate `externalData` entry, each
238
+ under the 2 GB caps, and reads them straight to bytes (no >2 GB `blob:` URL, no
239
+ multi-GB IndexedDB blob). The decoder, tokenizer and config are **not** duplicated
240
+ into `sharded/`; a loader takes the rewritten encoder + shards from `sharded/` and
241
+ everything else from the repo root.
242
+
243
+ When to use which: on **WebGPU**, prefer fp16 (half the download, native fp16
244
+ kernels) or the single-file fp32; the GPU EP has no 2 GB per-buffer wall. The shards
245
+ matter on **CPU / WASM**, where fp16 has no compute kernels and the single-file fp32
246
+ cannot be ingested, so the sharded fp32 is the only way to run full precision.
247
+
248
+ The shards are regenerated with [`shard-fp32.py`](./shard-fp32.py) (see
249
+ [How it was built](#how-it-was-built)).
250
+
251
  ## Files
252
 
253
  | file | what it is |
 
255
  | `encoder-model.onnx` (+ `.data`)| fp32 encoder (unchanged from istupakov) |
256
  | `encoder-model.fp16.onnx` | fp16 encoder (not shipped by the upstream istupakov repo)|
257
  | `encoder-model.int8.onnx` | **SmoothQuant int8 encoder (the reason for this repo)** |
258
+ | `sharded/encoder-model.onnx` (+ `.data.000`, `.data.001`) | fp32 encoder repacked into <2 GB shards so a browser / WASM backend can load it (see [Browser-friendly fp32 shards](#browser-friendly-fp32-shards-sharded)) |
259
  | `decoder_joint-model.onnx` | fp32 decoder / joint network (unchanged) |
260
  | `decoder_joint-model.fp16.onnx` | fp16 decoder / joint network (unchanged) |
261
  | `decoder_joint-model.int8.onnx` | int8 decoder / joint network (unchanged) |
 
263
  | `vocab.txt`, `config.json` | tokenizer and model config (unchanged) |
264
  | `quantize-int8-smoothquant.py` | script that produced the SmoothQuant int8 encoder |
265
  | `quantize-fp16.py` | script that produced the fp16 encoder |
266
+ | `shard-fp32.py` | script that produced the sharded fp32 encoder |
267
 
268
  ## How it was built
269
 
 
276
  fidelity check of the new encoder's output against fp32.
277
  - `encoder-model.fp16.onnx`: `quantize-fp16.py`, a straight fp16 cast of the
278
  fp32 encoder pieces.
279
+ - `sharded/`: `shard-fp32.py`, a pure repack of the single-file fp32 encoder into
280
+ <2 GB shards (see [Browser-friendly fp32 shards](#browser-friendly-fp32-shards-sharded)).
281
+ No weights are altered, so the sharded encoder is numerically identical to the
282
+ single-file fp32.
283
 
284
+ All three scripts are self-contained (they declare their own dependencies via a
285
  [PEP 723](https://peps.python.org/pep-0723/) header and run with `uv run`) and
286
  are included here for provenance and reproducibility. They reference fixtures
287
  from the [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web)
shard-fp32.py ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # /// script
3
+ # requires-python = ">=3.9"
4
+ # dependencies = ["onnx"]
5
+ # ///
6
+ """Shard the fp32 Parakeet encoder's external weights into <2 GB pieces so the
7
+ fp32 encoder can load on the WASM backend / in-browser.
8
+
9
+ This script lives in the model repo (parakeet-tdt-0.6b-v3-smoothquant-onnx)
10
+ alongside quantize-int8-smoothquant.py and quantize-fp16.py. Like them it reads
11
+ its fixtures from the parakeet_web project repository, so run it from there. It is
12
+ self-contained (PEP 723 header above), so `uv run shard-fp32.py` installs onnx on
13
+ the fly. The model repo ships pre-built shards under sharded/ for browsers; this
14
+ script is included for provenance and to regenerate them.
15
+
16
+ Why (see CLAUDE.md for the full reasoning): the fp32 encoder is ~2.4 GB held in
17
+ ONE encoder-model.onnx.data sidecar. That single file trips two *ingest* walls
18
+ that block WASM, and neither is a total-memory limit:
19
+ 1. a 32-bit WASM ArrayBuffer caps at ~2 GB (2^31-1), and
20
+ 2. Chromium's blob-URL fetch caps near 2 GB.
21
+ The wasm32 heap ceiling itself is ~4 GB, and fp32 (unlike fp16, which the CPU/WASM
22
+ EP upcasts to fp32 at session build) stays ~2.4 GB resident, so it *should* fit
23
+ once no single buffer exceeds 2 GB. This script rewrites the encoder's per-tensor
24
+ external_data locations to spread the initializers across N shard files, each under
25
+ a configurable byte budget (default 1.5 GB), producing:
26
+
27
+ encoder-model.onnx (graph; tensors now point at the shards)
28
+ encoder-model.onnx.data.000
29
+ encoder-model.onnx.data.001
30
+ ...
31
+
32
+ onnxruntime-node (native) resolves these from disk by the graph's location fields;
33
+ the WASM / browser loader mounts each shard as a separate externalData entry (each
34
+ < 2 GB), sidestepping both caps. No weights are altered: this is a pure repack, so
35
+ WER must be identical to the single-file fp32. That equality is the whole point of
36
+ the experiment (does fp32 hold up on a long chunk where int8 drops content), so the
37
+ script never touches tensor values, only where their bytes live.
38
+
39
+ Usage (run from the parakeet_web repo, with this script in the model-repo folder):
40
+ uv run parakeet-tdt-0.6b-v3-smoothquant-onnx/shard-fp32.py # ./fallback_models -> ./fallback_models/sharded
41
+ uv run parakeet-tdt-0.6b-v3-smoothquant-onnx/shard-fp32.py --model-dir DIR --out-dir DIR
42
+ uv run parakeet-tdt-0.6b-v3-smoothquant-onnx/shard-fp32.py --max-shard-bytes 1000000000 # smaller shards (lower transient load peak)
43
+ uv run parakeet-tdt-0.6b-v3-smoothquant-onnx/shard-fp32.py --encoder encoder-model.onnx # non-default encoder name
44
+
45
+ Built with Claude Code.
46
+ """
47
+
48
+ import argparse
49
+ import os
50
+ import sys
51
+
52
+ import onnx
53
+ from onnx import TensorProto
54
+ from onnx.external_data_helper import set_external_data
55
+
56
+ # Default shard budget. 1.5 GB leaves comfortable headroom under the 2 GB
57
+ # ArrayBuffer / blob caps even after a tensor that would straddle a boundary is
58
+ # pushed whole into the next shard. Smaller shards lower the transient load peak
59
+ # (ORT holds a shard's bytes in the heap while deserialising it), at the cost of
60
+ # more files; 1.5 GB is a sane default for a ~2.4 GB encoder (-> 2 shards).
61
+ DEFAULT_MAX_SHARD_BYTES = 1_500_000_000
62
+
63
+ # Tensors below this many bytes stay inline in the graph proto (mirrors onnx's
64
+ # own default size_threshold): sharding tiny scalars/biases is pointless and just
65
+ # inflates the file count.
66
+ INLINE_THRESHOLD_BYTES = 1024
67
+
68
+
69
+ def human(n):
70
+ n = float(n)
71
+ for unit in ("B", "KB", "MB", "GB"):
72
+ if n < 1024 or unit == "GB":
73
+ return f"{n:.0f} {unit}" if unit == "B" else f"{n:.1f} {unit}"
74
+ n /= 1024
75
+
76
+
77
+ def tensor_nbytes(t):
78
+ # After load_external_data the bytes live in raw_data; that is the only field
79
+ # the fp32 encoder's big initializers use. Non-raw tensors are left inline.
80
+ return len(t.raw_data) if t.HasField("raw_data") else 0
81
+
82
+
83
+ def shard_model(in_path, out_path, max_shard_bytes):
84
+ if not os.path.exists(in_path):
85
+ raise FileNotFoundError(f"missing input model: {in_path}")
86
+
87
+ # Pull the sibling .onnx.data into raw_data so we see real bytes to repack.
88
+ # Needs ~the encoder's size in RAM (~2.4 GB); cheap given the repack savings.
89
+ print(f"[shard] loading {in_path} (+ external data) ...")
90
+ model = onnx.load(in_path, load_external_data=True)
91
+
92
+ out_dir = os.path.dirname(out_path) or "."
93
+ os.makedirs(out_dir, exist_ok=True)
94
+ base = os.path.basename(out_path) # e.g. encoder-model.onnx
95
+
96
+ # Greedy bin-pack: walk initializers, open a new shard whenever adding the
97
+ # next tensor whole would exceed the budget. A single tensor larger than the
98
+ # budget gets its own shard (we never split a tensor across files, so each
99
+ # tensor's external_data stays a simple (location, offset, length)).
100
+ shard_idx = 0
101
+ shard_offset = 0
102
+ shard_file = None
103
+ shard_paths = []
104
+ inline_count = 0
105
+ externalised = 0
106
+
107
+ def shard_location(idx):
108
+ return f"{base}.data.{idx:03d}"
109
+
110
+ def open_shard(idx):
111
+ loc = shard_location(idx)
112
+ path = os.path.join(out_dir, loc)
113
+ f = open(path, "wb")
114
+ shard_paths.append(path)
115
+ return f, loc
116
+
117
+ shard_file, shard_loc = open_shard(shard_idx)
118
+
119
+ try:
120
+ for t in model.graph.initializer:
121
+ nbytes = tensor_nbytes(t)
122
+ if nbytes < INLINE_THRESHOLD_BYTES:
123
+ inline_count += 1
124
+ continue # leave small tensors inline in the graph
125
+
126
+ # Roll to the next shard if this tensor would push us over budget,
127
+ # unless the current shard is still empty (a tensor bigger than the
128
+ # whole budget then lands alone in its own shard).
129
+ if shard_offset > 0 and shard_offset + nbytes > max_shard_bytes:
130
+ shard_file.close()
131
+ shard_idx += 1
132
+ shard_offset = 0
133
+ shard_file, shard_loc = open_shard(shard_idx)
134
+
135
+ data = t.raw_data
136
+ shard_file.write(data)
137
+ set_external_data(t, location=shard_loc, offset=shard_offset, length=nbytes)
138
+ t.ClearField("raw_data")
139
+ t.data_location = TensorProto.EXTERNAL
140
+ shard_offset += nbytes
141
+ externalised += 1
142
+ finally:
143
+ if shard_file:
144
+ shard_file.close()
145
+
146
+ # The initializers now reference the shard files; save the graph as-is (the
147
+ # external_data is already set, so save_as_external_data=False is correct and
148
+ # must stay False or onnx would try to re-pack into a single file).
149
+ onnx.save(model, out_path, save_as_external_data=False)
150
+
151
+ sizes = [os.path.getsize(p) for p in shard_paths]
152
+ print(f"[shard] wrote {os.path.basename(out_path)} + {len(shard_paths)} shard(s) "
153
+ f"({externalised} external tensors, {inline_count} kept inline):")
154
+ for p, s in zip(shard_paths, sizes):
155
+ flag = " <-- OVER 2 GB!" if s >= 2 ** 31 else ""
156
+ print(f" {os.path.basename(p)} {human(s)}{flag}")
157
+ total = sum(sizes)
158
+ over = [p for p, s in zip(shard_paths, sizes) if s >= 2 ** 31]
159
+ print(f"[shard] total external: {human(total)} across {len(shard_paths)} shard(s)")
160
+ if over:
161
+ print(f"[shard] WARNING: {len(over)} shard(s) still exceed 2 GB; lower --max-shard-bytes",
162
+ file=sys.stderr)
163
+ return shard_paths
164
+
165
+
166
+ def link_sibling(src_dir, out_dir, name):
167
+ """Make `name` available in out_dir (symlink, falling back to copy) so the
168
+ output is a complete model dir for wer-bench/transcribe without duplicating
169
+ multi-hundred-MB files. Skips silently when src and out are the same dir or
170
+ the source is absent."""
171
+ src = os.path.join(src_dir, name)
172
+ dst = os.path.join(out_dir, name)
173
+ if not os.path.exists(src) or os.path.abspath(src) == os.path.abspath(dst):
174
+ return
175
+ if os.path.lexists(dst):
176
+ os.remove(dst)
177
+ try:
178
+ os.symlink(os.path.relpath(src, out_dir), dst)
179
+ except OSError:
180
+ import shutil
181
+ shutil.copy2(src, dst)
182
+
183
+
184
+ def main():
185
+ ap = argparse.ArgumentParser(description=__doc__,
186
+ formatter_class=argparse.RawDescriptionHelpFormatter)
187
+ ap.add_argument("--model-dir", default="./fallback_models",
188
+ help="dir holding encoder-model.onnx (+ .onnx.data). Default ./fallback_models")
189
+ ap.add_argument("--out-dir", default=None,
190
+ help="where to write the sharded encoder + a complete model dir "
191
+ "(default: <model-dir>/sharded). Pass the same value as --model-dir to shard in place.")
192
+ ap.add_argument("--encoder", default="encoder-model.onnx",
193
+ help="encoder graph filename within --model-dir (default encoder-model.onnx)")
194
+ ap.add_argument("--max-shard-bytes", type=int, default=DEFAULT_MAX_SHARD_BYTES,
195
+ help=f"max bytes per shard (default {DEFAULT_MAX_SHARD_BYTES}, i.e. 1.5 GB)")
196
+ args = ap.parse_args()
197
+
198
+ out_dir = args.out_dir or os.path.join(args.model_dir, "sharded")
199
+ in_path = os.path.join(args.model_dir, args.encoder)
200
+ out_path = os.path.join(out_dir, args.encoder)
201
+
202
+ if args.max_shard_bytes >= 2 ** 31:
203
+ print("[shard] WARNING: --max-shard-bytes >= 2 GB defeats the purpose "
204
+ "(shards must stay under the WASM/blob 2 GB cap)", file=sys.stderr)
205
+
206
+ shard_model(in_path, out_path, args.max_shard_bytes)
207
+
208
+ # Round out the output into a self-contained model dir so wer-bench can point
209
+ # --model-dir straight at it. The fp32 decoder/vocab/preproc are reused as-is.
210
+ if os.path.abspath(out_dir) != os.path.abspath(args.model_dir):
211
+ for name in ("decoder_joint-model.onnx", "vocab.txt", "nemo128.onnx", "config.json"):
212
+ link_sibling(args.model_dir, out_dir, name)
213
+ print(f"[shard] linked decoder/vocab/preproc into {out_dir}")
214
+
215
+ print(f"[shard] done. Use: node scripts/wer-bench.mjs --model-dir {out_dir} --configs fp32@60 --ort wasm")
216
+
217
+
218
+ if __name__ == "__main__":
219
+ main()
sharded/encoder-model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea11ddde05617182f4ea0f50bc494fda783b73c7ef9ca1bb90d3de4b4fba53b7
3
+ size 41773219
sharded/encoder-model.onnx.data.000 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a6e66c096cdfbb259bccde3772955399dc756b1c2c86dd3ec296c325f98d01f7
3
+ size 1483313152
sharded/encoder-model.onnx.data.001 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:988714aa07059e6ad1b12cca13ce59ff21a272250b21edf2c9cf5eac2b76bbed
3
+ size 952107008