Add browser-friendly sharded fp32 encoder + shard-fp32.py
Browse filesThe single-file fp32 encoder (encoder-model.onnx + a ~2.3 GB .data sidecar)
cannot be loaded in a browser / on the CPU-WASM ONNX Runtime backend: a wasm32
ArrayBuffer caps at 2^31-1 bytes (~2 GB) and Chromium's blob-URL fetch caps near
2 GB, so the single sidecar trips both ingest walls. sharded/ ships the same fp32
weights repacked (pure repack, byte-identical numerics, same WER) across two
<2 GB shards plus a rewritten encoder graph that points at them, so a loader can
mount each shard under the caps and run full-precision fp32 on WASM.
- shard-fp32.py: the repack script (moved here from parakeet_web for provenance,
PEP 723 self-contained), producing sharded/.
- .gitattributes: LFS-track the *.onnx.data.* shards.
- README: new 'Browser-friendly fp32 shards' section, Files table + build notes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- .gitattributes +1 -0
- README.md +54 -1
- shard-fp32.py +219 -0
- sharded/encoder-model.onnx +3 -0
- sharded/encoder-model.onnx.data.000 +3 -0
- sharded/encoder-model.onnx.data.001 +3 -0
|
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
encoder-model.onnx.data filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
encoder-model.onnx.data filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
*.onnx.data.* filter=lfs diff=lfs merge=lfs -text
|
|
@@ -201,6 +201,53 @@ run fp16 or fp32 (for example on a WebGPU backend), prefer those. This int8
|
|
| 201 |
matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8
|
| 202 |
is the only precision that both fits and runs.
|
| 203 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
## Files
|
| 205 |
|
| 206 |
| file | what it is |
|
|
@@ -208,6 +255,7 @@ is the only precision that both fits and runs.
|
|
| 208 |
| `encoder-model.onnx` (+ `.data`)| fp32 encoder (unchanged from istupakov) |
|
| 209 |
| `encoder-model.fp16.onnx` | fp16 encoder (not shipped by the upstream istupakov repo)|
|
| 210 |
| `encoder-model.int8.onnx` | **SmoothQuant int8 encoder (the reason for this repo)** |
|
|
|
|
| 211 |
| `decoder_joint-model.onnx` | fp32 decoder / joint network (unchanged) |
|
| 212 |
| `decoder_joint-model.fp16.onnx` | fp16 decoder / joint network (unchanged) |
|
| 213 |
| `decoder_joint-model.int8.onnx` | int8 decoder / joint network (unchanged) |
|
|
@@ -215,6 +263,7 @@ is the only precision that both fits and runs.
|
|
| 215 |
| `vocab.txt`, `config.json` | tokenizer and model config (unchanged) |
|
| 216 |
| `quantize-int8-smoothquant.py` | script that produced the SmoothQuant int8 encoder |
|
| 217 |
| `quantize-fp16.py` | script that produced the fp16 encoder |
|
|
|
|
| 218 |
|
| 219 |
## How it was built
|
| 220 |
|
|
@@ -227,8 +276,12 @@ is the only precision that both fits and runs.
|
|
| 227 |
fidelity check of the new encoder's output against fp32.
|
| 228 |
- `encoder-model.fp16.onnx`: `quantize-fp16.py`, a straight fp16 cast of the
|
| 229 |
fp32 encoder pieces.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 230 |
|
| 231 |
-
|
| 232 |
[PEP 723](https://peps.python.org/pep-0723/) header and run with `uv run`) and
|
| 233 |
are included here for provenance and reproducibility. They reference fixtures
|
| 234 |
from the [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web)
|
|
|
|
| 201 |
matters most on a CPU / WASM backend, where fp16 has no compute kernels and int8
|
| 202 |
is the only precision that both fits and runs.
|
| 203 |
|
| 204 |
+
## Browser-friendly fp32 shards (`sharded/`)
|
| 205 |
+
|
| 206 |
+
The fp32 encoder is shipped two ways here: as the canonical single sidecar
|
| 207 |
+
(`encoder-model.onnx` + a ~2.3 GB `encoder-model.onnx.data`), and, under
|
| 208 |
+
`sharded/`, as the **same weights repacked into several files each under 2 GB**.
|
| 209 |
+
The sharded copy exists so the fp32 encoder can be loaded **in a web browser** (and
|
| 210 |
+
on the CPU / WASM ONNX Runtime backend generally), which the single-file fp32
|
| 211 |
+
**cannot**.
|
| 212 |
+
|
| 213 |
+
Why a browser cannot load the single 2.3 GB sidecar (these are *ingest* limits, not
|
| 214 |
+
a total-memory limit):
|
| 215 |
+
|
| 216 |
+
1. **32-bit WASM ArrayBuffer cap.** A WASM build is wasm32, so any single
|
| 217 |
+
`ArrayBuffer` it holds caps at `2^31 - 1` bytes (~2 GB). A 2.3 GB sidecar cannot
|
| 218 |
+
live in one buffer. (This is the same wall that forces projects like wllama to
|
| 219 |
+
shard their GGUF files.)
|
| 220 |
+
2. **Chromium blob-URL fetch cap.** Fetching a `blob:` URL larger than ~2 GB fails
|
| 221 |
+
in Chromium with `TypeError: Failed to fetch`, so the file cannot even be read
|
| 222 |
+
into memory in one piece.
|
| 223 |
+
|
| 224 |
+
Note the wasm32 heap ceiling itself is ~4 GB, and fp32 stays ~2.3 GB resident (it
|
| 225 |
+
is *not* upcast the way the CPU / WASM EP upcasts fp16 to fp32 at session build), so
|
| 226 |
+
fp32 **fits** once no single buffer or fetch exceeds 2 GB. Sharding is purely about
|
| 227 |
+
clearing the two per-buffer ingest walls above.
|
| 228 |
+
|
| 229 |
+
`shard-fp32.py` rewrites each big initializer's `external_data` location to spread
|
| 230 |
+
the encoder's tensors across N shard files (`encoder-model.onnx.data.000`,
|
| 231 |
+
`encoder-model.onnx.data.001`, ... each under a 1.5 GB budget by default), leaving a
|
| 232 |
+
small rewritten `encoder-model.onnx` graph that points at them. Here that produces
|
| 233 |
+
**two shards** (~1.4 GB + ~0.9 GB). It is a **pure repack**: no tensor value is
|
| 234 |
+
touched, so the sharded encoder is **byte-for-byte numerically identical** to the
|
| 235 |
+
single-file fp32 and has the **exact same WER**. A loader (for example
|
| 236 |
+
[parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web), with its
|
| 237 |
+
`allowWasmFp32` opt-in) mounts each shard as a separate `externalData` entry, each
|
| 238 |
+
under the 2 GB caps, and reads them straight to bytes (no >2 GB `blob:` URL, no
|
| 239 |
+
multi-GB IndexedDB blob). The decoder, tokenizer and config are **not** duplicated
|
| 240 |
+
into `sharded/`; a loader takes the rewritten encoder + shards from `sharded/` and
|
| 241 |
+
everything else from the repo root.
|
| 242 |
+
|
| 243 |
+
When to use which: on **WebGPU**, prefer fp16 (half the download, native fp16
|
| 244 |
+
kernels) or the single-file fp32; the GPU EP has no 2 GB per-buffer wall. The shards
|
| 245 |
+
matter on **CPU / WASM**, where fp16 has no compute kernels and the single-file fp32
|
| 246 |
+
cannot be ingested, so the sharded fp32 is the only way to run full precision.
|
| 247 |
+
|
| 248 |
+
The shards are regenerated with [`shard-fp32.py`](./shard-fp32.py) (see
|
| 249 |
+
[How it was built](#how-it-was-built)).
|
| 250 |
+
|
| 251 |
## Files
|
| 252 |
|
| 253 |
| file | what it is |
|
|
|
|
| 255 |
| `encoder-model.onnx` (+ `.data`)| fp32 encoder (unchanged from istupakov) |
|
| 256 |
| `encoder-model.fp16.onnx` | fp16 encoder (not shipped by the upstream istupakov repo)|
|
| 257 |
| `encoder-model.int8.onnx` | **SmoothQuant int8 encoder (the reason for this repo)** |
|
| 258 |
+
| `sharded/encoder-model.onnx` (+ `.data.000`, `.data.001`) | fp32 encoder repacked into <2 GB shards so a browser / WASM backend can load it (see [Browser-friendly fp32 shards](#browser-friendly-fp32-shards-sharded)) |
|
| 259 |
| `decoder_joint-model.onnx` | fp32 decoder / joint network (unchanged) |
|
| 260 |
| `decoder_joint-model.fp16.onnx` | fp16 decoder / joint network (unchanged) |
|
| 261 |
| `decoder_joint-model.int8.onnx` | int8 decoder / joint network (unchanged) |
|
|
|
|
| 263 |
| `vocab.txt`, `config.json` | tokenizer and model config (unchanged) |
|
| 264 |
| `quantize-int8-smoothquant.py` | script that produced the SmoothQuant int8 encoder |
|
| 265 |
| `quantize-fp16.py` | script that produced the fp16 encoder |
|
| 266 |
+
| `shard-fp32.py` | script that produced the sharded fp32 encoder |
|
| 267 |
|
| 268 |
## How it was built
|
| 269 |
|
|
|
|
| 276 |
fidelity check of the new encoder's output against fp32.
|
| 277 |
- `encoder-model.fp16.onnx`: `quantize-fp16.py`, a straight fp16 cast of the
|
| 278 |
fp32 encoder pieces.
|
| 279 |
+
- `sharded/`: `shard-fp32.py`, a pure repack of the single-file fp32 encoder into
|
| 280 |
+
<2 GB shards (see [Browser-friendly fp32 shards](#browser-friendly-fp32-shards-sharded)).
|
| 281 |
+
No weights are altered, so the sharded encoder is numerically identical to the
|
| 282 |
+
single-file fp32.
|
| 283 |
|
| 284 |
+
All three scripts are self-contained (they declare their own dependencies via a
|
| 285 |
[PEP 723](https://peps.python.org/pep-0723/) header and run with `uv run`) and
|
| 286 |
are included here for provenance and reproducibility. They reference fixtures
|
| 287 |
from the [parakeet_web](https://github.com/thiswillbeyourgithub/parakeet_web)
|
|
@@ -0,0 +1,219 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python
|
| 2 |
+
# /// script
|
| 3 |
+
# requires-python = ">=3.9"
|
| 4 |
+
# dependencies = ["onnx"]
|
| 5 |
+
# ///
|
| 6 |
+
"""Shard the fp32 Parakeet encoder's external weights into <2 GB pieces so the
|
| 7 |
+
fp32 encoder can load on the WASM backend / in-browser.
|
| 8 |
+
|
| 9 |
+
This script lives in the model repo (parakeet-tdt-0.6b-v3-smoothquant-onnx)
|
| 10 |
+
alongside quantize-int8-smoothquant.py and quantize-fp16.py. Like them it reads
|
| 11 |
+
its fixtures from the parakeet_web project repository, so run it from there. It is
|
| 12 |
+
self-contained (PEP 723 header above), so `uv run shard-fp32.py` installs onnx on
|
| 13 |
+
the fly. The model repo ships pre-built shards under sharded/ for browsers; this
|
| 14 |
+
script is included for provenance and to regenerate them.
|
| 15 |
+
|
| 16 |
+
Why (see CLAUDE.md for the full reasoning): the fp32 encoder is ~2.4 GB held in
|
| 17 |
+
ONE encoder-model.onnx.data sidecar. That single file trips two *ingest* walls
|
| 18 |
+
that block WASM, and neither is a total-memory limit:
|
| 19 |
+
1. a 32-bit WASM ArrayBuffer caps at ~2 GB (2^31-1), and
|
| 20 |
+
2. Chromium's blob-URL fetch caps near 2 GB.
|
| 21 |
+
The wasm32 heap ceiling itself is ~4 GB, and fp32 (unlike fp16, which the CPU/WASM
|
| 22 |
+
EP upcasts to fp32 at session build) stays ~2.4 GB resident, so it *should* fit
|
| 23 |
+
once no single buffer exceeds 2 GB. This script rewrites the encoder's per-tensor
|
| 24 |
+
external_data locations to spread the initializers across N shard files, each under
|
| 25 |
+
a configurable byte budget (default 1.5 GB), producing:
|
| 26 |
+
|
| 27 |
+
encoder-model.onnx (graph; tensors now point at the shards)
|
| 28 |
+
encoder-model.onnx.data.000
|
| 29 |
+
encoder-model.onnx.data.001
|
| 30 |
+
...
|
| 31 |
+
|
| 32 |
+
onnxruntime-node (native) resolves these from disk by the graph's location fields;
|
| 33 |
+
the WASM / browser loader mounts each shard as a separate externalData entry (each
|
| 34 |
+
< 2 GB), sidestepping both caps. No weights are altered: this is a pure repack, so
|
| 35 |
+
WER must be identical to the single-file fp32. That equality is the whole point of
|
| 36 |
+
the experiment (does fp32 hold up on a long chunk where int8 drops content), so the
|
| 37 |
+
script never touches tensor values, only where their bytes live.
|
| 38 |
+
|
| 39 |
+
Usage (run from the parakeet_web repo, with this script in the model-repo folder):
|
| 40 |
+
uv run parakeet-tdt-0.6b-v3-smoothquant-onnx/shard-fp32.py # ./fallback_models -> ./fallback_models/sharded
|
| 41 |
+
uv run parakeet-tdt-0.6b-v3-smoothquant-onnx/shard-fp32.py --model-dir DIR --out-dir DIR
|
| 42 |
+
uv run parakeet-tdt-0.6b-v3-smoothquant-onnx/shard-fp32.py --max-shard-bytes 1000000000 # smaller shards (lower transient load peak)
|
| 43 |
+
uv run parakeet-tdt-0.6b-v3-smoothquant-onnx/shard-fp32.py --encoder encoder-model.onnx # non-default encoder name
|
| 44 |
+
|
| 45 |
+
Built with Claude Code.
|
| 46 |
+
"""
|
| 47 |
+
|
| 48 |
+
import argparse
|
| 49 |
+
import os
|
| 50 |
+
import sys
|
| 51 |
+
|
| 52 |
+
import onnx
|
| 53 |
+
from onnx import TensorProto
|
| 54 |
+
from onnx.external_data_helper import set_external_data
|
| 55 |
+
|
| 56 |
+
# Default shard budget. 1.5 GB leaves comfortable headroom under the 2 GB
|
| 57 |
+
# ArrayBuffer / blob caps even after a tensor that would straddle a boundary is
|
| 58 |
+
# pushed whole into the next shard. Smaller shards lower the transient load peak
|
| 59 |
+
# (ORT holds a shard's bytes in the heap while deserialising it), at the cost of
|
| 60 |
+
# more files; 1.5 GB is a sane default for a ~2.4 GB encoder (-> 2 shards).
|
| 61 |
+
DEFAULT_MAX_SHARD_BYTES = 1_500_000_000
|
| 62 |
+
|
| 63 |
+
# Tensors below this many bytes stay inline in the graph proto (mirrors onnx's
|
| 64 |
+
# own default size_threshold): sharding tiny scalars/biases is pointless and just
|
| 65 |
+
# inflates the file count.
|
| 66 |
+
INLINE_THRESHOLD_BYTES = 1024
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def human(n):
|
| 70 |
+
n = float(n)
|
| 71 |
+
for unit in ("B", "KB", "MB", "GB"):
|
| 72 |
+
if n < 1024 or unit == "GB":
|
| 73 |
+
return f"{n:.0f} {unit}" if unit == "B" else f"{n:.1f} {unit}"
|
| 74 |
+
n /= 1024
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def tensor_nbytes(t):
|
| 78 |
+
# After load_external_data the bytes live in raw_data; that is the only field
|
| 79 |
+
# the fp32 encoder's big initializers use. Non-raw tensors are left inline.
|
| 80 |
+
return len(t.raw_data) if t.HasField("raw_data") else 0
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def shard_model(in_path, out_path, max_shard_bytes):
|
| 84 |
+
if not os.path.exists(in_path):
|
| 85 |
+
raise FileNotFoundError(f"missing input model: {in_path}")
|
| 86 |
+
|
| 87 |
+
# Pull the sibling .onnx.data into raw_data so we see real bytes to repack.
|
| 88 |
+
# Needs ~the encoder's size in RAM (~2.4 GB); cheap given the repack savings.
|
| 89 |
+
print(f"[shard] loading {in_path} (+ external data) ...")
|
| 90 |
+
model = onnx.load(in_path, load_external_data=True)
|
| 91 |
+
|
| 92 |
+
out_dir = os.path.dirname(out_path) or "."
|
| 93 |
+
os.makedirs(out_dir, exist_ok=True)
|
| 94 |
+
base = os.path.basename(out_path) # e.g. encoder-model.onnx
|
| 95 |
+
|
| 96 |
+
# Greedy bin-pack: walk initializers, open a new shard whenever adding the
|
| 97 |
+
# next tensor whole would exceed the budget. A single tensor larger than the
|
| 98 |
+
# budget gets its own shard (we never split a tensor across files, so each
|
| 99 |
+
# tensor's external_data stays a simple (location, offset, length)).
|
| 100 |
+
shard_idx = 0
|
| 101 |
+
shard_offset = 0
|
| 102 |
+
shard_file = None
|
| 103 |
+
shard_paths = []
|
| 104 |
+
inline_count = 0
|
| 105 |
+
externalised = 0
|
| 106 |
+
|
| 107 |
+
def shard_location(idx):
|
| 108 |
+
return f"{base}.data.{idx:03d}"
|
| 109 |
+
|
| 110 |
+
def open_shard(idx):
|
| 111 |
+
loc = shard_location(idx)
|
| 112 |
+
path = os.path.join(out_dir, loc)
|
| 113 |
+
f = open(path, "wb")
|
| 114 |
+
shard_paths.append(path)
|
| 115 |
+
return f, loc
|
| 116 |
+
|
| 117 |
+
shard_file, shard_loc = open_shard(shard_idx)
|
| 118 |
+
|
| 119 |
+
try:
|
| 120 |
+
for t in model.graph.initializer:
|
| 121 |
+
nbytes = tensor_nbytes(t)
|
| 122 |
+
if nbytes < INLINE_THRESHOLD_BYTES:
|
| 123 |
+
inline_count += 1
|
| 124 |
+
continue # leave small tensors inline in the graph
|
| 125 |
+
|
| 126 |
+
# Roll to the next shard if this tensor would push us over budget,
|
| 127 |
+
# unless the current shard is still empty (a tensor bigger than the
|
| 128 |
+
# whole budget then lands alone in its own shard).
|
| 129 |
+
if shard_offset > 0 and shard_offset + nbytes > max_shard_bytes:
|
| 130 |
+
shard_file.close()
|
| 131 |
+
shard_idx += 1
|
| 132 |
+
shard_offset = 0
|
| 133 |
+
shard_file, shard_loc = open_shard(shard_idx)
|
| 134 |
+
|
| 135 |
+
data = t.raw_data
|
| 136 |
+
shard_file.write(data)
|
| 137 |
+
set_external_data(t, location=shard_loc, offset=shard_offset, length=nbytes)
|
| 138 |
+
t.ClearField("raw_data")
|
| 139 |
+
t.data_location = TensorProto.EXTERNAL
|
| 140 |
+
shard_offset += nbytes
|
| 141 |
+
externalised += 1
|
| 142 |
+
finally:
|
| 143 |
+
if shard_file:
|
| 144 |
+
shard_file.close()
|
| 145 |
+
|
| 146 |
+
# The initializers now reference the shard files; save the graph as-is (the
|
| 147 |
+
# external_data is already set, so save_as_external_data=False is correct and
|
| 148 |
+
# must stay False or onnx would try to re-pack into a single file).
|
| 149 |
+
onnx.save(model, out_path, save_as_external_data=False)
|
| 150 |
+
|
| 151 |
+
sizes = [os.path.getsize(p) for p in shard_paths]
|
| 152 |
+
print(f"[shard] wrote {os.path.basename(out_path)} + {len(shard_paths)} shard(s) "
|
| 153 |
+
f"({externalised} external tensors, {inline_count} kept inline):")
|
| 154 |
+
for p, s in zip(shard_paths, sizes):
|
| 155 |
+
flag = " <-- OVER 2 GB!" if s >= 2 ** 31 else ""
|
| 156 |
+
print(f" {os.path.basename(p)} {human(s)}{flag}")
|
| 157 |
+
total = sum(sizes)
|
| 158 |
+
over = [p for p, s in zip(shard_paths, sizes) if s >= 2 ** 31]
|
| 159 |
+
print(f"[shard] total external: {human(total)} across {len(shard_paths)} shard(s)")
|
| 160 |
+
if over:
|
| 161 |
+
print(f"[shard] WARNING: {len(over)} shard(s) still exceed 2 GB; lower --max-shard-bytes",
|
| 162 |
+
file=sys.stderr)
|
| 163 |
+
return shard_paths
|
| 164 |
+
|
| 165 |
+
|
| 166 |
+
def link_sibling(src_dir, out_dir, name):
|
| 167 |
+
"""Make `name` available in out_dir (symlink, falling back to copy) so the
|
| 168 |
+
output is a complete model dir for wer-bench/transcribe without duplicating
|
| 169 |
+
multi-hundred-MB files. Skips silently when src and out are the same dir or
|
| 170 |
+
the source is absent."""
|
| 171 |
+
src = os.path.join(src_dir, name)
|
| 172 |
+
dst = os.path.join(out_dir, name)
|
| 173 |
+
if not os.path.exists(src) or os.path.abspath(src) == os.path.abspath(dst):
|
| 174 |
+
return
|
| 175 |
+
if os.path.lexists(dst):
|
| 176 |
+
os.remove(dst)
|
| 177 |
+
try:
|
| 178 |
+
os.symlink(os.path.relpath(src, out_dir), dst)
|
| 179 |
+
except OSError:
|
| 180 |
+
import shutil
|
| 181 |
+
shutil.copy2(src, dst)
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
def main():
|
| 185 |
+
ap = argparse.ArgumentParser(description=__doc__,
|
| 186 |
+
formatter_class=argparse.RawDescriptionHelpFormatter)
|
| 187 |
+
ap.add_argument("--model-dir", default="./fallback_models",
|
| 188 |
+
help="dir holding encoder-model.onnx (+ .onnx.data). Default ./fallback_models")
|
| 189 |
+
ap.add_argument("--out-dir", default=None,
|
| 190 |
+
help="where to write the sharded encoder + a complete model dir "
|
| 191 |
+
"(default: <model-dir>/sharded). Pass the same value as --model-dir to shard in place.")
|
| 192 |
+
ap.add_argument("--encoder", default="encoder-model.onnx",
|
| 193 |
+
help="encoder graph filename within --model-dir (default encoder-model.onnx)")
|
| 194 |
+
ap.add_argument("--max-shard-bytes", type=int, default=DEFAULT_MAX_SHARD_BYTES,
|
| 195 |
+
help=f"max bytes per shard (default {DEFAULT_MAX_SHARD_BYTES}, i.e. 1.5 GB)")
|
| 196 |
+
args = ap.parse_args()
|
| 197 |
+
|
| 198 |
+
out_dir = args.out_dir or os.path.join(args.model_dir, "sharded")
|
| 199 |
+
in_path = os.path.join(args.model_dir, args.encoder)
|
| 200 |
+
out_path = os.path.join(out_dir, args.encoder)
|
| 201 |
+
|
| 202 |
+
if args.max_shard_bytes >= 2 ** 31:
|
| 203 |
+
print("[shard] WARNING: --max-shard-bytes >= 2 GB defeats the purpose "
|
| 204 |
+
"(shards must stay under the WASM/blob 2 GB cap)", file=sys.stderr)
|
| 205 |
+
|
| 206 |
+
shard_model(in_path, out_path, args.max_shard_bytes)
|
| 207 |
+
|
| 208 |
+
# Round out the output into a self-contained model dir so wer-bench can point
|
| 209 |
+
# --model-dir straight at it. The fp32 decoder/vocab/preproc are reused as-is.
|
| 210 |
+
if os.path.abspath(out_dir) != os.path.abspath(args.model_dir):
|
| 211 |
+
for name in ("decoder_joint-model.onnx", "vocab.txt", "nemo128.onnx", "config.json"):
|
| 212 |
+
link_sibling(args.model_dir, out_dir, name)
|
| 213 |
+
print(f"[shard] linked decoder/vocab/preproc into {out_dir}")
|
| 214 |
+
|
| 215 |
+
print(f"[shard] done. Use: node scripts/wer-bench.mjs --model-dir {out_dir} --configs fp32@60 --ort wasm")
|
| 216 |
+
|
| 217 |
+
|
| 218 |
+
if __name__ == "__main__":
|
| 219 |
+
main()
|
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ea11ddde05617182f4ea0f50bc494fda783b73c7ef9ca1bb90d3de4b4fba53b7
|
| 3 |
+
size 41773219
|
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a6e66c096cdfbb259bccde3772955399dc756b1c2c86dd3ec296c325f98d01f7
|
| 3 |
+
size 1483313152
|
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:988714aa07059e6ad1b12cca13ce59ff21a272250b21edf2c9cf5eac2b76bbed
|
| 3 |
+
size 952107008
|