mjbommar commited on
Commit
91b72cd
·
verified ·
1 Parent(s): fcee90a

README: surface ONNX bundle for users

Browse files
Files changed (2) hide show
  1. README.md +3 -0
  2. onnx/README.md +4 -4
README.md CHANGED
@@ -70,6 +70,9 @@ The family ships 28 parent cells (3 sizes × 4 vocabs × 2-3 seeds at seq\_len=1
70
  > **Short-sequence sibling available.** If your inputs are sub-KB (DNS payloads, sub-MTU packets, small forensic fragments), use `mjbommar/mimelens-001-medium-bpe-16k-s1-seq256` instead. Same architecture, 4× shorter context, ~5× lower CPU latency, BPE-cell accuracy ties or beats this cell on the magic-files probe-fit. See paper Appendix B.5.
71
 
72
 
 
 
 
73
  ---
74
 
75
  ## Overview
 
70
  > **Short-sequence sibling available.** If your inputs are sub-KB (DNS payloads, sub-MTU packets, small forensic fragments), use `mjbommar/mimelens-001-medium-bpe-16k-s1-seq256` instead. Same architecture, 4× shorter context, ~5× lower CPU latency, BPE-cell accuracy ties or beats this cell on the magic-files probe-fit. See paper Appendix B.5.
71
 
72
 
73
+ > **ONNX bundled.** This cell ships `onnx/model_fp32.onnx` + `onnx/model_int8.onnx` (dynamic int8 of MatMul/Gemm) for direct ONNX Runtime inference. See `onnx/README.md` in this repo for input/output shapes and the latency profile.
74
+
75
+
76
  ---
77
 
78
  ## Overview
onnx/README.md CHANGED
@@ -1,8 +1,8 @@
1
- # ONNX exports for MimeLens-medium-bpe-16k-s1
2
 
3
  Two ONNX exports are bundled here:
4
 
5
- - `model_fp32.onnx` + `model_fp32.onnx.data` float32 export via the legacy torch.onnx exporter; ~185 MB total. Load with `onnxruntime.InferenceSession`.
6
- - `model_int8.onnx` — dynamic int8 quantization via `onnxruntime.quantization.quantize_dynamic`; ~47 MB. Dynamic int8 was slower than fp32 on the CPU we measured (no AVX-VNNI); static (calibrated) quantization on modern int8-GEMM hardware should narrow the gap. See `data/p1/cpu_latency.json` in the GitHub repo for measured single-sample latencies.
7
 
8
- The inputs are `(input_ids: int64 [B, 1024], attention_mask: int64 [B, 1024])` and the output is `mean_pool_embedding: float32 [B, 512]`.
 
1
+ # ONNX exports for MimeLens-medium-bpe-16k-s1 (seq_len=1024)
2
 
3
  Two ONNX exports are bundled here:
4
 
5
+ - `model_fp32.onnx` (+ `model_fp32.onnx.data` if exported with external tensors) via the legacy torch.onnx exporter. Load with `onnxruntime.InferenceSession`.
6
+ - `model_int8.onnx` via `onnxruntime.quantization.quantize_dynamic`; dynamic int8 is slower than fp32 on this CPU (no AVX-VNNI; fp32 392 ms / int8 547 ms p50). Static (calibrated) quantization on modern int8-GEMM hardware should narrow the gap further.
7
 
8
+ Input shapes are `(input_ids: int64 [B, 1024], attention_mask: int64 [B, 1024])` and the output is `mean_pool_embedding: float32 [B, 512]`.