--- license: apache-2.0 language: - en - de - zh - multilingual library_name: onnxruntime tags: - onnx - embedding - text-embedding - retrieval - sentence-similarity - feature-extraction - quantized - int8 - smoothquant pipeline_tag: sentence-similarity base_model: codefuse-ai/F2LLM-v2-0.6B --- # F2LLM-v2-0.6B — INT8 ONNX (SmoothQuant α=0.8, per-channel) INT8-quantized ONNX of [codefuse-ai/F2LLM-v2-0.6B](https://huggingface.co/codefuse-ai/F2LLM-v2-0.6B). ~1.06 GB, ~50 % memory of FP32, validated to keep retrieval quality on a multilingual probe set. ## Quality Per-row cosine similarity vs the upstream PyTorch model on a 6-text multilingual probe set (English + German), computed with identical token IDs: | Variant of this repo | cos_min | cos_mean | |---------------------------------------|--------:|---------:| | **`model.int8.onnx`** (SmoothQuant α=0.8) | **0.932** | **0.940** | | `model.int8.vanilla.onnx` (kept for archive) | 0.263 | 0.426 | The previous (vanilla `quantize_dynamic`) artifact collapsed on Qwen3-class decoder LLMs because of activation outliers: matrix multiplies with a small number of large-magnitude activations exceed INT8 dynamic range, and per-tensor / per-channel naive quantization has nowhere to put them. The German "Klimawandel" sentence in our probe set was the worst case (cos≈0.26 on F2LLM, ≈0.64 on Octen). [SmoothQuant (Xiao et al. 2023)](https://arxiv.org/abs/2211.10438) migrates these outliers from activations into weights via a per-channel scaling: `Y = (X / s) · (s · W)`. After scaling, the outliers live in `s · W`, and the now-balanced `X / s` quantizes cleanly. α=0.8 was the LLM-class recommendation; smaller α moves more outliers into weights at the cost of weight quantization quality. The fastembed-rs cosine-parity CI harness asserts cos_min ≥ 0.90 against this artifact. ## Files | File | Description | |------|-------------| | `model.int8.onnx` | **Current SmoothQuant α=0.8 INT8 weights graph (use this).** | | `model.int8.onnx.data` | External data sidecar for the above (`use_external_data_format=True`). | | `model.int8.vanilla.onnx` | Archived original vanilla `quantize_dynamic` INT8 — DO NOT use for retrieval; kept only for reproducibility of historical reports. | | `model.int8.vanilla.onnx.data` | External data sidecar for the archived vanilla artifact. | | `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `added_tokens.json`, `config.json`, `merges.txt`, `vocab.json` | Tokenizer + config copied from the upstream PyTorch repo. | ## Quantization recipe (reproducible) ```python # 1. SmoothQuant pre-processing (migrate outliers into weights) python smoothquant_onnx.py \ --fp32 model.onnx --output model.smoothed.fp32.onnx \ --tokenizer --alpha 0.8 # 2. Standard per-channel dynamic INT8 quantize on the smoothed FP32 python -c "from onnxruntime.quantization import quantize_dynamic, QuantType; \ quantize_dynamic('model.smoothed.fp32.onnx', 'model.smoothed.int8.onnx', \ per_channel=True, op_types_to_quantize=['MatMul'], \ weight_type=QuantType.QInt8, use_external_data_format=True)" ``` The full driver lives at [github.com/CrispStrobe/fastembed-rs](https://github.com/CrispStrobe/fastembed-rs) under `tools/dump_reference.py` (validation) and the `wip/validation` branch's `/scripts/smoothquant_onnx.py` + `/scripts/quant_smoothed_int8.py`. ## Usage This artifact is consumed by [fastembed-rs](https://github.com/Anush008/fastembed-rs) under the canonical model_code `cstr/F2LLM-v2-0.6B-ONNX-INT8` with `model_file = "model.int8.onnx"`. Direct ORT usage (ONNX Runtime ≥ 1.17) is straightforward — load the `.onnx` and ORT will discover the `.data` sidecar automatically as long as both files sit in the same directory. ## License [Apache 2.0](https://github.com/Anush008/fastembed-rs/blob/main/LICENSE), inherited from upstream [codefuse-ai/F2LLM-v2-0.6B](https://huggingface.co/codefuse-ai/F2LLM-v2-0.6B). --- ## Change history - **2026-05-03** — Replaced `model.int8.onnx` with the SmoothQuant α=0.8 export. Original vanilla INT8 archived as `model.int8.vanilla.onnx`. Reason: vanilla `quantize_dynamic` produced cos_min=0.26 on this Qwen3-class decoder LLM (catastrophic outlier collapse on multilingual inputs); SmoothQuant recovers cos_min=0.93. - *Original upload* — vanilla `quantize_dynamic` per-channel INT8 export. Now archived.