---
license: apache-2.0
language:
- en
- de
- zh
- multilingual
library_name: onnxruntime
tags:
- onnx
- embedding
- text-embedding
- retrieval
- sentence-similarity
- feature-extraction
- quantized
- int8
- smoothquant
pipeline_tag: sentence-similarity
base_model: codefuse-ai/F2LLM-v2-0.6B
---

# F2LLM-v2-0.6B — INT8 ONNX (SmoothQuant α=0.8, per-channel)

INT8-quantized ONNX of [codefuse-ai/F2LLM-v2-0.6B](https://huggingface.co/codefuse-ai/F2LLM-v2-0.6B).
~1.06 GB, ~50 % memory of FP32, validated to keep retrieval quality on a multilingual probe set.

## Quality

Per-row cosine similarity vs the upstream PyTorch model on a 6-text multilingual probe set
(English + German), computed with identical token IDs:

| Variant of this repo                  | cos_min | cos_mean |
|---------------------------------------|--------:|---------:|
| **`model.int8.onnx`** (SmoothQuant α=0.8) | **0.932** | **0.940** |
| `model.int8.vanilla.onnx` (kept for archive)  | 0.263 | 0.426 |

The previous (vanilla `quantize_dynamic`) artifact collapsed on Qwen3-class decoder LLMs because of activation outliers: matrix multiplies with a small number of large-magnitude activations exceed INT8 dynamic range, and per-tensor / per-channel naive quantization has nowhere to put them. The German "Klimawandel" sentence in our probe set was the worst case (cos≈0.26 on F2LLM, ≈0.64 on Octen).

[SmoothQuant (Xiao et al. 2023)](https://arxiv.org/abs/2211.10438) migrates these outliers from activations into weights via a per-channel scaling: `Y = (X / s) · (s · W)`. After scaling, the outliers live in `s · W`, and the now-balanced `X / s` quantizes cleanly. α=0.8 was the LLM-class recommendation; smaller α moves more outliers into weights at the cost of weight quantization quality.

The fastembed-rs cosine-parity CI harness asserts cos_min ≥ 0.90 against this artifact.

## Files

| File | Description |
|------|-------------|
| `model.int8.onnx` | **Current SmoothQuant α=0.8 INT8 weights graph (use this).** |
| `model.int8.onnx.data` | External data sidecar for the above (`use_external_data_format=True`). |
| `model.int8.vanilla.onnx` | Archived original vanilla `quantize_dynamic` INT8 — DO NOT use for retrieval; kept only for reproducibility of historical reports. |
| `model.int8.vanilla.onnx.data` | External data sidecar for the archived vanilla artifact. |
| `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `added_tokens.json`, `config.json`, `merges.txt`, `vocab.json` | Tokenizer + config copied from the upstream PyTorch repo. |

## Quantization recipe (reproducible)

```python
# 1. SmoothQuant pre-processing (migrate outliers into weights)
python smoothquant_onnx.py \
    --fp32 model.onnx --output model.smoothed.fp32.onnx \
    --tokenizer <upstream snapshot> --alpha 0.8

# 2. Standard per-channel dynamic INT8 quantize on the smoothed FP32
python -c "from onnxruntime.quantization import quantize_dynamic, QuantType; \
    quantize_dynamic('model.smoothed.fp32.onnx', 'model.smoothed.int8.onnx', \
        per_channel=True, op_types_to_quantize=['MatMul'], \
        weight_type=QuantType.QInt8, use_external_data_format=True)"
```

The full driver lives at [github.com/CrispStrobe/fastembed-rs](https://github.com/CrispStrobe/fastembed-rs)
under `tools/dump_reference.py` (validation) and the `wip/validation` branch's
`/scripts/smoothquant_onnx.py` + `/scripts/quant_smoothed_int8.py`.

## Usage

This artifact is consumed by [fastembed-rs](https://github.com/Anush008/fastembed-rs) under the
canonical model_code `cstr/F2LLM-v2-0.6B-ONNX-INT8` with `model_file = "model.int8.onnx"`.  Direct ORT usage (ONNX Runtime ≥ 1.17) is straightforward — load the `.onnx` and ORT will discover the `.data` sidecar automatically as long as both files sit in the same directory.

## License

[Apache 2.0](https://github.com/Anush008/fastembed-rs/blob/main/LICENSE), inherited from
upstream [codefuse-ai/F2LLM-v2-0.6B](https://huggingface.co/codefuse-ai/F2LLM-v2-0.6B).

---

## Change history

- **2026-05-03** — Replaced `model.int8.onnx` with the SmoothQuant α=0.8 export.  Original vanilla INT8 archived as `model.int8.vanilla.onnx`.  Reason: vanilla `quantize_dynamic` produced cos_min=0.26 on this Qwen3-class decoder LLM (catastrophic outlier collapse on multilingual inputs); SmoothQuant recovers cos_min=0.93.
- *Original upload* — vanilla `quantize_dynamic` per-channel INT8 export.  Now archived.