---
license: apache-2.0
language:
- en
- de
- zh
- multilingual
library_name: onnxruntime
tags:
- onnx
- embedding
- text-embedding
- retrieval
- sentence-similarity
- feature-extraction
- fp16
- fastembed
pipeline_tag: sentence-similarity
base_model: codefuse-ai/F2LLM-v2-0.6B
---

# F2LLM-v2-0.6B — FP16 ONNX

FP16-converted ONNX of [codefuse-ai/F2LLM-v2-0.6B](https://huggingface.co/codefuse-ai/F2LLM-v2-0.6B), a Qwen3-derived 1024-dim retrieval embedding model with 32k context and last-token pooling.

~1.2 GB (~50 % memory of FP32), retrieval-quality-equivalent to FP32 in our gates.

## Quality

| Metric | Value | Threshold |
|---|---|---|
| `cos_min` vs PyTorch FP32 reference (6-text multilingual probe) | **0.999999** | ≥ 0.99 |
| `cos_mean` vs same | 1.000000 | — |

Validated under [fastembed-rs](https://github.com/CrispStrobe/fastembed-rs)' `cosine_parity` harness on `probe/ort-rc12` (ORT 1.24).

## Files

| File | Size | Description |
|------|------|-------------|
| `model.fp16.onnx` | ~5 MB | ONNX header (external data) |
| `model.fp16.onnx.data` | ~1.2 GB | FP16 weights |
| `tokenizer.json`, `config.json`, `tokenizer_config.json`, `special_tokens_map.json` | small | tokenizer + model config |

## Conversion

Streaming FP32→FP16 via `convert_fp16_streaming.py` (bypasses the 2 GB protobuf serialization limit).

## Use via fastembed-rs

```rust
let embedder = TextEmbedding::try_new(
    InitOptions::new(EmbeddingModel::F2LlmV2_0_6BFp16))?;
let vectors = embedder.embed(vec!["hello world"], None)?;
```

Pooling: last-token (auto-applied by fastembed-rs).  Use the F2LLM instruct format prefix for queries (see the upstream F2LLM repo).

## License

Apache 2.0, inherited from the base model.