--- license: apache-2.0 language: - en - de - zh - multilingual library_name: onnxruntime tags: - onnx - embedding - text-embedding - retrieval - sentence-similarity - feature-extraction - fp16 - fastembed pipeline_tag: sentence-similarity base_model: codefuse-ai/F2LLM-v2-0.6B --- # F2LLM-v2-0.6B — FP16 ONNX FP16-converted ONNX of [codefuse-ai/F2LLM-v2-0.6B](https://huggingface.co/codefuse-ai/F2LLM-v2-0.6B), a Qwen3-derived 1024-dim retrieval embedding model with 32k context and last-token pooling. ~1.2 GB (~50 % memory of FP32), retrieval-quality-equivalent to FP32 in our gates. ## Quality | Metric | Value | Threshold | |---|---|---| | `cos_min` vs PyTorch FP32 reference (6-text multilingual probe) | **0.999999** | ≥ 0.99 | | `cos_mean` vs same | 1.000000 | — | Validated under [fastembed-rs](https://github.com/CrispStrobe/fastembed-rs)' `cosine_parity` harness on `probe/ort-rc12` (ORT 1.24). ## Files | File | Size | Description | |------|------|-------------| | `model.fp16.onnx` | ~5 MB | ONNX header (external data) | | `model.fp16.onnx.data` | ~1.2 GB | FP16 weights | | `tokenizer.json`, `config.json`, `tokenizer_config.json`, `special_tokens_map.json` | small | tokenizer + model config | ## Conversion Streaming FP32→FP16 via `convert_fp16_streaming.py` (bypasses the 2 GB protobuf serialization limit). ## Use via fastembed-rs ```rust let embedder = TextEmbedding::try_new( InitOptions::new(EmbeddingModel::F2LlmV2_0_6BFp16))?; let vectors = embedder.embed(vec!["hello world"], None)?; ``` Pooling: last-token (auto-applied by fastembed-rs). Use the F2LLM instruct format prefix for queries (see the upstream F2LLM repo). ## License Apache 2.0, inherited from the base model.