--- base_model: Nanbeige/Nanbeige4.1-3B tags: - nvfp4 - quantized - blackwell - bilingual - chinese - english pipeline_tag: text-generation language: - en - zh --- # Nanbeige 4.1 3B - NVFP4 This is an **NVFP4 quantized version** of [Nanbeige/Nanbeige4.1-3B](https://huggingface.co/Nanbeige/Nanbeige4.1-3B), optimized for NVIDIA Blackwell GPUs (RTX 50xx series). **✅ FULLY FUNCTIONAL** - This model loads and generates correctly in vLLM 0.15.1! ## Model Details - **Base Model:** Nanbeige 4.1 3B (Bilingual Chinese/English reasoning model) - **Parameters:** 3 billion - **Quantization:** NVFP4 (4-bit floating point) - **Size:** 3.3GB (down from ~6GB, 45% reduction) - **Quantizer:** SILVERTHRONE - **Method:** llmcompressor with 512 calibration samples ## Quantization Strategy - ✅ **Quantized to NVFP4:** All linear layers - ❌ **Preserved at full precision:** Token embeddings, lm_head **Calibration:** - Dataset: open-platypus - Samples: 512 - Sequence length: 1024 ## Hardware Requirements - **GPU:** NVIDIA RTX 50xx series (Blackwell) or newer - **VRAM:** ~6-8GB recommended - **Framework:** vLLM 0.15.1+ **Note:** This model will NOT work with llama.cpp, transformers library, or GGUF-based tools. ## Usage ### With vLLM ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer # Load tokenizer for chat template tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/nanbeige4.1-3b-nvfp4") # Load model (IMPORTANT: set max_model_len=8192) model = LLM( model="SILVERTHRONE/nanbeige4.1-3b-nvfp4", gpu_memory_utilization=0.85, max_model_len=8192 # Required! Default 262K won't fit in 16GB VRAM ) # Format message with chat template messages = [{"role": "user", "content": "What is 2+2?"}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # Generate sampling_params = SamplingParams( temperature=0.7, max_tokens=200, stop=["<|im_end|>"] ) outputs = model.generate(prompt, sampling_params) response = outputs[0].outputs[0].text print(response) ``` ## Chat Template Nanbeige uses the ChatML format with `` reasoning blocks: ``` <|im_start|>system 你是南北阁,一款由BOSS直聘自主研发并训练的专业大语言模型。<|im_end|> <|im_start|>user {your question}<|im_end|> <|im_start|>assistant {reasoning steps} {final answer}<|im_end|> ``` The model outputs reasoning in `` tags, then provides the final answer. You can parse the response to extract either part. ## Performance - **Inference Speed:** ~110-120 tokens/sec on RTX 5060 Ti (16GB) - **Quality:** Good - coherent reasoning, correct answers on factual questions - **Language Support:** Bilingual (Chinese/English) **Example outputs:** - "What is 2+2?" → Correctly identifies answer as 4 - "What is the capital of France?" → Correctly identifies Paris ## Known Limitations 1. **Does NOT work with transformers library** - Crashes with `KeyError: 'weight_scale'` due to incomplete NVFP4 support in compressed-tensors 2. **vLLM only** - This is by design; NVFP4 is optimized for vLLM inference 3. **Must set max_model_len=8192** - Default 262K context requires 16GB just for KV cache 4. **Bilingual calibration** - Quantized with English-only dataset (open-platypus), may slightly affect Chinese performance ## Testing Environment - **GPU:** NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell) - **CPU:** AMD Ryzen 9 5950X - **RAM:** 32GB DDR4 - **OS:** Ubuntu 24.04.3 LTS (Noble) - **vLLM:** 0.15.1 - **transformers:** 4.57.3 (for tokenizer only) - **llmcompressor:** Latest (Feb 2026) ## Comparison: Text-Only NVFP4 Works! Unlike vision-language models (e.g., Apriel 1.6), this text-only NVFP4 quantization works perfectly in vLLM. The `weight_scale` error only affects: - Vision-language models (LLaVA architecture) - transformers library loading Pure text models like Nanbeige work great with vLLM! ## Credits - **Original Model:** [Nanbeige Team / BOSS直聘](https://huggingface.co/Nanbeige) - **Quantization:** SILVERTHRONE ([@SILVERTHRONE](https://huggingface.co/SILVERTHRONE)) - **Method:** [Neural Magic llmcompressor](https://github.com/vllm-project/llm-compressor) ## License Unspecified (inherited from base model)