---
base_model: Nanbeige/Nanbeige4.1-3B
tags:
  - nvfp4
  - quantized
  - blackwell
  - bilingual
  - chinese
  - english
pipeline_tag: text-generation
language:
  - en
  - zh
---

# Nanbeige 4.1 3B - NVFP4

This is an **NVFP4 quantized version** of [Nanbeige/Nanbeige4.1-3B](https://huggingface.co/Nanbeige/Nanbeige4.1-3B), optimized for NVIDIA Blackwell GPUs (RTX 50xx series).

**✅ FULLY FUNCTIONAL** - This model loads and generates correctly in vLLM 0.15.1!

## Model Details

- **Base Model:** Nanbeige 4.1 3B (Bilingual Chinese/English reasoning model)
- **Parameters:** 3 billion
- **Quantization:** NVFP4 (4-bit floating point)
- **Size:** 3.3GB (down from ~6GB, 45% reduction)
- **Quantizer:** SILVERTHRONE
- **Method:** llmcompressor with 512 calibration samples

## Quantization Strategy

- ✅ **Quantized to NVFP4:** All linear layers
- ❌ **Preserved at full precision:** Token embeddings, lm_head

**Calibration:**
- Dataset: open-platypus
- Samples: 512
- Sequence length: 1024

## Hardware Requirements

- **GPU:** NVIDIA RTX 50xx series (Blackwell) or newer
- **VRAM:** ~6-8GB recommended
- **Framework:** vLLM 0.15.1+

**Note:** This model will NOT work with llama.cpp, transformers library, or GGUF-based tools.

## Usage

### With vLLM
```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# Load tokenizer for chat template
tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/nanbeige4.1-3b-nvfp4")

# Load model (IMPORTANT: set max_model_len=8192)
model = LLM(
    model="SILVERTHRONE/nanbeige4.1-3b-nvfp4",
    gpu_memory_utilization=0.85,
    max_model_len=8192  # Required! Default 262K won't fit in 16GB VRAM
)

# Format message with chat template
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200,
    stop=["<|im_end|>"]
)

outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text

print(response)
```

## Chat Template

Nanbeige uses the ChatML format with `<think>` reasoning blocks:
```
<|im_start|>system
你是南北阁，一款由BOSS直聘自主研发并训练的专业大语言模型。<|im_end|>
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{reasoning steps}
</think>
{final answer}<|im_end|>
```

The model outputs reasoning in `<think>` tags, then provides the final answer. You can parse the response to extract either part.

## Performance

- **Inference Speed:** ~110-120 tokens/sec on RTX 5060 Ti (16GB)
- **Quality:** Good - coherent reasoning, correct answers on factual questions
- **Language Support:** Bilingual (Chinese/English)

**Example outputs:**
- "What is 2+2?" → Correctly identifies answer as 4
- "What is the capital of France?" → Correctly identifies Paris

## Known Limitations

1. **Does NOT work with transformers library** - Crashes with `KeyError: 'weight_scale'` due to incomplete NVFP4 support in compressed-tensors
2. **vLLM only** - This is by design; NVFP4 is optimized for vLLM inference
3. **Must set max_model_len=8192** - Default 262K context requires 16GB just for KV cache
4. **Bilingual calibration** - Quantized with English-only dataset (open-platypus), may slightly affect Chinese performance

## Testing Environment

- **GPU:** NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell)
- **CPU:** AMD Ryzen 9 5950X
- **RAM:** 32GB DDR4
- **OS:** Ubuntu 24.04.3 LTS (Noble)
- **vLLM:** 0.15.1
- **transformers:** 4.57.3 (for tokenizer only)
- **llmcompressor:** Latest (Feb 2026)

## Comparison: Text-Only NVFP4 Works!

Unlike vision-language models (e.g., Apriel 1.6), this text-only NVFP4 quantization works perfectly in vLLM. The `weight_scale` error only affects:
- Vision-language models (LLaVA architecture)
- transformers library loading

Pure text models like Nanbeige work great with vLLM!

## Credits

- **Original Model:** [Nanbeige Team / BOSS直聘](https://huggingface.co/Nanbeige)
- **Quantization:** SILVERTHRONE ([@SILVERTHRONE](https://huggingface.co/SILVERTHRONE))
- **Method:** [Neural Magic llmcompressor](https://github.com/vllm-project/llm-compressor)

## License

Unspecified (inherited from base model)