NVFP4 self-quant (llm-compressor): FP8 attn/GDN + NVFP4-W4A16 experts; beats redhat/unsloth on quality+speed+size

Browse files

Files changed (13) hide show

.gitattributes +1 -0
README.md +65 -0
benchmark.png +0 -0
chat_template.jinja +154 -0
config.json +0 -0
generation_config.json +13 -0
model.safetensors +3 -0
preprocessor_config.json +21 -0
processor_config.json +60 -0
recipe.yaml +60 -0
tokenizer.json +3 -0
tokenizer_config.json +32 -0
video_preprocessor_config.json +21 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,65 @@

+---
+license: apache-2.0
+base_model: Qwen/Qwen3.6-35B-A3B
+base_model_relation: quantized
+tags:
+- nvfp4
+- fp4
+- llm-compressor
+- compressed-tensors
+- vllm
+- moe
+- qwen3_5_moe
+language:
+- en
+pipeline_tag: text-generation
+---
+# Qwen3.6-35B-A3B-NVFP4 (self-quantized, llm-compressor)
+NVFP4 (4-bit) quantization of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), the hybrid
+**Gated-DeltaNet + 256-expert MoE** model (35B total / 3B active, multimodal, thinking-by-default).
+Produced in-house with **llm-compressor / compressed-tensors**, tuned for **NVIDIA Blackwell (sm120, RTX PRO 6000)**.
+**22.5 GB** (≈3× smaller than the 67 GB BF16 base) — the **smallest** of the public NVFP4 builds, while matching or
+beating them on quality and beating two of three on speed.
+## Recipe (mixed-precision)
+| Component | Precision |
+|---|---|
+| Routed experts (256/layer, fused) | **NVFP4 weight-only (W4A16)** group-16 |
+| Self-attention q/k/v/o + GDN `in_proj_*`/`out_proj` + shared-expert | **FP8** (W8A8, block-128 weight / dynamic group-128 act) |
+| Routers, `lm_head`, embeddings, conv1d/SSM, vision tower, MTP | **BF16** |
+Why weight-only NVFP4 on experts: on sm120 the native FP4 MoE GEMM is unavailable, so all NVFP4 experts serve via the
+**Marlin W4A16** path regardless — W4A16 therefore gives the same speed as W4A4 with less quantization error.
+Calibrated with `moe_calibrate_all_experts=True` (every one of the 256 experts receives stats).
+## Benchmarks (measured on RTX PRO 6000 / sm120, vLLM 0.23)
+lm-eval (thinking-on, `max_gen_toks=8192`, flexible-extract); speed from engine `/metrics`, TP1 solo.
+| Build | MMLU-Pro | GSM8K | single-stream tok/s | N16 tok/s | size |
+|---|---|---|---|---|---|
+| **this model (our self-quant)** | **0.825** | **0.920** | 200.6 | 1581 | **22.5 GB** |
+| unsloth/…-NVFP4 | 0.825 | 0.890 | 175.3 | 1493 | 24.7 GB |
+| RedHatAI/…-NVFP4 | 0.819 | 0.910 | 170.0 | 1422 | 24.0 GB |
+| nvidia/…-NVFP4 | 0.817 | 0.910 | **223.6** | **1646** | 23.4 GB |
+![benchmark](benchmark.png)
+- **Pareto-dominates** RedHatAI & unsloth on quality, speed, *and* size.
+- **Tied-best quality** (top GSM8K, tied-top MMLU-Pro); **smallest** build.
+- nvidia keeps the single-stream/concurrent speed crown (it sits on the sm120 hardware optimum — FP8-attn + W4A16-experts via Marlin); this build matches its scheme and trails only on raw decode throughput.
+## Serving (vLLM ≥ 0.23)
+```bash
+vllm serve <this-repo> --served-model-name qwen3.6-35b-a3b-nvfp4 \
+  --max-model-len 262144 --gpu-memory-utilization 0.90 \
+  --trust-remote-code --reasoning-parser qwen3 \
+  --enable-auto-tool-choice --tool-call-parser qwen3_xml
+```
+Quantized by `kyaky` with llm-compressor. Base model © Qwen, Apache-2.0.

benchmark.png ADDED Viewed

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,154 @@

+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- macro render_content(content, do_vision_count, is_system_content=false) %}
+    {%- if content is string %}
+        {{- content }}
+    {%- elif content is iterable and content is not mapping %}
+        {%- for item in content %}
+            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain images.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
+            {%- elif 'video' in item or item.type == 'video' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain videos.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Video ' ~ video_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
+            {%- elif 'text' in item %}
+                {{- item.text }}
+            {%- else %}
+                {{- raise_exception('Unexpected item type in content.') }}
+            {%- endif %}
+        {%- endfor %}
+    {%- elif content is none or content is undefined %}
+        {{- '' }}
+    {%- else %}
+        {{- raise_exception('Unexpected content type.') }}
+    {%- endif %}
+{%- endmacro %}
+{%- if not messages %}
+    {{- raise_exception('No messages provided.') }}
+{%- endif %}
+{%- if tools and tools is iterable and tools is not mapping %}
+    {{- '<|im_start|>system\n' }}
+    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>" }}
+    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {%- if content %}
+            {{- '\n\n' + content }}
+        {%- endif %}
+    {%- endif %}
+    {{- '<|im_end|>\n' }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" %}
+        {%- set content = render_content(message.content, false)|trim %}
+        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
+            {%- set ns.multi_step_tool = false %}
+            {%- set ns.last_query_index = index %}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if ns.multi_step_tool %}
+    {{- raise_exception('No user query found in messages.') }}
+{%- endif %}
+{%- for message in messages %}
+    {%- set content = render_content(message.content, true)|trim %}
+    {%- if message.role == "system" %}
+        {%- if not loop.first %}
+            {{- raise_exception('System message must be at the beginning.') }}
+        {%- endif %}
+    {%- elif message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- set reasoning_content = reasoning_content|trim %}
+        {%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}
+            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if tool_call.function is defined %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {%- if loop.first %}
+                    {%- if content|trim %}
+                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                    {%- else %}
+                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                    {%- endif %}
+                {%- else %}
+                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                {%- endif %}
+                {%- if tool_call.arguments is defined %}
+                    {%- for args_name, args_value in tool_call.arguments|items %}
+                        {{- '<parameter=' + args_name + '>\n' }}
+                        {%- set args_value = args_value | string if args_value is string else args_value | tojson | safe %}
+                        {{- args_value }}
+                        {{- '\n</parameter>\n' }}
+                    {%- endfor %}
+                {%- endif %}
+                {{- '</function>\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.previtem and loop.previtem.role != "tool" %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if not loop.last and loop.nextitem.role != "tool" %}
+            {{- '<|im_end|>\n' }}
+        {%- elif loop.last %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- else %}
+        {{- raise_exception('Unexpected message role.') }}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is false %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- else %}
+        {{- '<think>\n' }}
+    {%- endif %}
+{%- endif %}

config.json ADDED Viewed

The diff for this file is too large to render. See raw diff

generation_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "bos_token_id": 248044,
+  "do_sample": true,
+  "eos_token_id": [
+    248046,
+    248044
+  ],
+  "pad_token_id": 248044,
+  "temperature": 1.0,
+  "top_k": 20,
+  "top_p": 0.95,
+  "transformers_version": "5.10.1"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58d5f4d578092478b746ae99849b28166207fce38ec0380e840d502ba0cb5971
+size 22517775624

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+    "size": {
+        "longest_edge": 16777216,
+        "shortest_edge": 65536
+    },
+    "patch_size": 16,
+    "temporal_patch_size": 2,
+    "merge_size": 2,
+    "image_mean": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "image_std": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "processor_class": "Qwen3VLProcessor",
+    "image_processor_type": "Qwen2VLImageProcessorFast"
+}

processor_config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "image_processor": {
+    "do_convert_rgb": true,
+    "do_normalize": true,
+    "do_rescale": true,
+    "do_resize": true,
+    "image_mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "image_processor_type": "Qwen2VLImageProcessor",
+    "image_std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "merge_size": 2,
+    "patch_size": 16,
+    "resample": 3,
+    "rescale_factor": 0.00392156862745098,
+    "size": {
+      "longest_edge": 16777216,
+      "shortest_edge": 65536
+    },
+    "temporal_patch_size": 2
+  },
+  "processor_class": "Qwen3VLProcessor",
+  "video_processor": {
+    "do_convert_rgb": true,
+    "do_normalize": true,
+    "do_rescale": true,
+    "do_resize": true,
+    "do_sample_frames": true,
+    "fps": 2,
+    "image_mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "image_std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "max_frames": 768,
+    "merge_size": 2,
+    "min_frames": 4,
+    "patch_size": 16,
+    "resample": 3,
+    "rescale_factor": 0.00392156862745098,
+    "return_metadata": false,
+    "size": {
+      "longest_edge": 25165824,
+      "shortest_edge": 4096
+    },
+    "temporal_patch_size": 2,
+    "video_processor_type": "Qwen3VLVideoProcessor"
+  }
+}

recipe.yaml ADDED Viewed

	@@ -0,0 +1,60 @@

+quant_stage:
+  quant_modifiers:
+    QuantizationModifier:
+      config_groups:
+        group_0:
+          targets: ['re:.*self_attn.q_proj.*', 're:.*self_attn.k_proj.*', 're:.*self_attn.v_proj.*',
+            're:.*self_attn.o_proj.*', 're:.*linear_attn.in_proj_qkv.*', 're:.*linear_attn.in_proj_z.*',
+            're:.*linear_attn.out_proj.*', 're:.*shared_expert.gate_proj.*', 're:.*shared_expert.up_proj.*',
+            're:.*shared_expert.down_proj.*']
+          weights:
+            num_bits: 8
+            type: float
+            symmetric: true
+            group_size: null
+            strategy: block
+            block_structure: [128, 128]
+            dynamic: false
+            actorder: null
+            scale_dtype: null
+            zp_dtype: null
+            observer: memoryless_minmax
+            observer_kwargs: {}
+          input_activations:
+            num_bits: 8
+            type: float
+            symmetric: true
+            group_size: 128
+            strategy: group
+            block_structure: null
+            dynamic: true
+            actorder: null
+            scale_dtype: null
+            zp_dtype: null
+            observer: null
+            observer_kwargs: {}
+          output_activations: null
+          format: null
+        group_1:
+          targets: ['re:.*mlp.experts.*gate_proj.*', 're:.*mlp.experts.*up_proj.*', 're:.*mlp.experts.*down_proj.*']
+          weights:
+            num_bits: 4
+            type: float
+            symmetric: true
+            group_size: 16
+            strategy: tensor_group
+            block_structure: null
+            dynamic: false
+            actorder: null
+            scale_dtype: torch.float8_e4m3fn
+            zp_dtype: null
+            observer: memoryless_minmax
+            observer_kwargs: {}
+          input_activations: null
+          output_activations: null
+          format: null
+      targets: [Linear]
+      ignore: ['re:.*lm_head', 're:.*embed_tokens', 're:visual.*', 're:model.visual.*', 're:.*mlp.gate$',
+        're:.*shared_expert_gate$', 're:.*linear_attn.in_proj_a', 're:.*linear_attn.in_proj_b',
+        're:.*linear_attn.conv1d', 're:^mtp\..*']
+      bypass_divisibility_checks: false

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f399b3cd12fa270d51457bb749fb30863521e8359b8a27059c71b6c2f7d6dd6c
+size 19989424

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "add_prefix_space": false,
+  "audio_bos_token": "<|audio_start|>",
+  "audio_eos_token": "<|audio_end|>",
+  "audio_token": "<|audio_pad|>",
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "image_token": "<|image_pad|>",
+  "is_local": true,
+  "local_files_only": true,
+  "model_max_length": 262144,
+  "model_specific_special_tokens": {
+    "audio_bos_token": "<|audio_start|>",
+    "audio_eos_token": "<|audio_end|>",
+    "audio_token": "<|audio_pad|>",
+    "image_token": "<|image_pad|>",
+    "video_token": "<|video_pad|>",
+    "vision_bos_token": "<|vision_start|>",
+    "vision_eos_token": "<|vision_end|>"
+  },
+  "pad_token": "<|endoftext|>",
+  "pretokenize_regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null,
+  "video_token": "<|video_pad|>",
+  "vision_bos_token": "<|vision_start|>",
+  "vision_eos_token": "<|vision_end|>"
+}

video_preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+    "size": {
+        "longest_edge": 25165824,
+        "shortest_edge": 4096
+    },
+    "patch_size": 16,
+    "temporal_patch_size": 2,
+    "merge_size": 2,
+    "image_mean": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "image_std": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "processor_class": "Qwen3VLProcessor",
+    "video_processor_type": "Qwen3VLVideoProcessor"
+}