kakrotto commited on
Commit
447d1aa
·
1 Parent(s): 29623b9

Fix quantization method: FineGrainedFP8Config, not llmcompressor model_free_ptq

Browse files
Files changed (1) hide show
  1. README.md +52 -7
README.md CHANGED
@@ -4,23 +4,67 @@ base_model: llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1
4
  tags:
5
  - fp8
6
  - quantized
7
- - compressed-tensors
8
- - llmcompressor
9
- - qwen3
10
  ---
11
 
12
  # Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8
13
 
14
  FP8 block-quantized version of [llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1](https://huggingface.co/llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1).
15
 
 
 
16
  ## Quantization Details
17
 
18
- - **Method:** FP8_BLOCK (weight_block_size=[128,128], activation_scheme=dynamic)
19
- - **Tool:** [llmcompressor](https://github.com/vllm-project/llmcompressor) `model_free_ptq`
20
- - **Format:** compressed-tensors (vLLM native)
21
- - **Ignored layers:** lm_head, embedding layers
 
22
  - **Model size:** ~29 GB (vs ~55 GB BF16)
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  ## Evaluation Results
25
 
26
  BF16 baseline vs FP8 quantized, evaluated with lm_eval 0.4.11, vLLM backend, 2 seeds averaged.
@@ -47,3 +91,4 @@ This is an uncensored model. The quantizer (kakrotto) is not responsible for the
47
  ## Attribution
48
 
49
  - **Source model:** [llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1](https://huggingface.co/llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1)
 
 
4
  tags:
5
  - fp8
6
  - quantized
7
+ - qwen3.5
 
 
8
  ---
9
 
10
  # Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8
11
 
12
  FP8 block-quantized version of [llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1](https://huggingface.co/llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1).
13
 
14
+ Quantized to match the official [Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8) format exactly.
15
+
16
  ## Quantization Details
17
 
18
+ - **Method:** Fine-grained FP8 quantization with block size of 128
19
+ - **Tool:** Hugging Face Transformers native `FineGrainedFP8Config` (on-the-fly quantization during model loading)
20
+ - **Format:** `quant_method: "fp8"` (Qwen/DeepSeek native format, NOT compressed-tensors)
21
+ - **Weight:** FP8 E4M3, static, block_size=(128, 128)
22
+ - **Activation:** FP8, dynamic per-token
23
  - **Model size:** ~29 GB (vs ~55 GB BF16)
24
 
25
+ ### Ignored Layers (modules_to_not_convert)
26
+
27
+ Copied verbatim from the official [Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8) config.json, with MTP entries removed (this heretic variant has no MTP):
28
+
29
+ - `lm_head`
30
+ - `model.language_model.embed_tokens`
31
+ - All `linear_attn.conv1d`, `linear_attn.in_proj_a`, `linear_attn.in_proj_b` (DeltaNet SSM-specific subparts)
32
+ - All `model.visual.*` (entire vision tower)
33
+
34
+ **Quantized layers** (NOT in ignore list): `linear_attn.out_proj`, `linear_attn.in_proj_qkv`, `linear_attn.in_proj_z`, all `self_attn` Q/K/V/O projections, all MLP layers.
35
+
36
+ ### Quantization Script
37
+
38
+ ```python
39
+ from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor, FineGrainedFP8Config
40
+ import json, torch
41
+
42
+ # Load ignore list from Qwen official FP8 config
43
+ ref = json.load(open("Qwen3.5-27B-FP8/config.json"))
44
+ ref_ignore = ref["quantization_config"]["modules_to_not_convert"]
45
+ modules_to_not_convert = [m for m in ref_ignore if not m.startswith("mtp")]
46
+
47
+ qc = FineGrainedFP8Config(
48
+ activation_scheme="dynamic",
49
+ weight_block_size=(128, 128),
50
+ modules_to_not_convert=modules_to_not_convert,
51
+ dequantize=False,
52
+ )
53
+
54
+ processor = AutoProcessor.from_pretrained(MODEL_DIR)
55
+ model = Qwen3_5ForConditionalGeneration.from_pretrained(
56
+ MODEL_DIR,
57
+ dtype=torch.bfloat16,
58
+ device_map="auto",
59
+ max_memory={0: "30GiB", 1: "30GiB"},
60
+ quantization_config=qc,
61
+ low_cpu_mem_usage=True,
62
+ )
63
+
64
+ model.save_pretrained(SAVE_DIR, max_shard_size="5GB", save_original_format=False)
65
+ processor.save_pretrained(SAVE_DIR)
66
+ ```
67
+
68
  ## Evaluation Results
69
 
70
  BF16 baseline vs FP8 quantized, evaluated with lm_eval 0.4.11, vLLM backend, 2 seeds averaged.
 
91
  ## Attribution
92
 
93
  - **Source model:** [llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1](https://huggingface.co/llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1)
94
+ - **Quantization reference:** [Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8)