--- license: apache-2.0 base_model: llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1 tags: - fp8 - quantized - qwen3.5 --- # Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8 FP8 block-quantized version of [llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1](https://huggingface.co/llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1). Quantized to match the official [Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8) format exactly. ## Quantization Details - **Method:** Fine-grained FP8 quantization with block size of 128 - **Tool:** Hugging Face Transformers native `FineGrainedFP8Config` (on-the-fly quantization during model loading) - **Format:** `quant_method: "fp8"` (Qwen/DeepSeek native format, NOT compressed-tensors) - **Weight:** FP8 E4M3, static, block_size=(128, 128) - **Activation:** FP8, dynamic per-token - **Model size:** ~29 GB (vs ~55 GB BF16) ### Ignored Layers (modules_to_not_convert) Copied verbatim from the official [Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8) config.json, with MTP entries removed (this heretic variant has no MTP): - `lm_head` - `model.language_model.embed_tokens` - All `linear_attn.conv1d`, `linear_attn.in_proj_a`, `linear_attn.in_proj_b` (DeltaNet SSM-specific subparts) - All `model.visual.*` (entire vision tower) **Quantized layers** (NOT in ignore list): `linear_attn.out_proj`, `linear_attn.in_proj_qkv`, `linear_attn.in_proj_z`, all `self_attn` Q/K/V/O projections, all MLP layers. ### Quantization Script ```python from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor, FineGrainedFP8Config import json, torch # Load ignore list from Qwen official FP8 config ref = json.load(open("Qwen3.5-27B-FP8/config.json")) ref_ignore = ref["quantization_config"]["modules_to_not_convert"] modules_to_not_convert = [m for m in ref_ignore if not m.startswith("mtp")] qc = FineGrainedFP8Config( activation_scheme="dynamic", weight_block_size=(128, 128), modules_to_not_convert=modules_to_not_convert, dequantize=False, ) processor = AutoProcessor.from_pretrained(MODEL_DIR) model = Qwen3_5ForConditionalGeneration.from_pretrained( MODEL_DIR, dtype=torch.bfloat16, device_map="auto", max_memory={0: "30GiB", 1: "30GiB"}, quantization_config=qc, low_cpu_mem_usage=True, ) model.save_pretrained(SAVE_DIR, max_shard_size="5GB", save_original_format=False) processor.save_pretrained(SAVE_DIR) ``` ## Evaluation Results BF16 baseline vs FP8 quantized, evaluated with lm_eval 0.4.11, vLLM backend, 2 seeds averaged. | Benchmark | BF16 | FP8 | Recovery | |-----------|------|-----|----------| | GSM8k-Platinum (5-shot) | 98.10% | 97.89% | 99.79% | | IFEval inst_strict | 92.15% | 92.93% | 100.85% | | IFEval prompt_strict | 89.74% | 90.58% | 100.93% | Generation parameters: `temperature=1.0, top_p=0.95, top_k=64, max_gen_toks=16384` ## Usage ```python from vllm import LLM model = LLM("kakrotto/Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8") ``` ## Disclaimer This is an uncensored model. The quantizer (kakrotto) is not responsible for the model's outputs or any misuse. This FP8 quantization preserves the original model's behavior. Please use responsibly. ## Attribution - **Source model:** [llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1](https://huggingface.co/llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1) - **Quantization reference:** [Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8)