Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8
FP8 block-quantized version of llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1.
Quantized to match the official Qwen/Qwen3.5-27B-FP8 format exactly.
Quantization Details
- Method: Fine-grained FP8 quantization with block size of 128
- Tool: Hugging Face Transformers native
FineGrainedFP8Config(on-the-fly quantization during model loading) - Format:
quant_method: "fp8"(Qwen/DeepSeek native format, NOT compressed-tensors) - Weight: FP8 E4M3, static, block_size=(128, 128)
- Activation: FP8, dynamic per-token
- Model size: ~29 GB (vs ~55 GB BF16)
Ignored Layers (modules_to_not_convert)
Copied verbatim from the official Qwen/Qwen3.5-27B-FP8 config.json, with MTP entries removed (this heretic variant has no MTP):
lm_headmodel.language_model.embed_tokens- All
linear_attn.conv1d,linear_attn.in_proj_a,linear_attn.in_proj_b(DeltaNet SSM-specific subparts) - All
model.visual.*(entire vision tower)
Quantized layers (NOT in ignore list): linear_attn.out_proj, linear_attn.in_proj_qkv, linear_attn.in_proj_z, all self_attn Q/K/V/O projections, all MLP layers.
Quantization Script
from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor, FineGrainedFP8Config
import json, torch
# Load ignore list from Qwen official FP8 config
ref = json.load(open("Qwen3.5-27B-FP8/config.json"))
ref_ignore = ref["quantization_config"]["modules_to_not_convert"]
modules_to_not_convert = [m for m in ref_ignore if not m.startswith("mtp")]
qc = FineGrainedFP8Config(
activation_scheme="dynamic",
weight_block_size=(128, 128),
modules_to_not_convert=modules_to_not_convert,
dequantize=False,
)
processor = AutoProcessor.from_pretrained(MODEL_DIR)
model = Qwen3_5ForConditionalGeneration.from_pretrained(
MODEL_DIR,
dtype=torch.bfloat16,
device_map="auto",
max_memory={0: "30GiB", 1: "30GiB"},
quantization_config=qc,
low_cpu_mem_usage=True,
)
model.save_pretrained(SAVE_DIR, max_shard_size="5GB", save_original_format=False)
processor.save_pretrained(SAVE_DIR)
Evaluation Results
BF16 baseline vs FP8 quantized, evaluated with lm_eval 0.4.11, vLLM backend, 2 seeds averaged.
| Benchmark | BF16 | FP8 | Recovery |
|---|---|---|---|
| GSM8k-Platinum (5-shot) | 98.10% | 97.89% | 99.79% |
| IFEval inst_strict | 92.15% | 92.93% | 100.85% |
| IFEval prompt_strict | 89.74% | 90.58% | 100.93% |
Generation parameters: temperature=1.0, top_p=0.95, top_k=64, max_gen_toks=16384
Usage
from vllm import LLM
model = LLM("kakrotto/Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8")
Disclaimer
This is an uncensored model. The quantizer (kakrotto) is not responsible for the model's outputs or any misuse. This FP8 quantization preserves the original model's behavior. Please use responsibly.
Attribution
- Source model: llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1
- Quantization reference: Qwen/Qwen3.5-27B-FP8
- Downloads last month
- 35
Model tree for kakrotto/Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8
Base model
Qwen/Qwen3.5-27B