Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8

FP8 block-quantized version of llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1.

Quantized to match the official Qwen/Qwen3.5-27B-FP8 format exactly.

Quantization Details

Method: Fine-grained FP8 quantization with block size of 128
Tool: Hugging Face Transformers native FineGrainedFP8Config (on-the-fly quantization during model loading)
Format: quant_method: "fp8" (Qwen/DeepSeek native format, NOT compressed-tensors)
Weight: FP8 E4M3, static, block_size=(128, 128)
Activation: FP8, dynamic per-token
Model size: ~29 GB (vs ~55 GB BF16)

Ignored Layers (modules_to_not_convert)

Copied verbatim from the official Qwen/Qwen3.5-27B-FP8 config.json, with MTP entries removed (this heretic variant has no MTP):

lm_head
model.language_model.embed_tokens
All linear_attn.conv1d, linear_attn.in_proj_a, linear_attn.in_proj_b (DeltaNet SSM-specific subparts)
All model.visual.* (entire vision tower)

Quantized layers (NOT in ignore list): linear_attn.out_proj, linear_attn.in_proj_qkv, linear_attn.in_proj_z, all self_attn Q/K/V/O projections, all MLP layers.

Quantization Script

from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor, FineGrainedFP8Config
import json, torch

# Load ignore list from Qwen official FP8 config
ref = json.load(open("Qwen3.5-27B-FP8/config.json"))
ref_ignore = ref["quantization_config"]["modules_to_not_convert"]
modules_to_not_convert = [m for m in ref_ignore if not m.startswith("mtp")]

qc = FineGrainedFP8Config(
    activation_scheme="dynamic",
    weight_block_size=(128, 128),
    modules_to_not_convert=modules_to_not_convert,
    dequantize=False,
)

processor = AutoProcessor.from_pretrained(MODEL_DIR)
model = Qwen3_5ForConditionalGeneration.from_pretrained(
    MODEL_DIR,
    dtype=torch.bfloat16,
    device_map="auto",
    max_memory={0: "30GiB", 1: "30GiB"},
    quantization_config=qc,
    low_cpu_mem_usage=True,
)

model.save_pretrained(SAVE_DIR, max_shard_size="5GB", save_original_format=False)
processor.save_pretrained(SAVE_DIR)

Evaluation Results

BF16 baseline vs FP8 quantized, evaluated with lm_eval 0.4.11, vLLM backend, 2 seeds averaged.

Benchmark	BF16	FP8	Recovery
GSM8k-Platinum (5-shot)	98.10%	97.89%	99.79%
IFEval inst_strict	92.15%	92.93%	100.85%
IFEval prompt_strict	89.74%	90.58%	100.93%

Generation parameters: temperature=1.0, top_p=0.95, top_k=64, max_gen_toks=16384

Usage

from vllm import LLM
model = LLM("kakrotto/Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8")

Disclaimer

This is an uncensored model. The quantizer (kakrotto) is not responsible for the model's outputs or any misuse. This FP8 quantization preserves the original model's behavior. Please use responsibly.