Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8

FP8 block-quantized version of llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1.

Quantized to match the official Qwen/Qwen3.5-27B-FP8 format exactly.

Quantization Details

  • Method: Fine-grained FP8 quantization with block size of 128
  • Tool: Hugging Face Transformers native FineGrainedFP8Config (on-the-fly quantization during model loading)
  • Format: quant_method: "fp8" (Qwen/DeepSeek native format, NOT compressed-tensors)
  • Weight: FP8 E4M3, static, block_size=(128, 128)
  • Activation: FP8, dynamic per-token
  • Model size: ~29 GB (vs ~55 GB BF16)

Ignored Layers (modules_to_not_convert)

Copied verbatim from the official Qwen/Qwen3.5-27B-FP8 config.json, with MTP entries removed (this heretic variant has no MTP):

  • lm_head
  • model.language_model.embed_tokens
  • All linear_attn.conv1d, linear_attn.in_proj_a, linear_attn.in_proj_b (DeltaNet SSM-specific subparts)
  • All model.visual.* (entire vision tower)

Quantized layers (NOT in ignore list): linear_attn.out_proj, linear_attn.in_proj_qkv, linear_attn.in_proj_z, all self_attn Q/K/V/O projections, all MLP layers.

Quantization Script

from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor, FineGrainedFP8Config
import json, torch

# Load ignore list from Qwen official FP8 config
ref = json.load(open("Qwen3.5-27B-FP8/config.json"))
ref_ignore = ref["quantization_config"]["modules_to_not_convert"]
modules_to_not_convert = [m for m in ref_ignore if not m.startswith("mtp")]

qc = FineGrainedFP8Config(
    activation_scheme="dynamic",
    weight_block_size=(128, 128),
    modules_to_not_convert=modules_to_not_convert,
    dequantize=False,
)

processor = AutoProcessor.from_pretrained(MODEL_DIR)
model = Qwen3_5ForConditionalGeneration.from_pretrained(
    MODEL_DIR,
    dtype=torch.bfloat16,
    device_map="auto",
    max_memory={0: "30GiB", 1: "30GiB"},
    quantization_config=qc,
    low_cpu_mem_usage=True,
)

model.save_pretrained(SAVE_DIR, max_shard_size="5GB", save_original_format=False)
processor.save_pretrained(SAVE_DIR)

Evaluation Results

BF16 baseline vs FP8 quantized, evaluated with lm_eval 0.4.11, vLLM backend, 2 seeds averaged.

Benchmark BF16 FP8 Recovery
GSM8k-Platinum (5-shot) 98.10% 97.89% 99.79%
IFEval inst_strict 92.15% 92.93% 100.85%
IFEval prompt_strict 89.74% 90.58% 100.93%

Generation parameters: temperature=1.0, top_p=0.95, top_k=64, max_gen_toks=16384

Usage

from vllm import LLM
model = LLM("kakrotto/Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8")

Disclaimer

This is an uncensored model. The quantizer (kakrotto) is not responsible for the model's outputs or any misuse. This FP8 quantization preserves the original model's behavior. Please use responsibly.

Attribution

Downloads last month
35
Safetensors
Model size
27B params
Tensor type
BF16
F32
F8_E4M3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for kakrotto/Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8

Base model

Qwen/Qwen3.5-27B
Quantized
(14)
this model