---
license: apache-2.0
base_model: llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1
tags:
  - fp8
  - quantized
  - qwen3.5
---

# Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8

FP8 block-quantized version of [llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1](https://huggingface.co/llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1).

Quantized to match the official [Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8) format exactly.

## Quantization Details

- **Method:** Fine-grained FP8 quantization with block size of 128
- **Tool:** Hugging Face Transformers native `FineGrainedFP8Config` (on-the-fly quantization during model loading)
- **Format:** `quant_method: "fp8"` (Qwen/DeepSeek native format, NOT compressed-tensors)
- **Weight:** FP8 E4M3, static, block_size=(128, 128)
- **Activation:** FP8, dynamic per-token
- **Model size:** ~29 GB (vs ~55 GB BF16)

### Ignored Layers (modules_to_not_convert)

Copied verbatim from the official [Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8) config.json, with MTP entries removed (this heretic variant has no MTP):

- `lm_head`
- `model.language_model.embed_tokens`
- All `linear_attn.conv1d`, `linear_attn.in_proj_a`, `linear_attn.in_proj_b` (DeltaNet SSM-specific subparts)
- All `model.visual.*` (entire vision tower)

**Quantized layers** (NOT in ignore list): `linear_attn.out_proj`, `linear_attn.in_proj_qkv`, `linear_attn.in_proj_z`, all `self_attn` Q/K/V/O projections, all MLP layers.

### Quantization Script

```python
from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor, FineGrainedFP8Config
import json, torch

# Load ignore list from Qwen official FP8 config
ref = json.load(open("Qwen3.5-27B-FP8/config.json"))
ref_ignore = ref["quantization_config"]["modules_to_not_convert"]
modules_to_not_convert = [m for m in ref_ignore if not m.startswith("mtp")]

qc = FineGrainedFP8Config(
    activation_scheme="dynamic",
    weight_block_size=(128, 128),
    modules_to_not_convert=modules_to_not_convert,
    dequantize=False,
)

processor = AutoProcessor.from_pretrained(MODEL_DIR)
model = Qwen3_5ForConditionalGeneration.from_pretrained(
    MODEL_DIR,
    dtype=torch.bfloat16,
    device_map="auto",
    max_memory={0: "30GiB", 1: "30GiB"},
    quantization_config=qc,
    low_cpu_mem_usage=True,
)

model.save_pretrained(SAVE_DIR, max_shard_size="5GB", save_original_format=False)
processor.save_pretrained(SAVE_DIR)
```

## Evaluation Results

BF16 baseline vs FP8 quantized, evaluated with lm_eval 0.4.11, vLLM backend, 2 seeds averaged.

| Benchmark | BF16 | FP8 | Recovery |
|-----------|------|-----|----------|
| GSM8k-Platinum (5-shot) | 98.10% | 97.89% | 99.79% |
| IFEval inst_strict | 92.15% | 92.93% | 100.85% |
| IFEval prompt_strict | 89.74% | 90.58% | 100.93% |

Generation parameters: `temperature=1.0, top_p=0.95, top_k=64, max_gen_toks=16384`

## Usage

```python
from vllm import LLM
model = LLM("kakrotto/Qwen3.5-27B-ultra-uncensored-heretic-v1-FP8")
```

## Disclaimer

This is an uncensored model. The quantizer (kakrotto) is not responsible for the model's outputs or any misuse. This FP8 quantization preserves the original model's behavior. Please use responsibly.

## Attribution

- **Source model:** [llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1](https://huggingface.co/llmfan46/Qwen3.5-27B-ultra-uncensored-heretic-v1)
- **Quantization reference:** [Qwen/Qwen3.5-27B-FP8](https://huggingface.co/Qwen/Qwen3.5-27B-FP8)