---
license: apache-2.0
base_model:
- huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated
- Qwen/Qwen3.5-35B-A3B
- Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled
tags:
- qwen3.5
- moe
- vlm
- fp8
- quantized
- compressed-tensors
- vllm
- dgx-spark
pipeline_tag: image-text-to-text
library_name: transformers
---

# Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8

**Vision-capable FP8 quantized fast abliterated distilled Qwen3.5-35B model made for Nvidia DGX Spark (~80GB VRAM is needed for full functionality)**

## Model Lineage

So first it was [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) (BF16).

- Then Jackrong created a text-only, less chatty and better with tools version — [Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled)
- Then Huihui removed all refusals and put back the vision capabilities in [huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated)
- Then I quantized it to FP8 using the conservative approach demonstrated by the Qwen team in [Qwen/Qwen3.5-35B-A3B-FP8](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-FP8)

## Performance

Conservative approach to FP8 quantization caused minimum quality loss, while still bumping the speed from **31 t/s → 51 t/s** on DGX Spark. With 262k context and some space for KV cache it uses 80GB VRAM (only).

*Currently that's the best, fastest and abliterated model to be used on Nvidia DGX Spark*, which also preserves all visual layers untouched.

I failed to find a case where this model will refuse to answer. It is especially funny to use with pictures ;). So far the best "tooling" skills — it really likes to Google stuff first even if it knows the answer.

I plan to test the quality of the model's output later and update this page.

## Quantization Details

Quantized using the `FP8_DYNAMIC` scheme from [llmcompressor](https://github.com/vllm-project/llmcompressor) (`>=0.10`) with `compressed-tensors` serialization.

### Method

FP8_DYNAMIC is a **data-free** quantization scheme — no calibration dataset required. Weights are statically quantized to FP8 (per-channel, symmetric), while activations are dynamically quantized to FP8 (per-token, symmetric) at inference time.

### Modules Excluded from Quantization

Matching the conservative strategy from [Qwen/Qwen3.5-35B-A3B-FP8](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-FP8):

| Module | Reason |
|---|---|
| `lm_head` | Output head — precision-sensitive |
| `embed_tokens` | Embedding layer |
| `linear_attn.conv1d`, `linear_attn.in_proj_a/b` | Linear attention layers |
| `mlp.gate`, `mlp.shared_expert_gate` | MoE router gates — routing precision matters |
| `model.visual.*` | Entire visual encoder kept at BF16 |
| `mtp.*` | Multi-token prediction layers |

### Post-processing

The model was quantized via `AutoModelForCausalLM` (the only loader proven to work with llmcompressor for this architecture), then post-processed:

1. **Weight key renaming** — `model.layers.X` → `model.language_model.layers.X` to match the `ConditionalGeneration` format expected by vLLM
2. **Visual encoder restoration** — BF16 vision encoder weights copied from the source model (since `AutoModelForCausalLM` strips them)
3. **Config restructuring** — `config.json` rebuilt from the source model's nested structure with the quantization config injected

## Resources

- Conversion scripts: [github.com/ageev/AI/tree/main/converters/qwen35](https://github.com/ageev/AI/tree/main/converters/qwen35)
- Spark recipe for [spark-vllm-docker](https://github.com/eugr/spark-vllm-docker): [github.com/ageev/AI/tree/main/spark-recipes](https://github.com/ageev/AI/tree/main/spark-recipes)

## Disclaimer

It's an abliterated model. DO NOT use it if you think that all AIs need to be politically correct and boring.