--- license: apache-2.0 base_model: - huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated - Qwen/Qwen3.5-35B-A3B - Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled tags: - qwen3.5 - moe - vlm - fp8 - quantized - compressed-tensors - vllm - dgx-spark pipeline_tag: image-text-to-text library_name: transformers --- # Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-FP8 **Vision-capable FP8 quantized fast abliterated distilled Qwen3.5-35B model made for Nvidia DGX Spark (~80GB VRAM is needed for full functionality)** ## Model Lineage So first it was [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) (BF16). - Then Jackrong created a text-only, less chatty and better with tools version — [Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled) - Then Huihui removed all refusals and put back the vision capabilities in [huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated) - Then I quantized it to FP8 using the conservative approach demonstrated by the Qwen team in [Qwen/Qwen3.5-35B-A3B-FP8](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-FP8) ## Performance Conservative approach to FP8 quantization caused minimum quality loss, while still bumping the speed from **31 t/s → 51 t/s** on DGX Spark. With 262k context and some space for KV cache it uses 80GB VRAM (only). *Currently that's the best, fastest and abliterated model to be used on Nvidia DGX Spark*, which also preserves all visual layers untouched. I failed to find a case where this model will refuse to answer. It is especially funny to use with pictures ;). So far the best "tooling" skills — it really likes to Google stuff first even if it knows the answer. I plan to test the quality of the model's output later and update this page. ## Quantization Details Quantized using the `FP8_DYNAMIC` scheme from [llmcompressor](https://github.com/vllm-project/llmcompressor) (`>=0.10`) with `compressed-tensors` serialization. ### Method FP8_DYNAMIC is a **data-free** quantization scheme — no calibration dataset required. Weights are statically quantized to FP8 (per-channel, symmetric), while activations are dynamically quantized to FP8 (per-token, symmetric) at inference time. ### Modules Excluded from Quantization Matching the conservative strategy from [Qwen/Qwen3.5-35B-A3B-FP8](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-FP8): | Module | Reason | |---|---| | `lm_head` | Output head — precision-sensitive | | `embed_tokens` | Embedding layer | | `linear_attn.conv1d`, `linear_attn.in_proj_a/b` | Linear attention layers | | `mlp.gate`, `mlp.shared_expert_gate` | MoE router gates — routing precision matters | | `model.visual.*` | Entire visual encoder kept at BF16 | | `mtp.*` | Multi-token prediction layers | ### Post-processing The model was quantized via `AutoModelForCausalLM` (the only loader proven to work with llmcompressor for this architecture), then post-processed: 1. **Weight key renaming** — `model.layers.X` → `model.language_model.layers.X` to match the `ConditionalGeneration` format expected by vLLM 2. **Visual encoder restoration** — BF16 vision encoder weights copied from the source model (since `AutoModelForCausalLM` strips them) 3. **Config restructuring** — `config.json` rebuilt from the source model's nested structure with the quantization config injected ## Resources - Conversion scripts: [github.com/ageev/AI/tree/main/converters/qwen35](https://github.com/ageev/AI/tree/main/converters/qwen35) - Spark recipe for [spark-vllm-docker](https://github.com/eugr/spark-vllm-docker): [github.com/ageev/AI/tree/main/spark-recipes](https://github.com/ageev/AI/tree/main/spark-recipes) ## Disclaimer It's an abliterated model. DO NOT use it if you think that all AIs need to be politically correct and boring.