---
license: mit
library_name: transformers
base_model: OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated
base_model_relation: quantized
tags:
- qwen
- qwen3
- qwen3.5
- moe
- abliterated
- uncensored
- sft
- dpo
- opus
- qwopus
- kimi
- kimi-k2
- distill
- multimodal
- vision
- mtp
- nvfp4
- fp4
- quantized
- compressed-tensors
- llm-compressor
- vllm
pipeline_tag: image-text-to-text
---
## Support & Community
**☕ If these models are useful to you, consider supporting my work — it funds compute for more & larger abliterations.**

[**buymeacoffee.com/oym.kuato**](https://buymeacoffee.com/oym.kuato)
💬 **Discord:** [discord.gg/rhUZY5GEZr](https://discord.gg/rhUZY5GEZr) · ₿ **Bitcoin:** `bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv`
---
# Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4
## Overview
**4-bit NVFP4** quantization of [`OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated`](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated) — the Kimi-K2.6-distilled, reasoning-DPO-healed, abliterated/uncensored evolution of [Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B) (Mixture of Experts, ~10B active / 122B total).
This build packs the transformer weights to NVFP4 with **[LLM Compressor](https://github.com/vllm-project/llm-compressor)**, cutting the on-disk footprint from **~250 GB to ≈82 GB** while keeping the **vision tower**, **MTP head**, router gates, and the Gated-DeltaNet attention path in higher precision. It is multimodal (image + text), uncensored, and — despite 4-bit weights — **beats the full-precision Qwen3.5-122B-A10B baseline** on every benchmark we ran (see [Evaluation](#evaluation)).
It loads anywhere `compressed-tensors` is supported and is **auto-detected by vLLM** (no `--quantization` flag needed).
## Evaluation
Scores below were measured **on this NVFP4 build** and compared against the **full-precision (BF16) `Qwen/Qwen3.5-122B-A10B`** baseline:
| Benchmark | Qwen3.5-122B-A10B (BF16, baseline) | **Qwopus3.5 NVFP4 (this model)** |
|---|---|---|
| CTI | 64.8 | **71.5** |
| LiveCodeBench | 78.9 | **79.9** |
| BFCL | 72.2 | **85.6** |
Even after 4-bit (NVFP4) weight quantization, this model **outperforms the BF16 Qwen3.5-122B-A10B baseline on all three benchmarks** — the Kimi-K2.6 distillation + reasoning-DPO healing more than offsets any quantization loss. BFCL is the Berkeley Function-Calling Leaderboard (tool use); LiveCodeBench is contamination-controlled code generation.
## Quantization (NVFP4)
Produced with **LLM Compressor** using the `QuantizationModifier` recipe shipped in this repo (`recipe.yaml`).
- **Scheme:** `NVFP4` (`format: nvfp4-pack-quantized`) — 4-bit float weights in **micro-blocks of 16**, each block carrying an FP8 (`float8_e4m3fn`) scale. Weights are static; input activations are quantized dynamically (per-group, static-minmax).
- **Quantized:** all transformer `Linear` layers — attention projections and the 256 routed-expert MoE FFNs (37,056 packed weight tensors).
- **Left in higher precision (BF16):** the **vision tower** (`visual.*` — 333 tensors), the **MTP head** (`model_mtp.safetensors` — 785 tensors), `lm_head`, token embeddings, the MoE router gates (`mlp.gate`, `shared_expert_gate`), and the Gated-DeltaNet linear-attention path (`linear_attn.*`).
- **Architecture preserved:** `Qwen3_5MoeForConditionalGeneration` / `model_type: qwen3_5_moe`, so the checkpoint loads as a drop-in replacement for the base at the architecture level.
## Downloads / Other Formats
| Format | Repo | Use it for |
|--------|------|-----------|
| **Full BF16 weights** | [Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated) | Transformers / vLLM, fine-tuning, requantizing |
| **NVFP4** (this repo) | [Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4) | vLLM on a single ≥96 GB / Blackwell accelerator (vision + MTP included) |
| **GGUF (Q4_K_M)** | […-Kimi-K2.6-destill-healed-abliterated-GGUF](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-GGUF) | llama.cpp / LM Studio (text-only). MTP head included. |
| **MLX 4-bit** | […-Kimi-K2.6-destill-healed-abliterated-MLX-4bit](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MLX-4bit) | Apple Silicon / LM Studio (vision supported) |
## Files
| File | Description | Size |
|------|-------------|------|
| `model-00001-of-00002.safetensors` | NVFP4-packed language weights (4-bit + FP8 scales) + `lm_head` | ~50.0 GB |
| `model-00002-of-00002.safetensors` | NVFP4-packed language weights (tail) + BF16 vision tower | ~26.4 GB |
| `model_mtp.safetensors` | BF16 MTP head (785 tensors, 1 hidden layer) | ~5.0 GB |
| `model.safetensors.index.json` | Combined weight map | — |
| `config.json` | Multimodal config incl. `quantization_config` (`nvfp4-pack-quantized`) | — |
| `recipe.yaml` | LLM Compressor quantization recipe | — |
| `tokenizer*`, `chat_template.jinja`, `generation_config.json`, `*preprocessor_config.json` | Standard | — |
Total on disk: **≈81.5 GB** (~76 GiB).
## Usage (vLLM)
vLLM auto-detects the NVFP4 `compressed-tensors` format — no `--quantization` flag.
```bash
vllm serve OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--max-model-len 262144
```
The checkpoint ships the MTP head, so you can enable 1-token speculative decoding:
```bash
--speculative-config '{"num_speculative_tokens":1}'
```
> **Tip (Qwen3.5 MoE / Gated-DeltaNet):** if `torch.compile` errors in the GDN path during startup, add `--compilation-config '{"use_inductor_graph_partition":true}'`.
Text + vision both work through `AutoProcessor` / `AutoModelForImageTextToText` (via the `compressed-tensors` integration) for non-vLLM workflows.
## Vision & MTP
Both the **vision tower** and the **MTP (multi-token-prediction) head** are **included** and kept in **BF16**.
- **Vision** works as expected (image / video → text).
- **MTP**: the head is present and shape-compatible. It enables speculative decoding under vLLM, but on the upstream checkpoint it produced little measurable speedup/quality gain and would benefit from retraining — shipped intact for completeness and forward-compatibility.
## Hardware
The NVFP4 weights are **≈82 GB** (vs ~250 GB for the BF16 release), so the model runs on a **single accelerator with ≥ 96 GB**: H200, B200, RTX PRO 6000 Blackwell, or a **128 GB unified-memory NVIDIA DGX Spark / GB10**. Native FP4 math requires a **Blackwell** GPU (compute capability ≥ 10.0 / sm_120+); on other hardware vLLM runs NVFP4 via FlashInfer/emulation.
## Notes
- **License**: MIT (inherits from the upstream Qwen3.5 base license terms)
- **Base Model**: [OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated) → [Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)
- **Quantization**: NVFP4 (`nvfp4-pack-quantized`, group size 16) via LLM Compressor
- **Modality**: Text + Vision (image / video) + MTP
- **Architecture**: Qwen3 MoE (~10B active / 122B total) + Qwen3-VL vision tower + MTP head
## Thanks
- [Jackrong](https://huggingface.co/Jackrong) — for the idea of **Qwopus** merges (Opus distillations on Qwen models).
- [wangzhang](https://huggingface.co/wangzhang) — for the wonderful **abliterix** framework, which was customized to do this abliteration.
- The **[LLM Compressor](https://github.com/vllm-project/llm-compressor)** and **vLLM** teams for the NVFP4 tooling.
## Disclaimer
Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.