--- license: mit library_name: transformers base_model: OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated base_model_relation: quantized tags: - qwen - qwen3 - qwen3.5 - moe - abliterated - uncensored - sft - dpo - opus - qwopus - kimi - kimi-k2 - distill - multimodal - vision - mtp - nvfp4 - fp4 - quantized - compressed-tensors - llm-compressor - vllm pipeline_tag: image-text-to-text ---
OpenYourMind
## Support & Community
**☕ If these models are useful to you, consider supporting my work — it funds compute for more & larger abliterations.** Buy Me A Coffee [**buymeacoffee.com/oym.kuato**](https://buymeacoffee.com/oym.kuato) 💬 **Discord:** [discord.gg/rhUZY5GEZr](https://discord.gg/rhUZY5GEZr)  ·  ₿ **Bitcoin:** `bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv`
--- # Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 ## Overview **4-bit NVFP4** quantization of [`OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated`](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated) — the Kimi-K2.6-distilled, reasoning-DPO-healed, abliterated/uncensored evolution of [Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B) (Mixture of Experts, ~10B active / 122B total). This build packs the transformer weights to NVFP4 with **[LLM Compressor](https://github.com/vllm-project/llm-compressor)**, cutting the on-disk footprint from **~250 GB to ≈82 GB** while keeping the **vision tower**, **MTP head**, router gates, and the Gated-DeltaNet attention path in higher precision. It is multimodal (image + text), uncensored, and — despite 4-bit weights — **beats the full-precision Qwen3.5-122B-A10B baseline** on every benchmark we ran (see [Evaluation](#evaluation)). It loads anywhere `compressed-tensors` is supported and is **auto-detected by vLLM** (no `--quantization` flag needed). ## Evaluation Scores below were measured **on this NVFP4 build** and compared against the **full-precision (BF16) `Qwen/Qwen3.5-122B-A10B`** baseline: | Benchmark | Qwen3.5-122B-A10B (BF16, baseline) | **Qwopus3.5 NVFP4 (this model)** | |---|---|---| | CTI | 64.8 | **71.5** | | LiveCodeBench | 78.9 | **79.9** | | BFCL | 72.2 | **85.6** | Even after 4-bit (NVFP4) weight quantization, this model **outperforms the BF16 Qwen3.5-122B-A10B baseline on all three benchmarks** — the Kimi-K2.6 distillation + reasoning-DPO healing more than offsets any quantization loss. BFCL is the Berkeley Function-Calling Leaderboard (tool use); LiveCodeBench is contamination-controlled code generation. ## Quantization (NVFP4) Produced with **LLM Compressor** using the `QuantizationModifier` recipe shipped in this repo (`recipe.yaml`). - **Scheme:** `NVFP4` (`format: nvfp4-pack-quantized`) — 4-bit float weights in **micro-blocks of 16**, each block carrying an FP8 (`float8_e4m3fn`) scale. Weights are static; input activations are quantized dynamically (per-group, static-minmax). - **Quantized:** all transformer `Linear` layers — attention projections and the 256 routed-expert MoE FFNs (37,056 packed weight tensors). - **Left in higher precision (BF16):** the **vision tower** (`visual.*` — 333 tensors), the **MTP head** (`model_mtp.safetensors` — 785 tensors), `lm_head`, token embeddings, the MoE router gates (`mlp.gate`, `shared_expert_gate`), and the Gated-DeltaNet linear-attention path (`linear_attn.*`). - **Architecture preserved:** `Qwen3_5MoeForConditionalGeneration` / `model_type: qwen3_5_moe`, so the checkpoint loads as a drop-in replacement for the base at the architecture level. ## Downloads / Other Formats | Format | Repo | Use it for | |--------|------|-----------| | **Full BF16 weights** | [Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated) | Transformers / vLLM, fine-tuning, requantizing | | **NVFP4** (this repo) | [Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4) | vLLM on a single ≥96 GB / Blackwell accelerator (vision + MTP included) | | **GGUF (Q4_K_M)** | […-Kimi-K2.6-destill-healed-abliterated-GGUF](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-GGUF) | llama.cpp / LM Studio (text-only). MTP head included. | | **MLX 4-bit** | […-Kimi-K2.6-destill-healed-abliterated-MLX-4bit](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated-MLX-4bit) | Apple Silicon / LM Studio (vision supported) | ## Files | File | Description | Size | |------|-------------|------| | `model-00001-of-00002.safetensors` | NVFP4-packed language weights (4-bit + FP8 scales) + `lm_head` | ~50.0 GB | | `model-00002-of-00002.safetensors` | NVFP4-packed language weights (tail) + BF16 vision tower | ~26.4 GB | | `model_mtp.safetensors` | BF16 MTP head (785 tensors, 1 hidden layer) | ~5.0 GB | | `model.safetensors.index.json` | Combined weight map | — | | `config.json` | Multimodal config incl. `quantization_config` (`nvfp4-pack-quantized`) | — | | `recipe.yaml` | LLM Compressor quantization recipe | — | | `tokenizer*`, `chat_template.jinja`, `generation_config.json`, `*preprocessor_config.json` | Standard | — | Total on disk: **≈81.5 GB** (~76 GiB). ## Usage (vLLM) vLLM auto-detects the NVFP4 `compressed-tensors` format — no `--quantization` flag. ```bash vllm serve OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --max-model-len 262144 ``` The checkpoint ships the MTP head, so you can enable 1-token speculative decoding: ```bash --speculative-config '{"num_speculative_tokens":1}' ``` > **Tip (Qwen3.5 MoE / Gated-DeltaNet):** if `torch.compile` errors in the GDN path during startup, add `--compilation-config '{"use_inductor_graph_partition":true}'`. Text + vision both work through `AutoProcessor` / `AutoModelForImageTextToText` (via the `compressed-tensors` integration) for non-vLLM workflows. ## Vision & MTP Both the **vision tower** and the **MTP (multi-token-prediction) head** are **included** and kept in **BF16**. - **Vision** works as expected (image / video → text). - **MTP**: the head is present and shape-compatible. It enables speculative decoding under vLLM, but on the upstream checkpoint it produced little measurable speedup/quality gain and would benefit from retraining — shipped intact for completeness and forward-compatibility. ## Hardware The NVFP4 weights are **≈82 GB** (vs ~250 GB for the BF16 release), so the model runs on a **single accelerator with ≥ 96 GB**: H200, B200, RTX PRO 6000 Blackwell, or a **128 GB unified-memory NVIDIA DGX Spark / GB10**. Native FP4 math requires a **Blackwell** GPU (compute capability ≥ 10.0 / sm_120+); on other hardware vLLM runs NVFP4 via FlashInfer/emulation. ## Notes - **License**: MIT (inherits from the upstream Qwen3.5 base license terms) - **Base Model**: [OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated](https://huggingface.co/OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated) → [Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B) - **Quantization**: NVFP4 (`nvfp4-pack-quantized`, group size 16) via LLM Compressor - **Modality**: Text + Vision (image / video) + MTP - **Architecture**: Qwen3 MoE (~10B active / 122B total) + Qwen3-VL vision tower + MTP head ## Thanks - [Jackrong](https://huggingface.co/Jackrong) — for the idea of **Qwopus** merges (Opus distillations on Qwen models). - [wangzhang](https://huggingface.co/wangzhang) — for the wonderful **abliterix** framework, which was customized to do this abliteration. - The **[LLM Compressor](https://github.com/vllm-project/llm-compressor)** and **vLLM** teams for the NVFP4 tooling. ## Disclaimer Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.