--- license: apache-2.0 base_model: - huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated - Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF base_model_relation: quantized quantized_by: xero0000 pipeline_tag: text-generation library_name: gguf tags: - gguf - qwen35moe - moe - mixed-precision - imatrix - ik_llama.cpp - mtp - speculative-decoding - q2_k - uncensored - abliterated --- # ๐Ÿ•ด๏ธ G-Man โ€” Huihui Qwen3.6-35B-A3B abliterated Mixed q2_K + Transplanted MTP Head *Black Mesa mixed-quant series ยท operates outside the rules (uncensored) โ€” and now arrives sooner.* The [G-Man (plain mixed quant)](https://huggingface.co/xero0000/G-Man-35B-A3B-abliterated-mixed-q2k) mixed-precision GGUF of **Huihui Qwen3.6-35B-A3B abliterated**, with one addition: **the multi-token-prediction (MTP) head from [Qwopus-3.6-Coder](https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF) surgically grafted on**, enabling **self-speculative decoding** in [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp). > TL;DR: identical outputs to the plain mixed quant, but **~87 tok/s on code and > ~83 tok/s on prose** instead of ~78, on an 18 GB dual-GPU desktop. Free speed, > verified token-by-token. --- ## The head transplant Huihui Qwen3.6-35B-A3B abliterated ships without an MTP head โ€” but **Qwopus-Coder is a fine-tune of the same Qwen3.6-35B-A3B base**, and its `blk.40` nextn head (a full extra layer: attention + MoE + `eh_proj`/`enorm`/`hnorm` glue, ~0.55 GB) predicts the next-next token from hidden states that this model's hidden space is a close sibling of. So we graft it: 1. append the donor's 20 `blk.40.*` tensors after the target's 40 layers, 2. bump `block_count` 40 โ†’ 41, 3. set `qwen35moe.nextn_predict_layers = 1`. **Why this is safe:** speculative decoding *verifies every drafted token against this model*. A foreign head can never change the output distribution โ€” a bad match only lowers the acceptance rate (= less speedup). Measured across the series, acceptance tracks fine-tune distance from the donor: base Qwen3.6 93/90 (code/prose t/s) > abliterated 87/83 > Ornith 83/80 > AgentWorld 82/79, all against a 78 t/s no-MTP baseline. ## Recipe - **Quant layout** (same as the parent mixed quant): `ffn_*_exps` on blocks **13โ€“26** โ†’ **`Q2_K`** with importance matrix (the CPU-offloaded set); everything else **`Q4_K`**; output-class **`Q6_K`**. ~4.9 bpw effective, ~18.7 GB. - **Head:** `blk.40` nextn/MTP layer from Qwopus-Coder mixed-q2k (`Q4_K` experts), grafted byte-exact. - The mixed layout exists because decode on CPU-offload rigs is RAM-bandwidth-bound: only the *offloaded-layer* bytes matter, so those get `Q2_K` while GPU-resident tensors keep `Q4_K` quality. ## Measured performance Rig: RTX 3060 Ti 8 GB + RTX 3080 10 GB, DDR4, ik_llama.cpp, 128K ctx, greedy. | workload | tok/s | vs 78 t/s no-MTP baseline | |---|---|---| | code generation | **87** | +12% | | prose | **83** | +6% | ## How to run **Requires [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)** โ€” its `-mtp` flag is what drives the nextn head (mainline llama.cpp loads the file but ignores the head). ```bash ./llama-server -m Qwen3.6-35B-A3B-abliterated-mixed-q2k-MTP.gguf \ --jinja --cache-type-k q4_0 --cache-type-v q4_0 --flash-attn on \ --ctx-size 131072 --parallel 1 --n-gpu-layers 99 --ctx-checkpoints 8 \ -ot 'blk\.(1[3-9]|2[0-9])\.ffn_(up|gate|down)_exps\.weight=CPU' \ --tensor-split 44,56 --ubatch-size 256 \ -mtp --ctx-size-draft 8192 \ --no-mmap --threads 8 --no-warmup ``` Notes for 18 GB-class rigs: - The MTP draft context costs VRAM, which is why this profile runs **128K ctx** (not 256K), a small **8K draft context**, and pins expert layers 13โ€“29 to CPU (three more than the plain mixed profile). With less freed VRAM, flash-attention temp allocations OOM mid-decode. More VRAM โ†’ pin fewer layers and/or raise ctx. - `--ctx-checkpoints 8` caps ik's dynamically allocated SSM checkpoints (default 32 ร— 64 MiB โ‰ˆ 2 GB at deep context โ€” an OOM trap on long agent sessions). - Add `--reasoning off --reasoning-budget 0` for tool/browser loops (drop them for deep chat). - Drop `-mtp` (and the diet) and it behaves exactly like the parent mixed quant at 256K. ## Credits - Base model: [huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated) (Apache-2.0) - MTP head donor: [Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF](https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF) - Mixed quant, imatrix, transplant & profiling: [xero0000](https://huggingface.co/xero0000) - Series: Gordon (base) ยท Kleiner (coder) ยท G-Man (uncensored) ยท Vortigaunt (reasoner) ยท Alyx (agentic)