🕴️ G-Man — Huihui Qwen3.6-35B-A3B abliterated Mixed q2_K + Transplanted MTP Head

Black Mesa mixed-quant series · operates outside the rules (uncensored) — and now arrives sooner.

The G-Man (plain mixed quant) mixed-precision GGUF of Huihui Qwen3.6-35B-A3B abliterated, with one addition: the multi-token-prediction (MTP) head from Qwopus-3.6-Coder surgically grafted on, enabling self-speculative decoding in ik_llama.cpp.

TL;DR: identical outputs to the plain mixed quant, but ~87 tok/s on code and ~83 tok/s on prose instead of ~78, on an 18 GB dual-GPU desktop. Free speed, verified token-by-token.


The head transplant

Huihui Qwen3.6-35B-A3B abliterated ships without an MTP head — but Qwopus-Coder is a fine-tune of the same Qwen3.6-35B-A3B base, and its blk.40 nextn head (a full extra layer: attention + MoE + eh_proj/enorm/hnorm glue, ~0.55 GB) predicts the next-next token from hidden states that this model's hidden space is a close sibling of. So we graft it:

  1. append the donor's 20 blk.40.* tensors after the target's 40 layers,
  2. bump block_count 40 → 41,
  3. set qwen35moe.nextn_predict_layers = 1.

Why this is safe: speculative decoding verifies every drafted token against this model. A foreign head can never change the output distribution — a bad match only lowers the acceptance rate (= less speedup). Measured across the series, acceptance tracks fine-tune distance from the donor: base Qwen3.6 93/90 (code/prose t/s) > abliterated 87/83

Ornith 83/80 > AgentWorld 82/79, all against a 78 t/s no-MTP baseline.

Recipe

  • Quant layout (same as the parent mixed quant): ffn_*_exps on blocks 13–26Q2_K with importance matrix (the CPU-offloaded set); everything else Q4_K; output-class Q6_K. ~4.9 bpw effective, ~18.7 GB.
  • Head: blk.40 nextn/MTP layer from Qwopus-Coder mixed-q2k (Q4_K experts), grafted byte-exact.
  • The mixed layout exists because decode on CPU-offload rigs is RAM-bandwidth-bound: only the offloaded-layer bytes matter, so those get Q2_K while GPU-resident tensors keep Q4_K quality.

Measured performance

Rig: RTX 3060 Ti 8 GB + RTX 3080 10 GB, DDR4, ik_llama.cpp, 128K ctx, greedy.

workload tok/s vs 78 t/s no-MTP baseline
code generation 87 +12%
prose 83 +6%

How to run

Requires ik_llama.cpp — its -mtp flag is what drives the nextn head (mainline llama.cpp loads the file but ignores the head).

./llama-server -m Qwen3.6-35B-A3B-abliterated-mixed-q2k-MTP.gguf \
  --jinja --cache-type-k q4_0 --cache-type-v q4_0 --flash-attn on \
  --ctx-size 131072 --parallel 1 --n-gpu-layers 99 --ctx-checkpoints 8 \
  -ot 'blk\.(1[3-9]|2[0-9])\.ffn_(up|gate|down)_exps\.weight=CPU' \
  --tensor-split 44,56 --ubatch-size 256 \
  -mtp --ctx-size-draft 8192 \
  --no-mmap --threads 8 --no-warmup

Notes for 18 GB-class rigs:

  • The MTP draft context costs VRAM, which is why this profile runs 128K ctx (not 256K), a small 8K draft context, and pins expert layers 13–29 to CPU (three more than the plain mixed profile). With less freed VRAM, flash-attention temp allocations OOM mid-decode. More VRAM → pin fewer layers and/or raise ctx.
  • --ctx-checkpoints 8 caps ik's dynamically allocated SSM checkpoints (default 32 × 64 MiB ≈ 2 GB at deep context — an OOM trap on long agent sessions).
  • Add --reasoning off --reasoning-budget 0 for tool/browser loops (drop them for deep chat).
  • Drop -mtp (and the diet) and it behaves exactly like the parent mixed quant at 256K.

Credits

Downloads last month
-
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xero0000/G-Man-35B-A3B-abliterated-mixed-q2k-MTP