Yuanl-27B-v5-7 - MTP GGUF (Q8_0 + Q4_K_M)
MTP-aware GGUF builds of lkjiop8/Yuanl-27B-v5-7 applied on unsloth/Qwen3.6-27B, converted with llama.cpp b9180 convert_hf_to_gguf.py. The blk.64.nextn.* MTP speculative-decode head is present and verified.
Files
| File | Size | Notes |
|---|---|---|
Yuanl-27B-v5-7-MTP-Q8_0.gguf |
~29 GB | Near-lossless (8.50 BPW) |
Yuanl-27B-v5-7-MTP-Q4_K_M.gguf |
~17 GB | 4.92 BPW; fits 24 GB VRAM w/ Q8 KV cache @ ~32K |
About the MTP head (important)
The unsloth training/merge path strips the MTP block (unsloth_fixed_mtp=True), so the trained adapter and its merged model contain trunk weights only. To produce a usable --spec-type draft-mtp build, the 15 mtp.* tensors were re-injected from the original unsloth/Qwen3.6-27B before GGUF conversion, then bundled into blk.64.nextn.*.
Consequences:
- The MTP head is the base, untrained head (the LoRA only touched q/k/v/o/gate/up/down on the trunk). It was NOT separately trained.
- Real-world MTP acceptance varies by hardware and prompt distribution - measure on your stack with
--metrics. No acceptance-rate claim is made.
Requirements
- llama.cpp
b9180+ (Qwen 3.6 MTP support).
Launch (RTX 4090 dual, Q8_0, 120K ctx, MTP draft 2)
./llama-server \
-m Yuanl-27B-v5-7-MTP-Q8_0.gguf \
--alias Yuanl-27B-v5-7 \
--host 0.0.0.0 --port 8080 \
-c 122880 --parallel 1 \
-ngl 99 -sm layer -ts 23,25 -fa on \
-b 4096 -ub 2048 \
-t 8 -tb 16 --threads-http 8 \
-ctk q8_0 -ctv q8_0 \
--spec-type draft-mtp --spec-draft-n-max 2 \
--cache-reuse 256 --kv-unified \
--jinja --reasoning auto --reasoning-format deepseek \
--reasoning-budget 256 \
--temp 0.3 --top-p 0.85 --top-k 20 --min-p 0.05 \
--repeat-penalty 1.05 --repeat-last-n 256 --presence-penalty 0.10 \
--no-mmproj --no-webui --metrics \
--slot-save-path ./slots
v5-7 vs v5-6
- Single-epoch S1/S2/DPO (v5-6: 2/3/2)
- DPO LR back to 5e-7 (v5-6: 5e-6)
- cursor data forced epoch=1 + ~1334 phoenix ballast rows
- Inline uncensoring into S2 (no separate U1+U2)
- First-person persona, positive instructions, explicit length-matching rule
- LoRA r=16 (was 32); max_seq 16K (was 24K)
Responsible use
Designed for authorized red-team / research / academic use. The operator carries the legal and ethical authorization.
- Downloads last month
- 177
Hardware compatibility
Log In to add your hardware
4-bit
8-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for lkjiop8/Yuanl-27B-v5-7-GGUF
Base model
Qwen/Qwen3.6-27B