Yuanl-27B-v5-7 - MTP GGUF (Q8_0 + Q4_K_M)

MTP-aware GGUF builds of lkjiop8/Yuanl-27B-v5-7 applied on unsloth/Qwen3.6-27B, converted with llama.cpp b9180 convert_hf_to_gguf.py. The blk.64.nextn.* MTP speculative-decode head is present and verified.

Files

File	Size	Notes
`Yuanl-27B-v5-7-MTP-Q8_0.gguf`	~29 GB	Near-lossless (8.50 BPW)
`Yuanl-27B-v5-7-MTP-Q4_K_M.gguf`	~17 GB	4.92 BPW; fits 24 GB VRAM w/ Q8 KV cache @ ~32K

About the MTP head (important)

The unsloth training/merge path strips the MTP block (unsloth_fixed_mtp=True), so the trained adapter and its merged model contain trunk weights only. To produce a usable --spec-type draft-mtp build, the 15 mtp.* tensors were re-injected from the original unsloth/Qwen3.6-27B before GGUF conversion, then bundled into blk.64.nextn.*.

Consequences:

The MTP head is the base, untrained head (the LoRA only touched q/k/v/o/gate/up/down on the trunk). It was NOT separately trained.
Real-world MTP acceptance varies by hardware and prompt distribution - measure on your stack with --metrics. No acceptance-rate claim is made.

Requirements

llama.cpp b9180+ (Qwen 3.6 MTP support).

Launch (RTX 4090 dual, Q8_0, 120K ctx, MTP draft 2)

./llama-server \
    -m Yuanl-27B-v5-7-MTP-Q8_0.gguf \
    --alias Yuanl-27B-v5-7 \
    --host 0.0.0.0 --port 8080 \
    -c 122880 --parallel 1 \
    -ngl 99 -sm layer -ts 23,25 -fa on \
    -b 4096 -ub 2048 \
    -t 8 -tb 16 --threads-http 8 \
    -ctk q8_0 -ctv q8_0 \
    --spec-type draft-mtp --spec-draft-n-max 2 \
    --cache-reuse 256 --kv-unified \
    --jinja --reasoning auto --reasoning-format deepseek \
    --reasoning-budget 256 \
    --temp 0.3 --top-p 0.85 --top-k 20 --min-p 0.05 \
    --repeat-penalty 1.05 --repeat-last-n 256 --presence-penalty 0.10 \
    --no-mmproj --no-webui --metrics \
    --slot-save-path ./slots

v5-7 vs v5-6

Single-epoch S1/S2/DPO (v5-6: 2/3/2)
DPO LR back to 5e-7 (v5-6: 5e-6)
cursor data forced epoch=1 + ~1334 phoenix ballast rows
Inline uncensoring into S2 (no separate U1+U2)
First-person persona, positive instructions, explicit length-matching rule
LoRA r=16 (was 32); max_seq 16K (was 24K)

Responsible use

Designed for authorized red-team / research / academic use. The operator carries the legal and ethical authorization.

Downloads last month: 177

GGUF

Hardware compatibility

4-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lkjiop8/Yuanl-27B-v5-7-GGUF

Base model

Qwen/Qwen3.6-27B

Quantized

(517)

this model