Instructions to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M", filename="Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M # Run inference directly in the terminal: llama-cli -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M # Run inference directly in the terminal: llama-cli -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
Use Docker
docker model run hf.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
- Ollama
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Ollama:
ollama run hf.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
- Unsloth Studio
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M to start chatting
- Pi
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Docker Model Runner:
docker model run hf.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
- Lemonade
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
Run and chat with the model
lemonade run user.Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M-Q4_K_M
List all available models
lemonade list
- Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M
- 中文(简体)
🚨 Tool calling hotfix (2026-05-15)
⚠️ For tool calling / function calling, use chat_template_full_nothink.jinja (newly added) — NOT chat_template_no_think_simple.jinja
Earlier guidance recommended the 14-line chat_template_no_think_simple.jinja for clean no-think output, but that simplified template completely strips the {% if tools %} rendering block. The model never sees tool descriptions in the system prompt → tool_calls are silently empty even when you pass a valid tools field.
The new canonical chat_template_full_nothink.jinja (155 lines) = full 152-line chat_template.jinja with an explicit enable_thinking=false default prepended. It preserves all tools / <tool_call> / function rendering.
Recommended startup:
llama-server \
-m <gguf-file> \
--jinja \
--chat-template-file chat_template_full_nothink.jinja
Verification (should fire tool_calls, not text refusal):
curl http://localhost:8080/v1/chat/completions -d '{
"messages":[{"role":"user","content":"北京今天天气怎么样?"}],
"tools":[{"type":"function","function":{"name":"get_weather","parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}}],
"max_tokens": 200
}'
Expected response: tool_calls: [{function: {name: "get_weather", arguments: {"location":"北京"}}}] ✅
Legacy chat_template_no_think_simple.jinja is retained only for text-only chat scenarios with no tool calling needed. Do not use it if you need tool/function calling.
Long-term fix: next GGUF rebuild will patch the embedded chat_template so --chat-template-file is not required for tool calling.
Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M
Q4_K_M GGUF quantization of Lynn-V4-Pro-Distill-Qwen-35B-A3B for single-card 24 GB GPU deployment (RTX 4090 / RTX 5090 / RTX PRO 6000 / Apple Silicon). Single
.gguffile ~22 GB, runs on llama.cpp / Ollama / koboldcpp out of the box.
Quick Specs
| Base | nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B (BF16, 65.4 GB) |
| Quant | Q4_K_M (llama.cpp K-quants, ~4.85 bits per weight avg) |
| Size | ~22 GB single .gguf |
| Runtime | llama.cpp / Ollama / koboldcpp / llama-cpp-python (uses qwen3_5_moe GGUF loader) |
| GPU memory target | 24 GB(fits RTX 4090 / 5090 / RTX PRO 6000) |
| Active params | ~3 B per token (35 B total, 256 experts MoE) |
| License | MIT (see LICENSE / NOTICE) |
Why Q4_K_M (and not NVFP4)?
This variant takes an independent loader path from Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN:
- Q4_K_M (this) uses llama.cpp's own
qwen3_5_moeGGUF loader — works on any GPU/CPU, no Blackwell required, no dependency on transformers/vLLM/SGLang ecosystem - NVFP4 v8-RTN uses SGLang
dev-cu13withCompressedTensorsW4A4Nvfp4MoE— requires Blackwell sm_120+
For single-card 24 GB consumer GPU users (RTX 4090 / 5090 / Apple Silicon), Q4_K_M is the recommended path. For Blackwell production servers, NVFP4 v8-RTN gives better throughput.
How to Run
Recommended: llama.cpp server (Docker CUDA)
docker run --gpus all -d --name lynn-v4-pro-q4km \
--restart=no \
-v $(pwd):/models \
-p 8080:8080 \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 99 \
--ctx-size 4096 \
--jinja \
--chat-template-file /models/chat_template.jinja
Key flags:
-ngl 99— offload all transformer layers to GPU (Q4 model fits ~22 GB on 24 GB GPU)--ctx-size 4096— adjust per your VRAM headroom (long context costs KV cache memory)--jinja— enable Jinja chat template processing--chat-template-file chat_template.jinja— override embedded template with the no-think version (see notes below)
Ollama
# Create Modelfile pointing at the .gguf
cat > Modelfile <<EOF
FROM ./Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER stop "<|im_end|>"
EOF
ollama create lynn-v4-pro-q4km -f Modelfile
ollama run lynn-v4-pro-q4km "用一句话解释 MoE active params"
llama.cpp CLI (single prompt)
./llama-cli -m Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
-ngl 99 -p "用一句话解释 MoE active params" \
-n 256 --temp 0 --jinja \
--chat-template-file chat_template.jinja
Query example (OpenAI-compatible API)
import requests
resp = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "lynn-v4-pro-q4km",
"messages": [{"role": "user", "content": "用一句话解释 MoE active params"}],
"max_tokens": 256,
"temperature": 0,
},
)
print(resp.json()["choices"][0]["message"]["content"])
⚠️ Important Behavioral Notes
1. Chat template — use the external chat_template.jinja (no-think default)
The chat_template.jinja shipped in this repo defaults to enable_thinking=False (training-inference aligned). However, the chat template embedded inside the .gguf binary was generated from the original Qwen3.6 base which defaults to enable_thinking=True.
Recommended: pass --chat-template-file chat_template.jinja to override the embedded template with the no-think version. This produces direct answers without <think>...</think> traces.
To enable thinking output, simply omit --chat-template-file (uses embedded template that defaults to think) or pass an alternative template.
2. GGUF runtime only
This variant is NOT loadable by:
transformers.AutoModelForCausalLM(no GGUF parser)- vLLM / SGLang / TRT-LLM (no GGUF support for this arch)
For those runtimes, use BF16 or NVFP4 v8-RTN variants.
3. CPU offload for ≤ 16 GB GPU (Apple Silicon / 3080 Ti)
Reduce -ngl 99 to fewer layers to offload some to CPU:
- 24 GB GPU:
-ngl 99(all on GPU) - 16 GB GPU:
-ngl 60(partial offload, ~50% slower) - 12 GB GPU:
-ngl 40 - Apple Silicon (M-series unified memory):
-ngl 99if 24 GB+ unified
4. Known Limitation: Ollama empty <think> wrapper
When loading this GGUF via Ollama, responses include an empty <think></think> wrapper
before the actual content. This is embedded chat_template behavior (GGUF embeds the
original Qwen3.6 think-on template, Ollama doesn't expose a --chat-template-file flag),
NOT a quality issue — Ollama smoke shows 6/6 prompts coherent (Chinese / coding /
math / tool calls all correct).
To eliminate the wrapper, use llama.cpp server with --chat-template-file chat_template.jinja
(the shipped no-think version overrides the embedded template). See "How to Run" above.
⚡ Performance (NVIDIA GB10 sm_121 native build, 2026-05-14 full suite)
Measured with llama-server from local build-cuda-sm121 build, --parallel 4, temperature=0, streaming + stream_options.include_usage.
Single stream
| Mode | TPS (avg) | TPS (p50) | TTFT (avg) | Notes |
|---|---|---|---|---|
| Baseline | 74.9 tok/s ⭐ | 75.2 | 122 ms | 27% faster than NVFP4 single-stream |
首先... Chinese thinking-prefix injection |
75.0 | 74.9 | 106 ms | prefix has no perf impact |
Concurrent (llama.cpp --parallel)
| N | Wall time | Aggregate TPS | Per-stream TPS | TTFT avg | Speedup vs N=1 |
|---|---|---|---|---|---|
| 1 | — | 74.9 | 74.9 | 122 ms | 1.0x |
| 2 | 9.53s | 107.5 | 54.7 | 166 ms | 1.43x |
| 4 | 23.04s | 88.9 ⚠️ | 22.4 | 231 ms | 1.19x regressed below N=2 |
| 8 | 28.31s | 144.7 | 18.3 | 325 ms | 1.93x |
| 16 | 32.51s | 252.0 | 15.9 | 268 ms | 3.36x |
⚠️ Q4 concurrent scaling is poor. llama.cpp --parallel is not real continuous batching — it's slot multiplexing with isolated KV cache. Slot-switching overhead eats the benefit when batch is small (N=4 is slower than N=2 in aggregate).
Q4_K_M vs NVFP4 — choose based on deployment shape
| Dimension | Q4_K_M (this) | NVFP4 v8-RTN | Winner |
|---|---|---|---|
| Single stream TPS | 74.9 ⭐ | 58.7 | 🏆 Q4 (+27%) |
| Single stream TTFT | 122 ms | 81 ms | 🏆 NVFP4 |
| N=4 aggregate | 89 ⚠️ | 220 | 🏆 NVFP4 (2.5x) |
| N=16 aggregate | 252 | 599 | 🏆 NVFP4 (2.4x) |
| Long ctx 32K | ⚠️ HTTP 400 | 48.4 tok/s ✓ | 🏆 NVFP4 |
| Deploy footprint | 24 GB consumer GPU | Blackwell + SGLang | 🏆 Q4 (consumer) |
| Tool-call E2E | ⚠️ JSON-emit only | ✅ verified | 🏆 NVFP4 |
| Positioning | Consumer single-user | Server multi-user | Complementary |
Q4_K_M is the right choice for 24 GB consumer GPU + 1-2 user workloads (single-stream is fastest, runs on any GPU without Blackwell requirement). For multi-user serving with tool-calling, use the NVFP4 v8-RTN sibling + SGLang.
🔬 Evaluation
This Q4_K_M variant inherits the V Pro Distill (BF16) verdict, plus its own quality + serving verification:
| Gate | Result |
|---|---|
| 4-gate eval (BF16) | NET_WIN +40.00pp (reports/) |
| Differential sanity (BF16) | 5/5 PASS, logits diff 0.75-0.91 |
| Q4_K_M Ollama smoke | ✅ 6/6 PASS coherent (5/13) |
| Q4_K_M llama.cpp serving | ✅ 6/6 quality coherent + 75 tok/s perf (5/14) |
| Q4_K_M V8/V9 v4 (75 questions, thinking-on default) | 74/75 = 98.7% (see caveat below) |
V8/V9 v4 supplementary eval — Q4 thinking-on default
Q4_K_M ships with its GGUF chat_template thinking-default. We ran the same 75-question V8/V9 v4 suite used for NVFP4:
| Suite | Pass / Total | Pass Rate |
|---|---|---|
| V8 tool calling | 15 / 15 | 100.0% ✅ |
| V9 holdout | 8 / 8 | 100.0% |
| V9 expanded (AIME / finance / GPQA / health / SQL) | 51 / 52 | 98.1% |
| Total | 74 / 75 | 98.7% |
avg 70.67 tok/s / p50 72.02 tok/s during eval.
⚠️ Important: Q4 score is +6.7pp above NVFP4-think due to chat_template wrap, not quantization quality
Same Lynn V4-Pro weights. Same V8/V9 v4 suite. Same temperature=0. The 6.7pp gap (Q4 98.7% vs NVFP4-think 92.0%) comes from GGUF embeds a more concise thinking template than SGLang's chat_template.jinja. With a 4096-token budget, Q4 reaches the answer while NVFP4 hits the ceiling mid-derivation.
| qid | NVFP4 think tokens | Q4 tokens | What happened |
|---|---|---|---|
| v9_008 (gold "0.48 eV") | 4096 ⚠️ truncated | 668 ✓ | NVFP4 still unwinding hc/λ; Q4 reached K_max=0.4816 eV cleanly |
| v9p_aime_001 (gold 468) | 4096 ⚠️ truncated | 1796 ✓ | NVFP4 mid-coordinate; Q4 reached area=468 |
| v9p_fin_005 (gold 957.88) | 4096 ⚠️ truncated | 929 ✓ | NVFP4 stuck verifying; Q4 computed bond price |
Bottom line: NVFP4 and Q4_K_M serve the same model with the same quality. Q4's 98.7% is not "better quantization" — it's "more efficient thinking format inside the budget". Use the variant that matches your deployment.
Tool-call status
| Variant | Tool-call Status |
|---|---|
| BF16 merged | ⚠️ Serving E2E not validated |
| NVFP4 v8-RTN | ✅ Verified on SGLang + qwen3_coder parser |
| Q4_K_M (this) | ⚠️ Model emits valid JSON tool-call payloads; parsing back to tool_calls[] depends on llama.cpp / Ollama runtime integration (not systematically verified) |
Full raw V8/V9 results: evaluation/ on the v8-RTN sibling repo (kept together for direct comparison).
Available Variants
| Variant | Format | Size | Use Case |
|---|---|---|---|
| BF16 merged | BF16 safetensors (16 shards) | 65.4 GB | ⭐ canonical, full precision |
| NVFP4 v8-RTN | compressed-tensors NVFP4 (W4A4) | 21 GB | Blackwell GPU production interim |
| Q4_K_M GGUF (this) | llama.cpp GGUF (K-quants) | ~22 GB | ⭐ 24 GB consumer GPU |
🔧 Toolkit & Reproducibility
- 🐙 MerkyorLynn/lynn-distill-toolkit — 4-gate eval / sanity / quant verify / ship pipeline
- 🐙 MerkyorLynn/qwen3.6-nvfp4-toolkit — NVFP4 quantization (sister toolkit for the v8-RTN variant)
Q4_K_M conversion was done with llama.cpp convert_hf_to_gguf.py followed by llama-quantize Q4_K_M. The base 35B-A3B Qwen3.6 architecture (Qwen3_5MoeForConditionalGeneration) is supported in llama.cpp master.
To verify locally:
git clone https://github.com/MerkyorLynn/lynn-distill-toolkit
cd lynn-distill-toolkit/eval
python quant_verify.py \
--runtime llama.cpp \
--base-url http://localhost:8080 \
--prompts prompts/quant_smoke_6prompts.json
🔗 Sibling Repositories
| Platform | Repo |
|---|---|
| 🤗 HF BF16 | nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B |
| 🤗 HF NVFP4 v8-RTN | nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN |
| 🪞 MS BF16 | Merkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B |
| 🪞 MS NVFP4 v8-RTN | Merkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN |
| 🪞 MS Q4_K_M | Merkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M |
| 🐙 GitHub Distill Toolkit | MerkyorLynn/lynn-distill-toolkit |
| 🐙 GitHub NVFP4 Toolkit | MerkyorLynn/qwen3.6-nvfp4-toolkit |
| 📝 知乎 deep dive | Lynn V4-Pro Distill 复盘 |
License & Attribution
- This repo: MIT License (see
LICENSE) - Base model: Qwen3.6-35B-A3B by Alibaba Cloud, Apache 2.0
- Path B notice: This is a derivative work in the R1-Distill style. See
NOTICE.
Citation
@misc{lynn-v4-pro-distill-q4km,
title = {Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M},
author = {Lynn / MerkyorLynn},
year = {2026},
url = {https://huggingface.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M}
}
中文(简体)
Lynn-V4-Pro-Distill-Qwen-35B-A3B 的 Q4_K_M GGUF 量化版,单卡 24 GB GPU 部署目标(RTX 4090 / 5090 / RTX PRO 6000 / Apple Silicon)。单
.gguf文件 ~22 GB,直接走 llama.cpp / Ollama / koboldcpp。
规格
| 基座 | nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B(BF16,65.4 GB) |
| 量化 | Q4_K_M(llama.cpp K-quants,~4.85 bits / weight 均值) |
| 大小 | ~22 GB 单 .gguf |
| Runtime | llama.cpp / Ollama / koboldcpp / llama-cpp-python(用 qwen3_5_moe GGUF loader) |
| GPU 目标 | 24 GB 显存(RTX 4090 / 5090 / RTX PRO 6000 都可) |
| 激活参数 | 每 token ~3 B(总 35 B,256 expert MoE) |
| 许可 | MIT(详见 LICENSE / NOTICE) |
为什么 Q4_K_M(而不用 NVFP4)?
本变体走独立 loader 路径,跟 Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN 完全不同:
- Q4_K_M(本仓库):llama.cpp 自己的
qwen3_5_moeGGUF loader → 任意 GPU/CPU 都能跑,不要求 Blackwell,不依赖 transformers/vLLM/SGLang 生态 - NVFP4 v8-RTN:SGLang
dev-cu13+CompressedTensorsW4A4Nvfp4MoE,需要 Blackwell sm_120+
单卡 24 GB 消费 GPU 用户(4090 / 5090 / Apple Silicon)推荐 Q4_K_M。Blackwell 生产服务器走 NVFP4 v8-RTN 性能更好。
启动方式
推荐:llama.cpp server(Docker CUDA)
docker run --gpus all -d --name lynn-v4-pro-q4km \
--restart=no \
-v $(pwd):/models \
-p 8080:8080 \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 99 \
--ctx-size 4096 \
--jinja \
--chat-template-file /models/chat_template.jinja
关键参数说明:
-ngl 99— 全部 transformer 层放 GPU(Q4 模型 22G 刚好放 24G 显存)--ctx-size 4096— 视 VRAM 余量调整(长上下文吃 KV cache 显存)--jinja— 启用 Jinja chat template--chat-template-file chat_template.jinja— 覆盖嵌入模板用 no-think 版(详见下方注意点)
Ollama
cat > Modelfile <<EOF
FROM ./Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER stop "<|im_end|>"
EOF
ollama create lynn-v4-pro-q4km -f Modelfile
ollama run lynn-v4-pro-q4km "用一句话解释 MoE active params"
llama.cpp CLI(单条 prompt)
./llama-cli -m Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
-ngl 99 -p "用一句话解释 MoE active params" \
-n 256 --temp 0 --jinja \
--chat-template-file chat_template.jinja
⚠️ 重要行为提示
1. Chat template — 用外部 chat_template.jinja(no-think 默认)
本仓库附的 chat_template.jinja 默认 enable_thinking=False(训推一致)。但 .gguf 二进制内嵌的 chat template 是从 Qwen3.6 原 base 转出来的,默认 enable_thinking=True。
推荐做法:启动加 --chat-template-file chat_template.jinja 覆盖嵌入版,输出直接答案不带 <think>...</think> 痕迹。
要开 thinking → 省略 --chat-template-file 即用嵌入(默认 think)。
2. 只能用 GGUF runtime
本变体不能被以下加载:
transformers.AutoModelForCausalLM(没 GGUF parser)- vLLM / SGLang / TRT-LLM(GGUF 不支持本 arch)
那些场景用 BF16 或 NVFP4 v8-RTN 变体。
3. ≤ 16 GB GPU CPU offload(Apple Silicon / 3080 Ti)
调小 -ngl 99 让部分层走 CPU:
- 24 GB GPU:
-ngl 99(全 GPU) - 16 GB GPU:
-ngl 60(部分 offload,~50% 慢) - 12 GB GPU:
-ngl 40 - Apple Silicon(M 系 unified memory):24 GB+ 用
-ngl 99
4. 已知限制:Ollama 空 <think> 包装
通过 Ollama 加载本 GGUF 时,回复正文前会有一个空的 <think></think> wrapper。
这是 GGUF 内嵌 chat_template 的行为(GGUF 内嵌了原 Qwen3.6 default-think 模板,
Ollama 不支持 --chat-template-file 覆盖),不是质量问题 — Ollama smoke 6/6
prompts 全 coherent(中文 / 代码 / 数学 / 工具调用都正确)。
要去掉 wrapper,用 llama.cpp server + --chat-template-file chat_template.jinja
(仓库附带的 no-think 版覆盖嵌入模板)。详见上方 "启动方式" 章节。
⚡ 性能(NVIDIA GB10 sm_121 native build,2026-05-14 完整 suite)
llama-server 本地 build-cuda-sm121 编,--parallel 4,temperature=0,流式 + stream_options.include_usage。
单流测试
| 模式 | TPS avg | TPS p50 | TTFT avg | 说明 |
|---|---|---|---|---|
| Baseline | 74.9 tok/s ⭐ | 75.2 | 122 ms | 单流比 NVFP4 快 27% |
首先... 中文 thinking-prefix 注入 |
75.0 | 74.9 | 106 ms | prefix 不影响 perf |
并发(llama.cpp --parallel)
| N | Wall time | Aggregate TPS | Per-stream TPS | TTFT avg | 加速比 vs N=1 |
|---|---|---|---|---|---|
| 1 | — | 74.9 | 74.9 | 122 ms | 1.0x |
| 2 | 9.53s | 107.5 | 54.7 | 166 ms | 1.43x |
| 4 | 23.04s | 88.9 ⚠️ | 22.4 | 231 ms | 1.19x 反而比 N=2 慢 |
| 8 | 28.31s | 144.7 | 18.3 | 325 ms | 1.93x |
| 16 | 32.51s | 252.0 | 15.9 | 268 ms | 3.36x |
⚠️ Q4 并发扩展性差。llama.cpp --parallel 不是真 continuous batching,是 slot 多路复用 + KV cache 隔离。小 batch 时 slot 切换开销吃掉收益(N=4 反而比 N=2 慢)。
Q4_K_M vs NVFP4 — 按部署形态选
| 维度 | Q4_K_M(本) | NVFP4 v8-RTN | 赢家 |
|---|---|---|---|
| 单流 TPS | 74.9 ⭐ | 58.7 | 🏆 Q4(+27%) |
| 单流 TTFT | 122 ms | 81 ms | 🏆 NVFP4 |
| N=4 aggregate | 89 ⚠️ | 220 | 🏆 NVFP4(2.5x) |
| N=16 aggregate | 252 | 599 | 🏆 NVFP4(2.4x) |
| Long ctx 32K | ⚠️ HTTP 400 | 48.4 tok/s ✓ | 🏆 NVFP4 |
| 部署门槛 | 24 GB 消费 GPU | Blackwell + SGLang | 🏆 Q4(消费级) |
| Tool-call E2E | ⚠️ JSON-emit only | ✅ verified | 🏆 NVFP4 |
| 定位 | 消费单用户 | 服务器多用户 | 互补 |
Q4_K_M 适合 24 GB 消费 GPU + 1-2 user 工作负载(单流最快,任意 GPU 都能跑不要求 Blackwell)。多用户服务 + 工具调用,用 NVFP4 v8-RTN 兄弟版 + SGLang。
🔬 评测
Q4_K_M 继承 V Pro Distill(BF16)结论 + 自己 quality + serving 验证:
| Gate | 结果 |
|---|---|
| 4-gate 评测(BF16) | NET_WIN +40.00pp(reports/) |
| Differential sanity(BF16) | 5/5 PASS,logits diff 0.75-0.91 |
| Q4_K_M Ollama smoke | ✅ 6/6 PASS coherent(5/13) |
| Q4_K_M llama.cpp serving | ✅ 6/6 quality coherent + 75 tok/s perf(5/14) |
| Q4_K_M V8/V9 v4(75 题,thinking 默认开) | 74/75 = 98.7%(详见下方 caveat) |
V8/V9 v4 补充评测 — Q4 默认 thinking-on
Q4_K_M GGUF 内嵌 chat_template 默认 thinking。跑了跟 NVFP4 同套 75 题 V8/V9 v4:
| Suite | Pass / Total | 通过率 |
|---|---|---|
| V8 tool calling | 15 / 15 | 100.0% ✅ |
| V9 holdout | 8 / 8 | 100.0% |
| V9 expanded(AIME / 金融 / GPQA / 健康 / SQL) | 51 / 52 | 98.1% |
| 总计 | 74 / 75 | 98.7% |
评测期间 avg 70.67 tok/s / p50 72.02 tok/s。
⚠️ 重要:Q4 比 NVFP4-think 高 6.7pp 真因是 chat_template wrap,不是量化质量更好
同 Lynn V4-Pro weights,同 V8/V9 v4 题集,同 temperature=0。Q4 (98.7%) vs NVFP4-think (92.0%) 的 6.7pp 差距来自:GGUF 内嵌 thinking 模板比 SGLang 的 chat_template.jinja 紧凑。4096 token budget 内,Q4 能走到答,NVFP4 在推理途中打顶。
| qid | NVFP4 think tokens | Q4 tokens | 真相 |
|---|---|---|---|
| v9_008(gold "0.48 eV") | 4096 ⚠️ 打顶 | 668 ✓ | NVFP4 还在展开 hc/λ;Q4 已得 K_max=0.4816 eV |
| v9p_aime_001(gold 468) | 4096 ⚠️ 打顶 | 1796 ✓ | NVFP4 在坐标展开;Q4 已到 area=468 |
| v9p_fin_005(gold 957.88) | 4096 ⚠️ 打顶 | 929 ✓ | NVFP4 卡验算;Q4 算出债券价 957.88 |
结论:NVFP4 和 Q4_K_M 是同一模型同样质量。Q4 的 98.7% 不是"量化更好",是"thinking 格式更紧凑能在 budget 内完成"。选哪个变体看你部署形态。
Tool-call 状态
| 变体 | Tool-call 状态 |
|---|---|
| BF16 merged | ⚠️ Serving E2E 未做 |
| NVFP4 v8-RTN | ✅ SGLang + qwen3_coder parser 已 verified |
| Q4_K_M(本) | ⚠️ 模型能 emit 合法 JSON tool-call payload;parse 回 tool_calls[] 取决于 llama.cpp / Ollama runtime 集成(未系统验证) |
完整原始 V8/V9 结果:evaluation/(放在 v8-RTN 兄弟仓库,直接对照)。
变体清单
| 变体 | 格式 | 大小 | 用途 |
|---|---|---|---|
| BF16 merged | BF16 safetensors(16 shards) | 65.4 GB | ⭐ 标准版,full precision |
| NVFP4 v8-RTN | compressed-tensors NVFP4(W4A4) | 21 GB | Blackwell GPU 生产临时版 |
| Q4_K_M GGUF(本仓库) | llama.cpp GGUF(K-quants) | ~22 GB | ⭐ 24 GB 消费 GPU |
🔧 工具链与可复现性
- 🐙 MerkyorLynn/lynn-distill-toolkit — 4-gate 评测 / 差分校验 / 量化校验 / ship 流水线
- 🐙 MerkyorLynn/qwen3.6-nvfp4-toolkit — NVFP4 量化工具(v8-RTN 变体配套)
Q4_K_M 转换 用 llama.cpp 的 convert_hf_to_gguf.py + llama-quantize Q4_K_M。base 35B-A3B Qwen3.6 arch(Qwen3_5MoeForConditionalGeneration)在 llama.cpp master 已支持。
本地验证:
git clone https://github.com/MerkyorLynn/lynn-distill-toolkit
cd lynn-distill-toolkit/eval
python quant_verify.py \
--runtime llama.cpp \
--base-url http://localhost:8080 \
--prompts prompts/quant_smoke_6prompts.json
🔗 关联仓库
详见上方英文段 "Sibling Repositories" 表格。
许可与归属
- 本仓库:MIT 许可(见
LICENSE) - 基座模型:Qwen3.6-35B-A3B(阿里云,Apache 2.0)
- Path B 声明:R1-Distill 风格衍生作品。完整归属见
NOTICE。
引用
@misc{lynn-v4-pro-distill-q4km,
title = {Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M},
author = {Lynn / MerkyorLynn},
year = {2026},
url = {https://huggingface.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M}
}
Evaluation — Verified 2026-05-17 (Spark DGX, GB10 sm_121)
Public benchmark matrix
| Candidate | MMLU 500 (5-shot) | GPQA Diamond (198, 0-shot) |
|---|---|---|
| evaluation data pending |
Hardware + serving
- DGX Spark (GB10, SM121), 119 GiB unified memory, single card
- GGUF Q4_K_M:
llama.cpp:server-cudawith--n-gpu-layers 99 --ctx-size 4096 --jinja,chat_template_kwargs.enable_thinking=false - BF16:
lynn-engine server.openai_httpwith--dtype bfloat16,LYNN_MOE_IMPL=optimized,/no_thinkdirective - NVFP4 v8-RTN:
SGLang dev-cu13withCompressedTensorsW4A4Nvfp4MoEMoE path temperature=0greedy decode across all candidates- MMLU: 500-question deterministic random sample (seed=20260517), 5-shot subject-matched dev examples
- GPQA: full Diamond split (198), 0-shot, stable per-question option shuffle
Key findings
- Lynn distillation gives +17-20pp on MMLU vs the Qwen3.6 base teacher in Q4_K_M / BF16 forms. Multi-subject knowledge transfer is real.
- NVFP4 v8-RTN (W4A4) quantization erases the Lynn distillation gain — V4-Pro V8-RTN drops to ~63% MMLU, matching the Qwen base teacher (4-bit activation is too coarse to preserve distillation features).
- Q4_K_M (W4A16) preserves distillation quality — only 1-3pp drop vs BF16 reference.
- GPQA Diamond saturates at 40-45% for all candidates including base; graduate-level reasoning does not transfer through distillation in this family.
Caveats
- 500-sample MMLU has ~±2pp sample variance vs full 14k MMLU
- GPQA Diamond 198 questions: ~±3pp variance
- Results are decode-only; first-token latency not reported here
- Per-subject MMLU breakdown + raw per-question JSONL + reproducible harness scripts available in the underlying Lynn engine
quality-eval-20260517/artifact set
- Downloads last month
- 159
4-bit
Model tree for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M
Base model
nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B