Instructions to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M",
	filename="Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

Use Docker

docker model run hf.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

LM Studio
Jan

vLLM

How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

Ollama
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Ollama:
```
ollama run hf.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
```

Unsloth Studio

How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M to start chatting

How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Docker Model Runner:
```
docker model run hf.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M
```

Lemonade

How to use nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M:Q4_K_M

Run and chat with the model

lemonade run user.Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M-Q4_K_M

List all available models

lemonade list

🚨 Tool calling hotfix (2026-05-15)

⚠️ For tool calling / function calling, use chat_template_full_nothink.jinja (newly added) — NOT chat_template_no_think_simple.jinja

Earlier guidance recommended the 14-line chat_template_no_think_simple.jinja for clean no-think output, but that simplified template completely strips the {% if tools %} rendering block. The model never sees tool descriptions in the system prompt → tool_calls are silently empty even when you pass a valid tools field.

The new canonical chat_template_full_nothink.jinja (155 lines) = full 152-line chat_template.jinja with an explicit enable_thinking=false default prepended. It preserves all tools / <tool_call> / function rendering.

Recommended startup:

llama-server \
  -m <gguf-file> \
  --jinja \
  --chat-template-file chat_template_full_nothink.jinja

Verification (should fire tool_calls, not text refusal):

curl http://localhost:8080/v1/chat/completions -d '{
  "messages":[{"role":"user","content":"北京今天天气怎么样?"}],
  "tools":[{"type":"function","function":{"name":"get_weather","parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}}],
  "max_tokens": 200
}'

Expected response: tool_calls: [{function: {name: "get_weather", arguments: {"location":"北京"}}}] ✅

Legacy chat_template_no_think_simple.jinja is retained only for text-only chat scenarios with no tool calling needed. Do not use it if you need tool/function calling.

Long-term fix: next GGUF rebuild will patch the embedded chat_template so --chat-template-file is not required for tool calling.

Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M

Q4_K_M GGUF quantization of Lynn-V4-Pro-Distill-Qwen-35B-A3B for single-card 24 GB GPU deployment (RTX 4090 / RTX 5090 / RTX PRO 6000 / Apple Silicon). Single .gguf file ~22 GB, runs on llama.cpp / Ollama / koboldcpp out of the box.

Quick Specs


Base	`nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B` (BF16, 65.4 GB)
Quant	Q4_K_M (llama.cpp K-quants, ~4.85 bits per weight avg)
Size	~22 GB single `.gguf`
Runtime	llama.cpp / Ollama / koboldcpp / llama-cpp-python (uses `qwen3_5_moe` GGUF loader)
GPU memory target	24 GB(fits RTX 4090 / 5090 / RTX PRO 6000)
Active params	~3 B per token (35 B total, 256 experts MoE)
License	MIT (see LICENSE / NOTICE)

Why Q4_K_M (and not NVFP4)?

This variant takes an independent loader path from Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN:

Q4_K_M (this) uses llama.cpp's own qwen3_5_moe GGUF loader — works on any GPU/CPU, no Blackwell required, no dependency on transformers/vLLM/SGLang ecosystem
NVFP4 v8-RTN uses SGLang dev-cu13 with CompressedTensorsW4A4Nvfp4MoE — requires Blackwell sm_120+

For single-card 24 GB consumer GPU users (RTX 4090 / 5090 / Apple Silicon), Q4_K_M is the recommended path. For Blackwell production servers, NVFP4 v8-RTN gives better throughput.

How to Run

Recommended: llama.cpp server (Docker CUDA)

docker run --gpus all -d --name lynn-v4-pro-q4km \
  --restart=no \
  -v $(pwd):/models \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \
  --ctx-size 4096 \
  --jinja \
  --chat-template-file /models/chat_template.jinja

Key flags:

-ngl 99 — offload all transformer layers to GPU (Q4 model fits ~22 GB on 24 GB GPU)
--ctx-size 4096 — adjust per your VRAM headroom (long context costs KV cache memory)
--jinja — enable Jinja chat template processing
--chat-template-file chat_template.jinja — override embedded template with the no-think version (see notes below)

Ollama

# Create Modelfile pointing at the .gguf
cat > Modelfile <<EOF
FROM ./Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER stop "<|im_end|>"
EOF

ollama create lynn-v4-pro-q4km -f Modelfile
ollama run lynn-v4-pro-q4km "用一句话解释 MoE active params"

llama.cpp CLI (single prompt)

./llama-cli -m Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
  -ngl 99 -p "用一句话解释 MoE active params" \
  -n 256 --temp 0 --jinja \
  --chat-template-file chat_template.jinja

Query example (OpenAI-compatible API)

import requests
resp = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "model": "lynn-v4-pro-q4km",
        "messages": [{"role": "user", "content": "用一句话解释 MoE active params"}],
        "max_tokens": 256,
        "temperature": 0,
    },
)
print(resp.json()["choices"][0]["message"]["content"])

⚠️ Important Behavioral Notes

1. Chat template — use the external `chat_template.jinja` (no-think default)

The chat_template.jinja shipped in this repo defaults to enable_thinking=False (training-inference aligned). However, the chat template embedded inside the .gguf binary was generated from the original Qwen3.6 base which defaults to enable_thinking=True.

Recommended: pass --chat-template-file chat_template.jinja to override the embedded template with the no-think version. This produces direct answers without <think>...</think> traces.

To enable thinking output, simply omit --chat-template-file (uses embedded template that defaults to think) or pass an alternative template.

2. GGUF runtime only

This variant is NOT loadable by:

transformers.AutoModelForCausalLM (no GGUF parser)
vLLM / SGLang / TRT-LLM (no GGUF support for this arch)

For those runtimes, use BF16 or NVFP4 v8-RTN variants.

3. CPU offload for ≤ 16 GB GPU (Apple Silicon / 3080 Ti)

Reduce -ngl 99 to fewer layers to offload some to CPU:

24 GB GPU: -ngl 99 (all on GPU)
16 GB GPU: -ngl 60 (partial offload, ~50% slower)
12 GB GPU: -ngl 40
Apple Silicon (M-series unified memory): -ngl 99 if 24 GB+ unified

4. Known Limitation: Ollama empty `<think>` wrapper

When loading this GGUF via Ollama, responses include an empty <think></think> wrapper before the actual content. This is embedded chat_template behavior (GGUF embeds the original Qwen3.6 think-on template, Ollama doesn't expose a --chat-template-file flag), NOT a quality issue — Ollama smoke shows 6/6 prompts coherent (Chinese / coding / math / tool calls all correct).

To eliminate the wrapper, use llama.cpp server with --chat-template-file chat_template.jinja (the shipped no-think version overrides the embedded template). See "How to Run" above.

⚡ Performance (NVIDIA GB10 sm_121 native build, 2026-05-14 full suite)

Measured with llama-server from local build-cuda-sm121 build, --parallel 4, temperature=0, streaming + stream_options.include_usage.

Single stream

Mode	TPS (avg)	TPS (p50)	TTFT (avg)	Notes
Baseline	74.9 tok/s ⭐	75.2	122 ms	27% faster than NVFP4 single-stream
`首先...` Chinese thinking-prefix injection	75.0	74.9	106 ms	prefix has no perf impact

Concurrent (`llama.cpp --parallel`)

N	Wall time	Aggregate TPS	Per-stream TPS	TTFT avg	Speedup vs N=1
1	—	74.9	74.9	122 ms	1.0x
2	9.53s	107.5	54.7	166 ms	1.43x
4	23.04s	88.9 ⚠️	22.4	231 ms	1.19x regressed below N=2
8	28.31s	144.7	18.3	325 ms	1.93x
16	32.51s	252.0	15.9	268 ms	3.36x

⚠️ Q4 concurrent scaling is poor. llama.cpp --parallel is not real continuous batching — it's slot multiplexing with isolated KV cache. Slot-switching overhead eats the benefit when batch is small (N=4 is slower than N=2 in aggregate).

Q4_K_M vs NVFP4 — choose based on deployment shape

Dimension	Q4_K_M (this)	NVFP4 v8-RTN	Winner
Single stream TPS	74.9 ⭐	58.7	🏆 Q4 (+27%)
Single stream TTFT	122 ms	81 ms	🏆 NVFP4
N=4 aggregate	89 ⚠️	220	🏆 NVFP4 (2.5x)
N=16 aggregate	252	599	🏆 NVFP4 (2.4x)
Long ctx 32K	⚠️ HTTP 400	48.4 tok/s ✓	🏆 NVFP4
Deploy footprint	24 GB consumer GPU	Blackwell + SGLang	🏆 Q4 (consumer)
Tool-call E2E	⚠️ JSON-emit only	✅ verified	🏆 NVFP4
Positioning	Consumer single-user	Server multi-user	Complementary

Q4_K_M is the right choice for 24 GB consumer GPU + 1-2 user workloads (single-stream is fastest, runs on any GPU without Blackwell requirement). For multi-user serving with tool-calling, use the NVFP4 v8-RTN sibling + SGLang.

🔬 Evaluation

This Q4_K_M variant inherits the V Pro Distill (BF16) verdict, plus its own quality + serving verification:

Gate	Result
4-gate eval (BF16)	NET_WIN +40.00pp (reports/)
Differential sanity (BF16)	5/5 PASS, logits diff 0.75-0.91
Q4_K_M Ollama smoke	✅ 6/6 PASS coherent (5/13)
Q4_K_M llama.cpp serving	✅ 6/6 quality coherent + 75 tok/s perf (5/14)
Q4_K_M V8/V9 v4 (75 questions, thinking-on default)	74/75 = 98.7% (see caveat below)

V8/V9 v4 supplementary eval — Q4 thinking-on default

Q4_K_M ships with its GGUF chat_template thinking-default. We ran the same 75-question V8/V9 v4 suite used for NVFP4:

Suite	Pass / Total	Pass Rate
V8 tool calling	15 / 15	100.0% ✅
V9 holdout	8 / 8	100.0%
V9 expanded (AIME / finance / GPQA / health / SQL)	51 / 52	98.1%
Total	74 / 75	98.7%

avg 70.67 tok/s / p50 72.02 tok/s during eval.

⚠️ Important: Q4 score is +6.7pp above NVFP4-think due to chat_template wrap, not quantization quality

Same Lynn V4-Pro weights. Same V8/V9 v4 suite. Same temperature=0. The 6.7pp gap (Q4 98.7% vs NVFP4-think 92.0%) comes from GGUF embeds a more concise thinking template than SGLang's chat_template.jinja. With a 4096-token budget, Q4 reaches the answer while NVFP4 hits the ceiling mid-derivation.

qid	NVFP4 think tokens	Q4 tokens	What happened
v9_008 (gold "0.48 eV")	4096 ⚠️ truncated	668 ✓	NVFP4 still unwinding `hc/λ`; Q4 reached `K_max=0.4816 eV` cleanly
v9p_aime_001 (gold 468)	4096 ⚠️ truncated	1796 ✓	NVFP4 mid-coordinate; Q4 reached area=468
v9p_fin_005 (gold 957.88)	4096 ⚠️ truncated	929 ✓	NVFP4 stuck verifying; Q4 computed bond price

Bottom line: NVFP4 and Q4_K_M serve the same model with the same quality. Q4's 98.7% is not "better quantization" — it's "more efficient thinking format inside the budget". Use the variant that matches your deployment.

Tool-call status

Variant	Tool-call Status
BF16 merged	⚠️ Serving E2E not validated
NVFP4 v8-RTN	✅ Verified on SGLang + `qwen3_coder` parser
Q4_K_M (this)	⚠️ Model emits valid JSON tool-call payloads; parsing back to `tool_calls[]` depends on llama.cpp / Ollama runtime integration (not systematically verified)

Full raw V8/V9 results: evaluation/ on the v8-RTN sibling repo (kept together for direct comparison).

Available Variants

Variant	Format	Size	Use Case
BF16 merged	BF16 safetensors (16 shards)	65.4 GB	⭐ canonical, full precision
NVFP4 v8-RTN	compressed-tensors NVFP4 (W4A4)	21 GB	Blackwell GPU production interim
Q4_K_M GGUF (this)	llama.cpp GGUF (K-quants)	~22 GB	⭐ 24 GB consumer GPU

🔧 Toolkit & Reproducibility

🐙 MerkyorLynn/lynn-distill-toolkit — 4-gate eval / sanity / quant verify / ship pipeline
🐙 MerkyorLynn/qwen3.6-nvfp4-toolkit — NVFP4 quantization (sister toolkit for the v8-RTN variant)

Q4_K_M conversion was done with llama.cpp convert_hf_to_gguf.py followed by llama-quantize Q4_K_M. The base 35B-A3B Qwen3.6 architecture (Qwen3_5MoeForConditionalGeneration) is supported in llama.cpp master.

To verify locally:

git clone https://github.com/MerkyorLynn/lynn-distill-toolkit
cd lynn-distill-toolkit/eval
python quant_verify.py \
  --runtime llama.cpp \
  --base-url http://localhost:8080 \
  --prompts prompts/quant_smoke_6prompts.json

🔗 Sibling Repositories

Platform	Repo
🤗 HF BF16	nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B
🤗 HF NVFP4 v8-RTN	nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN
🪞 MS BF16	Merkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B
🪞 MS NVFP4 v8-RTN	Merkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN
🪞 MS Q4_K_M	Merkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M
🐙 GitHub Distill Toolkit	MerkyorLynn/lynn-distill-toolkit
🐙 GitHub NVFP4 Toolkit	MerkyorLynn/qwen3.6-nvfp4-toolkit
📝 知乎 deep dive	Lynn V4-Pro Distill 复盘

License & Attribution

This repo: MIT License (see LICENSE)
Base model: Qwen3.6-35B-A3B by Alibaba Cloud, Apache 2.0
Path B notice: This is a derivative work in the R1-Distill style. See NOTICE.

Citation

@misc{lynn-v4-pro-distill-q4km,
  title = {Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M},
  author = {Lynn / MerkyorLynn},
  year = {2026},
  url = {https://huggingface.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M}
}

中文(简体)

Lynn-V4-Pro-Distill-Qwen-35B-A3B 的 Q4_K_M GGUF 量化版,单卡 24 GB GPU 部署目标(RTX 4090 / 5090 / RTX PRO 6000 / Apple Silicon)。单 .gguf 文件 ~22 GB,直接走 llama.cpp / Ollama / koboldcpp。

规格


基座	`nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B`(BF16,65.4 GB)
量化	Q4_K_M(llama.cpp K-quants,~4.85 bits / weight 均值)
大小	~22 GB 单 `.gguf`
Runtime	llama.cpp / Ollama / koboldcpp / llama-cpp-python(用 `qwen3_5_moe` GGUF loader)
GPU 目标	24 GB 显存(RTX 4090 / 5090 / RTX PRO 6000 都可)
激活参数	每 token ~3 B(总 35 B,256 expert MoE)
许可	MIT(详见 LICENSE / NOTICE)

为什么 Q4_K_M(而不用 NVFP4)?

本变体走独立 loader 路径,跟 Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN 完全不同:

Q4_K_M(本仓库):llama.cpp 自己的 qwen3_5_moe GGUF loader → 任意 GPU/CPU 都能跑,不要求 Blackwell,不依赖 transformers/vLLM/SGLang 生态
NVFP4 v8-RTN:SGLang dev-cu13 + CompressedTensorsW4A4Nvfp4MoE,需要 Blackwell sm_120+

单卡 24 GB 消费 GPU 用户(4090 / 5090 / Apple Silicon)推荐 Q4_K_M。Blackwell 生产服务器走 NVFP4 v8-RTN 性能更好。

启动方式

推荐:llama.cpp server(Docker CUDA)

docker run --gpus all -d --name lynn-v4-pro-q4km \
  --restart=no \
  -v $(pwd):/models \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \
  --ctx-size 4096 \
  --jinja \
  --chat-template-file /models/chat_template.jinja

关键参数说明:

-ngl 99 — 全部 transformer 层放 GPU(Q4 模型 22G 刚好放 24G 显存)
--ctx-size 4096 — 视 VRAM 余量调整(长上下文吃 KV cache 显存)
--jinja — 启用 Jinja chat template
--chat-template-file chat_template.jinja — 覆盖嵌入模板用 no-think 版(详见下方注意点)

Ollama

cat > Modelfile <<EOF
FROM ./Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER stop "<|im_end|>"
EOF

ollama create lynn-v4-pro-q4km -f Modelfile
ollama run lynn-v4-pro-q4km "用一句话解释 MoE active params"

llama.cpp CLI(单条 prompt)

./llama-cli -m Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
  -ngl 99 -p "用一句话解释 MoE active params" \
  -n 256 --temp 0 --jinja \
  --chat-template-file chat_template.jinja

⚠️ 重要行为提示

1. Chat template — 用外部 `chat_template.jinja`(no-think 默认)

本仓库附的 chat_template.jinja 默认 enable_thinking=False(训推一致)。但 .gguf 二进制内嵌的 chat template 是从 Qwen3.6 原 base 转出来的,默认 enable_thinking=True。

推荐做法:启动加 --chat-template-file chat_template.jinja 覆盖嵌入版,输出直接答案不带 <think>...</think> 痕迹。

要开 thinking → 省略 --chat-template-file 即用嵌入(默认 think)。

2. 只能用 GGUF runtime

本变体不能被以下加载:

transformers.AutoModelForCausalLM(没 GGUF parser)
vLLM / SGLang / TRT-LLM(GGUF 不支持本 arch)

那些场景用 BF16 或 NVFP4 v8-RTN 变体。

3. ≤ 16 GB GPU CPU offload(Apple Silicon / 3080 Ti)

调小 -ngl 99 让部分层走 CPU:

24 GB GPU:-ngl 99(全 GPU)
16 GB GPU:-ngl 60(部分 offload,~50% 慢)
12 GB GPU:-ngl 40
Apple Silicon(M 系 unified memory):24 GB+ 用 -ngl 99

4. 已知限制:Ollama 空 `<think>` 包装

通过 Ollama 加载本 GGUF 时,回复正文前会有一个空的 <think></think> wrapper。这是 GGUF 内嵌 chat_template 的行为(GGUF 内嵌了原 Qwen3.6 default-think 模板, Ollama 不支持 --chat-template-file 覆盖),不是质量问题 — Ollama smoke 6/6 prompts 全 coherent(中文 / 代码 / 数学 / 工具调用都正确)。

要去掉 wrapper,用 llama.cpp server + --chat-template-file chat_template.jinja (仓库附带的 no-think 版覆盖嵌入模板)。详见上方 "启动方式" 章节。

⚡ 性能(NVIDIA GB10 sm_121 native build,2026-05-14 完整 suite)

llama-server 本地 build-cuda-sm121 编,--parallel 4,temperature=0,流式 + stream_options.include_usage。

单流测试

模式	TPS avg	TPS p50	TTFT avg	说明
Baseline	74.9 tok/s ⭐	75.2	122 ms	单流比 NVFP4 快 27%
`首先...` 中文 thinking-prefix 注入	75.0	74.9	106 ms	prefix 不影响 perf

并发(`llama.cpp --parallel`)

N	Wall time	Aggregate TPS	Per-stream TPS	TTFT avg	加速比 vs N=1
1	—	74.9	74.9	122 ms	1.0x
2	9.53s	107.5	54.7	166 ms	1.43x
4	23.04s	88.9 ⚠️	22.4	231 ms	1.19x 反而比 N=2 慢
8	28.31s	144.7	18.3	325 ms	1.93x
16	32.51s	252.0	15.9	268 ms	3.36x

⚠️ Q4 并发扩展性差。llama.cpp --parallel 不是真 continuous batching,是 slot 多路复用 + KV cache 隔离。小 batch 时 slot 切换开销吃掉收益(N=4 反而比 N=2 慢)。

Q4_K_M vs NVFP4 — 按部署形态选

维度	Q4_K_M(本)	NVFP4 v8-RTN	赢家
单流 TPS	74.9 ⭐	58.7	🏆 Q4(+27%)
单流 TTFT	122 ms	81 ms	🏆 NVFP4
N=4 aggregate	89 ⚠️	220	🏆 NVFP4(2.5x)
N=16 aggregate	252	599	🏆 NVFP4(2.4x)
Long ctx 32K	⚠️ HTTP 400	48.4 tok/s ✓	🏆 NVFP4
部署门槛	24 GB 消费 GPU	Blackwell + SGLang	🏆 Q4(消费级)
Tool-call E2E	⚠️ JSON-emit only	✅ verified	🏆 NVFP4
定位	消费单用户	服务器多用户	互补

Q4_K_M 适合 24 GB 消费 GPU + 1-2 user 工作负载(单流最快,任意 GPU 都能跑不要求 Blackwell)。多用户服务 + 工具调用,用 NVFP4 v8-RTN 兄弟版 + SGLang。

🔬 评测

Q4_K_M 继承 V Pro Distill(BF16)结论 + 自己 quality + serving 验证:

Gate	结果
4-gate 评测(BF16)	NET_WIN +40.00pp(reports/)
Differential sanity(BF16)	5/5 PASS,logits diff 0.75-0.91
Q4_K_M Ollama smoke	✅ 6/6 PASS coherent(5/13)
Q4_K_M llama.cpp serving	✅ 6/6 quality coherent + 75 tok/s perf(5/14)
Q4_K_M V8/V9 v4(75 题,thinking 默认开)	74/75 = 98.7%(详见下方 caveat)

V8/V9 v4 补充评测 — Q4 默认 thinking-on

Q4_K_M GGUF 内嵌 chat_template 默认 thinking。跑了跟 NVFP4 同套 75 题 V8/V9 v4:

Suite	Pass / Total	通过率
V8 tool calling	15 / 15	100.0% ✅
V9 holdout	8 / 8	100.0%
V9 expanded(AIME / 金融 / GPQA / 健康 / SQL)	51 / 52	98.1%
总计	74 / 75	98.7%

评测期间 avg 70.67 tok/s / p50 72.02 tok/s。

⚠️ 重要:Q4 比 NVFP4-think 高 6.7pp 真因是 chat_template wrap,不是量化质量更好

同 Lynn V4-Pro weights,同 V8/V9 v4 题集,同 temperature=0。Q4 (98.7%) vs NVFP4-think (92.0%) 的 6.7pp 差距来自:GGUF 内嵌 thinking 模板比 SGLang 的 chat_template.jinja 紧凑。4096 token budget 内,Q4 能走到答,NVFP4 在推理途中打顶。

qid	NVFP4 think tokens	Q4 tokens	真相
v9_008(gold "0.48 eV")	4096 ⚠️ 打顶	668 ✓	NVFP4 还在展开 `hc/λ`;Q4 已得 `K_max=0.4816 eV`
v9p_aime_001(gold 468)	4096 ⚠️ 打顶	1796 ✓	NVFP4 在坐标展开;Q4 已到 area=468
v9p_fin_005(gold 957.88)	4096 ⚠️ 打顶	929 ✓	NVFP4 卡验算;Q4 算出债券价 957.88

结论:NVFP4 和 Q4_K_M 是同一模型同样质量。Q4 的 98.7% 不是"量化更好",是"thinking 格式更紧凑能在 budget 内完成"。选哪个变体看你部署形态。

Tool-call 状态

变体	Tool-call 状态
BF16 merged	⚠️ Serving E2E 未做
NVFP4 v8-RTN	✅ SGLang + `qwen3_coder` parser 已 verified
Q4_K_M(本)	⚠️ 模型能 emit 合法 JSON tool-call payload;parse 回 `tool_calls[]` 取决于 llama.cpp / Ollama runtime 集成(未系统验证)

完整原始 V8/V9 结果:evaluation/(放在 v8-RTN 兄弟仓库,直接对照)。

变体清单

变体	格式	大小	用途
BF16 merged	BF16 safetensors(16 shards)	65.4 GB	⭐ 标准版,full precision
NVFP4 v8-RTN	compressed-tensors NVFP4(W4A4)	21 GB	Blackwell GPU 生产临时版
Q4_K_M GGUF(本仓库)	llama.cpp GGUF(K-quants)	~22 GB	⭐ 24 GB 消费 GPU

🔧 工具链与可复现性

🐙 MerkyorLynn/lynn-distill-toolkit — 4-gate 评测 / 差分校验 / 量化校验 / ship 流水线
🐙 MerkyorLynn/qwen3.6-nvfp4-toolkit — NVFP4 量化工具(v8-RTN 变体配套)

Q4_K_M 转换 用 llama.cpp 的 convert_hf_to_gguf.py + llama-quantize Q4_K_M。base 35B-A3B Qwen3.6 arch(Qwen3_5MoeForConditionalGeneration)在 llama.cpp master 已支持。

本地验证:

git clone https://github.com/MerkyorLynn/lynn-distill-toolkit
cd lynn-distill-toolkit/eval
python quant_verify.py \
  --runtime llama.cpp \
  --base-url http://localhost:8080 \
  --prompts prompts/quant_smoke_6prompts.json

🔗 关联仓库

详见上方英文段 "Sibling Repositories" 表格。

许可与归属

本仓库:MIT 许可(见 LICENSE)
基座模型:Qwen3.6-35B-A3B(阿里云,Apache 2.0)
Path B 声明:R1-Distill 风格衍生作品。完整归属见 NOTICE。

引用

@misc{lynn-v4-pro-distill-q4km,
  title = {Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M},
  author = {Lynn / MerkyorLynn},
  year = {2026},
  url = {https://huggingface.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M}
}

Evaluation — Verified 2026-05-17 (Spark DGX, GB10 sm_121)

Public benchmark matrix

Candidate	MMLU 500 (5-shot)	GPQA Diamond (198, 0-shot)
evaluation data pending

Hardware + serving

DGX Spark (GB10, SM121), 119 GiB unified memory, single card
GGUF Q4_K_M: llama.cpp:server-cuda with --n-gpu-layers 99 --ctx-size 4096 --jinja, chat_template_kwargs.enable_thinking=false
BF16: lynn-engine server.openai_http with --dtype bfloat16, LYNN_MOE_IMPL=optimized, /no_think directive
NVFP4 v8-RTN: SGLang dev-cu13 with CompressedTensorsW4A4Nvfp4MoE MoE path
temperature=0 greedy decode across all candidates
MMLU: 500-question deterministic random sample (seed=20260517), 5-shot subject-matched dev examples
GPQA: full Diamond split (198), 0-shot, stable per-question option shuffle

Key findings

Lynn distillation gives +17-20pp on MMLU vs the Qwen3.6 base teacher in Q4_K_M / BF16 forms. Multi-subject knowledge transfer is real.
NVFP4 v8-RTN (W4A4) quantization erases the Lynn distillation gain — V4-Pro V8-RTN drops to ~63% MMLU, matching the Qwen base teacher (4-bit activation is too coarse to preserve distillation features).
Q4_K_M (W4A16) preserves distillation quality — only 1-3pp drop vs BF16 reference.
GPQA Diamond saturates at 40-45% for all candidates including base; graduate-level reasoning does not transfer through distillation in this family.

Caveats

500-sample MMLU has ~±2pp sample variance vs full 14k MMLU
GPQA Diamond 198 questions: ~±3pp variance
Results are decode-only; first-token latency not reported here
Per-subject MMLU breakdown + raw per-question JSONL + reproducible harness scripts available in the underlying Lynn engine quality-eval-20260517/ artifact set

Downloads last month: 159

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

4-bit

Model tree for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M

Base model

nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B

Quantized

(2)

this model

🚨 Tool calling hotfix (2026-05-15)

Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M

Quick Specs

Why Q4_K_M (and not NVFP4)?

How to Run

Recommended: llama.cpp server (Docker CUDA)

Ollama

llama.cpp CLI (single prompt)

Query example (OpenAI-compatible API)

⚠️ Important Behavioral Notes

1. Chat template — use the external chat_template.jinja (no-think default)

2. GGUF runtime only

3. CPU offload for ≤ 16 GB GPU (Apple Silicon / 3080 Ti)

4. Known Limitation: Ollama empty <think> wrapper

⚡ Performance (NVIDIA GB10 sm_121 native build, 2026-05-14 full suite)

Single stream

Concurrent (llama.cpp --parallel)

Q4_K_M vs NVFP4 — choose based on deployment shape

🔬 Evaluation

V8/V9 v4 supplementary eval — Q4 thinking-on default

⚠️ Important: Q4 score is +6.7pp above NVFP4-think due to chat_template wrap, not quantization quality

Tool-call status

Available Variants

🔧 Toolkit & Reproducibility

🔗 Sibling Repositories

License & Attribution

Citation

中文(简体)

规格

为什么 Q4_K_M(而不用 NVFP4)?

启动方式

推荐:llama.cpp server(Docker CUDA)

Ollama

llama.cpp CLI(单条 prompt)

⚠️ 重要行为提示

1. Chat template — 用外部 chat_template.jinja(no-think 默认)

2. 只能用 GGUF runtime

3. ≤ 16 GB GPU CPU offload(Apple Silicon / 3080 Ti)

4. 已知限制:Ollama 空 <think> 包装

⚡ 性能(NVIDIA GB10 sm_121 native build,2026-05-14 完整 suite)

单流测试

并发(llama.cpp --parallel)

Q4_K_M vs NVFP4 — 按部署形态选

🔬 评测

V8/V9 v4 补充评测 — Q4 默认 thinking-on

⚠️ 重要:Q4 比 NVFP4-think 高 6.7pp 真因是 chat_template wrap,不是量化质量更好

Tool-call 状态

变体清单

🔧 工具链与可复现性

🔗 关联仓库

许可与归属

引用

Evaluation — Verified 2026-05-17 (Spark DGX, GB10 sm_121)

Public benchmark matrix

Hardware + serving

Key findings

Caveats

Model tree for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M

1. Chat template — use the external `chat_template.jinja` (no-think default)

4. Known Limitation: Ollama empty `<think>` wrapper

Concurrent (`llama.cpp --parallel`)

1. Chat template — 用外部 `chat_template.jinja`(no-think 默认)

4. 已知限制:Ollama 空 `<think>` 包装

并发(`llama.cpp --parallel`)