🚨 Tool calling hotfix (2026-05-15)

⚠️ For tool calling / function calling, use chat_template_full_nothink.jinja (newly added) — NOT chat_template_no_think_simple.jinja

Earlier guidance recommended the 14-line chat_template_no_think_simple.jinja for clean no-think output, but that simplified template completely strips the {% if tools %} rendering block. The model never sees tool descriptions in the system prompt → tool_calls are silently empty even when you pass a valid tools field.

The new canonical chat_template_full_nothink.jinja (155 lines) = full 152-line chat_template.jinja with an explicit enable_thinking=false default prepended. It preserves all tools / <tool_call> / function rendering.

Recommended startup:

llama-server \
  -m <gguf-file> \
  --jinja \
  --chat-template-file chat_template_full_nothink.jinja

Verification (should fire tool_calls, not text refusal):

curl http://localhost:8080/v1/chat/completions -d '{
  "messages":[{"role":"user","content":"北京今天天气怎么样?"}],
  "tools":[{"type":"function","function":{"name":"get_weather","parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}}],
  "max_tokens": 200
}'

Expected response: tool_calls: [{function: {name: "get_weather", arguments: {"location":"北京"}}}]

Legacy chat_template_no_think_simple.jinja is retained only for text-only chat scenarios with no tool calling needed. Do not use it if you need tool/function calling.

Long-term fix: next GGUF rebuild will patch the embedded chat_template so --chat-template-file is not required for tool calling.


Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M

Q4_K_M GGUF quantization of Lynn-V4-Pro-Distill-Qwen-35B-A3B for single-card 24 GB GPU deployment (RTX 4090 / RTX 5090 / RTX PRO 6000 / Apple Silicon). Single .gguf file ~22 GB, runs on llama.cpp / Ollama / koboldcpp out of the box.

Quick Specs

Base nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B (BF16, 65.4 GB)
Quant Q4_K_M (llama.cpp K-quants, ~4.85 bits per weight avg)
Size ~22 GB single .gguf
Runtime llama.cpp / Ollama / koboldcpp / llama-cpp-python (uses qwen3_5_moe GGUF loader)
GPU memory target 24 GB(fits RTX 4090 / 5090 / RTX PRO 6000)
Active params ~3 B per token (35 B total, 256 experts MoE)
License MIT (see LICENSE / NOTICE)

Why Q4_K_M (and not NVFP4)?

This variant takes an independent loader path from Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN:

  • Q4_K_M (this) uses llama.cpp's own qwen3_5_moe GGUF loader — works on any GPU/CPU, no Blackwell required, no dependency on transformers/vLLM/SGLang ecosystem
  • NVFP4 v8-RTN uses SGLang dev-cu13 with CompressedTensorsW4A4Nvfp4MoE — requires Blackwell sm_120+

For single-card 24 GB consumer GPU users (RTX 4090 / 5090 / Apple Silicon), Q4_K_M is the recommended path. For Blackwell production servers, NVFP4 v8-RTN gives better throughput.

How to Run

Recommended: llama.cpp server (Docker CUDA)

docker run --gpus all -d --name lynn-v4-pro-q4km \
  --restart=no \
  -v $(pwd):/models \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \
  --ctx-size 4096 \
  --jinja \
  --chat-template-file /models/chat_template.jinja

Key flags:

  • -ngl 99 — offload all transformer layers to GPU (Q4 model fits ~22 GB on 24 GB GPU)
  • --ctx-size 4096 — adjust per your VRAM headroom (long context costs KV cache memory)
  • --jinja — enable Jinja chat template processing
  • --chat-template-file chat_template.jinjaoverride embedded template with the no-think version (see notes below)

Ollama

# Create Modelfile pointing at the .gguf
cat > Modelfile <<EOF
FROM ./Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER stop "<|im_end|>"
EOF

ollama create lynn-v4-pro-q4km -f Modelfile
ollama run lynn-v4-pro-q4km "用一句话解释 MoE active params"

llama.cpp CLI (single prompt)

./llama-cli -m Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
  -ngl 99 -p "用一句话解释 MoE active params" \
  -n 256 --temp 0 --jinja \
  --chat-template-file chat_template.jinja

Query example (OpenAI-compatible API)

import requests
resp = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "model": "lynn-v4-pro-q4km",
        "messages": [{"role": "user", "content": "用一句话解释 MoE active params"}],
        "max_tokens": 256,
        "temperature": 0,
    },
)
print(resp.json()["choices"][0]["message"]["content"])

⚠️ Important Behavioral Notes

1. Chat template — use the external chat_template.jinja (no-think default)

The chat_template.jinja shipped in this repo defaults to enable_thinking=False (training-inference aligned). However, the chat template embedded inside the .gguf binary was generated from the original Qwen3.6 base which defaults to enable_thinking=True.

Recommended: pass --chat-template-file chat_template.jinja to override the embedded template with the no-think version. This produces direct answers without <think>...</think> traces.

To enable thinking output, simply omit --chat-template-file (uses embedded template that defaults to think) or pass an alternative template.

2. GGUF runtime only

This variant is NOT loadable by:

  • transformers.AutoModelForCausalLM (no GGUF parser)
  • vLLM / SGLang / TRT-LLM (no GGUF support for this arch)

For those runtimes, use BF16 or NVFP4 v8-RTN variants.

3. CPU offload for ≤ 16 GB GPU (Apple Silicon / 3080 Ti)

Reduce -ngl 99 to fewer layers to offload some to CPU:

  • 24 GB GPU: -ngl 99 (all on GPU)
  • 16 GB GPU: -ngl 60 (partial offload, ~50% slower)
  • 12 GB GPU: -ngl 40
  • Apple Silicon (M-series unified memory): -ngl 99 if 24 GB+ unified

4. Known Limitation: Ollama empty <think> wrapper

When loading this GGUF via Ollama, responses include an empty <think></think> wrapper before the actual content. This is embedded chat_template behavior (GGUF embeds the original Qwen3.6 think-on template, Ollama doesn't expose a --chat-template-file flag), NOT a quality issue — Ollama smoke shows 6/6 prompts coherent (Chinese / coding / math / tool calls all correct).

To eliminate the wrapper, use llama.cpp server with --chat-template-file chat_template.jinja (the shipped no-think version overrides the embedded template). See "How to Run" above.

⚡ Performance (NVIDIA GB10 sm_121 native build, 2026-05-14 full suite)

Measured with llama-server from local build-cuda-sm121 build, --parallel 4, temperature=0, streaming + stream_options.include_usage.

Single stream

Mode TPS (avg) TPS (p50) TTFT (avg) Notes
Baseline 74.9 tok/s 75.2 122 ms 27% faster than NVFP4 single-stream
首先... Chinese thinking-prefix injection 75.0 74.9 106 ms prefix has no perf impact

Concurrent (llama.cpp --parallel)

N Wall time Aggregate TPS Per-stream TPS TTFT avg Speedup vs N=1
1 74.9 74.9 122 ms 1.0x
2 9.53s 107.5 54.7 166 ms 1.43x
4 23.04s 88.9 ⚠️ 22.4 231 ms 1.19x regressed below N=2
8 28.31s 144.7 18.3 325 ms 1.93x
16 32.51s 252.0 15.9 268 ms 3.36x

⚠️ Q4 concurrent scaling is poor. llama.cpp --parallel is not real continuous batching — it's slot multiplexing with isolated KV cache. Slot-switching overhead eats the benefit when batch is small (N=4 is slower than N=2 in aggregate).

Q4_K_M vs NVFP4 — choose based on deployment shape

Dimension Q4_K_M (this) NVFP4 v8-RTN Winner
Single stream TPS 74.9 58.7 🏆 Q4 (+27%)
Single stream TTFT 122 ms 81 ms 🏆 NVFP4
N=4 aggregate 89 ⚠️ 220 🏆 NVFP4 (2.5x)
N=16 aggregate 252 599 🏆 NVFP4 (2.4x)
Long ctx 32K ⚠️ HTTP 400 48.4 tok/s ✓ 🏆 NVFP4
Deploy footprint 24 GB consumer GPU Blackwell + SGLang 🏆 Q4 (consumer)
Tool-call E2E ⚠️ JSON-emit only ✅ verified 🏆 NVFP4
Positioning Consumer single-user Server multi-user Complementary

Q4_K_M is the right choice for 24 GB consumer GPU + 1-2 user workloads (single-stream is fastest, runs on any GPU without Blackwell requirement). For multi-user serving with tool-calling, use the NVFP4 v8-RTN sibling + SGLang.

🔬 Evaluation

This Q4_K_M variant inherits the V Pro Distill (BF16) verdict, plus its own quality + serving verification:

Gate Result
4-gate eval (BF16) NET_WIN +40.00pp (reports/)
Differential sanity (BF16) 5/5 PASS, logits diff 0.75-0.91
Q4_K_M Ollama smoke 6/6 PASS coherent (5/13)
Q4_K_M llama.cpp serving 6/6 quality coherent + 75 tok/s perf (5/14)
Q4_K_M V8/V9 v4 (75 questions, thinking-on default) 74/75 = 98.7% (see caveat below)

V8/V9 v4 supplementary eval — Q4 thinking-on default

Q4_K_M ships with its GGUF chat_template thinking-default. We ran the same 75-question V8/V9 v4 suite used for NVFP4:

Suite Pass / Total Pass Rate
V8 tool calling 15 / 15 100.0%
V9 holdout 8 / 8 100.0%
V9 expanded (AIME / finance / GPQA / health / SQL) 51 / 52 98.1%
Total 74 / 75 98.7%

avg 70.67 tok/s / p50 72.02 tok/s during eval.

⚠️ Important: Q4 score is +6.7pp above NVFP4-think due to chat_template wrap, not quantization quality

Same Lynn V4-Pro weights. Same V8/V9 v4 suite. Same temperature=0. The 6.7pp gap (Q4 98.7% vs NVFP4-think 92.0%) comes from GGUF embeds a more concise thinking template than SGLang's chat_template.jinja. With a 4096-token budget, Q4 reaches the answer while NVFP4 hits the ceiling mid-derivation.

qid NVFP4 think tokens Q4 tokens What happened
v9_008 (gold "0.48 eV") 4096 ⚠️ truncated 668 NVFP4 still unwinding hc/λ; Q4 reached K_max=0.4816 eV cleanly
v9p_aime_001 (gold 468) 4096 ⚠️ truncated 1796 NVFP4 mid-coordinate; Q4 reached area=468
v9p_fin_005 (gold 957.88) 4096 ⚠️ truncated 929 NVFP4 stuck verifying; Q4 computed bond price

Bottom line: NVFP4 and Q4_K_M serve the same model with the same quality. Q4's 98.7% is not "better quantization" — it's "more efficient thinking format inside the budget". Use the variant that matches your deployment.

Tool-call status

Variant Tool-call Status
BF16 merged ⚠️ Serving E2E not validated
NVFP4 v8-RTN ✅ Verified on SGLang + qwen3_coder parser
Q4_K_M (this) ⚠️ Model emits valid JSON tool-call payloads; parsing back to tool_calls[] depends on llama.cpp / Ollama runtime integration (not systematically verified)

Full raw V8/V9 results: evaluation/ on the v8-RTN sibling repo (kept together for direct comparison).

Available Variants

Variant Format Size Use Case
BF16 merged BF16 safetensors (16 shards) 65.4 GB ⭐ canonical, full precision
NVFP4 v8-RTN compressed-tensors NVFP4 (W4A4) 21 GB Blackwell GPU production interim
Q4_K_M GGUF (this) llama.cpp GGUF (K-quants) ~22 GB ⭐ 24 GB consumer GPU

🔧 Toolkit & Reproducibility

Q4_K_M conversion was done with llama.cpp convert_hf_to_gguf.py followed by llama-quantize Q4_K_M. The base 35B-A3B Qwen3.6 architecture (Qwen3_5MoeForConditionalGeneration) is supported in llama.cpp master.

To verify locally:

git clone https://github.com/MerkyorLynn/lynn-distill-toolkit
cd lynn-distill-toolkit/eval
python quant_verify.py \
  --runtime llama.cpp \
  --base-url http://localhost:8080 \
  --prompts prompts/quant_smoke_6prompts.json

🔗 Sibling Repositories

License & Attribution

  • This repo: MIT License (see LICENSE)
  • Base model: Qwen3.6-35B-A3B by Alibaba Cloud, Apache 2.0
  • Path B notice: This is a derivative work in the R1-Distill style. See NOTICE.

Citation

@misc{lynn-v4-pro-distill-q4km,
  title = {Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M},
  author = {Lynn / MerkyorLynn},
  year = {2026},
  url = {https://huggingface.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M}
}

中文(简体)

Lynn-V4-Pro-Distill-Qwen-35B-A3BQ4_K_M GGUF 量化版,单卡 24 GB GPU 部署目标(RTX 4090 / 5090 / RTX PRO 6000 / Apple Silicon)。单 .gguf 文件 ~22 GB,直接走 llama.cpp / Ollama / koboldcpp

规格

基座 nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B(BF16,65.4 GB)
量化 Q4_K_M(llama.cpp K-quants,~4.85 bits / weight 均值)
大小 ~22 GB 单 .gguf
Runtime llama.cpp / Ollama / koboldcpp / llama-cpp-python(用 qwen3_5_moe GGUF loader)
GPU 目标 24 GB 显存(RTX 4090 / 5090 / RTX PRO 6000 都可)
激活参数 每 token ~3 B(总 35 B,256 expert MoE)
许可 MIT(详见 LICENSE / NOTICE)

为什么 Q4_K_M(而不用 NVFP4)?

本变体走独立 loader 路径,跟 Lynn-V4-Pro-Distill-Qwen-35B-A3B-NVFP4-v8-RTN 完全不同:

  • Q4_K_M(本仓库):llama.cpp 自己的 qwen3_5_moe GGUF loader → 任意 GPU/CPU 都能跑,不要求 Blackwell,不依赖 transformers/vLLM/SGLang 生态
  • NVFP4 v8-RTN:SGLang dev-cu13 + CompressedTensorsW4A4Nvfp4MoE,需要 Blackwell sm_120+

单卡 24 GB 消费 GPU 用户(4090 / 5090 / Apple Silicon)推荐 Q4_K_M。Blackwell 生产服务器走 NVFP4 v8-RTN 性能更好。

启动方式

推荐:llama.cpp server(Docker CUDA)

docker run --gpus all -d --name lynn-v4-pro-q4km \
  --restart=no \
  -v $(pwd):/models \
  -p 8080:8080 \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \
  --ctx-size 4096 \
  --jinja \
  --chat-template-file /models/chat_template.jinja

关键参数说明:

  • -ngl 99 — 全部 transformer 层放 GPU(Q4 模型 22G 刚好放 24G 显存)
  • --ctx-size 4096 — 视 VRAM 余量调整(长上下文吃 KV cache 显存)
  • --jinja — 启用 Jinja chat template
  • --chat-template-file chat_template.jinja覆盖嵌入模板用 no-think 版(详见下方注意点)

Ollama

cat > Modelfile <<EOF
FROM ./Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER stop "<|im_end|>"
EOF

ollama create lynn-v4-pro-q4km -f Modelfile
ollama run lynn-v4-pro-q4km "用一句话解释 MoE active params"

llama.cpp CLI(单条 prompt)

./llama-cli -m Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M.gguf \
  -ngl 99 -p "用一句话解释 MoE active params" \
  -n 256 --temp 0 --jinja \
  --chat-template-file chat_template.jinja

⚠️ 重要行为提示

1. Chat template — 用外部 chat_template.jinja(no-think 默认)

本仓库附的 chat_template.jinja 默认 enable_thinking=False(训推一致)。但 .gguf 二进制内嵌的 chat template 是从 Qwen3.6 原 base 转出来的,默认 enable_thinking=True

推荐做法:启动加 --chat-template-file chat_template.jinja 覆盖嵌入版,输出直接答案不带 <think>...</think> 痕迹。

要开 thinking → 省略 --chat-template-file 即用嵌入(默认 think)。

2. 只能用 GGUF runtime

本变体不能被以下加载:

  • transformers.AutoModelForCausalLM(没 GGUF parser)
  • vLLM / SGLang / TRT-LLM(GGUF 不支持本 arch)

那些场景用 BF16NVFP4 v8-RTN 变体。

3. ≤ 16 GB GPU CPU offload(Apple Silicon / 3080 Ti)

调小 -ngl 99 让部分层走 CPU:

  • 24 GB GPU:-ngl 99(全 GPU)
  • 16 GB GPU:-ngl 60(部分 offload,~50% 慢)
  • 12 GB GPU:-ngl 40
  • Apple Silicon(M 系 unified memory):24 GB+ 用 -ngl 99

4. 已知限制:Ollama 空 <think> 包装

通过 Ollama 加载本 GGUF 时,回复正文前会有一个空的 <think></think> wrapper。 这是 GGUF 内嵌 chat_template 的行为(GGUF 内嵌了原 Qwen3.6 default-think 模板, Ollama 不支持 --chat-template-file 覆盖),不是质量问题 — Ollama smoke 6/6 prompts 全 coherent(中文 / 代码 / 数学 / 工具调用都正确)。

要去掉 wrapper,用 llama.cpp server + --chat-template-file chat_template.jinja (仓库附带的 no-think 版覆盖嵌入模板)。详见上方 "启动方式" 章节。

⚡ 性能(NVIDIA GB10 sm_121 native build,2026-05-14 完整 suite)

llama-server 本地 build-cuda-sm121 编,--parallel 4,temperature=0,流式 + stream_options.include_usage

单流测试

模式 TPS avg TPS p50 TTFT avg 说明
Baseline 74.9 tok/s 75.2 122 ms 单流比 NVFP4 快 27%
首先... 中文 thinking-prefix 注入 75.0 74.9 106 ms prefix 不影响 perf

并发(llama.cpp --parallel)

N Wall time Aggregate TPS Per-stream TPS TTFT avg 加速比 vs N=1
1 74.9 74.9 122 ms 1.0x
2 9.53s 107.5 54.7 166 ms 1.43x
4 23.04s 88.9 ⚠️ 22.4 231 ms 1.19x 反而比 N=2 慢
8 28.31s 144.7 18.3 325 ms 1.93x
16 32.51s 252.0 15.9 268 ms 3.36x

⚠️ Q4 并发扩展性差llama.cpp --parallel 不是真 continuous batching,是 slot 多路复用 + KV cache 隔离。小 batch 时 slot 切换开销吃掉收益(N=4 反而比 N=2 慢)。

Q4_K_M vs NVFP4 — 按部署形态选

维度 Q4_K_M(本) NVFP4 v8-RTN 赢家
单流 TPS 74.9 58.7 🏆 Q4(+27%)
单流 TTFT 122 ms 81 ms 🏆 NVFP4
N=4 aggregate 89 ⚠️ 220 🏆 NVFP4(2.5x)
N=16 aggregate 252 599 🏆 NVFP4(2.4x)
Long ctx 32K ⚠️ HTTP 400 48.4 tok/s ✓ 🏆 NVFP4
部署门槛 24 GB 消费 GPU Blackwell + SGLang 🏆 Q4(消费级)
Tool-call E2E ⚠️ JSON-emit only ✅ verified 🏆 NVFP4
定位 消费单用户 服务器多用户 互补

Q4_K_M 适合 24 GB 消费 GPU + 1-2 user 工作负载(单流最快,任意 GPU 都能跑不要求 Blackwell)。多用户服务 + 工具调用,用 NVFP4 v8-RTN 兄弟版 + SGLang。

🔬 评测

Q4_K_M 继承 V Pro Distill(BF16)结论 + 自己 quality + serving 验证:

Gate 结果
4-gate 评测(BF16) NET_WIN +40.00pp(reports/)
Differential sanity(BF16) 5/5 PASS,logits diff 0.75-0.91
Q4_K_M Ollama smoke 6/6 PASS coherent(5/13)
Q4_K_M llama.cpp serving 6/6 quality coherent + 75 tok/s perf(5/14)
Q4_K_M V8/V9 v4(75 题,thinking 默认开) 74/75 = 98.7%(详见下方 caveat)

V8/V9 v4 补充评测 — Q4 默认 thinking-on

Q4_K_M GGUF 内嵌 chat_template 默认 thinking。跑了跟 NVFP4 同套 75 题 V8/V9 v4:

Suite Pass / Total 通过率
V8 tool calling 15 / 15 100.0%
V9 holdout 8 / 8 100.0%
V9 expanded(AIME / 金融 / GPQA / 健康 / SQL) 51 / 52 98.1%
总计 74 / 75 98.7%

评测期间 avg 70.67 tok/s / p50 72.02 tok/s。

⚠️ 重要:Q4 比 NVFP4-think 高 6.7pp 真因是 chat_template wrap,不是量化质量更好

同 Lynn V4-Pro weights,同 V8/V9 v4 题集,同 temperature=0。Q4 (98.7%) vs NVFP4-think (92.0%) 的 6.7pp 差距来自:GGUF 内嵌 thinking 模板比 SGLang 的 chat_template.jinja 紧凑。4096 token budget 内,Q4 能走到答,NVFP4 在推理途中打顶。

qid NVFP4 think tokens Q4 tokens 真相
v9_008(gold "0.48 eV") 4096 ⚠️ 打顶 668 NVFP4 还在展开 hc/λ;Q4 已得 K_max=0.4816 eV
v9p_aime_001(gold 468) 4096 ⚠️ 打顶 1796 NVFP4 在坐标展开;Q4 已到 area=468
v9p_fin_005(gold 957.88) 4096 ⚠️ 打顶 929 NVFP4 卡验算;Q4 算出债券价 957.88

结论:NVFP4 和 Q4_K_M 是同一模型同样质量。Q4 的 98.7% 不是"量化更好",是"thinking 格式更紧凑能在 budget 内完成"。选哪个变体看你部署形态。

Tool-call 状态

变体 Tool-call 状态
BF16 merged ⚠️ Serving E2E 未做
NVFP4 v8-RTN ✅ SGLang + qwen3_coder parser 已 verified
Q4_K_M(本) ⚠️ 模型能 emit 合法 JSON tool-call payload;parse 回 tool_calls[] 取决于 llama.cpp / Ollama runtime 集成(未系统验证)

完整原始 V8/V9 结果:evaluation/(放在 v8-RTN 兄弟仓库,直接对照)。

变体清单

变体 格式 大小 用途
BF16 merged BF16 safetensors(16 shards) 65.4 GB ⭐ 标准版,full precision
NVFP4 v8-RTN compressed-tensors NVFP4(W4A4) 21 GB Blackwell GPU 生产临时版
Q4_K_M GGUF(本仓库) llama.cpp GGUF(K-quants) ~22 GB ⭐ 24 GB 消费 GPU

🔧 工具链与可复现性

Q4_K_M 转换llama.cppconvert_hf_to_gguf.py + llama-quantize Q4_K_M。base 35B-A3B Qwen3.6 arch(Qwen3_5MoeForConditionalGeneration)在 llama.cpp master 已支持。

本地验证:

git clone https://github.com/MerkyorLynn/lynn-distill-toolkit
cd lynn-distill-toolkit/eval
python quant_verify.py \
  --runtime llama.cpp \
  --base-url http://localhost:8080 \
  --prompts prompts/quant_smoke_6prompts.json

🔗 关联仓库

详见上方英文段 "Sibling Repositories" 表格。

许可与归属

  • 本仓库:MIT 许可(见 LICENSE)
  • 基座模型:Qwen3.6-35B-A3B(阿里云,Apache 2.0)
  • Path B 声明:R1-Distill 风格衍生作品。完整归属见 NOTICE

引用

@misc{lynn-v4-pro-distill-q4km,
  title = {Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M},
  author = {Lynn / MerkyorLynn},
  year = {2026},
  url = {https://huggingface.co/nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M}
}

Evaluation — Verified 2026-05-17 (Spark DGX, GB10 sm_121)

Public benchmark matrix

Candidate MMLU 500 (5-shot) GPQA Diamond (198, 0-shot)
evaluation data pending

Hardware + serving

  • DGX Spark (GB10, SM121), 119 GiB unified memory, single card
  • GGUF Q4_K_M: llama.cpp:server-cuda with --n-gpu-layers 99 --ctx-size 4096 --jinja, chat_template_kwargs.enable_thinking=false
  • BF16: lynn-engine server.openai_http with --dtype bfloat16, LYNN_MOE_IMPL=optimized, /no_think directive
  • NVFP4 v8-RTN: SGLang dev-cu13 with CompressedTensorsW4A4Nvfp4MoE MoE path
  • temperature=0 greedy decode across all candidates
  • MMLU: 500-question deterministic random sample (seed=20260517), 5-shot subject-matched dev examples
  • GPQA: full Diamond split (198), 0-shot, stable per-question option shuffle

Key findings

  1. Lynn distillation gives +17-20pp on MMLU vs the Qwen3.6 base teacher in Q4_K_M / BF16 forms. Multi-subject knowledge transfer is real.
  2. NVFP4 v8-RTN (W4A4) quantization erases the Lynn distillation gain — V4-Pro V8-RTN drops to ~63% MMLU, matching the Qwen base teacher (4-bit activation is too coarse to preserve distillation features).
  3. Q4_K_M (W4A16) preserves distillation quality — only 1-3pp drop vs BF16 reference.
  4. GPQA Diamond saturates at 40-45% for all candidates including base; graduate-level reasoning does not transfer through distillation in this family.

Caveats

  • 500-sample MMLU has ~±2pp sample variance vs full 14k MMLU
  • GPQA Diamond 198 questions: ~±3pp variance
  • Results are decode-only; first-token latency not reported here
  • Per-subject MMLU breakdown + raw per-question JSONL + reproducible harness scripts available in the underlying Lynn engine quality-eval-20260517/ artifact set
Downloads last month
159
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nerkyor/Lynn-V4-Pro-Distill-Qwen-35B-A3B-Q4_K_M

Quantized
(2)
this model