Qwen3.6-35B-A3B Q4_K_M (imatrix) — Lynn calibration
This repository ships an imatrix-calibrated Q4_K_M GGUF of the official Qwen/Qwen3.6-35B-A3B base weights, built and validated on a DGX Spark (GB10, sm_121) by Lynn.
Not a distillation. This is the original Qwen3.6-35B-A3B post-trained model, only quantized. It is not a derivative of Lynn V4-Pro or V4-Flash distillations — those are separate Lynn product lines.
Quality
Evaluated 2026-05-18 on DGX Spark sm_121 against the official upstream BF16 weights, identical decoding setup for all three rows.
| Variant | Size | MMLU 500 (5-shot) | GPQA Diamond 198 (0-shot) |
|---|---|---|---|
| Qwen3.6-35B-A3B BF16 (upstream) | 67 GB | 86.40% (432/500) | 45.45% (90/198) |
| Qwen3.6-35B-A3B Q4_K_M (this repo, imatrix) — thinking off | 20 GB | 83.00% (415/500) | 50.00% (99/198) |
| Qwen3.6-35B-A3B Q4_K_M (this repo, imatrix) — thinking on (32K) | 20 GB | 90.40% (452/500) | 80.70% excl_pf (92/198 raw · parse_fail 84/198) |
| Qwen3.6-35B-A3B W4A16 NVFP4 (Lynn-native, separate repo) | 23 GB | 84.40% (422/500) | 49.49% (98/198) |
The thinking-on (32K) row matches the client default in Lynn Desktop (5/20 strategy pivot to local-first).
excl_pfexcludes 84/198 long-thinking samples that exceeded the 32K budget before emitting a parseable final choice — mostly Organic Chemistry. This is a max_tokens ceiling, not a model capability ceiling.
Headline: thinking-off MMLU drops only −3.4 pp vs. BF16; GPQA actually rises +4.55 pp, well above evaluation noise. With thinking-on (32K), the same Q4_K_M reaches 90.40% MMLU and 80.70% GPQA Diamond excl_pf — the day-to-day client-experience number. Quality line holds with a ~70% size reduction.
This is the original Qwen3.6 base. Numbers above are not comparable to the Lynn V4-Pro / V4-Flash distillation lines; those have their own model cards.
Single-stream serving (DGX Spark, sm_121)
| Stack | Single-stream TPS (300 tok) |
|---|---|
llama.cpp:server-cuda on this Q4_K_M |
~70 TPS |
SGLang dev-cu13 on upstream BF16 |
~30 TPS |
On Spark, this Q4_K_M is currently the highest single-stream throughput configuration at 20 GB resident.
What's in this repo
| File | Size | Role |
|---|---|---|
Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf |
20 GB | GGUF weights, ready for llama.cpp / Ollama / LM Studio |
Qwen3.6-35B-A3B.imatrix |
192 MB | Calibration file used during quantization (reproducible) |
README.md |
this | Model card |
How to run
llama.cpp server (OpenAI-compatible)
docker run -d --gpus all --ipc=host --shm-size=8g -p 18002:18002 \
-v $(pwd)/Qwen3.6-35B-A3B-GGUF-imatrix:/models \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf \
--host 0.0.0.0 --port 18002 \
--n-gpu-layers 999 --ctx-size 4096 --parallel 1 \
--alias Qwen3.6-35B-A3B-Q4_K_M
curl http://127.0.0.1:18002/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen3.6-35B-A3B-Q4_K_M",
"messages": [{"role": "user", "content": "Hello"}]
}'
Ollama
ollama create qwen36-q4km -f Modelfile
ollama run qwen36-q4km
(Modelfile template available in the engine docs.)
LM Studio / KTransformers
Load the GGUF file directly. The chat template is embedded in the GGUF metadata and matches Qwen3.6 upstream behaviour (thinking mode on by default).
Calibration details
- Calibration corpus: diverse Chinese + English + code mix (see
Qwen3.6-35B-A3B.imatrixfor the reproducible activation statistics) - Quantizer:
llama.cppQ4_K_M with imatrix scaling - Base weights: identical sha256 chain to
Qwen/Qwen3.6-35B-A3B - Built on: DGX Spark (GB10), 2026-05-18
License
Apache 2.0, inherited from
Qwen/Qwen3.6-35B-A3B. See the
LICENSE link in the model frontmatter.
Related Lynn artifacts
- Lynn engine — Spark / R6000 inference runtime with W4A8 + native MTP roadmap.
- Lynn W4A16 NVFP4 build of Qwen3.6-35B-A3B — separate repo (coming soon).
- Lynn agent desktop app — separate stack, integrates with the OpenAI- compatible endpoint above.
Citation
If you use this artifact in research or production, please cite the upstream Qwen3.6 model and note this calibration:
@misc{lynn2026qwen36q4km,
title = {Qwen3.6-35B-A3B Q4\_K\_M (imatrix), Lynn calibration},
author = {Lynn},
year = {2026},
url = {https://huggingface.co/nerkyor/Qwen3.6-35B-A3B-GGUF-imatrix},
}
简体中文版本 / Chinese version
Qwen3.6-35B-A3B Q4_K_M(imatrix 校准)— Lynn 量化
本仓库提供 Qwen/Qwen3.6-35B-A3B 官方权重的 imatrix 校准 Q4_K_M GGUF 量化版本,由 Lynn 在 DGX Spark(GB10,sm_121)上构建并验证。
不是蒸馏版本。本仓库是 Qwen3.6-35B-A3B 官方 post-train 权重的纯量化版, 不派生自 Lynn V4-Pro / V4-Flash 蒸馏线 — 那是 Lynn 自己的另两条产品线,有独立 model card。
质量
2026-05-18 在 DGX Spark sm_121 上对照官方 BF16 跑分,三档统一同 decode 设置。
| 版本 | 大小 | MMLU 500(5-shot) | GPQA Diamond 198(0-shot) |
|---|---|---|---|
| Qwen3.6-35B-A3B BF16(官方上游) | 67 GB | 86.40%(432/500) | 45.45%(90/198) |
| Qwen3.6-35B-A3B Q4_K_M(本仓库,imatrix)— thinking off | 20 GB | 83.00%(415/500) | 50.00%(99/198) |
| Qwen3.6-35B-A3B Q4_K_M(本仓库,imatrix)— thinking on (32K) | 20 GB | 90.40%(452/500) | 80.70% excl_pf(92/198 原始 · parse_fail 84/198) |
| Qwen3.6-35B-A3B W4A16 NVFP4(Lynn-native,单独仓库) | 23 GB | 84.40%(422/500) | 49.49%(98/198) |
thinking-on (32K) 行是 Lynn 桌面客户端默认口径(5/20 战略 pivot 落地后)。
excl_pf排除 84/198 长思考样本(主要为 Organic Chemistry),它们在 32K token 预算 耗尽前未给出可解析的最终选项。这是 max_tokens 上限问题,不是模型能力上限。
核心结论:体积压到 BF16 的 30%,thinking-off MMLU 只掉 −3.4pp;GPQA 实际反升 +4.55pp, 明显高于评测噪声。开启 thinking-on (32K) 后,同款 Q4_K_M 跑出 MMLU 90.40% / GPQA Diamond excl_pf 80.70% — 这是日常客户端体验数字。质量线完整保住。
本仓库是 Qwen3.6 原版,以上数字不能与 Lynn V4-Pro / V4-Flash 蒸馏线混比 — 蒸馏版有独立的 model card 与自己的测评。
单流推理速度(DGX Spark,sm_121)
| 框架 | 单流 TPS(300 tok) |
|---|---|
llama.cpp:server-cuda × 本 Q4_K_M |
~70 TPS |
SGLang dev-cu13 × 官方 BF16 |
~30 TPS |
在 Spark 上,本 Q4_K_M 暂时是单流吞吐最高的部署形态,常驻仅 20 GB。
仓库内容
| 文件 | 大小 | 用途 |
|---|---|---|
Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf |
20 GB | GGUF 权重,可直接用于 llama.cpp / Ollama / LM Studio |
Qwen3.6-35B-A3B.imatrix |
192 MB | 量化时使用的 imatrix 校准数据(可复现) |
README.md |
本文件 | 模型卡 |
使用方法
llama.cpp server(OpenAI 兼容)
docker run -d --gpus all --ipc=host --shm-size=8g -p 18002:18002 \
-v $(pwd)/Qwen3.6-35B-A3B-GGUF-imatrix:/models \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-m /models/Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf \
--host 0.0.0.0 --port 18002 \
--n-gpu-layers 999 --ctx-size 4096 --parallel 1 \
--alias Qwen3.6-35B-A3B-Q4_K_M
curl http://127.0.0.1:18002/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen3.6-35B-A3B-Q4_K_M",
"messages": [{"role": "user", "content": "你好"}]
}'
Ollama
ollama create qwen36-q4km -f Modelfile
ollama run qwen36-q4km
(Modelfile 模板见 Lynn engine 官网。)
LM Studio / KTransformers
直接加载 GGUF 文件即可。chat template 已 embed 在 GGUF metadata 里, 与 Qwen3.6 官方行为一致(默认开 thinking 模式)。
量化细节
- 校准语料:中文 + 英文 + 代码混合(可复现的激活统计见
Qwen3.6-35B-A3B.imatrix) - 量化工具:
llama.cppQ4_K_M + imatrix 缩放 - 基础权重:与
Qwen/Qwen3.6-35B-A3Bsha256 完全一致 - 构建机器:DGX Spark(GB10),2026-05-18
许可证
Apache 2.0,继承自上游 Qwen/Qwen3.6-35B-A3B。
Lynn 相关产物
- Lynn 引擎 — Spark / R6000 推理运行时,主线 W4A8 + 自训 MTP
- Lynn W4A16 NVFP4 版 Qwen3.6-35B-A3B — 独立仓库(即将发布)
- Lynn 智能体桌面 app — 独立技术栈,可直接对接上面 OpenAI 兼容 endpoint
引用
如在研究或生产中使用,请引用上游 Qwen3.6 并注明本校准版本:
@misc{lynn2026qwen36q4km,
title = {Qwen3.6-35B-A3B Q4\_K\_M(imatrix),Lynn 校准版},
author = {Lynn},
year = {2026},
url = {https://huggingface.co/nerkyor/Qwen3.6-35B-A3B-GGUF-imatrix},
}
- Downloads last month
- 99
4-bit
Model tree for nerkyor/Qwen3.6-35B-A3B-GGUF-imatrix
Base model
Qwen/Qwen3.6-35B-A3B