Qwen3.6-35B-A3B Q4_K_M (imatrix) — Lynn calibration

This repository ships an imatrix-calibrated Q4_K_M GGUF of the official Qwen/Qwen3.6-35B-A3B base weights, built and validated on a DGX Spark (GB10, sm_121) by Lynn.

Not a distillation. This is the original Qwen3.6-35B-A3B post-trained model, only quantized. It is not a derivative of Lynn V4-Pro or V4-Flash distillations — those are separate Lynn product lines.

Quality

Evaluated 2026-05-18 on DGX Spark sm_121 against the official upstream BF16 weights, identical decoding setup for all three rows.

Variant	Size	MMLU 500 (5-shot)	GPQA Diamond 198 (0-shot)
Qwen3.6-35B-A3B BF16 (upstream)	67 GB	86.40% (432/500)	45.45% (90/198)
Qwen3.6-35B-A3B Q4_K_M (this repo, imatrix) — thinking off	20 GB	83.00% (415/500)	50.00% (99/198)
Qwen3.6-35B-A3B Q4_K_M (this repo, imatrix) — thinking on (32K)	20 GB	90.40% (452/500)	80.70% excl_pf (92/198 raw · parse_fail 84/198)
Qwen3.6-35B-A3B W4A16 NVFP4 (Lynn-native, separate repo)	23 GB	84.40% (422/500)	49.49% (98/198)

The thinking-on (32K) row matches the client default in Lynn Desktop (5/20 strategy pivot to local-first). excl_pf excludes 84/198 long-thinking samples that exceeded the 32K budget before emitting a parseable final choice — mostly Organic Chemistry. This is a max_tokens ceiling, not a model capability ceiling.

Headline: thinking-off MMLU drops only −3.4 pp vs. BF16; GPQA actually rises +4.55 pp, well above evaluation noise. With thinking-on (32K), the same Q4_K_M reaches 90.40% MMLU and 80.70% GPQA Diamond excl_pf — the day-to-day client-experience number. Quality line holds with a ~70% size reduction.

This is the original Qwen3.6 base. Numbers above are not comparable to the Lynn V4-Pro / V4-Flash distillation lines; those have their own model cards.

Single-stream serving (DGX Spark, sm_121)

Stack	Single-stream TPS (300 tok)
`llama.cpp:server-cuda` on this Q4_K_M	~70 TPS
SGLang `dev-cu13` on upstream BF16	~30 TPS

On Spark, this Q4_K_M is currently the highest single-stream throughput configuration at 20 GB resident.

What's in this repo

File	Size	Role
`Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf`	20 GB	GGUF weights, ready for `llama.cpp` / Ollama / LM Studio
`Qwen3.6-35B-A3B.imatrix`	192 MB	Calibration file used during quantization (reproducible)
`README.md`	this	Model card

How to run

llama.cpp server (OpenAI-compatible)

docker run -d --gpus all --ipc=host --shm-size=8g -p 18002:18002 \
  -v $(pwd)/Qwen3.6-35B-A3B-GGUF-imatrix:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf \
  --host 0.0.0.0 --port 18002 \
  --n-gpu-layers 999 --ctx-size 4096 --parallel 1 \
  --alias Qwen3.6-35B-A3B-Q4_K_M

curl http://127.0.0.1:18002/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen3.6-35B-A3B-Q4_K_M",
  "messages": [{"role": "user", "content": "Hello"}]
}'

Ollama

ollama create qwen36-q4km -f Modelfile
ollama run qwen36-q4km

(Modelfile template available in the engine docs.)

LM Studio / KTransformers

Load the GGUF file directly. The chat template is embedded in the GGUF metadata and matches Qwen3.6 upstream behaviour (thinking mode on by default).

Calibration details

Calibration corpus: diverse Chinese + English + code mix (see Qwen3.6-35B-A3B.imatrix for the reproducible activation statistics)
Quantizer: llama.cpp Q4_K_M with imatrix scaling
Base weights: identical sha256 chain to Qwen/Qwen3.6-35B-A3B
Built on: DGX Spark (GB10), 2026-05-18

License

Apache 2.0, inherited from Qwen/Qwen3.6-35B-A3B. See the LICENSE link in the model frontmatter.

Related Lynn artifacts

Lynn engine — Spark / R6000 inference runtime with W4A8 + native MTP roadmap.
Lynn W4A16 NVFP4 build of Qwen3.6-35B-A3B — separate repo (coming soon).
Lynn agent desktop app — separate stack, integrates with the OpenAI- compatible endpoint above.

Citation

If you use this artifact in research or production, please cite the upstream Qwen3.6 model and note this calibration:

@misc{lynn2026qwen36q4km,
  title  = {Qwen3.6-35B-A3B Q4\_K\_M (imatrix), Lynn calibration},
  author = {Lynn},
  year   = {2026},
  url    = {https://huggingface.co/nerkyor/Qwen3.6-35B-A3B-GGUF-imatrix},
}

简体中文版本 / Chinese version

Qwen3.6-35B-A3B Q4_K_M(imatrix 校准)— Lynn 量化

本仓库提供 Qwen/Qwen3.6-35B-A3B 官方权重的 imatrix 校准 Q4_K_M GGUF 量化版本,由 Lynn 在 DGX Spark(GB10,sm_121)上构建并验证。

不是蒸馏版本。本仓库是 Qwen3.6-35B-A3B 官方 post-train 权重的纯量化版, 不派生自 Lynn V4-Pro / V4-Flash 蒸馏线 — 那是 Lynn 自己的另两条产品线,有独立 model card。

质量

2026-05-18 在 DGX Spark sm_121 上对照官方 BF16 跑分,三档统一同 decode 设置。

版本	大小	MMLU 500(5-shot)	GPQA Diamond 198(0-shot)
Qwen3.6-35B-A3B BF16(官方上游)	67 GB	86.40%(432/500)	45.45%(90/198)
Qwen3.6-35B-A3B Q4_K_M(本仓库,imatrix)— thinking off	20 GB	83.00%(415/500)	50.00%(99/198)
Qwen3.6-35B-A3B Q4_K_M(本仓库,imatrix)— thinking on (32K)	20 GB	90.40%(452/500)	80.70% excl_pf(92/198 原始 · parse_fail 84/198)
Qwen3.6-35B-A3B W4A16 NVFP4(Lynn-native,单独仓库)	23 GB	84.40%(422/500)	49.49%(98/198)

thinking-on (32K) 行是 Lynn 桌面客户端默认口径(5/20 战略 pivot 落地后)。 excl_pf 排除 84/198 长思考样本(主要为 Organic Chemistry),它们在 32K token 预算耗尽前未给出可解析的最终选项。这是 max_tokens 上限问题,不是模型能力上限。

核心结论:体积压到 BF16 的 30%,thinking-off MMLU 只掉 −3.4pp;GPQA 实际反升 +4.55pp, 明显高于评测噪声。开启 thinking-on (32K) 后,同款 Q4_K_M 跑出 MMLU 90.40% / GPQA Diamond excl_pf 80.70% — 这是日常客户端体验数字。质量线完整保住。

本仓库是 Qwen3.6 原版,以上数字不能与 Lynn V4-Pro / V4-Flash 蒸馏线混比 — 蒸馏版有独立的 model card 与自己的测评。

单流推理速度(DGX Spark,sm_121)

框架	单流 TPS(300 tok)
`llama.cpp:server-cuda` × 本 Q4_K_M	~70 TPS
SGLang `dev-cu13` × 官方 BF16	~30 TPS

在 Spark 上,本 Q4_K_M 暂时是单流吞吐最高的部署形态,常驻仅 20 GB。

仓库内容

文件	大小	用途
`Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf`	20 GB	GGUF 权重,可直接用于 `llama.cpp` / Ollama / LM Studio
`Qwen3.6-35B-A3B.imatrix`	192 MB	量化时使用的 imatrix 校准数据(可复现)
`README.md`	本文件	模型卡

使用方法

llama.cpp server(OpenAI 兼容)

docker run -d --gpus all --ipc=host --shm-size=8g -p 18002:18002 \
  -v $(pwd)/Qwen3.6-35B-A3B-GGUF-imatrix:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf \
  --host 0.0.0.0 --port 18002 \
  --n-gpu-layers 999 --ctx-size 4096 --parallel 1 \
  --alias Qwen3.6-35B-A3B-Q4_K_M

curl http://127.0.0.1:18002/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen3.6-35B-A3B-Q4_K_M",
  "messages": [{"role": "user", "content": "你好"}]
}'

Ollama

ollama create qwen36-q4km -f Modelfile
ollama run qwen36-q4km

(Modelfile 模板见 Lynn engine 官网。)

LM Studio / KTransformers

直接加载 GGUF 文件即可。chat template 已 embed 在 GGUF metadata 里, 与 Qwen3.6 官方行为一致(默认开 thinking 模式)。

量化细节

校准语料:中文 + 英文 + 代码混合(可复现的激活统计见 Qwen3.6-35B-A3B.imatrix)
量化工具:llama.cpp Q4_K_M + imatrix 缩放
基础权重:与 Qwen/Qwen3.6-35B-A3B sha256 完全一致
构建机器:DGX Spark(GB10),2026-05-18

许可证

Apache 2.0,继承自上游 Qwen/Qwen3.6-35B-A3B。

Lynn 相关产物

Lynn 引擎 — Spark / R6000 推理运行时,主线 W4A8 + 自训 MTP
Lynn W4A16 NVFP4 版 Qwen3.6-35B-A3B — 独立仓库(即将发布)
Lynn 智能体桌面 app — 独立技术栈,可直接对接上面 OpenAI 兼容 endpoint

引用

如在研究或生产中使用,请引用上游 Qwen3.6 并注明本校准版本:

@misc{lynn2026qwen36q4km,
  title  = {Qwen3.6-35B-A3B Q4\_K\_M(imatrix),Lynn 校准版},
  author = {Lynn},
  year   = {2026},
  url    = {https://huggingface.co/nerkyor/Qwen3.6-35B-A3B-GGUF-imatrix},
}

Downloads last month: 99

GGUF

Hardware compatibility

4-bit

Model tree for nerkyor/Qwen3.6-35B-A3B-GGUF-imatrix

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(512)

this model