Qwen3.6-35B-A3B Q4_K_M (imatrix) — Lynn calibration

This repository ships an imatrix-calibrated Q4_K_M GGUF of the official Qwen/Qwen3.6-35B-A3B base weights, built and validated on a DGX Spark (GB10, sm_121) by Lynn.

Not a distillation. This is the original Qwen3.6-35B-A3B post-trained model, only quantized. It is not a derivative of Lynn V4-Pro or V4-Flash distillations — those are separate Lynn product lines.

Quality

Evaluated 2026-05-18 on DGX Spark sm_121 against the official upstream BF16 weights, identical decoding setup for all three rows.

Variant Size MMLU 500 (5-shot) GPQA Diamond 198 (0-shot)
Qwen3.6-35B-A3B BF16 (upstream) 67 GB 86.40% (432/500) 45.45% (90/198)
Qwen3.6-35B-A3B Q4_K_M (this repo, imatrix) — thinking off 20 GB 83.00% (415/500) 50.00% (99/198)
Qwen3.6-35B-A3B Q4_K_M (this repo, imatrix) — thinking on (32K) 20 GB 90.40% (452/500) 80.70% excl_pf (92/198 raw · parse_fail 84/198)
Qwen3.6-35B-A3B W4A16 NVFP4 (Lynn-native, separate repo) 23 GB 84.40% (422/500) 49.49% (98/198)

The thinking-on (32K) row matches the client default in Lynn Desktop (5/20 strategy pivot to local-first). excl_pf excludes 84/198 long-thinking samples that exceeded the 32K budget before emitting a parseable final choice — mostly Organic Chemistry. This is a max_tokens ceiling, not a model capability ceiling.

Headline: thinking-off MMLU drops only −3.4 pp vs. BF16; GPQA actually rises +4.55 pp, well above evaluation noise. With thinking-on (32K), the same Q4_K_M reaches 90.40% MMLU and 80.70% GPQA Diamond excl_pf — the day-to-day client-experience number. Quality line holds with a ~70% size reduction.

This is the original Qwen3.6 base. Numbers above are not comparable to the Lynn V4-Pro / V4-Flash distillation lines; those have their own model cards.

Single-stream serving (DGX Spark, sm_121)

Stack Single-stream TPS (300 tok)
llama.cpp:server-cuda on this Q4_K_M ~70 TPS
SGLang dev-cu13 on upstream BF16 ~30 TPS

On Spark, this Q4_K_M is currently the highest single-stream throughput configuration at 20 GB resident.

What's in this repo

File Size Role
Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf 20 GB GGUF weights, ready for llama.cpp / Ollama / LM Studio
Qwen3.6-35B-A3B.imatrix 192 MB Calibration file used during quantization (reproducible)
README.md this Model card

How to run

llama.cpp server (OpenAI-compatible)

docker run -d --gpus all --ipc=host --shm-size=8g -p 18002:18002 \
  -v $(pwd)/Qwen3.6-35B-A3B-GGUF-imatrix:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf \
  --host 0.0.0.0 --port 18002 \
  --n-gpu-layers 999 --ctx-size 4096 --parallel 1 \
  --alias Qwen3.6-35B-A3B-Q4_K_M

curl http://127.0.0.1:18002/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen3.6-35B-A3B-Q4_K_M",
  "messages": [{"role": "user", "content": "Hello"}]
}'

Ollama

ollama create qwen36-q4km -f Modelfile
ollama run qwen36-q4km

(Modelfile template available in the engine docs.)

LM Studio / KTransformers

Load the GGUF file directly. The chat template is embedded in the GGUF metadata and matches Qwen3.6 upstream behaviour (thinking mode on by default).

Calibration details

  • Calibration corpus: diverse Chinese + English + code mix (see Qwen3.6-35B-A3B.imatrix for the reproducible activation statistics)
  • Quantizer: llama.cpp Q4_K_M with imatrix scaling
  • Base weights: identical sha256 chain to Qwen/Qwen3.6-35B-A3B
  • Built on: DGX Spark (GB10), 2026-05-18

License

Apache 2.0, inherited from Qwen/Qwen3.6-35B-A3B. See the LICENSE link in the model frontmatter.

Related Lynn artifacts

  • Lynn engine — Spark / R6000 inference runtime with W4A8 + native MTP roadmap.
  • Lynn W4A16 NVFP4 build of Qwen3.6-35B-A3B — separate repo (coming soon).
  • Lynn agent desktop app — separate stack, integrates with the OpenAI- compatible endpoint above.

Citation

If you use this artifact in research or production, please cite the upstream Qwen3.6 model and note this calibration:

@misc{lynn2026qwen36q4km,
  title  = {Qwen3.6-35B-A3B Q4\_K\_M (imatrix), Lynn calibration},
  author = {Lynn},
  year   = {2026},
  url    = {https://huggingface.co/nerkyor/Qwen3.6-35B-A3B-GGUF-imatrix},
}

简体中文版本 / Chinese version

Qwen3.6-35B-A3B Q4_K_M(imatrix 校准)— Lynn 量化

本仓库提供 Qwen/Qwen3.6-35B-A3B 官方权重的 imatrix 校准 Q4_K_M GGUF 量化版本,由 Lynn 在 DGX Spark(GB10,sm_121)上构建并验证。

不是蒸馏版本。本仓库是 Qwen3.6-35B-A3B 官方 post-train 权重的纯量化版, 派生自 Lynn V4-Pro / V4-Flash 蒸馏线 — 那是 Lynn 自己的另两条产品线,有独立 model card。

质量

2026-05-18 在 DGX Spark sm_121 上对照官方 BF16 跑分,三档统一同 decode 设置。

版本 大小 MMLU 500(5-shot) GPQA Diamond 198(0-shot)
Qwen3.6-35B-A3B BF16(官方上游) 67 GB 86.40%(432/500) 45.45%(90/198)
Qwen3.6-35B-A3B Q4_K_M(本仓库,imatrix)— thinking off 20 GB 83.00%(415/500) 50.00%(99/198)
Qwen3.6-35B-A3B Q4_K_M(本仓库,imatrix)— thinking on (32K) 20 GB 90.40%(452/500) 80.70% excl_pf(92/198 原始 · parse_fail 84/198)
Qwen3.6-35B-A3B W4A16 NVFP4(Lynn-native,单独仓库) 23 GB 84.40%(422/500) 49.49%(98/198)

thinking-on (32K) 行是 Lynn 桌面客户端默认口径(5/20 战略 pivot 落地后)。 excl_pf 排除 84/198 长思考样本(主要为 Organic Chemistry),它们在 32K token 预算 耗尽前未给出可解析的最终选项。这是 max_tokens 上限问题,不是模型能力上限。

核心结论:体积压到 BF16 的 30%,thinking-off MMLU 只掉 −3.4pp;GPQA 实际反升 +4.55pp, 明显高于评测噪声。开启 thinking-on (32K) 后,同款 Q4_K_M 跑出 MMLU 90.40% / GPQA Diamond excl_pf 80.70% — 这是日常客户端体验数字。质量线完整保住

本仓库是 Qwen3.6 原版,以上数字不能与 Lynn V4-Pro / V4-Flash 蒸馏线混比 — 蒸馏版有独立的 model card 与自己的测评。

单流推理速度(DGX Spark,sm_121)

框架 单流 TPS(300 tok)
llama.cpp:server-cuda × 本 Q4_K_M ~70 TPS
SGLang dev-cu13 × 官方 BF16 ~30 TPS

在 Spark 上,本 Q4_K_M 暂时是单流吞吐最高的部署形态,常驻仅 20 GB。

仓库内容

文件 大小 用途
Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf 20 GB GGUF 权重,可直接用于 llama.cpp / Ollama / LM Studio
Qwen3.6-35B-A3B.imatrix 192 MB 量化时使用的 imatrix 校准数据(可复现)
README.md 本文件 模型卡

使用方法

llama.cpp server(OpenAI 兼容)

docker run -d --gpus all --ipc=host --shm-size=8g -p 18002:18002 \
  -v $(pwd)/Qwen3.6-35B-A3B-GGUF-imatrix:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda \
  -m /models/Qwen3.6-35B-A3B-Q4_K_M-imatrix.gguf \
  --host 0.0.0.0 --port 18002 \
  --n-gpu-layers 999 --ctx-size 4096 --parallel 1 \
  --alias Qwen3.6-35B-A3B-Q4_K_M

curl http://127.0.0.1:18002/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen3.6-35B-A3B-Q4_K_M",
  "messages": [{"role": "user", "content": "你好"}]
}'

Ollama

ollama create qwen36-q4km -f Modelfile
ollama run qwen36-q4km

(Modelfile 模板见 Lynn engine 官网。)

LM Studio / KTransformers

直接加载 GGUF 文件即可。chat template 已 embed 在 GGUF metadata 里, 与 Qwen3.6 官方行为一致(默认开 thinking 模式)。

量化细节

  • 校准语料:中文 + 英文 + 代码混合(可复现的激活统计见 Qwen3.6-35B-A3B.imatrix)
  • 量化工具:llama.cpp Q4_K_M + imatrix 缩放
  • 基础权重:与 Qwen/Qwen3.6-35B-A3B sha256 完全一致
  • 构建机器:DGX Spark(GB10),2026-05-18

许可证

Apache 2.0,继承自上游 Qwen/Qwen3.6-35B-A3B

Lynn 相关产物

  • Lynn 引擎 — Spark / R6000 推理运行时,主线 W4A8 + 自训 MTP
  • Lynn W4A16 NVFP4 版 Qwen3.6-35B-A3B — 独立仓库(即将发布)
  • Lynn 智能体桌面 app — 独立技术栈,可直接对接上面 OpenAI 兼容 endpoint

引用

如在研究或生产中使用,请引用上游 Qwen3.6 并注明本校准版本:

@misc{lynn2026qwen36q4km,
  title  = {Qwen3.6-35B-A3B Q4\_K\_M(imatrix),Lynn 校准版},
  author = {Lynn},
  year   = {2026},
  url    = {https://huggingface.co/nerkyor/Qwen3.6-35B-A3B-GGUF-imatrix},
}
Downloads last month
99
GGUF
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nerkyor/Qwen3.6-35B-A3B-GGUF-imatrix

Quantized
(512)
this model