Qwen3-4B-Thinking-2507 Q4_K_M imatrix GGUF(Lynn 端侧极速版)

这是 Lynn 端侧极速路线使用的 Qwen3-4B-Thinking-2507 GGUF 量化包,面向 llama.cpp / 本地 OpenAI-compatible endpoint,专为内存紧张 / 极致速度场景优化。

定位:4B Thinking,2.4GB 体积,本地秒级响应。 是 9B 路线的补充小弟,适合内存 8-16GB 设备 / 移动端场景 / 高并发 batch / 速度敏感的工具调用链路。

English summary: this is Lynn's GGUF artifact for the speed-and-memory-optimized 4B path, based on the official Qwen3-4B-Thinking-2507 reasoning model. Quantized to Q4_K_M by Lynn with imatrix calibration over wikitext-2-raw. Intended as the lightweight companion to the 9B path for memory-constrained / batch / high-throughput device targets.

文件 / Files

文件 大小 SHA256 备注
Qwen3-4B-Thinking-2507-Q4_K_M-imatrix.gguf 2.4GB 530e7ee3da9e5d8873fa51cc1c66110c7cd23265aa79c3fabe8822c5ceb9c76c Lynn imatrix 校准 Q4_K_M 发布文件
imatrix.gguf 3.7MB a65304e847c1cc8df9bf2849d11eff988c0fe9422c5ebce1d29c4e29bc993359 imatrix 校准数据,可复现量化

量化来源:从官方 BF16 (Qwen/Qwen3-4B-Thinking-2507) 用 Lynn 自家 llama.cpp(build-cuda-sm121) + wikitext-2-raw 100 chunks × 512 ctx imatrix calibration 量化。区别于 bartowski / unsloth / lmstudio-community 等公开 GGUF。

为什么选择这一版 / Why this artifact

Qwen3-4B-Thinking-2507 是阿里 2025-08 发布的 4B dense Thinking 推理模型,2025 sub-9B 区段中 quality-per-param 最优代表之一。在 Lynn 矩阵中担任端侧极速档:

  • 2.4GB Q4_K_M 体积:对比 9B Q4_K_M 5.9GB 缩 60%,可在 8-16GB 设备从容跑 32K thinking。
  • decode TPS 极快:4B dense 在 Mac M-class / GB10 Spark 都能达到 ~70+ TPS 单流(参考:9B 32 TPS),适合实时对话与工具流。
  • vendor benchmark 强:MMLU-Pro 74.0 / MMLU-Redux 86.1 / GPQA Diamond 65.8(Qwen 官方报)。
  • Lynn-imatrix 校准:使用 Lynn 自家量化栈而非第三方,calibration 数据 wikitext-2-raw,可复现可审计(imatrix.gguf 同 repo 发布)。

English note: this is Lynn's speed-and-memory tier in the 4-quadrant matrix. The 4B Thinking model gives a ~2x decode speedup vs 9B at ~40% the memory footprint, with quality reaching MMLU-Pro 74.0 / GPQA-D 65.8 in vendor evaluations. The Lynn-imatrix calibration is reproducible — see imatrix.gguf.

评测摘要 / Benchmark Summary

所有数字来自 Lynn 内部同口径评测;标注 thinking-on 的项目允许模型输出长思考块。thinking-offchat_template_kwargs.enable_thinking=False

官方 vendor reference

维度
MMLU-Pro (vendor) 74.0
MMLU-Redux (vendor) 86.1
GPQA Diamond (vendor) 65.8
AIME24 (vendor) 65.6
LiveCodeBench v6 (vendor) 55.2

Lynn 内部评测(测试中,后补完整数据)

Tests in progress — full numbers will be added once Lynn's V8/V9/MMLU/GPQA/coding spike pipeline completes on this Lynn-quantized variant.

测试 状态 临时数据
Stage5 tool-call(15 题) 12/15 = 80.0%(thinking-on,本仓库 Q4_K_M) 跟 80% bartowski thinking-off 持平,验证 Lynn-imatrix 在 tool-call 上 calibration 完整
MMLU 500 thinking-on 🔄 排队中
GPQA Diamond 198 thinking-on 🔄 排队中
V8(35 工具触发) 🔄 跑完后评估 V8 protocol 依赖隐式 substring,thinking-mode model 适配性差(已发现 grader 限制,patched 版搜索 reasoning+content)
V9(60 verifier) 🔄 跑完后评估 同上
Coding spike v9_code_algo ⚠️ 0/9 grader 限制(runner 不支持 code_executes verifier,非模型问题)

完整 quality 数据测试完成后会更新到本卡。

English benchmark note: vendor-reported numbers above. Lynn internal eval is in progress; Stage5 tool-call (with explicit tool schema) already validated at 12/15 = 80% thinking-on for this Lynn-imatrix Q4_K_M artifact, matching bartowski's thinking-off baseline — confirming Lynn calibration is sound on tool-emit tasks.

本地使用方式 / Local Usage(llama.cpp)

Lynn Desktop 会在用户授权后自动完成下载、启动和 provider 注册。下面是手动等价命令,普通用户不需要自己执行:

modelscope download --model Merkyor/Qwen3-4B-Thinking-2507-GGUF-imatrix \
  Qwen3-4B-Thinking-2507-Q4_K_M-imatrix.gguf \
  --local_dir ~/Models/Lynn/Qwen3-4B-Thinking-2507/q4_k_m

llama-server \
  --model ~/Models/Lynn/Qwen3-4B-Thinking-2507/q4_k_m/Qwen3-4B-Thinking-2507-Q4_K_M-imatrix.gguf \
  --host 127.0.0.1 \
  --port 18097 \
  --ctx-size 32768 \
  --parallel 4 \
  --n-gpu-layers 999 \
  --jinja \
  --reasoning auto

OpenAI-compatible endpoint:

base_url = http://127.0.0.1:18097/v1
api_key  = local
model    = qwen3-4b-thinking-2507

Thinking 模式说明 / Thinking Mode

Qwen3-4B-Thinking-2507 默认 thinking-on。要 thinking-off 需在 chat completion request 显式传:

{
  "chat_template_kwargs": {"enable_thinking": false}
}

注意:/no_think 前缀不生效(该 token 是 Qwen3-Next-9B 才支持的,4B Thinking 模型用 chat_template_kwargs 控制)。

reasoning_content 字段会在 thinking-on 模式下返回,与最终 content 字段分离。下游 grader / 解析器应同时处理两个字段(否则会漏抓关键信息)。

English: enable_thinking is controlled via chat_template_kwargs. The /no_think prefix does not work for this model. The reasoning_content field is returned separately from final content — downstream graders should search both.

来源与集成信息 / Provenance

  • 基座模型 / Base model: Qwen/Qwen3-4B-Thinking-2507(2025-08-05 发布,Apache 2.0)
  • 格式 / Format: GGUF Q4_K_M with imatrix calibration
  • 量化栈 / Quant stack: llama.cpp build-cuda-sm121(commit ~2025-09)
  • imatrix 校准 / Calibration: wikitext-2-raw(100 chunks × 512 ctx)
  • 量化硬件 / Quantize host: NVIDIA GB10 Spark(sm_121)
  • 运行目标 / Runtime: llama.cpp server / OpenAI-compatible endpoint
  • Lynn 集成 / Integration: 端侧极速档 / lightweight tier(provider id 待 Lynn client v0.80+ 接入定 alias)

License

Apache-2.0,inherits from base model Qwen/Qwen3-4B-Thinking-2507. Re-quantized weights distributed under the same license.

致谢 / Acknowledgements

  • Qwen team for the base Qwen3-4B-Thinking-2507 weights
  • llama.cpp project for the GGUF format and quantization tooling
  • Lynn project for the imatrix calibration + integration

Last updated: 2026-05-23. For live benchmark progress see Lynn project GitHub.

Downloads last month
127
GGUF
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nerkyor/Qwen3-4B-Thinking-2507-GGUF-imatrix

Quantized
(104)
this model