SC117 commited on
Commit
68b12b8
·
verified ·
1 Parent(s): 995a934

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +14 -14
  2. README_zh.md +14 -14
README.md CHANGED
@@ -65,7 +65,7 @@ base_model:
65
  <div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
66
  <p style="margin: 0 0 12px 0;">The released <a href="https://huggingface.co/InternScience/Agents-A1" target="_blank" style="color: #047857; text-decoration: none; font-weight: 700;">InternScience/Agents-A1</a> checkpoint is a <b>40-layer Qwen3.5-35B-A3B MoE</b> without MTP (Multi-Token Prediction) layers. To enable MTP acceleration in llama.cpp (which speeds up long-context generation by 10–30%), we <b>extract the 1 MTP layer from Qwen3.5-35B-A3B</b> and inject it into Agents-A1's safetensors before GGUF conversion.</p>
67
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 1 — Extract MTP tensors from Qwen3.5-35B-A3B</p>
68
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;"># Source: J:\Models\Qwen3.5-35B-A3B-MTP (Qwen3.5-35B-A3B + native MTP)
69
  from safetensors import safe_open
70
  import json, os
71
 
@@ -73,27 +73,27 @@ src = r"J:\Models\Qwen3.5-35B-A3B-MTP"
73
  with open(os.path.join(src, "model.safetensors.index.json")) as f:
74
  idx = json.load(f)
75
  mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
76
- print(f"Found {len(mtp_keys)} MTP tensors") # 785</p>
77
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 2 — Add as a new safetensors shard (N+1)</p>
78
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;"># Save 785 MTP tensors as a new shard
79
  new_shard = "model.safetensors-15-of-15.safetensors"
80
  save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
81
 
82
  # Update model.safetensors.index.json:
83
  # - metadata.total_size += new_shard_size
84
  # - weight_map: append new_shard path for each MTP key
85
- # - DO NOT modify existing 14 shards (avoid touching original data)</p>
86
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 3 — Convert HF → BF16 GGUF with master llama.cpp</p>
87
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
88
  J:\Models\Agents-A1 ^
89
  --outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
90
  --outtype f16
91
 
92
  # Master version handles Qwen3.5MoE with MTP auto:
93
  # - Normal layers: blk.0–39
94
- # - MTP layer: blk.40.nextn.* (785 tensors)</p>
95
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 4 — Quantize with APEX (Q4_K_M default, MTP at Q8_0)</p>
96
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\...\llama-quantize.exe ^
97
  --imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
98
  --tensor-type-file E:\apex-quant\configs\qwen36_35b_mtp_&lt;tier&gt;.txt ^
99
  J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
@@ -101,7 +101,7 @@ save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
101
  Q4_K_M
102
 
103
  # APEX qwen36_35b_mtp_*.txt configs include blk.40 overrides
104
- # (Q8_0 for MTP across all tiers) — no manual patching needed.</p>
105
  </div>
106
  </div>
107
 
@@ -120,17 +120,17 @@ save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
120
  <div style="background: linear-gradient(135deg, #10b981 0%, #059669 100%); padding: 12px 16px; color: white; font-weight: 700; font-size: 14px; display: flex; align-items: center; gap: 8px;"><span>🚀</span> Usage</div>
121
  <div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
122
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">llama.cpp (text only)</p>
123
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">hf download SC117/Agents-A1-MTP-APEX-GGUF --include "*.gguf" --local-dir ./models
124
- ./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf -ngl 99 -c 131072</p>
125
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">llama.cpp (vision + text)</p>
126
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072</p>
127
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">vLLM</p>
128
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3
129
 
130
  # Tool-call variant
131
- vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder</p>
132
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">SGLang</p>
133
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000</p>
134
  </div>
135
  </div>
136
 
 
65
  <div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
66
  <p style="margin: 0 0 12px 0;">The released <a href="https://huggingface.co/InternScience/Agents-A1" target="_blank" style="color: #047857; text-decoration: none; font-weight: 700;">InternScience/Agents-A1</a> checkpoint is a <b>40-layer Qwen3.5-35B-A3B MoE</b> without MTP (Multi-Token Prediction) layers. To enable MTP acceleration in llama.cpp (which speeds up long-context generation by 10–30%), we <b>extract the 1 MTP layer from Qwen3.5-35B-A3B</b> and inject it into Agents-A1's safetensors before GGUF conversion.</p>
67
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 1 — Extract MTP tensors from Qwen3.5-35B-A3B</p>
68
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;"># Source: J:\Models\Qwen3.5-35B-A3B-MTP (Qwen3.5-35B-A3B + native MTP)
69
  from safetensors import safe_open
70
  import json, os
71
 
 
73
  with open(os.path.join(src, "model.safetensors.index.json")) as f:
74
  idx = json.load(f)
75
  mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
76
+ print(f"Found {len(mtp_keys)} MTP tensors") # 785</pre>
77
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 2 — Add as a new safetensors shard (N+1)</p>
78
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;"># Save 785 MTP tensors as a new shard
79
  new_shard = "model.safetensors-15-of-15.safetensors"
80
  save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
81
 
82
  # Update model.safetensors.index.json:
83
  # - metadata.total_size += new_shard_size
84
  # - weight_map: append new_shard path for each MTP key
85
+ # - DO NOT modify existing 14 shards (avoid touching original data)</pre>
86
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 3 — Convert HF → BF16 GGUF with master llama.cpp</p>
87
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
88
  J:\Models\Agents-A1 ^
89
  --outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
90
  --outtype f16
91
 
92
  # Master version handles Qwen3.5MoE with MTP auto:
93
  # - Normal layers: blk.0–39
94
+ # - MTP layer: blk.40.nextn.* (785 tensors)</pre>
95
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 4 — Quantize with APEX (Q4_K_M default, MTP at Q8_0)</p>
96
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\...\llama-quantize.exe ^
97
  --imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
98
  --tensor-type-file E:\apex-quant\configs\qwen36_35b_mtp_&lt;tier&gt;.txt ^
99
  J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
 
101
  Q4_K_M
102
 
103
  # APEX qwen36_35b_mtp_*.txt configs include blk.40 overrides
104
+ # (Q8_0 for MTP across all tiers) — no manual patching needed.</pre>
105
  </div>
106
  </div>
107
 
 
120
  <div style="background: linear-gradient(135deg, #10b981 0%, #059669 100%); padding: 12px 16px; color: white; font-weight: 700; font-size: 14px; display: flex; align-items: center; gap: 8px;"><span>🚀</span> Usage</div>
121
  <div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
122
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">llama.cpp (text only)</p>
123
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">hf download SC117/Agents-A1-MTP-APEX-GGUF --include "*.gguf" --local-dir ./models
124
+ ./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf -ngl 99 -c 131072</pre>
125
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">llama.cpp (vision + text)</p>
126
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072</pre>
127
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">vLLM</p>
128
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3
129
 
130
  # Tool-call variant
131
+ vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder</pre>
132
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">SGLang</p>
133
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000</pre>
134
  </div>
135
  </div>
136
 
README_zh.md CHANGED
@@ -65,7 +65,7 @@ base_model:
65
  <div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
66
  <p style="margin: 0 0 12px 0;">官方发布的 <a href="https://huggingface.co/InternScience/Agents-A1" target="_blank" style="color: #047857; text-decoration: none; font-weight: 700;">InternScience/Agents-A1</a> checkpoint 是一个 <b>40 层 Qwen3.5-35B-A3B MoE</b>,不包含 MTP(Multi-Token Prediction)层。为了在 llama.cpp 中启用 MTP 加速(长上下文生成提速 10–30%),我们 <b>从 Qwen3.5-35B-A3B 中提取 1 层 MTP</b>,注入到 Agents-A1 的 safetensors 中,再转 GGUF。</p>
67
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 1 — 从 Qwen3.5-35B-A3B 提取 MTP tensor</p>
68
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;"># 源:J:\Models\Qwen3.5-35B-A3B-MTP(Qwen3.5-35B-A3B + 原生 MTP)
69
  from safetensors import safe_open
70
  import json, os
71
 
@@ -73,27 +73,27 @@ src = r"J:\Models\Qwen3.5-35B-A3B-MTP"
73
  with open(os.path.join(src, "model.safetensors.index.json")) as f:
74
  idx = json.load(f)
75
  mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
76
- print(f"Found {len(mtp_keys)} MTP tensors") # 785</p>
77
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 2 — 作为新分片(N+1)追加</p>
78
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;"># 把 785 个 MTP tensor 保存为新分片
79
  new_shard = "model.safetensors-15-of-15.safetensors"
80
  save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
81
 
82
  # 更新 model.safetensors.index.json:
83
  # - metadata.total_size += 新分片大小
84
  # - weight_map: 为每个 MTP key 追加新分片路径
85
- # - 不修改原 14 个分片(避免触碰原始数据)</p>
86
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 3 — 用 master llama.cpp 转 BF16 GGUF</p>
87
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
88
  J:\Models\Agents-A1 ^
89
  --outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
90
  --outtype f16
91
 
92
  # master 版本自动处理 Qwen3.5MoE + MTP:
93
  # - 常规层:blk.0–39
94
- # - MTP 层:blk.40.nextn.* (785 个 tensor)</p>
95
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 4 — 用 APEX 量化(Q4_K_M 默认,MTP 用 Q8_0)</p>
96
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\...\llama-quantize.exe ^
97
  --imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
98
  --tensor-type-file E:\apex-quant\configs\qwen36_35b_mtp_&lt;档位&gt;.txt ^
99
  J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
@@ -101,7 +101,7 @@ save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
101
  Q4_K_M
102
 
103
  # APEX qwen36_35b_mtp_*.txt 配置已包含 blk.40 override
104
- # (所有档位 MTP 用 Q8_0)—— 无需手动 patch。</p>
105
  </div>
106
  </div>
107
 
@@ -120,17 +120,17 @@ save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
120
  <div style="background: linear-gradient(135deg, #10b981 0%, #059669 100%); padding: 12px 16px; color: white; font-weight: 700; font-size: 14px; display: flex; align-items: center; gap: 8px;"><span>🚀</span> 使用方法</div>
121
  <div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
122
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">llama.cpp(纯文本)</p>
123
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">hf download SC117/Agents-A1-MTP-APEX-GGUF --include "*.gguf" --local-dir ./models
124
- ./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf -ngl 99 -c 131072</p>
125
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">llama.cpp(视觉 + 文本)</p>
126
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072</p>
127
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">vLLM</p>
128
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3
129
 
130
  # 工具调用变体
131
- vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder</p>
132
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">SGLang</p>
133
- <p style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000</p>
134
  </div>
135
  </div>
136
 
 
65
  <div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
66
  <p style="margin: 0 0 12px 0;">官方发布的 <a href="https://huggingface.co/InternScience/Agents-A1" target="_blank" style="color: #047857; text-decoration: none; font-weight: 700;">InternScience/Agents-A1</a> checkpoint 是一个 <b>40 层 Qwen3.5-35B-A3B MoE</b>,不包含 MTP(Multi-Token Prediction)层。为了在 llama.cpp 中启用 MTP 加速(长上下文生成提速 10–30%),我们 <b>从 Qwen3.5-35B-A3B 中提取 1 层 MTP</b>,注入到 Agents-A1 的 safetensors 中,再转 GGUF。</p>
67
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 1 — 从 Qwen3.5-35B-A3B 提取 MTP tensor</p>
68
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;"># 源:J:\Models\Qwen3.5-35B-A3B-MTP(Qwen3.5-35B-A3B + 原生 MTP)
69
  from safetensors import safe_open
70
  import json, os
71
 
 
73
  with open(os.path.join(src, "model.safetensors.index.json")) as f:
74
  idx = json.load(f)
75
  mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
76
+ print(f"Found {len(mtp_keys)} MTP tensors") # 785</pre>
77
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 2 — 作为新分片(N+1)追加</p>
78
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;"># 把 785 个 MTP tensor 保存为新分片
79
  new_shard = "model.safetensors-15-of-15.safetensors"
80
  save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
81
 
82
  # 更新 model.safetensors.index.json:
83
  # - metadata.total_size += 新分片大小
84
  # - weight_map: 为每个 MTP key 追加新分片路径
85
+ # - 不修改原 14 个分片(避免触碰原始数据)</pre>
86
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 3 — 用 master llama.cpp 转 BF16 GGUF</p>
87
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
88
  J:\Models\Agents-A1 ^
89
  --outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
90
  --outtype f16
91
 
92
  # master 版本自动处理 Qwen3.5MoE + MTP:
93
  # - 常规层:blk.0–39
94
+ # - MTP 层:blk.40.nextn.* (785 个 tensor)</pre>
95
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 4 — 用 APEX 量化(Q4_K_M 默认,MTP 用 Q8_0)</p>
96
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\...\llama-quantize.exe ^
97
  --imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
98
  --tensor-type-file E:\apex-quant\configs\qwen36_35b_mtp_&lt;档位&gt;.txt ^
99
  J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
 
101
  Q4_K_M
102
 
103
  # APEX qwen36_35b_mtp_*.txt 配置已包含 blk.40 override
104
+ # (所有档位 MTP 用 Q8_0)—— 无需手动 patch。</pre>
105
  </div>
106
  </div>
107
 
 
120
  <div style="background: linear-gradient(135deg, #10b981 0%, #059669 100%); padding: 12px 16px; color: white; font-weight: 700; font-size: 14px; display: flex; align-items: center; gap: 8px;"><span>🚀</span> 使用方法</div>
121
  <div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
122
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">llama.cpp(纯文本)</p>
123
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">hf download SC117/Agents-A1-MTP-APEX-GGUF --include "*.gguf" --local-dir ./models
124
+ ./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf -ngl 99 -c 131072</pre>
125
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">llama.cpp(视觉 + 文本)</p>
126
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072</pre>
127
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">vLLM</p>
128
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3
129
 
130
  # 工具调用变体
131
+ vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder</pre>
132
  <p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">SGLang</p>
133
+ <pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000</pre>
134
  </div>
135
  </div>
136