Text Generation
Transformers
GGUF
qwen3_5_moe
qwen3_5
reasoning
agentic
mtp
apex
quantization
multimodal
imatrix
conversational
Instructions to use SC117/Agents-A1-MTP-APEX-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SC117/Agents-A1-MTP-APEX-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SC117/Agents-A1-MTP-APEX-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("SC117/Agents-A1-MTP-APEX-GGUF", dtype="auto") - llama-cpp-python
How to use SC117/Agents-A1-MTP-APEX-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="SC117/Agents-A1-MTP-APEX-GGUF", filename="Agents-A1-MTP-APEX-I-Balanced.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use SC117/Agents-A1-MTP-APEX-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16 # Run inference directly in the terminal: llama cli -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16 # Run inference directly in the terminal: llama cli -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16 # Run inference directly in the terminal: ./llama-cli -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16
Use Docker
docker model run hf.co/SC117/Agents-A1-MTP-APEX-GGUF:BF16
- LM Studio
- Jan
- vLLM
How to use SC117/Agents-A1-MTP-APEX-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SC117/Agents-A1-MTP-APEX-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SC117/Agents-A1-MTP-APEX-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SC117/Agents-A1-MTP-APEX-GGUF:BF16
- SGLang
How to use SC117/Agents-A1-MTP-APEX-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SC117/Agents-A1-MTP-APEX-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SC117/Agents-A1-MTP-APEX-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SC117/Agents-A1-MTP-APEX-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SC117/Agents-A1-MTP-APEX-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use SC117/Agents-A1-MTP-APEX-GGUF with Ollama:
ollama run hf.co/SC117/Agents-A1-MTP-APEX-GGUF:BF16
- Unsloth Studio
How to use SC117/Agents-A1-MTP-APEX-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SC117/Agents-A1-MTP-APEX-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SC117/Agents-A1-MTP-APEX-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for SC117/Agents-A1-MTP-APEX-GGUF to start chatting
- Pi
How to use SC117/Agents-A1-MTP-APEX-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "SC117/Agents-A1-MTP-APEX-GGUF:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use SC117/Agents-A1-MTP-APEX-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default SC117/Agents-A1-MTP-APEX-GGUF:BF16
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use SC117/Agents-A1-MTP-APEX-GGUF with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf SC117/Agents-A1-MTP-APEX-GGUF:BF16
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "SC117/Agents-A1-MTP-APEX-GGUF:BF16" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use SC117/Agents-A1-MTP-APEX-GGUF with Docker Model Runner:
docker model run hf.co/SC117/Agents-A1-MTP-APEX-GGUF:BF16
- Lemonade
How to use SC117/Agents-A1-MTP-APEX-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull SC117/Agents-A1-MTP-APEX-GGUF:BF16
Run and chat with the model
lemonade run user.Agents-A1-MTP-APEX-GGUF-BF16
List all available models
lemonade list
Upload 2 files
Browse files- README.md +17 -17
- README_zh.md +17 -17
README.md
CHANGED
|
@@ -65,33 +65,33 @@ base_model:
|
|
| 65 |
<div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
|
| 66 |
<p style="margin: 0 0 12px 0;">The released <a href="https://huggingface.co/InternScience/Agents-A1" target="_blank" style="color: #047857; text-decoration: none; font-weight: 700;">InternScience/Agents-A1</a> checkpoint is a <b>40-layer Qwen3.5-35B-A3B MoE</b> without MTP (Multi-Token Prediction) layers. To enable MTP acceleration in llama.cpp (which speeds up long-context generation by 10–30%), we <b>extract the 1 MTP layer from Qwen3.5-35B-A3B</b> and inject it into Agents-A1's safetensors before GGUF conversion.</p>
|
| 67 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 1 — Extract MTP tensors from Qwen3.5-35B-A3B</p>
|
| 68 |
-
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">
|
| 69 |
from safetensors import safe_open
|
| 70 |
import json, os
|
| 71 |
-
|
| 72 |
src = r"J:\Models\Qwen3.5-35B-A3B-MTP"
|
| 73 |
with open(os.path.join(src, "model.safetensors.index.json")) as f:
|
| 74 |
idx = json.load(f)
|
| 75 |
mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
|
| 76 |
print(f"Found {len(mtp_keys)} MTP tensors") # 785</pre>
|
| 77 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 2 — Add as a new safetensors shard (N+1)</p>
|
| 78 |
-
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">
|
| 79 |
new_shard = "model.safetensors-15-of-15.safetensors"
|
| 80 |
save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 3 — Convert HF → BF16 GGUF with master llama.cpp</p>
|
| 87 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
|
| 88 |
J:\Models\Agents-A1 ^
|
| 89 |
--outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
|
| 90 |
--outtype f16
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 4 — Quantize with APEX (Q4_K_M default, MTP at Q8_0)</p>
|
| 96 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\...\llama-quantize.exe ^
|
| 97 |
--imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
|
|
@@ -99,9 +99,9 @@ save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
|
|
| 99 |
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
|
| 100 |
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-APEX-I-<tier>.gguf ^
|
| 101 |
Q4_K_M
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
</div>
|
| 106 |
</div>
|
| 107 |
|
|
@@ -126,8 +126,8 @@ save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
|
|
| 126 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072</pre>
|
| 127 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">vLLM</p>
|
| 128 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3
|
| 129 |
-
|
| 130 |
-
|
| 131 |
vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder</pre>
|
| 132 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">SGLang</p>
|
| 133 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000</pre>
|
|
|
|
| 65 |
<div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
|
| 66 |
<p style="margin: 0 0 12px 0;">The released <a href="https://huggingface.co/InternScience/Agents-A1" target="_blank" style="color: #047857; text-decoration: none; font-weight: 700;">InternScience/Agents-A1</a> checkpoint is a <b>40-layer Qwen3.5-35B-A3B MoE</b> without MTP (Multi-Token Prediction) layers. To enable MTP acceleration in llama.cpp (which speeds up long-context generation by 10–30%), we <b>extract the 1 MTP layer from Qwen3.5-35B-A3B</b> and inject it into Agents-A1's safetensors before GGUF conversion.</p>
|
| 67 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 1 — Extract MTP tensors from Qwen3.5-35B-A3B</p>
|
| 68 |
+
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">Source: J:\Models\Qwen3.5-35B-A3B-MTP (Qwen3.5-35B-A3B + native MTP)
|
| 69 |
from safetensors import safe_open
|
| 70 |
import json, os
|
| 71 |
+
·
|
| 72 |
src = r"J:\Models\Qwen3.5-35B-A3B-MTP"
|
| 73 |
with open(os.path.join(src, "model.safetensors.index.json")) as f:
|
| 74 |
idx = json.load(f)
|
| 75 |
mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
|
| 76 |
print(f"Found {len(mtp_keys)} MTP tensors") # 785</pre>
|
| 77 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 2 — Add as a new safetensors shard (N+1)</p>
|
| 78 |
+
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">Save 785 MTP tensors as a new shard
|
| 79 |
new_shard = "model.safetensors-15-of-15.safetensors"
|
| 80 |
save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
|
| 81 |
+
·
|
| 82 |
+
Update model.safetensors.index.json:
|
| 83 |
+
· metadata.total_size += new_shard_size
|
| 84 |
+
· weight_map: append new_shard path for each MTP key
|
| 85 |
+
· DO NOT modify existing 14 shards (avoid touching original data)</pre>
|
| 86 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 3 — Convert HF → BF16 GGUF with master llama.cpp</p>
|
| 87 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
|
| 88 |
J:\Models\Agents-A1 ^
|
| 89 |
--outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
|
| 90 |
--outtype f16
|
| 91 |
+
·
|
| 92 |
+
Master version handles Qwen3.5MoE with MTP auto:
|
| 93 |
+
· Normal layers: blk.0–39
|
| 94 |
+
· MTP layer: blk.40.nextn.* (785 tensors)</pre>
|
| 95 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">Step 4 — Quantize with APEX (Q4_K_M default, MTP at Q8_0)</p>
|
| 96 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\...\llama-quantize.exe ^
|
| 97 |
--imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
|
|
|
|
| 99 |
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
|
| 100 |
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-APEX-I-<tier>.gguf ^
|
| 101 |
Q4_K_M
|
| 102 |
+
·
|
| 103 |
+
APEX qwen36_35b_mtp_*.txt configs include blk.40 overrides
|
| 104 |
+
(Q8_0 for MTP across all tiers) — no manual patching needed.</pre>
|
| 105 |
</div>
|
| 106 |
</div>
|
| 107 |
|
|
|
|
| 126 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072</pre>
|
| 127 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">vLLM</p>
|
| 128 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3
|
| 129 |
+
·
|
| 130 |
+
Tool-call variant
|
| 131 |
vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder</pre>
|
| 132 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">SGLang</p>
|
| 133 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000</pre>
|
README_zh.md
CHANGED
|
@@ -65,33 +65,33 @@ base_model:
|
|
| 65 |
<div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
|
| 66 |
<p style="margin: 0 0 12px 0;">官方发布的 <a href="https://huggingface.co/InternScience/Agents-A1" target="_blank" style="color: #047857; text-decoration: none; font-weight: 700;">InternScience/Agents-A1</a> checkpoint 是一个 <b>40 层 Qwen3.5-35B-A3B MoE</b>,不包含 MTP(Multi-Token Prediction)层。为了在 llama.cpp 中启用 MTP 加速(长上下文生成提速 10–30%),我们 <b>从 Qwen3.5-35B-A3B 中提取 1 层 MTP</b>,注入到 Agents-A1 的 safetensors 中,再转 GGUF。</p>
|
| 67 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 1 — 从 Qwen3.5-35B-A3B 提取 MTP tensor</p>
|
| 68 |
-
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">
|
| 69 |
from safetensors import safe_open
|
| 70 |
import json, os
|
| 71 |
-
|
| 72 |
src = r"J:\Models\Qwen3.5-35B-A3B-MTP"
|
| 73 |
with open(os.path.join(src, "model.safetensors.index.json")) as f:
|
| 74 |
idx = json.load(f)
|
| 75 |
mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
|
| 76 |
print(f"Found {len(mtp_keys)} MTP tensors") # 785</pre>
|
| 77 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 2 — 作为新分片(N+1)追加</p>
|
| 78 |
-
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">
|
| 79 |
new_shard = "model.safetensors-15-of-15.safetensors"
|
| 80 |
save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 3 — 用 master llama.cpp 转 BF16 GGUF</p>
|
| 87 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
|
| 88 |
J:\Models\Agents-A1 ^
|
| 89 |
--outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
|
| 90 |
--outtype f16
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 4 — 用 APEX 量化(Q4_K_M 默认,MTP 用 Q8_0)</p>
|
| 96 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\...\llama-quantize.exe ^
|
| 97 |
--imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
|
|
@@ -99,9 +99,9 @@ save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
|
|
| 99 |
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
|
| 100 |
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-APEX-I-<档位>.gguf ^
|
| 101 |
Q4_K_M
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
</div>
|
| 106 |
</div>
|
| 107 |
|
|
@@ -126,8 +126,8 @@ save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
|
|
| 126 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072</pre>
|
| 127 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">vLLM</p>
|
| 128 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3
|
| 129 |
-
|
| 130 |
-
|
| 131 |
vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder</pre>
|
| 132 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">SGLang</p>
|
| 133 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000</pre>
|
|
|
|
| 65 |
<div style="padding: 16px; font-size: 13px; color: #334155; line-height: 1.7;">
|
| 66 |
<p style="margin: 0 0 12px 0;">官方发布的 <a href="https://huggingface.co/InternScience/Agents-A1" target="_blank" style="color: #047857; text-decoration: none; font-weight: 700;">InternScience/Agents-A1</a> checkpoint 是一个 <b>40 层 Qwen3.5-35B-A3B MoE</b>,不包含 MTP(Multi-Token Prediction)层。为了在 llama.cpp 中启用 MTP 加速(长上下文生成提速 10–30%),我们 <b>从 Qwen3.5-35B-A3B 中提取 1 层 MTP</b>,注入到 Agents-A1 的 safetensors 中,再转 GGUF。</p>
|
| 67 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 1 — 从 Qwen3.5-35B-A3B 提取 MTP tensor</p>
|
| 68 |
+
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">源:J:\Models\Qwen3.5-35B-A3B-MTP(Qwen3.5-35B-A3B + 原生 MTP)
|
| 69 |
from safetensors import safe_open
|
| 70 |
import json, os
|
| 71 |
+
·
|
| 72 |
src = r"J:\Models\Qwen3.5-35B-A3B-MTP"
|
| 73 |
with open(os.path.join(src, "model.safetensors.index.json")) as f:
|
| 74 |
idx = json.load(f)
|
| 75 |
mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
|
| 76 |
print(f"Found {len(mtp_keys)} MTP tensors") # 785</pre>
|
| 77 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 2 — 作为新分片(N+1)追加</p>
|
| 78 |
+
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">把 785 个 MTP tensor 保存为新分片
|
| 79 |
new_shard = "model.safetensors-15-of-15.safetensors"
|
| 80 |
save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
|
| 81 |
+
·
|
| 82 |
+
更新 model.safetensors.index.json:
|
| 83 |
+
· metadata.total_size += 新分片大小
|
| 84 |
+
· weight_map: 为每个 MTP key 追加新分片路径
|
| 85 |
+
· 不修改原 14 个分片(避免触碰原始数据)</pre>
|
| 86 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 3 — 用 master llama.cpp 转 BF16 GGUF</p>
|
| 87 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
|
| 88 |
J:\Models\Agents-A1 ^
|
| 89 |
--outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
|
| 90 |
--outtype f16
|
| 91 |
+
·
|
| 92 |
+
master 版本自动处理 Qwen3.5MoE + MTP:
|
| 93 |
+
· 常规层:blk.0–39
|
| 94 |
+
· MTP 层:blk.40.nextn.* (785 个 tensor)</pre>
|
| 95 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">步骤 4 — 用 APEX 量化(Q4_K_M 默认,MTP 用 Q8_0)</p>
|
| 96 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre;">F:\llama.cpp\...\llama-quantize.exe ^
|
| 97 |
--imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
|
|
|
|
| 99 |
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
|
| 100 |
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-APEX-I-<档位>.gguf ^
|
| 101 |
Q4_K_M
|
| 102 |
+
·
|
| 103 |
+
APEX qwen36_35b_mtp_*.txt 配置已包含 blk.40 override
|
| 104 |
+
(所有档位 MTP 用 Q8_0)—— 无需手动 patch。</pre>
|
| 105 |
</div>
|
| 106 |
</div>
|
| 107 |
|
|
|
|
| 126 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072</pre>
|
| 127 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">vLLM</p>
|
| 128 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3
|
| 129 |
+
·
|
| 130 |
+
工具调用变体
|
| 131 |
vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder</pre>
|
| 132 |
<p style="margin: 0 0 8px 0; font-weight: bold; color: #064e3b;">SGLang</p>
|
| 133 |
<pre style="margin: 0; font-family: monospace; background: #f8fafc; padding: 10px 14px; border-radius: 6px; border: 1px solid #e2e8f0; font-size: 12px; color: #1e293b; white-space: pre-wrap;">python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000</pre>
|