Instructions to use LordNeel/Agents-A1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use LordNeel/Agents-A1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="LordNeel/Agents-A1-GGUF", filename="agents-a1-IQ4_XS-MTP-graft-headQ6.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use LordNeel/Agents-A1-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf LordNeel/Agents-A1-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf LordNeel/Agents-A1-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M
Use Docker
docker model run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use LordNeel/Agents-A1-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "LordNeel/Agents-A1-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LordNeel/Agents-A1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M
- Ollama
How to use LordNeel/Agents-A1-GGUF with Ollama:
ollama run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M
- Unsloth Studio
How to use LordNeel/Agents-A1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for LordNeel/Agents-A1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for LordNeel/Agents-A1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for LordNeel/Agents-A1-GGUF to start chatting
- Pi
How to use LordNeel/Agents-A1-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "LordNeel/Agents-A1-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use LordNeel/Agents-A1-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default LordNeel/Agents-A1-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use LordNeel/Agents-A1-GGUF with Docker Model Runner:
docker model run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M
- Lemonade
How to use LordNeel/Agents-A1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull LordNeel/Agents-A1-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Agents-A1-GGUF-Q4_K_M
List all available models
lemonade list
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default LordNeel/Agents-A1-GGUF:Run Hermes
hermesAgents-A1 GGUF Quants
High quality GGUF quantizations of InternScience/Agents-A1, a 35B Qwen3.5-MoE agent model.
These files were produced from the BF16 Hugging Face checkpoint with a patched llama.cpp build that supports the qwen35moe architecture. The calibration pass used an importance matrix built from coding/instruction chat data, then each quant was benchmarked against the BF16 GGUF reference.
Recommended Files
| Use case | File | Notes |
|---|---|---|
| Best small general-purpose quant | agents-a1-IQ4_XS.gguf |
Strong quality for size, broad llama.cpp compatibility. |
| Best single-user MTP throughput | agents-a1-IQ4_XS-MTP-graft-headQ6.gguf |
IQ4_XS body with Q6_K MTP block; measured 1.22x over target-only in c1/128 chat serving. |
| Highest MTP acceptance in this run | agents-a1-Q4_K_M-MTP-graft-headQ6.gguf with SPEC_DRAFT_N_MAX=1 |
91.46% draft acceptance while still 1.15x over target-only. |
| Vision / image input for Q4+ quants | mmproj-agents-a1-bf16.gguf |
Shared BF16 Qwen3VL mmproj for IQ4_XS, Q4_K_M, Q5_K_M, Q6_K, Q8_0, NVFP4, and the Q4 MTP variants. |
| Fast Blackwell FP4 path | agents-a1-NVFP4.gguf |
Tested on RTX PRO 6000 Blackwell. Requires runtime support for GGML_TYPE_NVFP4. |
| Safer quality step up | agents-a1-Q5_K_M.gguf |
Lower KLD than IQ4_XS with larger size. |
| Closest to BF16 by KLD | agents-a1-Q6_K.gguf |
Best KLD in this eval set. |
| High precision archival quant | agents-a1-Q8_0.gguf |
Largest quantized file. |
Files
| Quant | File size | Notes |
|---|---|---|
| Q3_K_M | 16.76 GB | Smallest included quant. |
| IQ4_XS | 18.73 GB | Recommended compact quant. |
| IQ4_XS-MTP-graft-headQ6 | 19.42 GB | IQ4_XS body plus integrated Q6_K/F32 MTP block. |
| NVFP4 | 19.72 GB | Blackwell-oriented FP4 GGUF, output head kept at Q6_K by quality rule. |
| Q4_K_M | 21.17 GB | Standard K-quant. |
| Q4_K_M-MTP-graft-headQ6 | 21.86 GB | Q4_K_M body plus integrated Q6_K/F32 MTP block. |
| Q5_K_M | 24.73 GB | Strong quality/size tradeoff. |
| Q6_K | 28.51 GB | Lowest mean KLD in this run. |
| Q8_0 | 36.90 GB | Highest precision quant. |
| mmproj BF16 | 0.90 GB | Shared Qwen3VL vision encoder/projector for Q4-class and higher text GGUFs. |
Metrics
Hardware and runtime profile:
- GPU: single NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, full offload
- llama.cpp flags:
-ngl 99 -sm none -fa on -p 512 -n 128 -b 4096 -ub 512 -r 3 - PPL:
llama-perplexity, context 2048, 64 rendered eval conversations, 3 chunks - KLD: approximate
KL(P_BF16 || P_quant)over top-64 next-token distributions on 32 prompts
The PPL eval is intentionally small, so treat PPL deltas as directional. KLD and top-1 agreement are more useful here for quant-to-BF16 comparison.
| Model | Size GB | Prompt tok/s | Gen tok/s | PPL | PPL delta | KLD mean | KLD p95 | Top-1 match |
|---|---|---|---|---|---|---|---|---|
| BF16 reference | 69.38 | 3418.9 | 161.8 | 1.3031 | 0.0000 | 0.0000 | 0.0000 | 32/32 |
| Q3_K_M | 16.76 | 6779.5 | 269.0 | 1.3101 | +0.0070 | 0.0655 | 0.2155 | 28/32 |
| IQ4_XS | 18.73 | 7719.5 | 258.1 | 1.3038 | +0.0007 | 0.0151 | 0.0654 | 29/32 |
| NVFP4 | 19.72 | 9064.0 | 265.1 | 1.3063 | +0.0032 | 0.0420 | 0.1473 | 31/32 |
| Q4_K_M | 21.17 | 7230.8 | 262.6 | 1.3016 | -0.0015 | 0.1225 | 0.3349 | 27/32 |
| Q5_K_M | 24.73 | 7021.4 | 257.9 | 1.3041 | +0.0010 | 0.0091 | 0.0335 | 30/32 |
| Q6_K | 28.51 | 6294.0 | 244.6 | 1.3040 | +0.0009 | 0.0049 | 0.0178 | 32/32 |
| Q8_0 | 36.90 | 7431.3 | 222.7 | 1.3036 | +0.0005 | 0.0053 | 0.0063 | 30/32 |
Charts
Raw metric files are in metrics/; KLD reports, checksums, and the MTP audit are in reports/.
MTP Q4 Variants
The upstream Agents-A1 checkpoint used for the first GGUF release advertises
MTP in config but does not ship mtp.*/blk.40.* tensors. The two MTP Q4
variants here graft in the Agents-A1 MTPLX MTP sidecar from
wang-yang/Agents-A1-MTPLX-Q4, then convert it with llama.cpp's Qwen3.5-MoE
MTP path. The dense MTP block is preserved at Q6_K while the model body is
quantized to IQ4_XS or Q4_K_M.
Structural checks for both MTP GGUFs:
| Check | Value |
|---|---|
| GGUF tensors | 753 |
qwen35moe.block_count |
41 |
qwen35moe.nextn_predict_layers |
1 |
blk.40.* MTP tensors |
20 |
blk.40.nextn.* tensors |
4 |
Single-user serving profile: one RTX PRO 6000 Blackwell Max-Q 96 GB GPU,
PARALLEL=1, CTX_SIZE=8192, streaming chat completions, 12 requests,
128 max tokens, temperature=0, top_p=1.
| Quant | Mode | Aggregate tok/s | Speedup vs target-only | Draft acceptance | Mean accepted length | Acceptance by position |
|---|---|---|---|---|---|---|
| IQ4_XS-MTP | target-only | 224.59 | 1.00x | n/a | n/a | n/a |
| IQ4_XS-MTP | draft-mtp, n_max=2 |
275.03 | 1.22x | 76.51% | 2.52 | (0.830, 0.692) |
| IQ4_XS-MTP | draft-mtp, n_max=1 |
259.58 | 1.16x | 86.47% | 1.86 | (0.865) |
| Q4_K_M-MTP | target-only | 230.48 | 1.00x | n/a | n/a | n/a |
| Q4_K_M-MTP | draft-mtp, n_max=2 |
273.80 | 1.19x | 77.18% | 2.53 | (0.847, 0.687) |
| Q4_K_M-MTP | draft-mtp, n_max=1 |
264.88 | 1.15x | 91.46% | 1.91 | (0.915) |
Recommended low-latency/single-user throughput profile: SPEC_DRAFT_N_MAX=2.
Recommended high-acceptance fallback: SPEC_DRAFT_N_MAX=1.
Detailed MTP evidence is in:
reports/agents-a1-mtp-q4-profile-summary.mdreports/agents-a1-mtp-q4-profile-summary.jsonconfigs/mtp_profiles.yaml
Usage
Example with the recommended compact quant:
llama-server \
-m agents-a1-IQ4_XS.gguf \
-ngl 99 \
-c 8192 \
-b 4096 \
-ub 512 \
--flash-attn on
NVFP4 example:
llama-server \
-m agents-a1-NVFP4.gguf \
-ngl 99 \
-c 8192 \
-b 4096 \
-ub 512 \
--flash-attn on
The NVFP4 artifact is a standard GGUF using the NVFP4 tensor type, but runtime support is still newer and less universal than K-quants or IQ4_XS. It was tested on a Blackwell GPU with a llama.cpp build reporting BLACKWELL_NATIVE_FP4 = 1.
MTP example:
LLAMA_SPEC_MAX_DRAFTING_SLOTS=1 \
LLAMA_MTP_FAST_BACKEND_SAMPLE=1 \
LLAMA_MTP_DRAFT_TOP_K=1 \
LLAMA_MTP_DRAFT_TOP_P=1 \
LLAMA_MTP_DRAFT_TEMP=1 \
llama-server \
-m agents-a1-IQ4_XS-MTP-graft-headQ6.gguf \
-ngl 99 \
-c 8192 \
-b 4096 \
-ub 512 \
--flash-attn on \
--reasoning off \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--spec-draft-n-min 0 \
--spec-draft-backend-sampling
For the high-acceptance profile, change --spec-draft-n-max 2 to
--spec-draft-n-max 1.
Vision / mmproj
The release includes one shared multimodal projector:
mmproj-agents-a1-bf16.ggufprocessor_config.jsonpreprocessor_config.jsonvideo_preprocessor_config.json
The mmproj was converted from the original InternScience/Agents-A1 Hugging
Face checkpoint with llama.cpp convert_hf_to_gguf.py --mmproj --outtype bf16.
It contains the Qwen3VL vision tower/projector and is independent of the text
quantization level, so the same file is intended for Q4-class and higher text
GGUFs:
agents-a1-IQ4_XS.ggufagents-a1-IQ4_XS-MTP-graft-headQ6.ggufagents-a1-NVFP4.ggufagents-a1-Q4_K_M.ggufagents-a1-Q4_K_M-MTP-graft-headQ6.ggufagents-a1-Q5_K_M.ggufagents-a1-Q6_K.ggufagents-a1-Q8_0.gguf
Q3_K_M may load with the same mmproj, but it is not the recommended vision
profile because image tasks are more sensitive to text-model quantization.
Example with llama.cpp's multimodal CLI:
llama-mtmd-cli \
-m agents-a1-Q4_K_M.gguf \
--mmproj mmproj-agents-a1-bf16.gguf \
--image image.jpg \
-p "Describe the image." \
-ngl 99 \
-c 4096 \
-b 1024 \
-ub 256 \
--chat-template chatml \
--image-min-tokens 1024 \
--flash-attn on
If your llama.cpp llama-server build has multimodal support enabled, the same
mmproj can be passed with --mmproj mmproj-agents-a1-bf16.gguf.
Local smoke test:
| Text GGUF | Image | Prompt | Expected | Answer | Verified |
|---|---|---|---|---|---|
agents-a1-Q4_K_M.gguf |
llama.cpp tools/mtmd/test-1.jpeg |
Look at the newspaper image. What is the main headline? Answer only with the headline text. |
MEN WALK ON MOON |
MEN WALK ON MOON |
true |
Verification report: reports/mmproj-q4km-actual-image-verify.json.
MTP Status
The original upstream snapshot remains config-only for MTP; see
reports/mtp-weights-audit.json. The new *-MTP-graft-headQ6.gguf files are
true integrated MTP GGUFs built from the Agents-A1 MTPLX MTP sidecar.
Provenance
- Base model:
InternScience/Agents-A1 - License: Apache-2.0, inherited from the base model
- Quantization source: BF16 GGUF converted from the Hugging Face checkpoint
- MTP source:
wang-yang/Agents-A1-MTPLX-Q4sidecar grafted onto the base Agents-A1 checkpoint - Calibration: coding/instruction chat data rendered with the model chat template
- Quantizer: patched llama.cpp with Qwen3.5-MoE and NVFP4 support
- Downloads last month
- 2,393
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for LordNeel/Agents-A1-GGUF
Base model
InternScience/Agents-A1



Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama serve -hf LordNeel/Agents-A1-GGUF: