---
base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
license: other
license_name: nvidia-nemotron-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
tags:
- gguf
- quantized
- rocm
- rocmfpx
- agentic
- tool-calling
- nemotron
- nemotron-3
- nvidia
- text-generation
- llama.cpp
---

# Agent-Nemotron-ROCmFP6

**Q6_0_ROCMFPX_AGENT** (ROCmFP6 Agent) quantized GGUF of NVIDIA's Nemotron-3-Nano-30B-A3B.

- **Base model**: [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)
- **Quantization**: `Q6_0_ROCMFPX_AGENT` — ROCm-optimized 6-bit format with **agent/tool-call coherent routing**
- **Size**: ~27.4 GiB (21 GB + 6.4 GB shards)
- **Parameters**: ~30B total / 3.5B active (hybrid Mamba-2 + MoE)
- **Optimized for**: Agentic workflows, tool calling, reasoning on **AMD ROCm** and **Vulkan** backends

This quantization uses custom ROCmFPX kernels (part of experimental ROCmFPx family in llama.cpp) that provide better performance/quality on ROCm hardware for agent-style workloads. The `_AGENT` preset protects and enhances routing for tool use (Hermes-style / OpenClaw / BFCL etc.).


## Files

| File | Size | Description |
|------|------|-------------|
| `Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00001-of-00002.gguf` | 21 GB | Main weights shard |
| `Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00002-of-00002.gguf` | 6.4 GB | Second shard |

## Recommended Usage (llama.cpp)

Use a **ROCmFPX-enabled** build of llama.cpp (see ROCmFPX projects / strix builds).

### Quick server (recommended flags)

```bash
# Using the convenience wrapper (if installed)
HERMES_NEMOTRON_NANO_FP6_MODEL=/path/to/Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00001-of-00002.gguf \
  hermes-nemotron-nano-30b-rocmfp6-agent-server
```

Direct `llama-server`:

```bash
llama-server \
  -m /path/to/Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00001-of-00002.gguf \
  --alias nemotron-nano-30b-rocmfp6-agent \
  --host 0.0.0.0 --port 8101 \
  -dev ROCm0 \
  -ngl 999 \
  -fa on \
  --mmap \
  --jinja \
  -c 131072 \
  -b 512 -ub 512 \
  --reasoning off \
  --slots \
  --metrics
```

For best agent/tool performance use `--jinja` (the GGUF embeds a strong Nemotron tool calling template).

### Key notes

- `Q6_0_ROCMFPX_AGENT` spends a few extra bits on agent routing tensors compared to plain `Q6_0_ROCMFPX`.
- Excellent balance of quality vs size for agentic use on high-end AMD GPUs (Strix Halo, etc.).
- Supports very long context (tested high values).
- Tool calling format is the Nemotron `<tool_call>` style (also compatible with many frameworks via parsers).

## Chat Template

The GGUF includes the official Nemotron-3 tool-aware chat template. Use `--jinja` (or equivalent) with your loader.

## Benchmarks (example from development)

Typical token/s on ROCm0 (full offload) for this quant:
- ~650+ t/s prompt eval (pp512)
- ~53 t/s generation (tg128)

Results vary by hardware + context.

## License

- Original weights: [NVIDIA Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/)
- This is a derived quantized artifact. You must comply with the base model's license terms.

---

**Model page**: https://huggingface.co/cafonez/Agent-Nemotron-ROCmFP6

For questions or issues with the quantization, refer to the ROCmFPX documentation in the corresponding development repositories.