--- base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 license: other license_name: nvidia-nemotron-open-model-license license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/ tags: - gguf - quantized - rocm - rocmfpx - agentic - tool-calling - nemotron - nemotron-3 - nvidia - text-generation - llama.cpp --- # Agent-Nemotron-ROCmFP6 **Q6_0_ROCMFPX_AGENT** (ROCmFP6 Agent) quantized GGUF of NVIDIA's Nemotron-3-Nano-30B-A3B. - **Base model**: [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) - **Quantization**: `Q6_0_ROCMFPX_AGENT` — ROCm-optimized 6-bit format with **agent/tool-call coherent routing** - **Size**: ~27.4 GiB (21 GB + 6.4 GB shards) - **Parameters**: ~30B total / 3.5B active (hybrid Mamba-2 + MoE) - **Optimized for**: Agentic workflows, tool calling, reasoning on **AMD ROCm** and **Vulkan** backends This quantization uses custom ROCmFPX kernels (part of experimental ROCmFPx family in llama.cpp) that provide better performance/quality on ROCm hardware for agent-style workloads. The `_AGENT` preset protects and enhances routing for tool use (Hermes-style / OpenClaw / BFCL etc.). ## Files | File | Size | Description | |------|------|-------------| | `Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00001-of-00002.gguf` | 21 GB | Main weights shard | | `Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00002-of-00002.gguf` | 6.4 GB | Second shard | ## Recommended Usage (llama.cpp) Use a **ROCmFPX-enabled** build of llama.cpp (see ROCmFPX projects / strix builds). ### Quick server (recommended flags) ```bash # Using the convenience wrapper (if installed) HERMES_NEMOTRON_NANO_FP6_MODEL=/path/to/Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00001-of-00002.gguf \ hermes-nemotron-nano-30b-rocmfp6-agent-server ``` Direct `llama-server`: ```bash llama-server \ -m /path/to/Nemotron-3-Nano-30B-A3B-Q6_0_ROCMFPX_AGENT-00001-of-00002.gguf \ --alias nemotron-nano-30b-rocmfp6-agent \ --host 0.0.0.0 --port 8101 \ -dev ROCm0 \ -ngl 999 \ -fa on \ --mmap \ --jinja \ -c 131072 \ -b 512 -ub 512 \ --reasoning off \ --slots \ --metrics ``` For best agent/tool performance use `--jinja` (the GGUF embeds a strong Nemotron tool calling template). ### Key notes - `Q6_0_ROCMFPX_AGENT` spends a few extra bits on agent routing tensors compared to plain `Q6_0_ROCMFPX`. - Excellent balance of quality vs size for agentic use on high-end AMD GPUs (Strix Halo, etc.). - Supports very long context (tested high values). - Tool calling format is the Nemotron `` style (also compatible with many frameworks via parsers). ## Chat Template The GGUF includes the official Nemotron-3 tool-aware chat template. Use `--jinja` (or equivalent) with your loader. ## Benchmarks (example from development) Typical token/s on ROCm0 (full offload) for this quant: - ~650+ t/s prompt eval (pp512) - ~53 t/s generation (tg128) Results vary by hardware + context. ## License - Original weights: [NVIDIA Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/) - This is a derived quantized artifact. You must comply with the base model's license terms. --- **Model page**: https://huggingface.co/cafonez/Agent-Nemotron-ROCmFP6 For questions or issues with the quantization, refer to the ROCmFPX documentation in the corresponding development repositories.