---
license: mit
base_model: microsoft/FastContext-1.0-4B-SFT
base_model_relation: quantized
pipeline_tag: text-generation
library_name: gguf
tags:
- gguf
- rocmfp4
- quantized
- amd
- rocm
- strix-halo
- qwen3
- agent
- repository-exploration
language:
- en
---

<p align="center">
  <img src="hal0-banner.png" alt="hal0" width="420"/>
</p>

# FastContext-Hal0-4B — ROCmFP4 (STRIX_LEAN)

A 4-bit **ROCmFP4** quantization of [`microsoft/FastContext-1.0-4B-SFT`](https://huggingface.co/microsoft/FastContext-1.0-4B-SFT),
a lightweight repository-exploration subagent (Qwen3-4B backbone) for LLM coding agents.

Quantized and validated on **AMD Strix Halo** (Ryzen AI MAX+ 395 / Radeon 8060S, `gfx1151`)
using [`hal0ai/amd-strix-halo-toolboxes`](https://github.com/hal0ai) 🛠️.

> ### ⚠️ Read this first — special runtime required
> This file uses the experimental **`Q4_0_ROCMFP4`** GGUF tensor format. It is **NOT** loadable by
> stock `llama.cpp`, Ollama, LM Studio, or any standard GGUF runtime. It runs **only** in the
> [`charlie12345/rocmfp4-llama`](https://github.com/charlie12345/rocmfp4-llama) fork.
> ROCmFP4 is a custom Codebook10 / finite-UE4M3 layout — it is **not** MXFP4 or NVFP4.

## What's in this repo

| File | Size | Format | BPW |
|---|---:|---|---:|
| `FastContext-4B-ROCmFP4-STRIX_LEAN.gguf` | 2.05 GiB | `Q4_0_ROCMFP4_STRIX_LEAN` | 4.38 |

`STRIX_LEAN` is a tensor-aware preset: norms stay `f32`, sensitive tensors keep higher precision,
and the bulk of the weights use the dual/fast ROCmFP4 layouts.

## Why ROCmFP4 here

On Strix Halo, token generation is memory-bandwidth-bound, so 4-bit weights decode much faster than
BF16 while keeping quality intact for tool-calling.

### Performance (`llama-bench`, ROCm0, FlashAttention on, Radeon 8060S)

| Metric | BF16 source | **ROCmFP4 STRIX_LEAN** | Δ |
|---|---:|---:|---|
| Size | 7.49 GiB | **2.05 GiB** | **3.65× smaller** |
| Prefill `pp512` | 2388 t/s | 2244 t/s | ~same (compute-bound) |
| Decode `tg128` | 25.6 t/s | **73.7 t/s** | **2.88× faster** |

### Tool-calling quality (`server-test-function-call.py`, 5 multi-turn cases, greedy `temp 0`)

| | BF16 source | ROCmFP4 STRIX_LEAN |
|---|---:|---:|
| Cases passed | 2/5 | 4/5 |

In every case **both** models selected and ordered the correct tools — the only failures were
"no final summary produced" after correct tool use, a stopping quirk shared by the BF16 source
(not a quantization artifact). **Takeaway: FP4 introduced no measurable tool-calling regression.**
A 5-case harness can't rank models finely, so read this as "quality preserved," not "FP4 > BF16."

## How to run

Build the fork for your AMD GPU (see its README), then:

```bash
HSA_OVERRIDE_GFX_VERSION=11.5.1 \
GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
./build-strix-rocmfp4/bin/llama-server \
  -m FastContext-4B-ROCmFP4-STRIX_LEAN.gguf \
  -dev ROCm0 -ngl 999 -c 262144 -fa on --jinja
```

For scripted/non-interactive generation use `llama-completion` (this fork's `llama-cli` is
interactive-only and rejects `-no-cnv`). FastContext supports up to **262K** context.

## How it was made

```bash
# 1. HF safetensors -> BF16 GGUF
python convert_hf_to_gguf.py ./FastContext-1.0-4B-SFT --outtype bf16 --outfile fc-bf16.gguf
# 2. BF16 -> ROCmFP4 (same fork binary the server uses)
llama-quantize fc-bf16.gguf FastContext-4B-ROCmFP4-STRIX_LEAN.gguf Q4_0_ROCMFP4_STRIX_LEAN
```

## License & attribution

- Weights derive from [`microsoft/FastContext-1.0-4B-SFT`](https://huggingface.co/microsoft/FastContext-1.0-4B-SFT) — **MIT**.
- Backbone: [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) — Apache-2.0.
- Quantization format & tooling: [`charlie12345/rocmfp4-llama`](https://github.com/charlie12345/rocmfp4-llama).

This repository redistributes a quantized derivative under the terms of the upstream MIT license.

---

### About hal0ai

Built and benchmarked with **[hal0ai](https://github.com/hal0ai)** — local-first AI agent
infrastructure tuned for **AMD Strix Halo**. The
[`amd-strix-halo-toolboxes`](https://github.com/hal0ai) ship ready-to-run ROCm + ROCmFP4
container images so you can quantize and serve large models on a single unified-memory APU.
If you're running agents on AMD silicon, come say hi. 👋

---

### A note from the author 🙇

This is my **first time** doing any kind of custom model quantization or training — this
release is very much a learning project. So if you spot something I got wrong, or have tips on
presets, calibration, or quality testing, I'd genuinely **appreciate the feedback** — open a
Community discussion and let me know.

I made this to run as a **slot in [hal0](https://github.com/hal0ai)**, alongside the main agent —
a small, fast repository-exploration subagent that ROCmFP4 lets me keep resident on the Strix
Halo without crowding out the bigger models sharing the same unified memory.

If you're tinkering with local agents on AMD hardware, **come check out hal0** — would love to
see what you build. 🙂