---
base_model:
- google/gemma-4-26B-A4B-it-assistant
tags:
  - fp8
  - quantized
---
# gemma-4-26B-A4B-it-assistant-fp8

NOTE: I'm still trying to figure out if there are additional tweaks needed to get this working with MTP. As it stands in my first tests I got a 0 token acceptance rate against both the Google BF16 base and my FP8 quant. For now consider this experimental and possibly not working. I'll keep tweaking and see if I can get it all running FP8. Using the quantization flag at run time in VLLM to force fp8 works so it has to be possible.

**Format:** FP8_DYNAMIC — weights quantized to FP8 statically; activations scaled dynamically at runtime.
**Base model:** `google/gemma-4-26B-A4B-it-assistant`
**How it was made:** One-shot datafree quantization with LLM Compressor (FP8_DYNAMIC recipe) on a DGX Spark (GB10 Grace Blackwell). No calibration data required — activations are scaled dynamically at runtime.

> Notes: `lm_head` and multimodal projection layers kept in high precision. Blackwell (GB10/B100/B200) has native FP8 hardware support. Hopper (H100/H200) also supports FP8 natively. Older architectures will fall back to BF16 compute while still benefiting from the reduced model size.

Check the original model card for information about this model.

# Running the model with vLLM in Docker
```sh
sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host \
  vllm/vllm-openai:latest \
  --model Firworks/gemma-4-26B-A4B-it-assistant-fp8 \
  --dtype auto \
  --max-model-len 32768
```
Tested on a DGX Spark (GB10 Grace Blackwell Superchip, 128GB unified memory).

If there are other models you'd like quantized to FP8, let me know.