amd
/

granite-4.0-h-small-fp8

granitemoehybrid

Model card Files Files and versions

granite-4.0-h-small-fp8 / README.md

Xiao-AMD's picture

Update README.md

01d6ee1 verified 9 months ago

|

2.34 kB

	---
	license: mit
	base_model:
	- ibm-granite/granite-4.0-h-small
	---


	# Model Overview

	- Model Architecture: Granite-4.0-h-small
	- Input: Text
	- Output: Text
	- Supported Hardware Microarchitecture: AMD MI350/MI355
	- ROCm: 7.0
	- Operating System(s): Linux
	- Inference Engine: [SGLang](https://docs.sglang.ai/)
	- Model Optimizer: [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
	- Weight quantization: FP8, Static
	- Activation quantization: FP8, Dynamic
	- Calibration Dataset: [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

	This model was built with deepseek-ai DeepSeek-R1-0528 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

	# Model Quantization

	The model was quantized from [ibm-granite/granite-4.0-h-small](https://huggingface.co/ibm-granite/granite-4.0-h-small) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized to MXFP4 format, and the AutoSmoothQuant algorithm was applied to enhance accuracy.

	Preprocessing requirement:

	Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16.
	You can either perform the dequantization manually using this [conversion script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py), or use the pre-converted BFloat16 model available at [unsloth/DeepSeek-R1-0528-BF16](https://huggingface.co/unsloth/DeepSeek-R1-0528-BF16).

	Quantization scripts:
	```
	cd Quark/examples/torch/language_modeling/llm_ptq/
	exclude_layers="router. *lm_head"

	python llm_ptq/quantize_quark.py \
	--model_dir $MODEL_DIR \
	--output_dir $OUT_DIR \
	--quant_scheme w_fp8_a_fp8 \
	--kv_cache_dtype fp8 \
	--num_calib_data 128 \
	--exclude_layers $exclude_layers \
	--model_export hf_format \
	--multi_gpu
	```

	# Deployment
	### Use with SGLang

	This model can be deployed efficiently using the [vllm](https://github.com/vllm-project/vllm) backend.

	# License
	Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.