Xiao-AMD's picture
Update README.md
01d6ee1 verified
|
Raw
History Blame
2.34 kB
metadata
license: mit
base_model:
  - ibm-granite/granite-4.0-h-small

Model Overview

  • Model Architecture: Granite-4.0-h-small
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • ROCm: 7.0
  • Operating System(s): Linux
  • Inference Engine: SGLang
  • Model Optimizer: AMD-Quark
    • Weight quantization: FP8, Static
    • Activation quantization: FP8, Dynamic
  • Calibration Dataset: Pile

This model was built with deepseek-ai DeepSeek-R1-0528 model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from ibm-granite/granite-4.0-h-small using AMD-Quark. Both weights and activations were quantized to MXFP4 format, and the AutoSmoothQuant algorithm was applied to enhance accuracy.

Preprocessing requirement:

Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16. You can either perform the dequantization manually using this conversion script, or use the pre-converted BFloat16 model available at unsloth/DeepSeek-R1-0528-BF16.

Quantization scripts:

cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*router.* *lm_head"

python llm_ptq/quantize_quark.py \
                          --model_dir $MODEL_DIR \
                          --output_dir $OUT_DIR \
                          --quant_scheme w_fp8_a_fp8 \
                          --kv_cache_dtype fp8 \
                          --num_calib_data 128 \
                          --exclude_layers $exclude_layers \
                          --model_export hf_format \
                          --multi_gpu

Deployment

Use with SGLang

This model can be deployed efficiently using the vllm backend.

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.