---
license: apache-2.0
base_model:
- ibm-granite/granite-4.0-h-small
---


# Model Overview

- **Model Architecture:** Granite-4.0-h-small
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355/MI300
- **ROCm**: 7.0
- **Operating System(s):** Linux
- **Inference Engine:** [vllm](https://github.com/vllm-project/vllm)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
  - **Weight quantization:** FP8, Static
  - **Activation quantization:** FP8, Static
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with ibm-granite/granite-4.0-h-small model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for fp8 quantization.

# Model Quantization

The model was quantized from [ibm-granite/granite-4.0-h-small](https://huggingface.co/ibm-granite/granite-4.0-h-small) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized to FP8 format. 

**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling
exclude_layers="*router.* *lm_head"

python llm_ptq/quantize_quark.py \
                          --model_dir $MODEL_DIR \
                          --output_dir $OUT_DIR \
                          --quant_scheme fp8 \
                          --kv_cache_dtype fp8 \
                          --num_calib_data 128 \
                          --exclude_layers $exclude_layers \
                          --model_export hf_format \
                          --multi_gpu
```
# Evaluation

The model was evaluated on GSM8K.

**Scripts:**
```
export MODEL_DIR=granite-4.0-h-small-fp8
export VLLM_USE_V1=1
export VLLM_ROCM_USE_AITER=0
export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0

lm_eval --model vllm \
    --model_args pretrained=$MODEL_DIR,tensor_parallel_size=1,gpu_memory_utilization=0.75 \
    --tasks gsm8k \
    --trust_remote_code \
    --batch_size 32
```

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>ibm-granite/granite-4.0-h-small </strong>
   </td>
   <td><strong>ibm-granite/granite-4.0-h-small-fp8(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>GSMK 
   </td>
   <td>85.60
   </td>
   <td>84.53
   </td>
   <td>98.75%
   </td>
  </tr>
</table>


# Deployment
### Use with vllm

This model can be deployed efficiently using the [vllm](https://github.com/vllm-project/vllm) backend.

# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
Benchmark	ibm-granite/granite-4.0-h-small	ibm-granite/granite-4.0-h-small-fp8(this model)	Recovery
GSMK	85.60	84.53	98.75%