--- license: apache-2.0 base_model: - ibm-granite/granite-4.0-h-small --- # Model Overview - **Model Architecture:** Granite-4.0-h-small - **Input:** Text - **Output:** Text - **Supported Hardware Microarchitecture:** AMD MI350/MI355/MI300 - **ROCm**: 7.0 - **Operating System(s):** Linux - **Inference Engine:** [vllm](https://github.com/vllm-project/vllm) - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) - **Weight quantization:** FP8, Static - **Activation quantization:** FP8, Static - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) This model was built with ibm-granite/granite-4.0-h-small model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for fp8 quantization. # Model Quantization The model was quantized from [ibm-granite/granite-4.0-h-small](https://huggingface.co/ibm-granite/granite-4.0-h-small) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized to FP8 format. **Quantization scripts:** ``` cd Quark/examples/torch/language_modeling exclude_layers="*router.* *lm_head" python llm_ptq/quantize_quark.py \ --model_dir $MODEL_DIR \ --output_dir $OUT_DIR \ --quant_scheme fp8 \ --kv_cache_dtype fp8 \ --num_calib_data 128 \ --exclude_layers $exclude_layers \ --model_export hf_format \ --multi_gpu ``` # Evaluation The model was evaluated on GSM8K. **Scripts:** ``` export MODEL_DIR=granite-4.0-h-small-fp8 export VLLM_USE_V1=1 export VLLM_ROCM_USE_AITER=0 export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 lm_eval --model vllm \ --model_args pretrained=$MODEL_DIR,tensor_parallel_size=1,gpu_memory_utilization=0.75 \ --tasks gsm8k \ --trust_remote_code \ --batch_size 32 ``` ### Accuracy
| Benchmark | ibm-granite/granite-4.0-h-small | ibm-granite/granite-4.0-h-small-fp8(this model) | Recovery |
| GSMK | 85.60 | 84.53 | 98.75% |