--- license: apache-2.0 base_model: - stepfun-ai/Step-3.5-Flash library_name: transformers --- # Model Overview - **Model Architecture:** Step3p5ForCausalLM - **Input:** Text - **Output:** Text - **Supported Hardware Microarchitecture:** AMD MI350/MI355 - **ROCm**: 7.1.0 - **PyTorch**: 2.10.0 - **Transformers**: 4.57.6 - **Operating System(s):** Linux - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) - **Weight quantization:** MoE-only, OCP MXFP4, Static - **Activation quantization:** MoE-only, OCP MXFP4, Dynamic - **Docker Image:** rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463 # Model Quantization The model was quantized from [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are both quantized to MXFP4. **Please note that a custom quantization script is needed, and is included in this repository (`step3p5_quantize_quark.py`).** **Quantization scripts:** ``` python3 step3p5_quantize_quark.py --model_dir $MODEL_DIR \ --num_calib_data 128 \ --multi_gpu \ --trust_remote_code \ --preset mxfp4_moe_only_no_kvcache --output_dir $output_dir ``` For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers. # Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. ## Evaluation The model was evaluated on gsm8k benchmarks using the [vLLM](https://docs.vllm.ai/en/latest/) framework. ### Accuracy
Benchmark stepfun-ai/Step-3.5-Flash (bf16) amd/Step-3.5-Flash-MXFP4 (this model) Recovery
gsm8k (flexible-extract) 0.8939 0.8726 97.6%
### Reproduction The GSM8K results were obtained using the vLLM framework, based on the Docker image `rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463`. #### Note: Due to model support issues in vLLM for Step-3.5-Flash, a few patches need to be applied (specified below) in order to run inference and evaluation using vLLM. #### Preparation in container ``` # Reinstall vLLM pip uninstall vllm -y git clone https://github.com/vllm-project/vllm.git cd vllm git checkout de7dd634b969adc6e5f50cff0cc09c1be1711d01 pip install -r requirements/rocm.txt python setup.py develop cd .. export QUARK_MXFP4_IMPL="triton" ``` Modify `vllm/model_executor/models/step3p5.py` by adding the below packed_modules_mapping attribute to the Step3p5ForCausalLM class: ``` ... class Step3p5ForCausalLM(nn.Module, SupportsPP, MixtureOfExperts): hf_to_vllm_mapper = WeightsMapper( orig_to_new_substr={".share_expert.": ".moe.share_expert."} ) + packed_modules_mapping = { + "qkv_proj": [ + "q_proj", + "k_proj", + "v_proj", + ], + "gate_up_proj": [ + "gate_proj", + "up_proj", + ], + } def __init__( self, *, vllm_config: VllmConfig, prefix: str = "", ): super().__init__() ... ``` Additionally, modify the same file (`step3p5.py`) by adding the below MoE expert name mapping to the model's `load_weights` function: ``` def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: config = self.config assert config.num_attention_groups > 1, "Only support GQA" ... for name, loaded_weight in weights: if name.startswith("model."): local_name = name[len("model.") :] full_name = name else: local_name = name full_name = f"model.{name}" if name else "model" + # Normalize legacy MoE expert naming like ".moe..gate_proj" to + # the ".moe.experts..gate_proj" format + if ".moe.experts." not in local_name and ".moe." in local_name: + parts = local_name.split(".moe.", 1) + if len(parts) == 2 and "." in parts[1]: + expert_and_rest = parts[1] + expert_id, remainder = expert_and_rest.split(".", 1) + if expert_id.isdigit(): + local_name = f"{parts[0]}.moe.experts.{expert_id}.{remainder}" spec_layer = get_spec_layer_idx_from_weight_name(config, full_name) if spec_layer is not None: continue # skip spec decode layers for main model ... ``` Finally, modify `vllm/model_executor/layers/quantization/quark/quark_moe.py` by forcing `self.emulate` to "True" ([alternate resolution](https://github.com/vllm-project/vllm/pull/39436)): ``` class QuarkOCP_MX_MoEMethod(QuarkMoEMethod): def __init__(...): super().__init__(moe) ... self.model_type = getattr( get_current_vllm_config().model_config.hf_config, "model_type", None ) - self.emulate = ( - not current_platform.supports_mx() - or not self.ocp_mx_scheme.startswith("w_mxfp4") - ) and (self.mxfp4_backend is None or not self.use_rocm_aiter_moe) + self.emulate = True logger.warning_once( ... ``` ***Note:** If Memory Access Faults are encountered, ensure that the `QUARK_MXFP4_IMPL="triton"` environmental variable is set.* #### Evaluating model using lm_eval ``` lm_eval --model vllm --model_args 'pretrained=$MODEL_DIR,attention_backend=ROCM_AITER_UNIFIED_ATTN,quantization='quark',trust_remote_code=True' --tasks gsm8k --batch_size auto ``` # License Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.