---
license: apache-2.0
base_model:
- stepfun-ai/Step-3.5-Flash
library_name: transformers
---

# Model Overview

- **Model Architecture:** Step3p5ForCausalLM
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.1.0
- **PyTorch**: 2.10.0
- **Transformers**: 4.57.6
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
  - **Weight quantization:** MoE-only, OCP MXFP4, Static
  - **Activation quantization:** MoE-only, OCP MXFP4, Dynamic
- **Docker Image:** rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463


# Model Quantization

The model was quantized from [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are both quantized to MXFP4. **Please note that a custom quantization script is needed, and is included in this repository (`step3p5_quantize_quark.py`).**


**Quantization scripts:**
```
python3 step3p5_quantize_quark.py --model_dir $MODEL_DIR \
                          --num_calib_data 128 \
                          --multi_gpu  \
                          --trust_remote_code \
                          --preset mxfp4_moe_only_no_kvcache
                          --output_dir $output_dir
```
For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers.

# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

## Evaluation
The model was evaluated on gsm8k benchmarks using the [vLLM](https://docs.vllm.ai/en/latest/) framework.

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>stepfun-ai/Step-3.5-Flash (bf16)</strong>
   </td>
   <td><strong>amd/Step-3.5-Flash-MXFP4 (this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>gsm8k (flexible-extract) 
   </td>
   <td>0.8939
   </td>
   <td>0.8726
   </td>
   <td>97.6%
   </td>
  </tr>
</table>


### Reproduction

The GSM8K results were obtained using the vLLM framework, based on the Docker image `rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463`.

#### Note: Due to model support issues in vLLM for Step-3.5-Flash, a few patches need to be applied (specified below) in order to run inference and evaluation using vLLM.  

#### Preparation in container
```
# Reinstall vLLM
pip uninstall vllm -y
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout de7dd634b969adc6e5f50cff0cc09c1be1711d01
pip install -r requirements/rocm.txt
python setup.py develop
cd ..
export QUARK_MXFP4_IMPL="triton"
```
Modify `vllm/model_executor/models/step3p5.py` by adding the below packed_modules_mapping attribute to the Step3p5ForCausalLM class:
```
...

class Step3p5ForCausalLM(nn.Module, SupportsPP, MixtureOfExperts):
    hf_to_vllm_mapper = WeightsMapper(
        orig_to_new_substr={".share_expert.": ".moe.share_expert."}
    )
    
+   packed_modules_mapping = {
+           "qkv_proj": [
+               "q_proj",
+               "k_proj",
+               "v_proj",
+           ],
+           "gate_up_proj": [
+               "gate_proj",
+               "up_proj",
+           ],
+       }
    
    def __init__(
        self,
        *,
        vllm_config: VllmConfig,
        prefix: str = "",
    ):
        super().__init__()
        ...
```
Additionally, modify the same file (`step3p5.py`) by adding the below MoE expert name mapping to the model's `load_weights` function:
```
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
        config = self.config
        assert config.num_attention_groups > 1, "Only support GQA"
        
        ...

        for name, loaded_weight in weights:
            if name.startswith("model."):
                local_name = name[len("model.") :]
                full_name = name
            else:
                local_name = name
                full_name = f"model.{name}" if name else "model"

+           # Normalize legacy MoE expert naming like ".moe.<E>.gate_proj" to
+           # the ".moe.experts.<E>.gate_proj" format 
+           if ".moe.experts." not in local_name and ".moe." in local_name:
+               parts = local_name.split(".moe.", 1)
+               if len(parts) == 2 and "." in parts[1]:
+                   expert_and_rest = parts[1]
+                   expert_id, remainder = expert_and_rest.split(".", 1)
+                   if expert_id.isdigit():
+                       local_name = f"{parts[0]}.moe.experts.{expert_id}.{remainder}"

            spec_layer = get_spec_layer_idx_from_weight_name(config, full_name)
            if spec_layer is not None:
                continue  # skip spec decode layers for main model
        ...
``` 
Finally, modify `vllm/model_executor/layers/quantization/quark/quark_moe.py` by forcing `self.emulate` to "True" ([alternate resolution](https://github.com/vllm-project/vllm/pull/39436)):
```
class QuarkOCP_MX_MoEMethod(QuarkMoEMethod):
    def __init__(...):
        super().__init__(moe)
        ...

        self.model_type = getattr(
            get_current_vllm_config().model_config.hf_config, "model_type", None
        )

-       self.emulate = (
-           not current_platform.supports_mx()
-           or not self.ocp_mx_scheme.startswith("w_mxfp4")
-       ) and (self.mxfp4_backend is None or not self.use_rocm_aiter_moe)
+       self.emulate = True
        
            logger.warning_once(
        ...
```

***Note:** If Memory Access Faults are encountered, ensure that the `QUARK_MXFP4_IMPL="triton"` environmental variable is set.*


#### Evaluating model using lm_eval
```
lm_eval  --model vllm  --model_args 'pretrained=$MODEL_DIR,attention_backend=ROCM_AITER_UNIFIED_ATTN,quantization='quark',trust_remote_code=True'  --tasks gsm8k  --batch_size auto
```


# License
Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.
Benchmark	stepfun-ai/Step-3.5-Flash (bf16)	amd/Step-3.5-Flash-MXFP4 (this model)	Recovery
gsm8k (flexible-extract)	0.8939	0.8726	97.6%