---
license: apache-2.0
base_model:
- stepfun-ai/Step-3.5-Flash
library_name: transformers
---
# Model Overview
- **Model Architecture:** Step3p5ForCausalLM
- **Input:** Text
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.1.0
- **PyTorch**: 2.10.0
- **Transformers**: 4.57.6
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
- **Weight quantization:** MoE-only, OCP MXFP4, Static
- **Activation quantization:** MoE-only, OCP MXFP4, Dynamic
- **Docker Image:** rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463
# Model Quantization
The model was quantized from [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are both quantized to MXFP4. **Please note that a custom quantization script is needed, and is included in this repository (`step3p5_quantize_quark.py`).**
**Quantization scripts:**
```
python3 step3p5_quantize_quark.py --model_dir $MODEL_DIR \
--num_calib_data 128 \
--multi_gpu \
--trust_remote_code \
--preset mxfp4_moe_only_no_kvcache
--output_dir $output_dir
```
For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers.
# Deployment
### Use with vLLM
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
## Evaluation
The model was evaluated on gsm8k benchmarks using the [vLLM](https://docs.vllm.ai/en/latest/) framework.
### Accuracy
| Benchmark
|
stepfun-ai/Step-3.5-Flash (bf16)
|
amd/Step-3.5-Flash-MXFP4 (this model)
|
Recovery
|
| gsm8k (flexible-extract)
|
0.8939
|
0.8726
|
97.6%
|
### Reproduction
The GSM8K results were obtained using the vLLM framework, based on the Docker image `rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463`.
#### Note: Due to model support issues in vLLM for Step-3.5-Flash, a few patches need to be applied (specified below) in order to run inference and evaluation using vLLM.
#### Preparation in container
```
# Reinstall vLLM
pip uninstall vllm -y
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout de7dd634b969adc6e5f50cff0cc09c1be1711d01
pip install -r requirements/rocm.txt
python setup.py develop
cd ..
export QUARK_MXFP4_IMPL="triton"
```
Modify `vllm/model_executor/models/step3p5.py` by adding the below packed_modules_mapping attribute to the Step3p5ForCausalLM class:
```
...
class Step3p5ForCausalLM(nn.Module, SupportsPP, MixtureOfExperts):
hf_to_vllm_mapper = WeightsMapper(
orig_to_new_substr={".share_expert.": ".moe.share_expert."}
)
+ packed_modules_mapping = {
+ "qkv_proj": [
+ "q_proj",
+ "k_proj",
+ "v_proj",
+ ],
+ "gate_up_proj": [
+ "gate_proj",
+ "up_proj",
+ ],
+ }
def __init__(
self,
*,
vllm_config: VllmConfig,
prefix: str = "",
):
super().__init__()
...
```
Additionally, modify the same file (`step3p5.py`) by adding the below MoE expert name mapping to the model's `load_weights` function:
```
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
config = self.config
assert config.num_attention_groups > 1, "Only support GQA"
...
for name, loaded_weight in weights:
if name.startswith("model."):
local_name = name[len("model.") :]
full_name = name
else:
local_name = name
full_name = f"model.{name}" if name else "model"
+ # Normalize legacy MoE expert naming like ".moe..gate_proj" to
+ # the ".moe.experts..gate_proj" format
+ if ".moe.experts." not in local_name and ".moe." in local_name:
+ parts = local_name.split(".moe.", 1)
+ if len(parts) == 2 and "." in parts[1]:
+ expert_and_rest = parts[1]
+ expert_id, remainder = expert_and_rest.split(".", 1)
+ if expert_id.isdigit():
+ local_name = f"{parts[0]}.moe.experts.{expert_id}.{remainder}"
spec_layer = get_spec_layer_idx_from_weight_name(config, full_name)
if spec_layer is not None:
continue # skip spec decode layers for main model
...
```
Finally, modify `vllm/model_executor/layers/quantization/quark/quark_moe.py` by forcing `self.emulate` to "True" ([alternate resolution](https://github.com/vllm-project/vllm/pull/39436)):
```
class QuarkOCP_MX_MoEMethod(QuarkMoEMethod):
def __init__(...):
super().__init__(moe)
...
self.model_type = getattr(
get_current_vllm_config().model_config.hf_config, "model_type", None
)
- self.emulate = (
- not current_platform.supports_mx()
- or not self.ocp_mx_scheme.startswith("w_mxfp4")
- ) and (self.mxfp4_backend is None or not self.use_rocm_aiter_moe)
+ self.emulate = True
logger.warning_once(
...
```
***Note:** If Memory Access Faults are encountered, ensure that the `QUARK_MXFP4_IMPL="triton"` environmental variable is set.*
#### Evaluating model using lm_eval
```
lm_eval --model vllm --model_args 'pretrained=$MODEL_DIR,attention_backend=ROCM_AITER_UNIFIED_ATTN,quantization='quark',trust_remote_code=True' --tasks gsm8k --batch_size auto
```
# License
Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.