---
base_model: Qwen/Qwen3.5-35B-A3B
language: en
license: apache-2.0
pipeline_tag: text-generation
library_name: mlx
datasets:
- NeelNanda/pile-10k
tags:
- mlx
- qwen
- mixture-of-experts
- moe
- reap
- cerebras-reap
- static-pruning
- apple-silicon
- quantized
- q8
---

# Qwen3.5-35B-A3B-REAP-pile10k-15p-MLX-q8

This repository contains a **static REAP-pruned MLX checkpoint** derived from **Qwen/Qwen3.5-35B-A3B** and quantized to **q8**.

## Original Model Lineage

- Original upstream model: [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B)
- MLX bf16 source checkpoint used locally for pruning and quantization: [mlx-community/Qwen3.5-35B-A3B-bf16](https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-bf16)

This model card is intentionally explicit about lineage: the starting point was the original Qwen model family, but the local pruning and quantization workflow operated on the MLX bf16 conversion.

## REAP Method Lineage

- Original REAP project referenced by the MLX tool: [CerebrasResearch/reap](https://github.com/CerebrasResearch/reap)
- MLX pruning implementation used here: [0xSero/reap-mlx](https://github.com/0xSero/reap-mlx)

`reap-mlx` is the Apple Silicon / MLX implementation of the **pruning side** of Cerebras REAP. In other words, the pruning logic used for this checkpoint is descended from the Cerebras REAP method, but executed through the `reap-mlx` workflow on a local MLX checkpoint.

## Exact Tooling Used

- `reap-mlx` version: `0.1.0`
- `reap-mlx` commit: `080a764`
- `MLX` version used for quantization: `0.31.0`
- `mlx-lm` version used for quantization and serving: `0.30.8`

## What Was Actually Done

The workflow for this release was:

1. Start from the MLX bf16 Qwen3.5-35B-A3B checkpoint.
2. Run REAP telemetry collection on `pile-10k` calibration data.
3. Build a static pruning plan from that telemetry.
4. Apply that pruning plan to physically remove MoE experts from the checkpoint.
5. Quantize the resulting pruned bf16 checkpoint into `q8`.

This is **static MoE expert pruning**.

## Calibration Data and How It Was Used

- Calibration dataset: [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k)
- Background/source corpus project: [EleutherAI/the-pile](https://github.com/EleutherAI/the-pile)
- Calibration slice used for REAP: `train[:256]`

This part is important:

`pile-10k train[:256]` was used **as REAP calibration data**, not as model training data and not as the main published benchmark target. Concretely, the model was run over this public text subset to collect per-expert telemetry, estimate expert salience, and decide which experts to prune.

## Pruning Configuration for This Release

- Pruning method: `reap`
- Experts pruned per layer: `38 / 256`
- Achieved prune ratio: `14.8438%`
- Total experts removed: `1520 / 10240`

The REAP-style salience used in `reap-mlx` is based on the mean routed expert contribution:

`saliency_j = mean(g_j(x) * ||f_j(x)||)`

where `g_j(x)` is the router weight for expert `j` and `f_j(x)` is the expert output.

## Format and Size

- Format: `MLX`
- Quantization: `q8`
- Approximate local on-disk size during creation: `30 GB`

## Benchmark / Evaluation Status

This 15% family was generated from the same pile-10k REAP telemetry and uploaded, but no full custom benchmark run was attached to these 15% variants before publication.

## Notes

- This repository is the pruned-and-quantized derivative checkpoint, not the original Qwen release.
- The pruning decision was driven by REAP telemetry collected on `pile-10k` calibration rows.
- The benchmark and calibration roles are separate; the calibration slice was used to build the prune plan.