--- base_model: Qwen/Qwen3.5-35B-A3B language: en license: apache-2.0 pipeline_tag: text-generation library_name: mlx datasets: - NeelNanda/pile-10k tags: - mlx - qwen - mixture-of-experts - moe - reap - cerebras-reap - static-pruning - apple-silicon - quantized - q8 --- # Qwen3.5-35B-A3B-REAP-pile10k-15p-MLX-q8 This repository contains a **static REAP-pruned MLX checkpoint** derived from **Qwen/Qwen3.5-35B-A3B** and quantized to **q8**. ## Original Model Lineage - Original upstream model: [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) - MLX bf16 source checkpoint used locally for pruning and quantization: [mlx-community/Qwen3.5-35B-A3B-bf16](https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-bf16) This model card is intentionally explicit about lineage: the starting point was the original Qwen model family, but the local pruning and quantization workflow operated on the MLX bf16 conversion. ## REAP Method Lineage - Original REAP project referenced by the MLX tool: [CerebrasResearch/reap](https://github.com/CerebrasResearch/reap) - MLX pruning implementation used here: [0xSero/reap-mlx](https://github.com/0xSero/reap-mlx) `reap-mlx` is the Apple Silicon / MLX implementation of the **pruning side** of Cerebras REAP. In other words, the pruning logic used for this checkpoint is descended from the Cerebras REAP method, but executed through the `reap-mlx` workflow on a local MLX checkpoint. ## Exact Tooling Used - `reap-mlx` version: `0.1.0` - `reap-mlx` commit: `080a764` - `MLX` version used for quantization: `0.31.0` - `mlx-lm` version used for quantization and serving: `0.30.8` ## What Was Actually Done The workflow for this release was: 1. Start from the MLX bf16 Qwen3.5-35B-A3B checkpoint. 2. Run REAP telemetry collection on `pile-10k` calibration data. 3. Build a static pruning plan from that telemetry. 4. Apply that pruning plan to physically remove MoE experts from the checkpoint. 5. Quantize the resulting pruned bf16 checkpoint into `q8`. This is **static MoE expert pruning**. ## Calibration Data and How It Was Used - Calibration dataset: [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) - Background/source corpus project: [EleutherAI/the-pile](https://github.com/EleutherAI/the-pile) - Calibration slice used for REAP: `train[:256]` This part is important: `pile-10k train[:256]` was used **as REAP calibration data**, not as model training data and not as the main published benchmark target. Concretely, the model was run over this public text subset to collect per-expert telemetry, estimate expert salience, and decide which experts to prune. ## Pruning Configuration for This Release - Pruning method: `reap` - Experts pruned per layer: `38 / 256` - Achieved prune ratio: `14.8438%` - Total experts removed: `1520 / 10240` The REAP-style salience used in `reap-mlx` is based on the mean routed expert contribution: `saliency_j = mean(g_j(x) * ||f_j(x)||)` where `g_j(x)` is the router weight for expert `j` and `f_j(x)` is the expert output. ## Format and Size - Format: `MLX` - Quantization: `q8` - Approximate local on-disk size during creation: `30 GB` ## Benchmark / Evaluation Status This 15% family was generated from the same pile-10k REAP telemetry and uploaded, but no full custom benchmark run was attached to these 15% variants before publication. ## Notes - This repository is the pruned-and-quantized derivative checkpoint, not the original Qwen release. - The pruning decision was driven by REAP telemetry collected on `pile-10k` calibration rows. - The benchmark and calibration roles are separate; the calibration slice was used to build the prune plan.