---
base_model:
- Qwen/Qwen3.5-35B-A3B
---
# Qwen3.5-Creative-26B-A3B
**A creative-writing-optimized pruning of [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) using [REAP](https://github.com/CerebrasResearch/reap).**
25% of MoE experts pruned (256 → 192) using a creative writing calibration dataset. A lighter prune that preserves more reasoning capability while still significantly reducing model size.
## What is this?
| | Base Model | This Model | 50% Prune |
|---|---|---|---|
| **Total params** | ~35B | ~26B | ~18B |
| **Active params/token** | ~3B | ~3B | ~3B |
| **MoE experts** | 256 | 192 | 128 |
| **Q4_K_M GGUF** | ~21GB | ~15GB | ~10GB |
| **Target VRAM** | 24GB+ | 24GB | 16-24GB |
## How it was made
1. **Calibration dataset:** 3000 samples — 1000 each from WritingPrompts, Project Gutenberg, and Roleplay scenarios ([Timersofc/creative-writing-reap-calibration](https://huggingface.co/datasets/Timersofc/creative-writing-reap-calibration))
2. **REAP profiling:** Router-weighted expert activation norms recorded across all 40 MoE layers
3. **Pruning:** Bottom 25% of experts by REAP score removed globally
## Why this over the 50% prune?
The 50% version ([Timersofc/Qwen3.5-Creative-18B-A3B](https://huggingface.co/Timersofc/Qwen3.5-Creative-18B-A3B)) is smaller and faster but chain-of-thought reasoning is less stable. This 25% version retains more of the original model's reasoning experts, making it better for:
- **Long-form storytelling** requiring plot consistency
- **Complex character work** needing CoT planning
- **Any task** where you want reliable thinking mode
If you just need short-form creative output and want maximum compression, the 50% version is better value.
## Usage notes
- Works with standard Qwen3.5 chat templates
- For thinking mode: model should produce `...` blocks naturally
- Prefilling with `\nOkay, ` can help ensure CoT engagement
## GGUF quantizations
Available in [Timersofc/Qwen3.5-Creative-26B-A3B-GGUF](https://huggingface.co/Timersofc/Qwen3.5-Creative-26B-A3B-GGUF):
- `Q4_K_M` (imatrix) — ~15GB, recommended for 24GB VRAM
- `Q6_K` (imatrix) — ~19GB, higher quality
- `f16` — full precision GGUF for custom quantization
All quantizations use an importance matrix generated from the same creative writing calibration dataset used for REAP profiling. This means bit allocation within each tensor is optimized for creative writing — weights that matter most for prose quality get higher precision.
## Credits
- [Qwen team](https://huggingface.co/Qwen) for the base model
- [Cerebras Research](https://github.com/CerebrasResearch/reap) for the REAP method
- REAP fork with Qwen3.5 patches: [janmts/reap](https://github.com/janmts/reap/tree/qwen3.5-support)
## License
Same as the base model. This is an unofficial community variant, not affiliated with Alibaba or Cerebras.