---
quantized_by: tarruda
pipeline_tag: text-generation
base_model: Qwen/Qwen3.5-397B-A17B
base_model_relation: quantized
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE
tags:
- imatrix
- conversational
- qwen3_5_moe
---

### Intro

This is a 2.54 BPW Qwen 3.5 397B quantization using a recipe inspired by @AesSedai and @ubergarm.

My goal was to maximize BPW for my hardware (128G M1 ultra) while allowing up for 128K context.

The recipe is:

```
TYPE_FFN_GATE_UP_EXPS=IQ2_XXS
TYPE_FFN_DOWN_EXPS=IQ3_XXS
TYPE_TOKEN_EMBEDDING=Q4_K
TYPE_OUTPUT=Q6_K
TYPE_DEFAULT=Q8_0
```

### Running

This is the command I use to run it locally:

```
llama-server --no-mmap --no-warmup -fa on --model IQ3_XXS/Qwen3.5-397B-A17B-IQ3_XXS-00001-of-00004.gguf --mmproj mmproj-F16.gguf --ctx-size 131072 --jinja --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -cram 0
```

I use `-cram 0` because the model + context will take 100% of my available RAM.


### Quantizing

Assuming the original model is located at ../../Qwen/Qwen3.5-397B-A17B and
llama.cpp (with built binaries) is located at ~/llama.cpp, the full
quantization is done with:


```
./scripts/convert-to-gguf.sh ~/llama.cpp ../../Qwen/Qwen3.5-397B-A17B
./scripts/quantize.sh ~/code/llama.cpp IQ3_XXS
```

The quantization depends on imatrix.gguf, which was copied from @ubergarm's
397B repo.