--- quantized_by: tarruda pipeline_tag: text-generation base_model: Qwen/Qwen3.5-397B-A17B base_model_relation: quantized license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE tags: - imatrix - conversational - qwen3_5_moe --- ### Intro This is a 2.54 BPW Qwen 3.5 397B quantization using a recipe inspired by @AesSedai and @ubergarm. My goal was to maximize BPW for my hardware (128G M1 ultra) while allowing up for 128K context. The recipe is: ``` TYPE_FFN_GATE_UP_EXPS=IQ2_XXS TYPE_FFN_DOWN_EXPS=IQ3_XXS TYPE_TOKEN_EMBEDDING=Q4_K TYPE_OUTPUT=Q6_K TYPE_DEFAULT=Q8_0 ``` ### Running This is the command I use to run it locally: ``` llama-server --no-mmap --no-warmup -fa on --model IQ3_XXS/Qwen3.5-397B-A17B-IQ3_XXS-00001-of-00004.gguf --mmproj mmproj-F16.gguf --ctx-size 131072 --jinja --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -cram 0 ``` I use `-cram 0` because the model + context will take 100% of my available RAM. ### Quantizing Assuming the original model is located at ../../Qwen/Qwen3.5-397B-A17B and llama.cpp (with built binaries) is located at ~/llama.cpp, the full quantization is done with: ``` ./scripts/convert-to-gguf.sh ~/llama.cpp ../../Qwen/Qwen3.5-397B-A17B ./scripts/quantize.sh ~/code/llama.cpp IQ3_XXS ``` The quantization depends on imatrix.gguf, which was copied from @ubergarm's 397B repo.