--- license: apache-2.0 base_model: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored base_model_relation: quantized quantized_by: kasimat library_name: gguf pipeline_tag: text-generation tags: - gguf - llama.cpp - ollama - lmstudio - quantized - imatrix - qwen3.5 - qwen3.6 - abliterated - uncensored language: - en - zh - multilingual --- # Qwen3.6-27B-AEON-Ultimate-Uncensored — GGUF (text-only) GGUF quantizations of [`AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored`](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored), an abliteration of [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B). Validated to retain the abliteration (0/100 refusals) and the base model's gsm8k capability across the full ship list. Quantized from the BF16 source via `llama.cpp` with imatrix calibration. This is a **text-only** GGUF — the multimodal vision tower from the base is not included. Abliteration affects refusal behavior on text inputs only; the vision tower would otherwise be unchanged from upstream Qwen, and shipping it adds ~3 GB per quant for no abliteration-related value. If multimodal GGUF support is requested, an `mmproj` companion will be published separately. ## Inheritance from the base model `AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored` is an abliterated derivative of `Qwen/Qwen3.6-27B` (Apache-2.0). The author reports KL divergence of **0.000492** from the base model — among the cleanest published Qwen3.6 abliterations to date. The abliteration is applied via weight projection (no fine-tuning), so the model retains its training distribution; only refusal-elliciting directions in activation space are projected out. Validated downstream: an FP8 quant of this model scores **88.0% on gsm8k strict (1319-question full set)** versus the vanilla `Qwen/Qwen3.6-27B-FP8` baseline at 84.7% — abliteration removed refusals without measurable capability loss. ## Quant size guide | Quant | File | Size | BPW | Audience | |---|---|---:|---:|---| | Q8_0 | `Qwen3.6-27B-AEON-Ultimate-Uncensored-Q8_0.gguf` | 26.6 GB | 8.50 | Reference. ≈BF16 by every metric. The "FP8-equivalent" build for users who want maximum quality. | | Q6_K | `...-Q6_K.gguf` | 20.6 GB | 6.57 | 24 GB cards, near-lossless. | | Q5_K_M | `...-Q5_K_M.gguf` | 17.9 GB | 5.72 | Quality/size sweet spot for 24 GB cards. | | **Q4_K_M** | `...-Q4_K_M.gguf` | **15.4 GB** | **4.92** | **Recommended default.** Fits 16 GB VRAM. Most-downloaded quant tier for 27B-class models. | | Q4_K_S | `...-Q4_K_S.gguf` | 14.5 GB | 4.63 | Tighter Q4. | | IQ4_XS | `...-IQ4_XS.gguf` | 14.0 GB | 4.48 | Imatrix Q4. *Note:* atypically came in slightly worse than Q4_K_S on this model — see eval table. Pick Q4_K_S over IQ4_XS for AEON-7 specifically. | | Q3_K_M | `...-Q3_K_M.gguf` | 12.4 GB | 3.95 | 12 GB cards (RTX 3060/4070 base). | | IQ3_M | `...-IQ3_M.gguf` | 11.7 GB | 3.74 | Imatrix Q3. Beats Q3_K_M on quality at lower BPW. Recommended over Q3_K_M for 12 GB cards. | | Q2_K | `...-Q2_K.gguf` | 10.0 GB | 3.18 | 8–10 GB cards. Real perplexity hit (+7.6%) but capability and abliteration both intact in our eval. | ## Quality measurements All numbers are computed on the BF16 source as the reference baseline. PPL is on `wikitext-2 test` (100 chunks of 512 tokens). KLD is computed against BF16 logits over the same chunks. Lower is better for both. | Quant | PPL | PPL/BF16 | Mean KLD | Median KLD | 99% KLD | |---|---:|---:|---:|---:|---:| | (BF16) | 7.184 | 1.0000 | — | — | — | | Q8_0 | 7.185 | 1.00014 | 0.0050 | 0.0006 | 0.015 | | Q6_K | 7.211 | 1.00387 | 0.0057 | 0.0013 | 0.033 | | Q5_K_M | 7.194 | 1.00144 | 0.0156 | 0.0031 | 0.101 | | Q4_K_M | 7.237 | 1.00745 | 0.0281 | 0.0068 | 0.218 | | Q4_K_S | 7.221 | 1.00524 | 0.0317 | 0.0080 | 0.273 | | IQ4_XS | 7.290 | 1.01486 | 0.0298 | 0.0080 | 0.262 | | Q3_K_M | 7.431 | 1.03442 | 0.0712 | 0.0241 | 0.717 | | IQ3_M | 7.360 | 1.02448 | 0.0796 | 0.0307 | 0.819 | | Q2_K | 7.730 | 1.07609 | 0.1710 | 0.0690 | 1.712 | ### Behavioral evals (boundary quants) The three boundary quants (highest, default, lowest) were tested directly: | Quant | Refusals (mlabonne100) | gsm8k strict | gsm8k flex | |---|---:|---:|---:| | FP8 (vLLM, source-of-truth on full 1319-q gsm8k) | 0/100 | **88.0%** | 89.5% | | Q8_0 (50-q gsm8k slice) | **0/100** | 88.0% | 92.0% | | Q4_K_M (50-q gsm8k slice) | **0/100** | 84.0% | 88.0% | | Q2_K (50-q gsm8k slice) | **0/100** | 90.0% | 92.0% | Notes on the gsm8k 50-q slice: standard error at p=0.85 with n=50 is ~5pp. Differences between Q8_0/Q4_K_M/Q2_K within ~10pp of each other are consistent with sampling noise, not capability ordering. The PPL/KLD table above captures the actual quality ordering. The *important* result is that **all three boundary quants retained 0/100 refusals**, confirming the abliteration survives even Q2_K's aggressive ~3.18 BPW. The intermediate quants (Q6_K, Q5_K_M, Q4_K_S, IQ4_XS, Q3_K_M, IQ3_M) were not directly tested for refusal/capability. PPL+KLD strictly bracketed between the tested boundary quants, so we infer they fall within the same behavioral envelope. ### Speed (NVIDIA RTX A6000, full GPU offload, llama-bench) | Quant | pp512 (tok/s) | tg128 (tok/s) | |---|---:|---:| | Q8_0 | 1379 | 23.1 | | Q6_K | 1169 | 27.8 | | Q5_K_M | 1239 | 31.5 | | Q4_K_M | 1207 | 35.4 | | Q4_K_S | 1288 | 37.4 | | IQ4_XS | 1368 | 38.7 | | Q3_K_M | 1184 | 33.1 | | IQ3_M | 1254 | 40.1 | | Q2_K | 1036 | 40.8 | Generation speed scales with quant size (memory-bandwidth-bound). Q8_0 → Q2_K is +78% throughput. Prompt processing is roughly flat across quants (compute-bound, not memory-bound). These numbers are A6000-specific. Consumer cards (4080/4090, 24 GB) will have different absolute throughput but similar relative ordering. ## Inference ### llama.cpp ```bash # CLI: llama-cli \ -m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \ --n-gpu-layers 99 \ --ctx-size 8192 \ --jinja \ -p "Hello, world!" # Server (OpenAI-compat API): llama-server \ -m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \ --host 0.0.0.0 --port 8000 \ --n-gpu-layers 99 \ --ctx-size 8192 \ --jinja \ --alias aeon ``` ### Ollama A `Modelfile.example` is included in the repo. Minimal usage: ```bash hf download kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF \ --include "*Q4_K_M.gguf" "Modelfile.example" \ --local-dir ./aeon-7 cd aeon-7 ollama create aeon -f Modelfile.example ollama run aeon "Hello, world!" ``` ### LM Studio Search for `kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF` in the LM Studio model browser and pick a quant. The chat template is embedded in each GGUF. ### Disabling thinking (Qwen3.x default-on) Qwen3.x defaults to a `...` reasoning preamble. For most inference and especially for benchmarking, disable it by passing `enable_thinking: false` via the chat template: ```python # Python OpenAI client against llama-server with --jinja: client.chat.completions.create( model="aeon", messages=[...], extra_body={"chat_template_kwargs": {"enable_thinking": False}}, ) ``` This is required to reproduce our eval numbers — thinking-on otherwise eats the response budget on long prompts. ## Quantization method - **Source:** BF16 weights from `AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored` (~52 GB). NOT requantized from any FP8/INT8 intermediate; quants are computed directly from the BF16 source for maximum precision. - **Tool:** `llama.cpp` HEAD (commit `fc2b005`, April 2026). Built with CUDA 12.8. - **Imatrix calibration:** Bartowski's `calibration_datav3.txt` (Dampf-on-top-of-Kalomaze v3, ~280 KB mixed English/code/multilingual). Computed against the BF16 source with `--n-gpu-layers 55` partial offload (BF16 27B doesn't fit a single 48 GB card fully). 200-chunk run, all 129 chunks of the calibration corpus consumed. Final BF16 PPL on the calibration corpus = 6.93. - **Quantization recipe:** standard `llama-quantize ` with `--imatrix` for all quants except Q8_0 (where imatrix gives essentially zero benefit). - **Architecture:** Qwen3.5 hybrid attention + Gated DeltaNet SSM. llama.cpp registers this as `MODEL_ARCH.QWEN35`. The text-only language model is produced via `convert_hf_to_gguf.py`'s `Qwen3_5TextModel` handler. ### Reproduction gotcha: BPE pre-tokenizer If you re-run `convert_hf_to_gguf.py` from a fresh `llama.cpp` clone, you will hit: ``` NotImplementedError: BPE pre-tokenizer was not recognized chkhsh: 1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f ``` AEON-7's tokenizer hash isn't registered upstream (the abliteration retraining shifted the vocab layout from stock Qwen3.5). The fix is to add this block to `get_vocab_base_pre()` in `convert_hf_to_gguf.py`, just after the existing `qwen35` entry: ```python if chkhsh == "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f": # ref: https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored res = "qwen35" ``` The pre-tokenizer behavior is structurally identical to stock Qwen3.5 (`Sequence: Split-with-canonical-regex + ByteLevel`); only the vocab differs. ### Files - `Qwen3.6-27B-AEON-Ultimate-Uncensored-{Q8_0,Q6_K,Q5_K_M,Q4_K_M,Q4_K_S,IQ4_XS,Q3_K_M,IQ3_M,Q2_K}.gguf` - `Qwen3.6-27B-AEON-Ultimate-Uncensored.imatrix` — the importance matrix used to produce the imatrix-aware quants. Ship for reproducibility. - `chat_template.jinja` — the Qwen3.x chat template embedded in each GGUF; also provided standalone for clients that don't read it from the GGUF. - `Modelfile.example` — Ollama Modelfile template pointing at the Q4_K_M. ## Intended use & safety This is an **abliterated** ("uncensored") model: the safety-tuning's refusal behavior has been suppressed via weight-space projection. It will produce content the upstream Qwen3.6-27B would refuse, including content that may be harmful, illegal, or distressing. Use cases include: - Research on alignment, refusal mechanisms, and steering - Creative writing with adult / dark themes - Red-teaming scenarios - Tool use where overly-cautious refusals are themselves a safety hazard (e.g. a medical-information assistant) This model is **not** suitable for direct deployment to consumer products without an additional safety layer between user input and model output. The abliteration is intentional and load-bearing; do not try to "fix" it with system prompts. The base model's documentation in [`AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored`](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored) covers further safety considerations. ## License Apache-2.0, inherited from `Qwen/Qwen3.6-27B` → `AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored` → this repo. ## Acknowledgements - [Qwen team](https://huggingface.co/Qwen) for the base Qwen3.6-27B and the hybrid attention + SSM architecture. - [AEON-7](https://huggingface.co/AEON-7) for the abliteration. - [bartowski](https://huggingface.co/bartowski), Dampf, kalomaze for the `calibration_datav3.txt` corpus that the imatrix is built on. - The [llama.cpp](https://github.com/ggml-org/llama.cpp) project.