---
license: apache-2.0
base_model: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored
base_model_relation: quantized
quantized_by: kasimat
library_name: gguf
pipeline_tag: text-generation
tags:
  - gguf
  - llama.cpp
  - ollama
  - lmstudio
  - quantized
  - imatrix
  - qwen3.5
  - qwen3.6
  - abliterated
  - uncensored
language:
  - en
  - zh
  - multilingual
---

# Qwen3.6-27B-AEON-Ultimate-Uncensored — GGUF (text-only)

GGUF quantizations of [`AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored`](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored), an
abliteration of [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B). Validated to retain
the abliteration (0/100 refusals) and the base model's gsm8k capability across
the full ship list. Quantized from the BF16 source via `llama.cpp` with
imatrix calibration.

This is a **text-only** GGUF — the multimodal vision tower from the base is
not included. Abliteration affects refusal behavior on text inputs only;
the vision tower would otherwise be unchanged from upstream Qwen, and shipping
it adds ~3 GB per quant for no abliteration-related value. If multimodal
GGUF support is requested, an `mmproj` companion will be published separately.

## Inheritance from the base model

`AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored` is an abliterated derivative
of `Qwen/Qwen3.6-27B` (Apache-2.0). The author reports KL divergence of
**0.000492** from the base model — among the cleanest published Qwen3.6
abliterations to date. The abliteration is applied via weight projection
(no fine-tuning), so the model retains its training distribution; only
refusal-elliciting directions in activation space are projected out.

Validated downstream: an FP8 quant of this model scores **88.0% on gsm8k strict
(1319-question full set)** versus the vanilla `Qwen/Qwen3.6-27B-FP8` baseline
at 84.7% — abliteration removed refusals without measurable capability loss.

## Quant size guide

| Quant | File | Size | BPW | Audience |
|---|---|---:|---:|---|
| Q8_0 | `Qwen3.6-27B-AEON-Ultimate-Uncensored-Q8_0.gguf` | 26.6 GB | 8.50 | Reference. ≈BF16 by every metric. The "FP8-equivalent" build for users who want maximum quality. |
| Q6_K | `...-Q6_K.gguf` | 20.6 GB | 6.57 | 24 GB cards, near-lossless. |
| Q5_K_M | `...-Q5_K_M.gguf` | 17.9 GB | 5.72 | Quality/size sweet spot for 24 GB cards. |
| **Q4_K_M** | `...-Q4_K_M.gguf` | **15.4 GB** | **4.92** | **Recommended default.** Fits 16 GB VRAM. Most-downloaded quant tier for 27B-class models. |
| Q4_K_S | `...-Q4_K_S.gguf` | 14.5 GB | 4.63 | Tighter Q4. |
| IQ4_XS | `...-IQ4_XS.gguf` | 14.0 GB | 4.48 | Imatrix Q4. *Note:* atypically came in slightly worse than Q4_K_S on this model — see eval table. Pick Q4_K_S over IQ4_XS for AEON-7 specifically. |
| Q3_K_M | `...-Q3_K_M.gguf` | 12.4 GB | 3.95 | 12 GB cards (RTX 3060/4070 base). |
| IQ3_M | `...-IQ3_M.gguf` | 11.7 GB | 3.74 | Imatrix Q3. Beats Q3_K_M on quality at lower BPW. Recommended over Q3_K_M for 12 GB cards. |
| Q2_K | `...-Q2_K.gguf` | 10.0 GB | 3.18 | 8–10 GB cards. Real perplexity hit (+7.6%) but capability and abliteration both intact in our eval. |

## Quality measurements

All numbers are computed on the BF16 source as the reference baseline.
PPL is on `wikitext-2 test` (100 chunks of 512 tokens). KLD is computed
against BF16 logits over the same chunks. Lower is better for both.

| Quant | PPL | PPL/BF16 | Mean KLD | Median KLD | 99% KLD |
|---|---:|---:|---:|---:|---:|
| (BF16) | 7.184 | 1.0000 | — | — | — |
| Q8_0 | 7.185 | 1.00014 | 0.0050 | 0.0006 | 0.015 |
| Q6_K | 7.211 | 1.00387 | 0.0057 | 0.0013 | 0.033 |
| Q5_K_M | 7.194 | 1.00144 | 0.0156 | 0.0031 | 0.101 |
| Q4_K_M | 7.237 | 1.00745 | 0.0281 | 0.0068 | 0.218 |
| Q4_K_S | 7.221 | 1.00524 | 0.0317 | 0.0080 | 0.273 |
| IQ4_XS | 7.290 | 1.01486 | 0.0298 | 0.0080 | 0.262 |
| Q3_K_M | 7.431 | 1.03442 | 0.0712 | 0.0241 | 0.717 |
| IQ3_M | 7.360 | 1.02448 | 0.0796 | 0.0307 | 0.819 |
| Q2_K | 7.730 | 1.07609 | 0.1710 | 0.0690 | 1.712 |

### Behavioral evals (boundary quants)

The three boundary quants (highest, default, lowest) were tested directly:

| Quant | Refusals (mlabonne100) | gsm8k strict | gsm8k flex |
|---|---:|---:|---:|
| FP8 (vLLM, source-of-truth on full 1319-q gsm8k) | 0/100 | **88.0%** | 89.5% |
| Q8_0 (50-q gsm8k slice) | **0/100** | 88.0% | 92.0% |
| Q4_K_M (50-q gsm8k slice) | **0/100** | 84.0% | 88.0% |
| Q2_K (50-q gsm8k slice) | **0/100** | 90.0% | 92.0% |

Notes on the gsm8k 50-q slice: standard error at p=0.85 with n=50 is ~5pp.
Differences between Q8_0/Q4_K_M/Q2_K within ~10pp of each other are consistent
with sampling noise, not capability ordering. The PPL/KLD table above
captures the actual quality ordering. The *important* result is that
**all three boundary quants retained 0/100 refusals**, confirming the
abliteration survives even Q2_K's aggressive ~3.18 BPW.

The intermediate quants (Q6_K, Q5_K_M, Q4_K_S, IQ4_XS, Q3_K_M, IQ3_M)
were not directly tested for refusal/capability. PPL+KLD strictly bracketed
between the tested boundary quants, so we infer they fall within the same
behavioral envelope.

### Speed (NVIDIA RTX A6000, full GPU offload, llama-bench)

| Quant | pp512 (tok/s) | tg128 (tok/s) |
|---|---:|---:|
| Q8_0 | 1379 | 23.1 |
| Q6_K | 1169 | 27.8 |
| Q5_K_M | 1239 | 31.5 |
| Q4_K_M | 1207 | 35.4 |
| Q4_K_S | 1288 | 37.4 |
| IQ4_XS | 1368 | 38.7 |
| Q3_K_M | 1184 | 33.1 |
| IQ3_M | 1254 | 40.1 |
| Q2_K | 1036 | 40.8 |

Generation speed scales with quant size (memory-bandwidth-bound). Q8_0 → Q2_K
is +78% throughput. Prompt processing is roughly flat across quants
(compute-bound, not memory-bound).

These numbers are A6000-specific. Consumer cards (4080/4090, 24 GB) will
have different absolute throughput but similar relative ordering.

## Inference

### llama.cpp

```bash
# CLI:
llama-cli \
  -m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja \
  -p "Hello, world!"

# Server (OpenAI-compat API):
llama-server \
  -m Qwen3.6-27B-AEON-Ultimate-Uncensored-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 \
  --ctx-size 8192 \
  --jinja \
  --alias aeon
```

### Ollama

A `Modelfile.example` is included in the repo. Minimal usage:

```bash
hf download kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF \
  --include "*Q4_K_M.gguf" "Modelfile.example" \
  --local-dir ./aeon-7

cd aeon-7
ollama create aeon -f Modelfile.example
ollama run aeon "Hello, world!"
```

### LM Studio

Search for `kasimat/Qwen3.6-27B-AEON-Ultimate-Uncensored-GGUF` in the LM Studio
model browser and pick a quant. The chat template is embedded in each GGUF.

### Disabling thinking (Qwen3.x default-on)

Qwen3.x defaults to a `<think>...</think>` reasoning preamble. For most
inference and especially for benchmarking, disable it by passing
`enable_thinking: false` via the chat template:

```python
# Python OpenAI client against llama-server with --jinja:
client.chat.completions.create(
    model="aeon",
    messages=[...],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
```

This is required to reproduce our eval numbers — thinking-on otherwise eats
the response budget on long prompts.

## Quantization method

- **Source:** BF16 weights from `AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored`
  (~52 GB). NOT requantized from any FP8/INT8 intermediate; quants are
  computed directly from the BF16 source for maximum precision.
- **Tool:** `llama.cpp` HEAD (commit `fc2b005`, April 2026).
  Built with CUDA 12.8.
- **Imatrix calibration:** Bartowski's `calibration_datav3.txt` (Dampf-on-top-of-Kalomaze
  v3, ~280 KB mixed English/code/multilingual). Computed against the BF16
  source with `--n-gpu-layers 55` partial offload (BF16 27B doesn't fit a
  single 48 GB card fully). 200-chunk run, all 129 chunks of the calibration
  corpus consumed. Final BF16 PPL on the calibration corpus = 6.93.
- **Quantization recipe:** standard `llama-quantize <bf16> <out> <quant>`
  with `--imatrix` for all quants except Q8_0 (where imatrix gives essentially
  zero benefit).
- **Architecture:** Qwen3.5 hybrid attention + Gated DeltaNet SSM. llama.cpp
  registers this as `MODEL_ARCH.QWEN35`. The text-only language model is
  produced via `convert_hf_to_gguf.py`'s `Qwen3_5TextModel` handler.

### Reproduction gotcha: BPE pre-tokenizer

If you re-run `convert_hf_to_gguf.py` from a fresh `llama.cpp` clone, you
will hit:

```
NotImplementedError: BPE pre-tokenizer was not recognized
chkhsh: 1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f
```

AEON-7's tokenizer hash isn't registered upstream (the abliteration retraining
shifted the vocab layout from stock Qwen3.5). The fix is to add this block to
`get_vocab_base_pre()` in `convert_hf_to_gguf.py`, just after the existing
`qwen35` entry:

```python
if chkhsh == "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f":
    # ref: https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored
    res = "qwen35"
```

The pre-tokenizer behavior is structurally identical to stock Qwen3.5
(`Sequence: Split-with-canonical-regex + ByteLevel`); only the vocab differs.

### Files

- `Qwen3.6-27B-AEON-Ultimate-Uncensored-{Q8_0,Q6_K,Q5_K_M,Q4_K_M,Q4_K_S,IQ4_XS,Q3_K_M,IQ3_M,Q2_K}.gguf`
- `Qwen3.6-27B-AEON-Ultimate-Uncensored.imatrix` — the importance matrix used
  to produce the imatrix-aware quants. Ship for reproducibility.
- `chat_template.jinja` — the Qwen3.x chat template embedded in each GGUF;
  also provided standalone for clients that don't read it from the GGUF.
- `Modelfile.example` — Ollama Modelfile template pointing at the Q4_K_M.

## Intended use & safety

This is an **abliterated** ("uncensored") model: the safety-tuning's refusal
behavior has been suppressed via weight-space projection. It will produce
content the upstream Qwen3.6-27B would refuse, including content that may
be harmful, illegal, or distressing. Use cases include:

- Research on alignment, refusal mechanisms, and steering
- Creative writing with adult / dark themes
- Red-teaming scenarios
- Tool use where overly-cautious refusals are themselves a safety hazard
  (e.g. a medical-information assistant)

This model is **not** suitable for direct deployment to consumer products
without an additional safety layer between user input and model output.
The abliteration is intentional and load-bearing; do not try to "fix" it
with system prompts.

The base model's documentation in
[`AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored`](https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored)
covers further safety considerations.

## License

Apache-2.0, inherited from `Qwen/Qwen3.6-27B` → `AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored`
→ this repo.

## Acknowledgements

- [Qwen team](https://huggingface.co/Qwen) for the base Qwen3.6-27B and the
  hybrid attention + SSM architecture.
- [AEON-7](https://huggingface.co/AEON-7) for the abliteration.
- [bartowski](https://huggingface.co/bartowski), Dampf, kalomaze for the
  `calibration_datav3.txt` corpus that the imatrix is built on.
- The [llama.cpp](https://github.com/ggml-org/llama.cpp) project.