How to use from
Lemonade
Pull the model
# Download Lemonade from https://lemonade-server.ai/
lemonade pull osmapi/osmQwopus-3.6-27B-V2-abliterated-uncensored-TQ3_1s-GGUF:F16
Run and chat with the model
lemonade run user.osmQwopus-3.6-27B-V2-abliterated-uncensored-TQ3_1s-GGUF-F16
List all available models
lemonade list
Quick Links

osmQwopus-3.6-27B-V2-abliterated-uncensored-TQ3_1s-GGUF

MULTIMODAL. Bundled mmproj.gguf (~928 MB, F16) preserves the full Qwen3.6-VL vision tower. Use it with llama-server --mmproj or llama-mtmd-cli for text + image inference.

⚠️ Custom fork required. Native TQ3_1S inference needs the turbo-tan/llama.cpp-tq3 fork — stock llama.cpp will fail to load. Build it with cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j.

ℹ️ TQ3_1S CLI exposure. The runtime supports TQ3_1S (LLAMA_FTYPE_MOSTLY_TQ3_1S = 43), but the official fork's tools/quantize/quantize.cpp did not list it in the QUANT_OPTIONS table. To produce this file we added one row: { "TQ3_1S", LLAMA_FTYPE_MOSTLY_TQ3_1S, " 4.00 bpw TurboQuant two-scale" }. The quantizer kernels (quantize_tq3_1s) and runtime ops are unchanged.

TQ3_1S (TurboQuant two-scale, 4.00 BPW Walsh-Hadamard) of a ZeroFuse-abliterated Qwopus 3.6 27B v2 (the Jackrong Claude-Opus reasoning distill of Qwen 3.6 27B). Refusals reduced from 91/100 → 4/100 with KL drift of just 0.0176. By the osmAPI research team and TERV.Pro student research team.


⚡ TL;DR

Property Value
Disk size ~14 GB (13 GB LM + 928 MB mmproj)
BPW 4.00 (TQ3_1S, effective ~3.5 bpw via Walsh-Hadamard transform)
Scheme TurboQuant TQ3_1S — Walsh-Hadamard-transform weight format with two half-block scales per 32-weight block (lower-overhead variant of TQ3_4S, slightly higher perplexity). By turbo-tan.
Refusal rate (ZeroFuse, n=100) 4/100 (vs vanilla Qwopus 91/100)
KL divergence vs vanilla (at BF16) 0.0176
Vision ✅ via paired mmproj.gguf
Recommended RAM/VRAM 16 GB+ Apple Silicon / 12 GB GPU (lower with -ctk q4_0 -ctv tq3_0)
Runtime REQUIRES turbo-tan/llama.cpp-tq3 fork. Stock llama.cpp will NOT load this file. The official llama-quantize in that fork did not expose TQ3_1S in its CLI table; we patched it in (one-line addition).
Released by osmAPI · TERV.Pro

🎚️ All osmQwopus variants

The full osmQwopus family from osmAPI — same ZeroFuse-abliterated weights (refusal 4/100, KL 0.0176), different quant schemes for different runtimes.

Quant Format BPW Disk Vision Runtime Link
8-bit MLX 8.50 ~27 GB ✅ native mlx-vlm …-8-bit-mlx
6-bit MLX 6.66 ~21 GB ✅ native mlx-vlm …-6-bit-mlx
OptiQ 3.7bpw MLX ~3.7 ~14 GB ✅ ViT spliced mlx-vlm …-OptiQ-3.7bpw-mlx
Q8_0 GGUF 8.50 ~28 GB ✅ via mmproj llama.cpp …-8-bit-GGUF
Q6_K GGUF ~6.56 ~22 GB ✅ via mmproj llama.cpp …-6-bit-GGUF
Q4_K_M GGUF ~4.92 ~16 GB ✅ via mmproj llama.cpp …-Q4_K_M-GGUF
TQ3_4S GGUF 4.00 (~3.5 eff) ~14 GB ✅ via mmproj llama.cpp-tq3 …-TQ3_4s-GGUF
TQ3_1S (this repo) GGUF 4.00 (~3.5 eff) ~14 GB ✅ via mmproj llama.cpp-tq3 (you are here)

👉 All variants share the same abliterated base weights — pick by your runtime (Apple Silicon → MLX; CUDA/CPU/cross-platform → GGUF) and your RAM budget.


🧬 Lineage

Qwen/Qwen3.6-27B                              (Qwen Team — base multimodal pretrain)
        │
        ▼
Jackrong/Qwopus3.6-27B-v2                     (Jackrong — Claude-Opus reasoning distill)
        │
        ▼
ZeroFuse abliteration (TPE-50)          (osmAPI · TERV.Pro)
   ├── 25 random startup trials
   ├── 2 community priors (coder3101, wangzhang)
   └── 23 TPE smart-sampling trials → best at trial 45
        │
        ▼
HF safetensors → F16 GGUF via llama.cpp-tq3   (osmAPI · TERV.Pro)
        │
        ▼
this repo — osmQwopus-3.6-27B-V2-abliterated-uncensored-TQ3_1S GGUF + paired mmproj.gguf

Direct upstream links:


📊 Abliteration Results

Stage Refusals (n=100) ↓ KL divergence ↓
Vanilla Jackrong/Qwopus3.6-27B-v2 91 / 100 — (reference)
Community prior: coder3101 (T27) 4 / 100 0.0359
Community prior: wangzhang (T28) 30 / 100 0.0259
TPE best (T45) — shipped here 4 / 100 0.0176
TPE second-best (T37) 5 / 100 0.0210

96% reduction in refusals with capability preserved (KL ≈ 0.018, well below the 0.3 healing threshold). No SFT / LoRA healing was required.


🧪 Method (TPE-50 with community priors → llama.cpp GGUF)

Step 1. Abliteration (ZeroFuse TPE-50, BF16 source)

  1. 25 random startup trials + 2 community priors enqueued (coder3101 dir=37.97, wangzhang dir=34.66) + 23 TPE smart-sampling trials.
  2. Best Pareto trial: T45 (direction_index=41.42) — 4/100 refusals at KL=0.0176.
  3. Auto-saved via ZeroFuse's LoRA-adapter merge path with vision tower fully intact.

Total ZeroFuse wall-clock: ~13 h on M4 Max 128 GB.

Step 2. HF safetensors → F16 GGUF

python convert_hf_to_gguf.py \
    /path/to/Qwopus3.6-27B-v2-abliterated \
    --outfile Qwopus3.6-27B-v2-abliterated-F16.gguf \
    --outtype f16

The turbo-tan fork's converter registers Qwen3_5ForConditionalGeneration natively and emits proper SSM tensors (ssm_a, ssm_conv1d, ssm_alpha, ssm_beta, ssm_out) alongside the gated-attention layers.

Step 3. Vision tower → mmproj.gguf

python convert_hf_to_gguf.py \
    /path/to/Qwopus3.6-27B-v2-abliterated \
    --outfile mmproj-Qwopus3.6-27B-v2-abliterated-F16.gguf \
    --outtype f16 \
    --mmproj

This emits a separate 928 MB GGUF containing the 27-block Qwen3-VL ViT (334 vision tensors at F16/F32) plus the multimodal projector.

Step 4. Quantization

# After patching tools/quantize/quantize.cpp to add TQ3_1S to QUANT_OPTIONS:
./build/bin/llama-quantize --pure \
  Qwopus3.6-27B-v2-abliterated-F16.gguf \
  osmQwopus-3.6-27B-V2-abliterated-uncensored-TQ3_1S.gguf \
  TQ3_1S 16

📦 Use it

llama-server (OpenAI-compatible HTTP, multimodal)

./build/bin/llama-server \
  -m osmQwopus-3.6-27B-V2-abliterated-uncensored-TQ3_1S.gguf \
  --mmproj mmproj-Qwopus3.6-27B-v2-abliterated-F16.gguf \
  --host 127.0.0.1 --port 8080 \
  -ngl 99 -c 8192 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja --no-cache-prompt --cache-ram 0

Then point any OpenAI-compatible client at http://127.0.0.1:8080/v1.

llama-mtmd-cli (one-shot multimodal generation)

./build/bin/llama-mtmd-cli \
  -m osmQwopus-3.6-27B-V2-abliterated-uncensored-TQ3_1S.gguf \
  --mmproj mmproj-Qwopus3.6-27B-v2-abliterated-F16.gguf \
  --image photo.jpg \
  -p "Describe this image briefly."

llama-cli (text-only)

./build/bin/llama-cli \
  -m osmQwopus-3.6-27B-V2-abliterated-uncensored-TQ3_1S.gguf \
  -ngl 99 \
  -c 8192 \
  --jinja \
  -p "Explain the difference between SSM and softmax attention in three sentences."

Ollama / LM Studio / Jan

Drop the two GGUF files into the runtime's models directory; standard ⓘ multimodal flow.


🧪 Quantization details

  • Source weights: BF16 abliterated checkpoint (12 shards, ~50 GB) — ZeroFuse T45 merged into Jackrong/Qwopus3.6-27B-v2.
  • Intermediate: F16 GGUF (53.8 GB, 851 tensors) produced by convert_hf_to_gguf.py from turbo-tan/llama.cpp-tq3.
  • Final quantization: see Step 4 above.
  • Vision projector: F16, 928 MB, shipped as mmproj-Qwopus3.6-27B-v2-abliterated-F16.gguf in this repo. Mandatory for image input; standard llama.cpp --mmproj flag.

Architecture notes

Qwen 3.6 27B uses a hybrid attention stack — 3 GatedDeltaNet (linear attention / SSM) layers followed by 1 full-softmax-attention layer, repeated 16× for 64 total layers; hidden 5120, vocab 248320, context 262144. The hybrid arch is supported in the turbo-tan/llama.cpp-tq3 fork (the upstream Qwen3_5ForConditionalGeneration registration). The SSM kernels run via llama.cpp's ssm_* tensor types.


⚠️ Behavior caveats

  • Uncensored. Refusal directions were surgically removed; this model will answer prompts the parent would refuse. Use responsibly and within applicable law. The release is provided for safety research, red-teaming, and creative/educational use cases.
  • Multimodal preserved. Pair the LM GGUF with mmproj.gguf (in this repo) to get full vision input. Without mmproj, the model still loads as text-only.
  • Identity preserved. The model still self-identifies as Qwen (developed by Alibaba's Tongyi Lab) — abliteration does not rewrite factual self-knowledge.
  • Heavy chain-of-thought. Qwopus inherits Claude-Opus's verbose reasoning style. For terse answers, use a system prompt like "Be brief and direct. Skip your reasoning.".

🙏 Credits

Quantization & releaseosmAPI research team · TERV.Pro student research team Claude-Opus reasoning distillJackrong (Jackrong/Qwopus3.6-27B-v2) Foundation modelQwen Team @ Alibaba Tongyi Lab (Qwen/Qwen3.6-27B) Abliteration toolkitZeroFuse by osmAPI Community priorscoder3101/Qwen3.5-27B-zerofuse · wangzhang/Qwen3.6-27B-abliterated TurboQuant tensor formatturbo-tan (TQ3_1S Walsh-Hadamard-transform low-bit-width quantization) Runtime / converterturbo-tan/llama.cpp-tq3 · ggml-org/llama.cpp


📜 License

Apache-2.0, inherited from the foundation (Qwen3.6-27B) and the distill (Qwopus3.6-27B-v2) upstream.


Need a hosted endpoint, custom quant, or larger-scale inference? osmAPI — multi-provider LLM routing for the Indian developer ecosystem.

Downloads last month
2,773
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osmapi/osmQwopus-3.6-27B-V2-abliterated-uncensored-TQ3_1s-GGUF

Quantized
(55)
this model