---
language:
- en
- fr
license: apache-2.0
tags:
- qwen3.5
- moe
- distillation
- opus-4.6
- tool-calling
- agentic
- gguf
- ramp
- imatrix
- chimere-server
- mamba2
- nemotron-h
- hybrid-ssm
- multi-arch
base_model: Qwen/Qwen3.5-35B-A3B
model_type: qwen3_5_moe
quantized_by: Kevletesteur
pipeline_tag: text-generation
---

# Qwen3.5-35B-A3B Chimere v3 -- RAMP GGUF

**Chimere v3: Claude Opus 4.6 distillation of Qwen3.5-35B-A3B, optimized for instruction following and reasoning.**

RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB VRAM, ~80 tok/s on RTX 5060 Ti.

> Looking for **v1** (best code + tools)? See [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF).

## Compatible runtimes

This GGUF can be loaded by any runtime that supports the Qwen3.5-35B-A3B (`qwen35moe`) architecture. The reference runtime — and the one that exercises all chimere-specific features (Engram n-gram bias, multi-agent context switching, the C++ fast sampler with DRY + min-p, K-cache Hadamard rotation, fused MoE up/gate) — is **chimere-server**.

| Runtime | Engram | Multi-agent | DRY sampler | K-cache Hadamard | Notes |
|---|---|---|---|---|---|
| [chimere-server](https://github.com/AIdevsmartdata/chimere) (Rust, official) | yes | yes | yes (C++ fast path) | yes | Production target. Also runs Mamba-2 / Nemotron-H MoE through the same backend (PR [ikawrakow/ik_llama.cpp#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593)). |
| [`ik_llama.cpp`](https://github.com/ikawrakow/ik_llama.cpp) `llama-server` | no | no | optional | optional | Same backend that chimere-server links against, just without the Rust HTTP/sampling layer. |
| [`llama.cpp`](https://github.com/ggml-org/llama.cpp) stock `llama-server` | no | no | no | no | Works, but slower on Qwen3.5 MoE on our hardware (no `iqk` matmul, no fused MoE up/gate). |

## Benchmark Results

### v3 strengths: instructions and reasoning

| Benchmark | v3 RAMP (this repo) | v1 RAMP | Base Qwen3.5-35B-A3B | Notes |
|-----------|---------------------|---------|---------------------|-------|
| **IFEval** (15 instruction tests) | **100%** | 67% | ~91.9% | +33 pts vs v1 |
| **Edge cases** (15 adversarial tests) | **100%** | 87% | -- | Perfect prompt injection resistance |
| **GSM8K CoT 8-shot** (1,319 qs) | **84.0%** | 52.2% | -- | +32 pts vs v1 |
| **HumanEval** (30 problems, executed) | 83% | 97% | -- | v1 better here |
| **BFCL tool-calling** (20 questions) | 75% | 90% | 67.3% | v1 better here |
| **Speed** (RTX 5060 Ti 16 GB, chimere-server) | ~80 tok/s | ~80 tok/s | -- | NCMOE=3, ctx 64K |

### Qualitative agentic tests

| Scenario | v3 | v1 | /10 |
|----------|----|----|-----|
| Cybersecurity incident response (multi-tool chain) | 4 | 4 | 10 |
| ML pipeline architecture (RAG, 10K users, $50K budget) | 8 | 8 | 10 |
| Rust MoE runtime optimization (async prefetch, CUDA) | 8 | 7 | 10 |
| **Total** | **20** | **19** | **30** |

### Honest assessment

- **Strengths**: 100% IFEval, 100% adversarial edge cases, 84% GSM8K, best overall reasoning
- **Weaknesses**: Code generation slightly weaker (83% vs 97%), tool-calling lower (75% vs 90%)
- **Why**: v3 dataset added IFEval-strict, OPSDC-compressed reasoning, and instruction-following samples on top of v1 base. Recommended for general agentic use.

## Which version to use?

| Use case | Recommended | Why |
|----------|------------|-----|
| **Instruction following, formatting** | **v3 (this repo)** | 100% IFEval, 100% edge cases |
| **Math reasoning** | **v3 (this repo)** | 84% GSM8K (vs 52% v1) |
| **Prompt injection resistance** | **v3 (this repo)** | 100% adversarial edge cases |
| **Code generation, debugging** | [v1](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) | 97% HumanEval |
| **Tool-calling, function calling** | [v1](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) | 90% BFCL |
| **Re-quantization or fine-tuning** | [BF16 weights](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-BF16) | Full precision |

**Best of both worlds**: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo).

## Quick start (chimere-server, recommended)

```bash
# 1. Backend (one-time): build the ik_llama.cpp fork with sm_120 CUDA + Mamba-2 backport
git clone https://github.com/AIdevsmartdata/ik_llama.cpp.git ~/ik_llama.cpp
cd ~/ik_llama.cpp
git checkout mamba2-nemotron-h-backport
cmake -B build_sm120 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_NATIVE=OFF
cmake --build build_sm120 -j

# 2. Server
git clone https://github.com/AIdevsmartdata/chimere.git
cd chimere/chimere-server
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
  cargo build --release --features server --bin chimere-server

# 3. Model + tokenizer
mkdir -p ~/models && cd ~/models
hf download Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF chimere-v3-ramp.gguf
hf download Qwen/Qwen3.5-35B-A3B tokenizer.json --local-dir tokenizers/qwen35

# 4. Run (production env vars)
CHIMERE_MODEL=$PWD/chimere-v3-ramp.gguf \
CHIMERE_TOKENIZER=$PWD/tokenizers/qwen35/tokenizer.json \
CHIMERE_LLAMA_BACKEND=1 \
CHIMERE_NCMOE=3 \
CHIMERE_KV_MAX_SEQ=65536 \
CHIMERE_PORT=8081 \
CHIMERE_FORCE_QWEN35=1 \
LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
~/chimere/chimere-server/target/release/chimere-server

# 5. Hello world
curl -s http://localhost:8081/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'
```

### Engram (optional, prod-only)

Chimere ships an n-gram logit bias overlay loaded from binary `.engr` tables. To enable it, set:

```sh
CHIMERE_ENGRAM_DIR=/path/to/engram_tables   # directory of *.engr files
CHIMERE_ENGRAM_ALPHA=0.1                     # logit bias strength
```

The engram tables are tokenizer-specific (Qwen3.5 vocab) and used as a per-domain overlay (kine, code, cyber, general). They are intended as a domain-knowledge injector, not a measured quality booster — see the [chimere repo README](https://github.com/AIdevsmartdata/chimere#performance) for the honest status of the path.

## Quick start (generic GGUF runtimes)

If you do not need the chimere stack, the GGUF works with any Qwen3.5-compatible runtime:

```bash
# llama.cpp / llama-server
llama-server \
    -m chimere-v3-ramp.gguf \
    -ngl 99 --n-cpu-moe 4 -c 32768 \
    --flash-attn on --jinja --port 8081

# For 16 GB VRAM (RTX 5060 Ti / RTX 4080):
# Add KV cache quantization to save VRAM:
# -ctk q8_0 -ctv q4_0
```

### Recommended sampling parameters

| Mode | temp | top_p | top_k | presence_penalty |
|------|------|-------|-------|------------------|
| Thinking (default) | 1.0 | 0.95 | 20 | 0.0 |
| Thinking + code/tools | 0.6 | 0.95 | 20 | 0.0 |
| No-think | 0.7 | 0.8 | 20 | 0.0 |

## Backend

The official `chimere-server` runtime links against a customized [`ik_llama.cpp`](https://github.com/AIdevsmartdata/ik_llama.cpp) fork (branch `mamba2-nemotron-h-backport`, head of upstream PR [ikawrakow/ik_llama.cpp#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593)).

Highlights of the chimere-specific layer on top of ik_llama:

- **Custom C++ fast sampler** exporting `sample_token_fast`, `set_logit_bias`, `set_engram_bias`, `clear_engram_bias` and `take_packed_logprobs` — avoids a ~993 KB logits copy per token, packs OpenAI-format top-5 logprobs.
- **K-cache Hadamard rotation**, fused MoE up/gate, grouped expert routing — all enabled by default via `cparams`.
- **Multi-agent KV / SSM state save & restore** via `llama_state_seq_*`, keyed on the OpenAI `user` field. Up to `CHIMERE_MAX_AGENTS` (default 4) concurrent personas with their own conversation state.
- An **OpenAI-compatible HTTP layer in Rust** (axum 0.8), supporting non-streaming and SSE streaming, tool calls, `<think>` reasoning extraction and `chat_template_kwargs.enable_thinking`.

## Multi-architecture support

The same `chimere-server` runtime is **not Qwen-only** any more. As of [Step 7](https://github.com/AIdevsmartdata/chimere/blob/main/chimere-server/docs/STEP7_MULTI_ARCH.md) (April 2026), it dispatches between two code paths based on the GGUF's `general.architecture` metadata:

- **Qwen3.5-35B-A3B** (`qwen35moe`) — full production stack: MTP, MRoPE, Engram, agent scheduler, custom Candle / cudarc / libllama paths. **This GGUF.**
- **Mamba-2 / Nemotron-H MoE / Mamba-1 / Mamba-2 hybrids** — libllama-only path via `GenericModel`. No MTP, no Engram, single-agent only at Step 7. Validated end-to-end on `unsloth/Nemotron-3-Nano-30B-A3B-GGUF` (Q4_0 and UD-IQ3_XXS) at **~45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048**, via the bundled `test-nemotron` smoke binary.

Models that **should** run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, `state-spaces/mamba2-*`, `mistralai/Mamba-Codestral-7B-v0.1`, AI21-Jamba-Reasoning-3B.

## RAMP Quantization Details

Custom per-tensor quality overrides -- critical paths get higher precision. Overall: **~3.78 BPW**.

| Tensor | Quant | BPW | Rationale |
|--------|-------|-----|-----------|
| attn_v (value) | Q8_0 | 8.0 | Most critical -- errors cause hallucinations |
| ssm_alpha, ssm_d | Q8_0 | 8.0 | GDN recurrent params, tiny but hypersensitive |
| attn_k (key) | Q6_K | 6.5 | Important for attention routing |
| ssm_dt | Q6_K | 6.5 | GDN timestep |
| token_embd, output | Q6_K | 6.5 | Shared embeddings |
| attn_q, attn_output | Q5_K | 5.5 | More tolerant |
| ssm_in, ssm_out | Q5_K | 5.5 | SSM projections |
| 256 MoE experts (FFN) | IQ3_S | 3.44 | 80% of params, high MoE redundancy |

- **imatrix**: Generated on BF16 model (B200, 192 GB VRAM), 200 calibration chunks
- **Result**: 15 GB with zero quality loss on agentic benchmarks vs BF16

## Training Details

| Parameter | Value |
|-----------|-------|
| Base model | Qwen/Qwen3.5-35B-A3B (MoE, 256 experts) |
| Method | SFT BF16 LoRA r64, completion-only loss |
| Dataset | 10,191 samples (v1 base + 428 additional: IFEval strict, OPSDC reasoning, instruction following) |
| Epochs | 1 (160 steps, batch 64) |
| Training GPU | NVIDIA B200 |
| Training cost | ~$2 |

### v3 dataset additions (on top of v1 base)

- +50 IFEval strict (5 constraint categories)
- +30 strict code (no markdown)
- +30 code gen with thinking
- +30 instruction following
- +20 OPSDC-compressed reasoning (-64% tokens)
- +15 multi-turn agentic

## Limitations

- **MTP infrastructure present, gated.** This GGUF carries an MTP (multi-token prediction) head — chimere-server detects it via `n_nextn_layer = 1` and exposes the speculative-decoding infrastructure (`mtp_scheduler.rs`, `MtpOp` FFI). An early March bench on a previous build measured **+49.5% token acceptance rate** for the MTP draft path; that figure is **not currently reproducible** because `bench_mtp.rs:104-167` has Benchmarks 2 and 5 hard-coded as `SKIPPED` with the comment `crash in ik_llama MTP graph, KV cache issue for layer 41`. Until that fix lands the 80 tok/s figure above is the non-MTP path. We will re-publish the MTP gain once the bench passes.
- **Engram is a domain-knowledge overlay, not a measured quality boost.** The only saved engram eval in the chimere repo (`benchmarks/engram_trained_eval.json`) was run on GPT-2 + wikitext-2 and shows a −13.39% PPL regression on that out-of-distribution setup. No Qwen3.5-specific perplexity eval has been published yet. Engram is shipped as an optional per-domain n-gram bias (kine, code, cyber, general); qualitative use shows specialized vocabulary in responses (`drainage bronchique postural`, `EMII`, ...) on the kiné domain, but there is no quantitative claim attached to it today.
- **Multi-slot concurrent decoding via `ik_llama.cpp` is broken** under heavy load (`ik_llama` multi-slot bug, slot 0 contamination of system prompts under contention). The `chimere-server` production deployment is single-slot. Stock `llama-server` does NOT have this bug if you need parallel slots.
- **Tool-calling sampler defaults**: `presence_penalty` defaults to `0.0` — a previous default of `1.5` killed code generation and long reasoning blocks. See [chimere-server source](https://github.com/AIdevsmartdata/chimere/blob/main/chimere-server/src/server.rs).

## Files

| File | Size | Description |
|------|------|-------------|
| `chimere-v3-ramp.gguf` | 15 GB | v3 RAMP GGUF (instructions + reasoning focus) |
| `imatrix.dat` | 184 MB | Importance matrix used for quantization |

## Related

- [chimere](https://github.com/AIdevsmartdata/chimere) -- Official Rust runtime (chimere-server) with Engram, MTP, multi-agent, multi-arch dispatch
- [ik_llama.cpp fork](https://github.com/AIdevsmartdata/ik_llama.cpp) -- Backend with Mamba-2 + Nemotron-H backport (PR [#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593))
- [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) -- Best code + tools
- [BF16 full weights](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-BF16) -- For re-quantization or fine-tuning
- [LoRA adapter](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA) -- For further training
- [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo) -- A-LoRA intent routing

## Citation

```bibtex
@misc{chimere-v3-2026,
  title={Chimere v3: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Instructions and Reasoning},
  author={Kevletesteur},
  year={2026},
  url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF}
}
```