--- language: - en - fr license: apache-2.0 tags: - qwen3.5 - moe - distillation - opus-4.6 - tool-calling - agentic - gguf - ramp - imatrix - chimere-server - mamba2 - nemotron-h - hybrid-ssm - multi-arch base_model: Qwen/Qwen3.5-35B-A3B model_type: qwen3_5_moe quantized_by: Kevletesteur pipeline_tag: text-generation --- # Qwen3.5-35B-A3B Chimere v3 -- RAMP GGUF **Chimere v3: Claude Opus 4.6 distillation of Qwen3.5-35B-A3B, optimized for instruction following and reasoning.** RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB VRAM, ~80 tok/s on RTX 5060 Ti. > Looking for **v1** (best code + tools)? See [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF). ## Compatible runtimes This GGUF can be loaded by any runtime that supports the Qwen3.5-35B-A3B (`qwen35moe`) architecture. The reference runtime — and the one that exercises all chimere-specific features (Engram n-gram bias, multi-agent context switching, the C++ fast sampler with DRY + min-p, K-cache Hadamard rotation, fused MoE up/gate) — is **chimere-server**. | Runtime | Engram | Multi-agent | DRY sampler | K-cache Hadamard | Notes | |---|---|---|---|---|---| | [chimere-server](https://github.com/AIdevsmartdata/chimere) (Rust, official) | yes | yes | yes (C++ fast path) | yes | Production target. Also runs Mamba-2 / Nemotron-H MoE through the same backend (PR [ikawrakow/ik_llama.cpp#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593)). | | [`ik_llama.cpp`](https://github.com/ikawrakow/ik_llama.cpp) `llama-server` | no | no | optional | optional | Same backend that chimere-server links against, just without the Rust HTTP/sampling layer. | | [`llama.cpp`](https://github.com/ggml-org/llama.cpp) stock `llama-server` | no | no | no | no | Works, but slower on Qwen3.5 MoE on our hardware (no `iqk` matmul, no fused MoE up/gate). | ## Benchmark Results ### v3 strengths: instructions and reasoning | Benchmark | v3 RAMP (this repo) | v1 RAMP | Base Qwen3.5-35B-A3B | Notes | |-----------|---------------------|---------|---------------------|-------| | **IFEval** (15 instruction tests) | **100%** | 67% | ~91.9% | +33 pts vs v1 | | **Edge cases** (15 adversarial tests) | **100%** | 87% | -- | Perfect prompt injection resistance | | **GSM8K CoT 8-shot** (1,319 qs) | **84.0%** | 52.2% | -- | +32 pts vs v1 | | **HumanEval** (30 problems, executed) | 83% | 97% | -- | v1 better here | | **BFCL tool-calling** (20 questions) | 75% | 90% | 67.3% | v1 better here | | **Speed** (RTX 5060 Ti 16 GB, chimere-server) | ~80 tok/s | ~80 tok/s | -- | NCMOE=3, ctx 64K | ### Qualitative agentic tests | Scenario | v3 | v1 | /10 | |----------|----|----|-----| | Cybersecurity incident response (multi-tool chain) | 4 | 4 | 10 | | ML pipeline architecture (RAG, 10K users, $50K budget) | 8 | 8 | 10 | | Rust MoE runtime optimization (async prefetch, CUDA) | 8 | 7 | 10 | | **Total** | **20** | **19** | **30** | ### Honest assessment - **Strengths**: 100% IFEval, 100% adversarial edge cases, 84% GSM8K, best overall reasoning - **Weaknesses**: Code generation slightly weaker (83% vs 97%), tool-calling lower (75% vs 90%) - **Why**: v3 dataset added IFEval-strict, OPSDC-compressed reasoning, and instruction-following samples on top of v1 base. Recommended for general agentic use. ## Which version to use? | Use case | Recommended | Why | |----------|------------|-----| | **Instruction following, formatting** | **v3 (this repo)** | 100% IFEval, 100% edge cases | | **Math reasoning** | **v3 (this repo)** | 84% GSM8K (vs 52% v1) | | **Prompt injection resistance** | **v3 (this repo)** | 100% adversarial edge cases | | **Code generation, debugging** | [v1](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) | 97% HumanEval | | **Tool-calling, function calling** | [v1](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) | 90% BFCL | | **Re-quantization or fine-tuning** | [BF16 weights](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-BF16) | Full precision | **Best of both worlds**: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo). ## Quick start (chimere-server, recommended) ```bash # 1. Backend (one-time): build the ik_llama.cpp fork with sm_120 CUDA + Mamba-2 backport git clone https://github.com/AIdevsmartdata/ik_llama.cpp.git ~/ik_llama.cpp cd ~/ik_llama.cpp git checkout mamba2-nemotron-h-backport cmake -B build_sm120 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_NATIVE=OFF cmake --build build_sm120 -j # 2. Server git clone https://github.com/AIdevsmartdata/chimere.git cd chimere/chimere-server LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \ cargo build --release --features server --bin chimere-server # 3. Model + tokenizer mkdir -p ~/models && cd ~/models hf download Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF chimere-v3-ramp.gguf hf download Qwen/Qwen3.5-35B-A3B tokenizer.json --local-dir tokenizers/qwen35 # 4. Run (production env vars) CHIMERE_MODEL=$PWD/chimere-v3-ramp.gguf \ CHIMERE_TOKENIZER=$PWD/tokenizers/qwen35/tokenizer.json \ CHIMERE_LLAMA_BACKEND=1 \ CHIMERE_NCMOE=3 \ CHIMERE_KV_MAX_SEQ=65536 \ CHIMERE_PORT=8081 \ CHIMERE_FORCE_QWEN35=1 \ LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \ ~/chimere/chimere-server/target/release/chimere-server # 5. Hello world curl -s http://localhost:8081/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}' ``` ### Engram (optional, prod-only) Chimere ships an n-gram logit bias overlay loaded from binary `.engr` tables. To enable it, set: ```sh CHIMERE_ENGRAM_DIR=/path/to/engram_tables # directory of *.engr files CHIMERE_ENGRAM_ALPHA=0.1 # logit bias strength ``` The engram tables are tokenizer-specific (Qwen3.5 vocab) and used as a per-domain overlay (kine, code, cyber, general). They are intended as a domain-knowledge injector, not a measured quality booster — see the [chimere repo README](https://github.com/AIdevsmartdata/chimere#performance) for the honest status of the path. ## Quick start (generic GGUF runtimes) If you do not need the chimere stack, the GGUF works with any Qwen3.5-compatible runtime: ```bash # llama.cpp / llama-server llama-server \ -m chimere-v3-ramp.gguf \ -ngl 99 --n-cpu-moe 4 -c 32768 \ --flash-attn on --jinja --port 8081 # For 16 GB VRAM (RTX 5060 Ti / RTX 4080): # Add KV cache quantization to save VRAM: # -ctk q8_0 -ctv q4_0 ``` ### Recommended sampling parameters | Mode | temp | top_p | top_k | presence_penalty | |------|------|-------|-------|------------------| | Thinking (default) | 1.0 | 0.95 | 20 | 0.0 | | Thinking + code/tools | 0.6 | 0.95 | 20 | 0.0 | | No-think | 0.7 | 0.8 | 20 | 0.0 | ## Backend The official `chimere-server` runtime links against a customized [`ik_llama.cpp`](https://github.com/AIdevsmartdata/ik_llama.cpp) fork (branch `mamba2-nemotron-h-backport`, head of upstream PR [ikawrakow/ik_llama.cpp#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593)). Highlights of the chimere-specific layer on top of ik_llama: - **Custom C++ fast sampler** exporting `sample_token_fast`, `set_logit_bias`, `set_engram_bias`, `clear_engram_bias` and `take_packed_logprobs` — avoids a ~993 KB logits copy per token, packs OpenAI-format top-5 logprobs. - **K-cache Hadamard rotation**, fused MoE up/gate, grouped expert routing — all enabled by default via `cparams`. - **Multi-agent KV / SSM state save & restore** via `llama_state_seq_*`, keyed on the OpenAI `user` field. Up to `CHIMERE_MAX_AGENTS` (default 4) concurrent personas with their own conversation state. - An **OpenAI-compatible HTTP layer in Rust** (axum 0.8), supporting non-streaming and SSE streaming, tool calls, `` reasoning extraction and `chat_template_kwargs.enable_thinking`. ## Multi-architecture support The same `chimere-server` runtime is **not Qwen-only** any more. As of [Step 7](https://github.com/AIdevsmartdata/chimere/blob/main/chimere-server/docs/STEP7_MULTI_ARCH.md) (April 2026), it dispatches between two code paths based on the GGUF's `general.architecture` metadata: - **Qwen3.5-35B-A3B** (`qwen35moe`) — full production stack: MTP, MRoPE, Engram, agent scheduler, custom Candle / cudarc / libllama paths. **This GGUF.** - **Mamba-2 / Nemotron-H MoE / Mamba-1 / Mamba-2 hybrids** — libllama-only path via `GenericModel`. No MTP, no Engram, single-agent only at Step 7. Validated end-to-end on `unsloth/Nemotron-3-Nano-30B-A3B-GGUF` (Q4_0 and UD-IQ3_XXS) at **~45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048**, via the bundled `test-nemotron` smoke binary. Models that **should** run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, `state-spaces/mamba2-*`, `mistralai/Mamba-Codestral-7B-v0.1`, AI21-Jamba-Reasoning-3B. ## RAMP Quantization Details Custom per-tensor quality overrides -- critical paths get higher precision. Overall: **~3.78 BPW**. | Tensor | Quant | BPW | Rationale | |--------|-------|-----|-----------| | attn_v (value) | Q8_0 | 8.0 | Most critical -- errors cause hallucinations | | ssm_alpha, ssm_d | Q8_0 | 8.0 | GDN recurrent params, tiny but hypersensitive | | attn_k (key) | Q6_K | 6.5 | Important for attention routing | | ssm_dt | Q6_K | 6.5 | GDN timestep | | token_embd, output | Q6_K | 6.5 | Shared embeddings | | attn_q, attn_output | Q5_K | 5.5 | More tolerant | | ssm_in, ssm_out | Q5_K | 5.5 | SSM projections | | 256 MoE experts (FFN) | IQ3_S | 3.44 | 80% of params, high MoE redundancy | - **imatrix**: Generated on BF16 model (B200, 192 GB VRAM), 200 calibration chunks - **Result**: 15 GB with zero quality loss on agentic benchmarks vs BF16 ## Training Details | Parameter | Value | |-----------|-------| | Base model | Qwen/Qwen3.5-35B-A3B (MoE, 256 experts) | | Method | SFT BF16 LoRA r64, completion-only loss | | Dataset | 10,191 samples (v1 base + 428 additional: IFEval strict, OPSDC reasoning, instruction following) | | Epochs | 1 (160 steps, batch 64) | | Training GPU | NVIDIA B200 | | Training cost | ~$2 | ### v3 dataset additions (on top of v1 base) - +50 IFEval strict (5 constraint categories) - +30 strict code (no markdown) - +30 code gen with thinking - +30 instruction following - +20 OPSDC-compressed reasoning (-64% tokens) - +15 multi-turn agentic ## Limitations - **MTP infrastructure present, gated.** This GGUF carries an MTP (multi-token prediction) head — chimere-server detects it via `n_nextn_layer = 1` and exposes the speculative-decoding infrastructure (`mtp_scheduler.rs`, `MtpOp` FFI). An early March bench on a previous build measured **+49.5% token acceptance rate** for the MTP draft path; that figure is **not currently reproducible** because `bench_mtp.rs:104-167` has Benchmarks 2 and 5 hard-coded as `SKIPPED` with the comment `crash in ik_llama MTP graph, KV cache issue for layer 41`. Until that fix lands the 80 tok/s figure above is the non-MTP path. We will re-publish the MTP gain once the bench passes. - **Engram is a domain-knowledge overlay, not a measured quality boost.** The only saved engram eval in the chimere repo (`benchmarks/engram_trained_eval.json`) was run on GPT-2 + wikitext-2 and shows a −13.39% PPL regression on that out-of-distribution setup. No Qwen3.5-specific perplexity eval has been published yet. Engram is shipped as an optional per-domain n-gram bias (kine, code, cyber, general); qualitative use shows specialized vocabulary in responses (`drainage bronchique postural`, `EMII`, ...) on the kiné domain, but there is no quantitative claim attached to it today. - **Multi-slot concurrent decoding via `ik_llama.cpp` is broken** under heavy load (`ik_llama` multi-slot bug, slot 0 contamination of system prompts under contention). The `chimere-server` production deployment is single-slot. Stock `llama-server` does NOT have this bug if you need parallel slots. - **Tool-calling sampler defaults**: `presence_penalty` defaults to `0.0` — a previous default of `1.5` killed code generation and long reasoning blocks. See [chimere-server source](https://github.com/AIdevsmartdata/chimere/blob/main/chimere-server/src/server.rs). ## Files | File | Size | Description | |------|------|-------------| | `chimere-v3-ramp.gguf` | 15 GB | v3 RAMP GGUF (instructions + reasoning focus) | | `imatrix.dat` | 184 MB | Importance matrix used for quantization | ## Related - [chimere](https://github.com/AIdevsmartdata/chimere) -- Official Rust runtime (chimere-server) with Engram, MTP, multi-agent, multi-arch dispatch - [ik_llama.cpp fork](https://github.com/AIdevsmartdata/ik_llama.cpp) -- Backend with Mamba-2 + Nemotron-H backport (PR [#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593)) - [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) -- Best code + tools - [BF16 full weights](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-BF16) -- For re-quantization or fine-tuning - [LoRA adapter](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA) -- For further training - [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo) -- A-LoRA intent routing ## Citation ```bibtex @misc{chimere-v3-2026, title={Chimere v3: Claude Opus 4.6 Distillation of Qwen3.5-35B-A3B MoE for Instructions and Reasoning}, author={Kevletesteur}, year={2026}, url={https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF} } ```