Instructions to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF",
	filename="chimere-v3-ramp.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
# Run inference directly in the terminal:
llama cli -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
# Run inference directly in the terminal:
llama cli -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
# Run inference directly in the terminal:
./llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Use Docker

docker model run hf.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

LM Studio
Jan

vLLM

How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Ollama
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Ollama:
```
ollama run hf.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
```

Unsloth Studio

How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF to start chatting

How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Docker Model Runner:
```
docker model run hf.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF
```

Lemonade

How to use Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF

Run and chat with the model

lemonade run user.Qwen3.5-35B-A3B-Chimere-v3-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Kevletesteur commited on Apr 7

Commit

d8f2497

verified ·

1 Parent(s): e7f4188

docs: Step 7 multi-arch support, chimere-server runtime, honest narratives

Browse files

Files changed (1) hide show

README.md +98 -4

README.md CHANGED Viewed

@@ -13,6 +13,11 @@ tags:
 - gguf
 - ramp
 - imatrix
 base_model: Qwen/Qwen3.5-35B-A3B
 model_type: qwen3_5_moe
 quantized_by: Kevletesteur
@@ -27,6 +32,16 @@ RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB
 > Looking for **v1** (best code + tools)? See [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF).
 ## Benchmark Results
 ### v3 strengths: instructions and reasoning
@@ -38,7 +53,7 @@ RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB
 | **GSM8K CoT 8-shot** (1,319 qs) | **84.0%** | 52.2% | -- | +32 pts vs v1 |
 | **HumanEval** (30 problems, executed) | 83% | 97% | -- | v1 better here |
 | **BFCL tool-calling** (20 questions) | 75% | 90% | 67.3% | v1 better here |
-| **Speed** (RTX 5060 Ti 16 GB) | ~80 tok/s | ~80 tok/s | -- | |
 ### Qualitative agentic tests
@@ -68,7 +83,58 @@ RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB
 **Best of both worlds**: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo).
-## Usage
 ```bash
 # llama.cpp / llama-server
@@ -90,6 +156,26 @@ llama-server \
 | Thinking + code/tools | 0.6 | 0.95 | 20 | 0.0 |
 | No-think | 0.7 | 0.8 | 20 | 0.0 |
 ## RAMP Quantization Details
 Custom per-tensor quality overrides -- critical paths get higher precision. Overall: **~3.78 BPW**.
@@ -128,6 +214,13 @@ Custom per-tensor quality overrides -- critical paths get higher precision. Over
 - +20 OPSDC-compressed reasoning (-64% tokens)
 - +15 multi-turn agentic
 ## Files
 | File | Size | Description |
@@ -137,11 +230,12 @@ Custom per-tensor quality overrides -- critical paths get higher precision. Over
 ## Related
 - [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) -- Best code + tools
 - [BF16 full weights](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-BF16) -- For re-quantization or fine-tuning
 - [LoRA adapter](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA) -- For further training
-- [GitHub: Chimere](https://github.com/AIdevsmartdata/chimere)
-- [GitHub: Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo)
 ## Citation

 - gguf
 - ramp
 - imatrix
+- chimere-server
+- mamba2
+- nemotron-h
+- hybrid-ssm
+- multi-arch
 base_model: Qwen/Qwen3.5-35B-A3B
 model_type: qwen3_5_moe
 quantized_by: Kevletesteur
 > Looking for **v1** (best code + tools)? See [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF).
+## Compatible runtimes
+This GGUF can be loaded by any runtime that supports the Qwen3.5-35B-A3B (`qwen35moe`) architecture. The reference runtime — and the one that exercises all chimere-specific features (Engram n-gram bias, multi-agent context switching, the C++ fast sampler with DRY + min-p, K-cache Hadamard rotation, fused MoE up/gate) — is **chimere-server**.
+| Runtime | Engram | Multi-agent | DRY sampler | K-cache Hadamard | Notes |
+|---|---|---|---|---|---|
+| [chimere-server](https://github.com/AIdevsmartdata/chimere) (Rust, official) | yes | yes | yes (C++ fast path) | yes | Production target. Also runs Mamba-2 / Nemotron-H MoE through the same backend (PR [ikawrakow/ik_llama.cpp#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593)). |
+| [`ik_llama.cpp`](https://github.com/ikawrakow/ik_llama.cpp) `llama-server` | no | no | optional | optional | Same backend that chimere-server links against, just without the Rust HTTP/sampling layer. |
+| [`llama.cpp`](https://github.com/ggml-org/llama.cpp) stock `llama-server` | no | no | no | no | Works, but slower on Qwen3.5 MoE on our hardware (no `iqk` matmul, no fused MoE up/gate). |
 ## Benchmark Results
 ### v3 strengths: instructions and reasoning
 | **GSM8K CoT 8-shot** (1,319 qs) | **84.0%** | 52.2% | -- | +32 pts vs v1 |
 | **HumanEval** (30 problems, executed) | 83% | 97% | -- | v1 better here |
 | **BFCL tool-calling** (20 questions) | 75% | 90% | 67.3% | v1 better here |
+| **Speed** (RTX 5060 Ti 16 GB, chimere-server) | ~80 tok/s | ~80 tok/s | -- | NCMOE=3, ctx 64K |
 ### Qualitative agentic tests
 **Best of both worlds**: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo).
+## Quick start (chimere-server, recommended)
+```bash
+# 1. Backend (one-time): build the ik_llama.cpp fork with sm_120 CUDA + Mamba-2 backport
+git clone https://github.com/AIdevsmartdata/ik_llama.cpp.git ~/ik_llama.cpp
+cd ~/ik_llama.cpp
+git checkout mamba2-nemotron-h-backport
+cmake -B build_sm120 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_NATIVE=OFF
+cmake --build build_sm120 -j
+# 2. Server
+git clone https://github.com/AIdevsmartdata/chimere.git
+cd chimere/chimere-server
+LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
+  cargo build --release --features server --bin chimere-server
+# 3. Model + tokenizer
+mkdir -p ~/models && cd ~/models
+hf download Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF chimere-v3-ramp.gguf
+hf download Qwen/Qwen3.5-35B-A3B tokenizer.json --local-dir tokenizers/qwen35
+# 4. Run (production env vars)
+CHIMERE_MODEL=$PWD/chimere-v3-ramp.gguf \
+CHIMERE_TOKENIZER=$PWD/tokenizers/qwen35/tokenizer.json \
+CHIMERE_LLAMA_BACKEND=1 \
+CHIMERE_NCMOE=3 \
+CHIMERE_KV_MAX_SEQ=65536 \
+CHIMERE_PORT=8081 \
+CHIMERE_FORCE_QWEN35=1 \
+LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
+~/chimere/chimere-server/target/release/chimere-server
+# 5. Hello world
+curl -s http://localhost:8081/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'
+```
+### Engram (optional, prod-only)
+Chimere ships an n-gram logit bias overlay loaded from binary `.engr` tables. To enable it, set:
+```sh
+CHIMERE_ENGRAM_DIR=/path/to/engram_tables   # directory of *.engr files
+CHIMERE_ENGRAM_ALPHA=0.1                     # logit bias strength
+```
+The engram tables are tokenizer-specific (Qwen3.5 vocab) and used as a per-domain overlay (kine, code, cyber, general). They are intended as a domain-knowledge injector, not a measured quality booster — see the [chimere repo README](https://github.com/AIdevsmartdata/chimere#performance) for the honest status of the path.
+## Quick start (generic GGUF runtimes)
+If you do not need the chimere stack, the GGUF works with any Qwen3.5-compatible runtime:
 ```bash
 # llama.cpp / llama-server
 | Thinking + code/tools | 0.6 | 0.95 | 20 | 0.0 |
 | No-think | 0.7 | 0.8 | 20 | 0.0 |
+## Backend
+The official `chimere-server` runtime links against a customized [`ik_llama.cpp`](https://github.com/AIdevsmartdata/ik_llama.cpp) fork (branch `mamba2-nemotron-h-backport`, head of upstream PR [ikawrakow/ik_llama.cpp#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593)).
+Highlights of the chimere-specific layer on top of ik_llama:
+- **Custom C++ fast sampler** exporting `sample_token_fast`, `set_logit_bias`, `set_engram_bias`, `clear_engram_bias` and `take_packed_logprobs` — avoids a ~993 KB logits copy per token, packs OpenAI-format top-5 logprobs.
+- **K-cache Hadamard rotation**, fused MoE up/gate, grouped expert routing — all enabled by default via `cparams`.
+- **Multi-agent KV / SSM state save & restore** via `llama_state_seq_*`, keyed on the OpenAI `user` field. Up to `CHIMERE_MAX_AGENTS` (default 4) concurrent personas with their own conversation state.
+- An **OpenAI-compatible HTTP layer in Rust** (axum 0.8), supporting non-streaming and SSE streaming, tool calls, `<think>` reasoning extraction and `chat_template_kwargs.enable_thinking`.
+## Multi-architecture support
+The same `chimere-server` runtime is **not Qwen-only** any more. As of [Step 7](https://github.com/AIdevsmartdata/chimere/blob/main/chimere-server/docs/STEP7_MULTI_ARCH.md) (April 2026), it dispatches between two code paths based on the GGUF's `general.architecture` metadata:
+- **Qwen3.5-35B-A3B** (`qwen35moe`) — full production stack: MTP, MRoPE, Engram, agent scheduler, custom Candle / cudarc / libllama paths. **This GGUF.**
+- **Mamba-2 / Nemotron-H MoE / Mamba-1 / Mamba-2 hybrids** — libllama-only path via `GenericModel`. No MTP, no Engram, single-agent only at Step 7. Validated end-to-end on `unsloth/Nemotron-3-Nano-30B-A3B-GGUF` (Q4_0 and UD-IQ3_XXS) at **~45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048**, via the bundled `test-nemotron` smoke binary.
+Models that **should** run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, `state-spaces/mamba2-*`, `mistralai/Mamba-Codestral-7B-v0.1`, AI21-Jamba-Reasoning-3B.
 ## RAMP Quantization Details
 Custom per-tensor quality overrides -- critical paths get higher precision. Overall: **~3.78 BPW**.
 - +20 OPSDC-compressed reasoning (-64% tokens)
 - +15 multi-turn agentic
+## Limitations
+- **MTP infrastructure present, gated.** This GGUF carries an MTP (multi-token prediction) head — chimere-server detects it via `n_nextn_layer = 1` and exposes the speculative-decoding infrastructure (`mtp_scheduler.rs`, `MtpOp` FFI). An early March bench on a previous build measured **+49.5% token acceptance rate** for the MTP draft path; that figure is **not currently reproducible** because `bench_mtp.rs:104-167` has Benchmarks 2 and 5 hard-coded as `SKIPPED` with the comment `crash in ik_llama MTP graph, KV cache issue for layer 41`. Until that fix lands the 80 tok/s figure above is the non-MTP path. We will re-publish the MTP gain once the bench passes.
+- **Engram is a domain-knowledge overlay, not a measured quality boost.** The only saved engram eval in the chimere repo (`benchmarks/engram_trained_eval.json`) was run on GPT-2 + wikitext-2 and shows a −13.39% PPL regression on that out-of-distribution setup. No Qwen3.5-specific perplexity eval has been published yet. Engram is shipped as an optional per-domain n-gram bias (kine, code, cyber, general); qualitative use shows specialized vocabulary in responses (`drainage bronchique postural`, `EMII`, ...) on the kiné domain, but there is no quantitative claim attached to it today.
+- **Multi-slot concurrent decoding via `ik_llama.cpp` is broken** under heavy load (`ik_llama` multi-slot bug, slot 0 contamination of system prompts under contention). The `chimere-server` production deployment is single-slot. Stock `llama-server` does NOT have this bug if you need parallel slots.
+- **Tool-calling sampler defaults**: `presence_penalty` defaults to `0.0` — a previous default of `1.5` killed code generation and long reasoning blocks. See [chimere-server source](https://github.com/AIdevsmartdata/chimere/blob/main/chimere-server/src/server.rs).
 ## Files
 | File | Size | Description |
 ## Related
+- [chimere](https://github.com/AIdevsmartdata/chimere) -- Official Rust runtime (chimere-server) with Engram, MTP, multi-agent, multi-arch dispatch
+- [ik_llama.cpp fork](https://github.com/AIdevsmartdata/ik_llama.cpp) -- Backend with Mamba-2 + Nemotron-H backport (PR [#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593))
 - [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) -- Best code + tools
 - [BF16 full weights](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-BF16) -- For re-quantization or fine-tuning
 - [LoRA adapter](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA) -- For further training
+- [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo) -- A-LoRA intent routing
 ## Citation