Kevletesteur commited on
Commit
d8f2497
·
verified ·
1 Parent(s): e7f4188

docs: Step 7 multi-arch support, chimere-server runtime, honest narratives

Browse files
Files changed (1) hide show
  1. README.md +98 -4
README.md CHANGED
@@ -13,6 +13,11 @@ tags:
13
  - gguf
14
  - ramp
15
  - imatrix
 
 
 
 
 
16
  base_model: Qwen/Qwen3.5-35B-A3B
17
  model_type: qwen3_5_moe
18
  quantized_by: Kevletesteur
@@ -27,6 +32,16 @@ RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB
27
 
28
  > Looking for **v1** (best code + tools)? See [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF).
29
 
 
 
 
 
 
 
 
 
 
 
30
  ## Benchmark Results
31
 
32
  ### v3 strengths: instructions and reasoning
@@ -38,7 +53,7 @@ RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB
38
  | **GSM8K CoT 8-shot** (1,319 qs) | **84.0%** | 52.2% | -- | +32 pts vs v1 |
39
  | **HumanEval** (30 problems, executed) | 83% | 97% | -- | v1 better here |
40
  | **BFCL tool-calling** (20 questions) | 75% | 90% | 67.3% | v1 better here |
41
- | **Speed** (RTX 5060 Ti 16 GB) | ~80 tok/s | ~80 tok/s | -- | |
42
 
43
  ### Qualitative agentic tests
44
 
@@ -68,7 +83,58 @@ RAMP quantization (per-tensor quality overrides + imatrix) -- 15 GB, fits 16 GB
68
 
69
  **Best of both worlds**: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo).
70
 
71
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
  ```bash
74
  # llama.cpp / llama-server
@@ -90,6 +156,26 @@ llama-server \
90
  | Thinking + code/tools | 0.6 | 0.95 | 20 | 0.0 |
91
  | No-think | 0.7 | 0.8 | 20 | 0.0 |
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  ## RAMP Quantization Details
94
 
95
  Custom per-tensor quality overrides -- critical paths get higher precision. Overall: **~3.78 BPW**.
@@ -128,6 +214,13 @@ Custom per-tensor quality overrides -- critical paths get higher precision. Over
128
  - +20 OPSDC-compressed reasoning (-64% tokens)
129
  - +15 multi-turn agentic
130
 
 
 
 
 
 
 
 
131
  ## Files
132
 
133
  | File | Size | Description |
@@ -137,11 +230,12 @@ Custom per-tensor quality overrides -- critical paths get higher precision. Over
137
 
138
  ## Related
139
 
 
 
140
  - [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) -- Best code + tools
141
  - [BF16 full weights](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-BF16) -- For re-quantization or fine-tuning
142
  - [LoRA adapter](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA) -- For further training
143
- - [GitHub: Chimere](https://github.com/AIdevsmartdata/chimere)
144
- - [GitHub: Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo)
145
 
146
  ## Citation
147
 
 
13
  - gguf
14
  - ramp
15
  - imatrix
16
+ - chimere-server
17
+ - mamba2
18
+ - nemotron-h
19
+ - hybrid-ssm
20
+ - multi-arch
21
  base_model: Qwen/Qwen3.5-35B-A3B
22
  model_type: qwen3_5_moe
23
  quantized_by: Kevletesteur
 
32
 
33
  > Looking for **v1** (best code + tools)? See [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF).
34
 
35
+ ## Compatible runtimes
36
+
37
+ This GGUF can be loaded by any runtime that supports the Qwen3.5-35B-A3B (`qwen35moe`) architecture. The reference runtime — and the one that exercises all chimere-specific features (Engram n-gram bias, multi-agent context switching, the C++ fast sampler with DRY + min-p, K-cache Hadamard rotation, fused MoE up/gate) — is **chimere-server**.
38
+
39
+ | Runtime | Engram | Multi-agent | DRY sampler | K-cache Hadamard | Notes |
40
+ |---|---|---|---|---|---|
41
+ | [chimere-server](https://github.com/AIdevsmartdata/chimere) (Rust, official) | yes | yes | yes (C++ fast path) | yes | Production target. Also runs Mamba-2 / Nemotron-H MoE through the same backend (PR [ikawrakow/ik_llama.cpp#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593)). |
42
+ | [`ik_llama.cpp`](https://github.com/ikawrakow/ik_llama.cpp) `llama-server` | no | no | optional | optional | Same backend that chimere-server links against, just without the Rust HTTP/sampling layer. |
43
+ | [`llama.cpp`](https://github.com/ggml-org/llama.cpp) stock `llama-server` | no | no | no | no | Works, but slower on Qwen3.5 MoE on our hardware (no `iqk` matmul, no fused MoE up/gate). |
44
+
45
  ## Benchmark Results
46
 
47
  ### v3 strengths: instructions and reasoning
 
53
  | **GSM8K CoT 8-shot** (1,319 qs) | **84.0%** | 52.2% | -- | +32 pts vs v1 |
54
  | **HumanEval** (30 problems, executed) | 83% | 97% | -- | v1 better here |
55
  | **BFCL tool-calling** (20 questions) | 75% | 90% | 67.3% | v1 better here |
56
+ | **Speed** (RTX 5060 Ti 16 GB, chimere-server) | ~80 tok/s | ~80 tok/s | -- | NCMOE=3, ctx 64K |
57
 
58
  ### Qualitative agentic tests
59
 
 
83
 
84
  **Best of both worlds**: Use A-LoRA routing -- an intent classifier selects the appropriate LoRA at runtime. Code/tools queries use v1, instruction/reasoning queries use v3. See [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo).
85
 
86
+ ## Quick start (chimere-server, recommended)
87
+
88
+ ```bash
89
+ # 1. Backend (one-time): build the ik_llama.cpp fork with sm_120 CUDA + Mamba-2 backport
90
+ git clone https://github.com/AIdevsmartdata/ik_llama.cpp.git ~/ik_llama.cpp
91
+ cd ~/ik_llama.cpp
92
+ git checkout mamba2-nemotron-h-backport
93
+ cmake -B build_sm120 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_NATIVE=OFF
94
+ cmake --build build_sm120 -j
95
+
96
+ # 2. Server
97
+ git clone https://github.com/AIdevsmartdata/chimere.git
98
+ cd chimere/chimere-server
99
+ LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
100
+ cargo build --release --features server --bin chimere-server
101
+
102
+ # 3. Model + tokenizer
103
+ mkdir -p ~/models && cd ~/models
104
+ hf download Kevletesteur/Qwen3.5-35B-A3B-Chimere-v3-GGUF chimere-v3-ramp.gguf
105
+ hf download Qwen/Qwen3.5-35B-A3B tokenizer.json --local-dir tokenizers/qwen35
106
+
107
+ # 4. Run (production env vars)
108
+ CHIMERE_MODEL=$PWD/chimere-v3-ramp.gguf \
109
+ CHIMERE_TOKENIZER=$PWD/tokenizers/qwen35/tokenizer.json \
110
+ CHIMERE_LLAMA_BACKEND=1 \
111
+ CHIMERE_NCMOE=3 \
112
+ CHIMERE_KV_MAX_SEQ=65536 \
113
+ CHIMERE_PORT=8081 \
114
+ CHIMERE_FORCE_QWEN35=1 \
115
+ LD_LIBRARY_PATH=$HOME/ik_llama.cpp/build_sm120/ggml/src:$HOME/ik_llama.cpp/build_sm120/src:/usr/local/cuda-12.8/lib64 \
116
+ ~/chimere/chimere-server/target/release/chimere-server
117
+
118
+ # 5. Hello world
119
+ curl -s http://localhost:8081/v1/chat/completions \
120
+ -H 'Content-Type: application/json' \
121
+ -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":64}'
122
+ ```
123
+
124
+ ### Engram (optional, prod-only)
125
+
126
+ Chimere ships an n-gram logit bias overlay loaded from binary `.engr` tables. To enable it, set:
127
+
128
+ ```sh
129
+ CHIMERE_ENGRAM_DIR=/path/to/engram_tables # directory of *.engr files
130
+ CHIMERE_ENGRAM_ALPHA=0.1 # logit bias strength
131
+ ```
132
+
133
+ The engram tables are tokenizer-specific (Qwen3.5 vocab) and used as a per-domain overlay (kine, code, cyber, general). They are intended as a domain-knowledge injector, not a measured quality booster — see the [chimere repo README](https://github.com/AIdevsmartdata/chimere#performance) for the honest status of the path.
134
+
135
+ ## Quick start (generic GGUF runtimes)
136
+
137
+ If you do not need the chimere stack, the GGUF works with any Qwen3.5-compatible runtime:
138
 
139
  ```bash
140
  # llama.cpp / llama-server
 
156
  | Thinking + code/tools | 0.6 | 0.95 | 20 | 0.0 |
157
  | No-think | 0.7 | 0.8 | 20 | 0.0 |
158
 
159
+ ## Backend
160
+
161
+ The official `chimere-server` runtime links against a customized [`ik_llama.cpp`](https://github.com/AIdevsmartdata/ik_llama.cpp) fork (branch `mamba2-nemotron-h-backport`, head of upstream PR [ikawrakow/ik_llama.cpp#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593)).
162
+
163
+ Highlights of the chimere-specific layer on top of ik_llama:
164
+
165
+ - **Custom C++ fast sampler** exporting `sample_token_fast`, `set_logit_bias`, `set_engram_bias`, `clear_engram_bias` and `take_packed_logprobs` — avoids a ~993 KB logits copy per token, packs OpenAI-format top-5 logprobs.
166
+ - **K-cache Hadamard rotation**, fused MoE up/gate, grouped expert routing — all enabled by default via `cparams`.
167
+ - **Multi-agent KV / SSM state save & restore** via `llama_state_seq_*`, keyed on the OpenAI `user` field. Up to `CHIMERE_MAX_AGENTS` (default 4) concurrent personas with their own conversation state.
168
+ - An **OpenAI-compatible HTTP layer in Rust** (axum 0.8), supporting non-streaming and SSE streaming, tool calls, `<think>` reasoning extraction and `chat_template_kwargs.enable_thinking`.
169
+
170
+ ## Multi-architecture support
171
+
172
+ The same `chimere-server` runtime is **not Qwen-only** any more. As of [Step 7](https://github.com/AIdevsmartdata/chimere/blob/main/chimere-server/docs/STEP7_MULTI_ARCH.md) (April 2026), it dispatches between two code paths based on the GGUF's `general.architecture` metadata:
173
+
174
+ - **Qwen3.5-35B-A3B** (`qwen35moe`) — full production stack: MTP, MRoPE, Engram, agent scheduler, custom Candle / cudarc / libllama paths. **This GGUF.**
175
+ - **Mamba-2 / Nemotron-H MoE / Mamba-1 / Mamba-2 hybrids** — libllama-only path via `GenericModel`. No MTP, no Engram, single-agent only at Step 7. Validated end-to-end on `unsloth/Nemotron-3-Nano-30B-A3B-GGUF` (Q4_0 and UD-IQ3_XXS) at **~45 tok/s on RTX 5060 Ti, NCMOE=30, ctx 2048**, via the bundled `test-nemotron` smoke binary.
176
+
177
+ Models that **should** run via the same Generic path (untested at the chimere level — your mileage may vary): Granite 4.0 H-Tiny / H-Small / H-Micro, Falcon-H1 0.5B – 34B, Bamba-9B v1 / v2, `state-spaces/mamba2-*`, `mistralai/Mamba-Codestral-7B-v0.1`, AI21-Jamba-Reasoning-3B.
178
+
179
  ## RAMP Quantization Details
180
 
181
  Custom per-tensor quality overrides -- critical paths get higher precision. Overall: **~3.78 BPW**.
 
214
  - +20 OPSDC-compressed reasoning (-64% tokens)
215
  - +15 multi-turn agentic
216
 
217
+ ## Limitations
218
+
219
+ - **MTP infrastructure present, gated.** This GGUF carries an MTP (multi-token prediction) head — chimere-server detects it via `n_nextn_layer = 1` and exposes the speculative-decoding infrastructure (`mtp_scheduler.rs`, `MtpOp` FFI). An early March bench on a previous build measured **+49.5% token acceptance rate** for the MTP draft path; that figure is **not currently reproducible** because `bench_mtp.rs:104-167` has Benchmarks 2 and 5 hard-coded as `SKIPPED` with the comment `crash in ik_llama MTP graph, KV cache issue for layer 41`. Until that fix lands the 80 tok/s figure above is the non-MTP path. We will re-publish the MTP gain once the bench passes.
220
+ - **Engram is a domain-knowledge overlay, not a measured quality boost.** The only saved engram eval in the chimere repo (`benchmarks/engram_trained_eval.json`) was run on GPT-2 + wikitext-2 and shows a −13.39% PPL regression on that out-of-distribution setup. No Qwen3.5-specific perplexity eval has been published yet. Engram is shipped as an optional per-domain n-gram bias (kine, code, cyber, general); qualitative use shows specialized vocabulary in responses (`drainage bronchique postural`, `EMII`, ...) on the kiné domain, but there is no quantitative claim attached to it today.
221
+ - **Multi-slot concurrent decoding via `ik_llama.cpp` is broken** under heavy load (`ik_llama` multi-slot bug, slot 0 contamination of system prompts under contention). The `chimere-server` production deployment is single-slot. Stock `llama-server` does NOT have this bug if you need parallel slots.
222
+ - **Tool-calling sampler defaults**: `presence_penalty` defaults to `0.0` — a previous default of `1.5` killed code generation and long reasoning blocks. See [chimere-server source](https://github.com/AIdevsmartdata/chimere/blob/main/chimere-server/src/server.rs).
223
+
224
  ## Files
225
 
226
  | File | Size | Description |
 
230
 
231
  ## Related
232
 
233
+ - [chimere](https://github.com/AIdevsmartdata/chimere) -- Official Rust runtime (chimere-server) with Engram, MTP, multi-agent, multi-arch dispatch
234
+ - [ik_llama.cpp fork](https://github.com/AIdevsmartdata/ik_llama.cpp) -- Backend with Mamba-2 + Nemotron-H backport (PR [#1593](https://github.com/ikawrakow/ik_llama.cpp/pull/1593))
235
  - [Chimere v1 GGUF](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-GGUF) -- Best code + tools
236
  - [BF16 full weights](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-BF16) -- For re-quantization or fine-tuning
237
  - [LoRA adapter](https://huggingface.co/Kevletesteur/Qwen3.5-35B-A3B-Chimere-Distilled-LoRA) -- For further training
238
+ - [Chimere ODO](https://github.com/AIdevsmartdata/chimere-odo) -- A-LoRA intent routing
 
239
 
240
  ## Citation
241