--- language: - en license: apache-2.0 tags: - qwen3 - reasoning - distillation - claude-opus - full-finetune - gguf base_model: Qwen/Qwen3-Coder-Next datasets: - nohurry/Opus-4.6-Reasoning-3000x-filtered - TeichAI/claude-4.5-opus-high-reasoning-250x - Jackrong/Qwen3.5-reasoning-700x --- # Qwen3-Coder-Next — Opus 4.6 Reasoning Distilled (GGUF) GGUF quantizations of the full fine-tuned **Qwen/Qwen3-Coder-Next** (~80B total / ~3B active, MoE) with Claude Opus 4.6 reasoning distillation. Trained on 8x H100 80GB SXM with DeepSpeed ZeRO-3, all parameters. ## Model Details | Property | Value | |---|---| | Base Model | [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) | | Architecture | qwen3_next (Mixture of Experts) | | Total Parameters | ~80B | | Active Parameters | ~3B per token (10 of 512 experts) | | Expert Config | 512 experts, 10 active per token, intermediate=512 | | Shared Expert | Yes (intermediate=512) | | Attention | Hybrid: linear attention + full attention every 4th layer | | Hidden Size | 2048 | | Layers | 48 | | Attention Heads | 16 (2 KV heads, GQA) | | Intermediate Size | 5120 | | Vocab Size | 151,936 | | Max Context | 262,144 tokens (256K) | | RoPE | Partial rotary (0.25), theta=5M | ## Available Quantizations | Quantization | Size | BPW | Min VRAM | Use Case | |---|---|---|---|---| | **BF16** | 149 GB | 16.01 | 2x 96GB | Full precision, lossless | | **Q8_0** | 79 GB | 8.50 | 1x 96GB | Best quality quant — recommended for RTX PRO 6000 | | **Q6_K** | 62 GB | 6.58 | 1x 80GB | High quality with room for large context | | **Q4_K_M** | 46 GB | 4.87 | 1x 48GB | Good quality — fits RTX 4090/A6000 or maximizes context on larger GPUs | ## Benchmark Results Evaluated by Claude Opus 4.6 across 26 tests in 6 categories, scored 1-10 on correctness, completeness, clarity, and adherence to instructions. | Category | Base Model | Opus Distilled | Winner | Delta | |---|---|---|---|---| | **Coding** | 8.2 | **8.5** | Opus Distilled | +0.3 | | **Bug Detection** | 8.4 | **9.0** | Opus Distilled | +0.6 | | **Probability** | **8.6** | 8.0 | Base | -0.6 | | **Tool Calling** | 3.4 | **7.2** | Opus Distilled | +3.8 | | **Logic** | **8.5** | 7.5 | Base | -1.0 | | **Instruction Following** | **7.7** | 7.0 | Base | -0.7 | | **Overall** | 7.35 | **7.73** | **Opus Distilled** | **+0.38** | ### Detailed Test Results #### Coding (6 tests) | Test | Base | Opus | Winner | Notes | |---|---|---|---|---| | Python: Trie Implementation | 9 | 9 | Tie | Both correct and complete | | Python: Async Web Scraper | 8 | 9 | Opus | Opus uses cleaner class-based design with semaphore rate limiting | | Rust: Custom Iterator | 8 | 9 | Opus | Opus adds idiomatic `impl Iterator` convenience function | | TypeScript: Event Emitter | 8 | 8 | Tie | Both implement type-safe emitters with different valid approaches | | SQL: Complex Query | 7 | 8 | Opus | Base has a bug: references window function in WHERE of same SELECT | | Python: Graph BFS/DFS | 9 | 8 | Base | Base covers more methods within token budget | #### Bug Detection (5 tests) | Test | Base | Opus | Winner | Notes | |---|---|---|---|---| | Off-by-one in binary search | 9 | 9 | Tie | Both find all 4 bugs | | Race condition in Go | 8 | 9 | Opus | Opus adds deadlock diagram + 3 solutions vs 2 | | Memory leak in C++ | 9 | 9 | Tie | Both identify Rule of Three/Five violations | | Security bugs in JavaScript | 8 | 9 | Opus | Opus adds severity table, catches JWT forgery and missing rate limiting | | Deadlock in Python threading | 8 | 9 | Opus | Opus provides step-by-step timeline diagram + Coffman conditions | #### Probability (5 tests) | Test | Base | Opus | Winner | Notes | |---|---|---|---|---| | Bayes' Theorem | 9 | 9 | Tie | Both correct (~1.94%) | | Birthday Problem Variant | 9 | 8 | Base | Base more mathematically rigorous | | Monty Hall Extended | 9 | 9 | Tie | Both correctly derive P(switch)=2/5 | | Markov Chain | 8 | 6 | Base | Opus made computation error, had to restart | | Combinatorics: Card Hands | 9 | 9 | Tie | Both correct on completed sections | #### Tool Calling (5 tests) — Largest improvement | Test | Base | Opus | Winner | Notes | |---|---|---|---|---| | Weather API planning | 2 | 7 | Opus | Base outputs single call (37 tokens). Opus chains all 3 tools. | | Database CRUD operations | 3 | 8 | Opus | Base: 1 tool call (119 tokens). Opus: complete 4-step workflow. | | Multi-step file operations | 4 | 3 | Base | Both perform poorly on this test | | API orchestration | 2 | 7 | Opus | Base outputs malformed tool call. Opus plans 3 clear steps. | | Complex reasoning with tools | 6 | 7 | Opus | Base batches lookups correctly but stops. Opus completes reasoning. | #### Logic (2 tests) | Test | Base | Opus | Winner | Notes | |---|---|---|---|---| | Sudoku Solver Explanation | 9 | 8 | Base | Base explains constraint propagation more clearly within token budget | | Einstein's Riddle | 8 | 7 | Base | Base makes more deduction progress within token budget | #### Instruction Following (3 tests) | Test | Base | Opus | Winner | Notes | |---|---|---|---|---| | Structured JSON output | 10 | 6 | Base | Base: clean JSON only (453 tokens). Opus: 2048 tokens, ignored constraint | | Code with exact constraints | 5 | 6 | Opus | Both struggle. Base self-corrects mid-response. | | Multi-format output | 8 | 8 | Tie | Both truncated, similar quality | ### Key Findings - **Tool Calling**: Largest improvement (+3.8). Base model outputs only the first tool call and stops. Opus Distilled plans full multi-step tool chains with reasoning between steps. - **Bug Detection**: Opus Distilled provides more structured analysis with severity tables, timeline diagrams, and catches more edge cases (+0.6). - **Coding**: Opus Distilled favors class-based architectures with better design patterns. Caught a SQL bug (window function in WHERE clause) that Base missed. - **Probability**: Base is more concise and made fewer computation errors. Opus Distilled made an error on a Markov Chain steady-state calculation. - **Logic**: Base makes better progress within token budgets — Opus Distilled spends more tokens on preamble. - **Instruction Following**: Base adheres more strictly to output format constraints (e.g., "output ONLY valid JSON"). ### Verdict > Opus-Distilled wins overall driven by massively better tool calling and slightly better bug detection and coding. Base wins on math/probability (fewer errors), logic (better token efficiency), and instruction following (better constraint adherence). **For coding assistant use cases where tool calling matters, Opus-Distilled is clearly superior.** ### Performance Both models run at comparable speeds on RTX PRO 6000 Blackwell (96GB) with Q8_0: | Metric | Base | Opus Distilled | |---|---|---| | Tokens/sec | 100.5 | 102.5 | | Avg response length | 1,085 tokens | 1,464 tokens | ## Usage with llama.cpp ### Basic Serving ```bash llama-server \ --model Qwen3-Coder-Next-Opus-Distilled-Q8_0.gguf \ --n-gpu-layers -1 \ --ctx-size 262144 \ --host 0.0.0.0 --port 8081 ``` ### With Reasoning Support (Recommended) The model produces `...` reasoning blocks. To properly separate these from the visible output, use a custom chat template with `--reasoning-format deepseek`: ```bash llama-server \ --model Qwen3-Coder-Next-Opus-Distilled-Q8_0.gguf \ --n-gpu-layers -1 \ --ctx-size 262144 \ --chat-template-file qwen3-think.jinja \ --reasoning-format deepseek \ --host 0.0.0.0 --port 8081 ``` This puts the thinking in `message.reasoning_content` and keeps `message.content` clean. ### Chat Template The base Qwen3-Coder-Next chat template does not include `` tag support. You need a merged template that supports both thinking and tool calling. Save this as `qwen3-think.jinja`:
Click to expand chat template ```jinja {%- if messages[0]["role"] == "system" %} {%- set system_message = messages[0]["content"] %} {%- set loop_messages = messages[1:] %} {%- else %} {%- set loop_messages = messages %} {%- endif %} {%- if not tools is defined %} {%- set tools = [] %} {%- endif %} {%- if system_message is defined %} {{- "<|im_start|>system\n" + system_message }} {%- else %} {%- if tools is iterable and tools | length > 0 %} {{- "<|im_start|>system\nYou are a helpful AI assistant." }} {%- endif %} {%- endif %} {%- if tools is iterable and tools | length > 0 %} {{- "\n\n# Tools\n\nYou have access to the following functions:\n\n" }} {%- for tool in tools %} {%- if tool.function is defined %} {%- set tool = tool.function %} {%- endif %} {{- "\n\n" ~ tool.name ~ "" }} {%- if tool.description is defined %} {{- "\n" ~ tool.description ~ "" }} {%- endif %} {{- "\n" ~ (tool.parameters | tojson) ~ "" }} {{- "\n" }} {%- endfor %} {{- "\n" }} {%- endif %} {%- if system_message is defined or (tools is iterable and tools | length > 0) %} {{- "<|im_end|>\n" }} {%- endif %} {%- for message in loop_messages %} {%- if message.role == "assistant" %} {{- "<|im_start|>assistant\n" }} {%- if message.reasoning_content is defined and message.reasoning_content %} {{- "\n" + message.reasoning_content + "\n\n\n" }} {%- endif %} {{- message.content + "<|im_end|>\n" }} {%- elif message.role == "tool" %} {{- "<|im_start|>user\n\n" + message.content + "\n<|im_end|>\n" }} {%- else %} {{- "<|im_start|>" + message.role + "\n" + message.content + "<|im_end|>\n" }} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- "<|im_start|>assistant\n\n" }} {%- endif %} ```
### OpenAI-Compatible API The model serves an OpenAI-compatible API. Example request: ```bash curl http://localhost:8081/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "opus-distilled", "messages": [{"role": "user", "content": "Implement a thread-safe LRU cache in Rust"}], "max_tokens": 4096, "temperature": 0.6 }' ``` Response includes `reasoning_content` (thinking) separate from `content` (answer). ## Training Details ### Hardware & Infrastructure - **GPUs**: 8x NVIDIA H100 80GB SXM with NVLink - **System RAM**: 2 TB DDR5 - **Distribution**: DeepSpeed ZeRO-3 (parameters sharded across all 8 GPUs) - **Optimizer Offload**: AdamW optimizer states offloaded to CPU RAM (~700GB) - **Platform**: RunPod ### Hyperparameters | Parameter | Value | |---|---| | Method | Full fine-tune (all parameters) | | Framework | HuggingFace TRL 1.0.0 + DeepSpeed 0.18.9 | | Transformers | 5.4.0 | | Optimizer | AdamW (CPU offloaded via DeepSpeed ZeRO-3) | | Learning Rate | 2e-5 (cosine schedule) | | Warmup | 5% of steps | | Weight Decay | 0.01 | | Gradient Clipping | 1.0 | | Epochs | 3 | | Effective Batch Size | 32 (1 per GPU x 4 grad accum x 8 GPUs) | | Max Sequence Length | 8192 (training context window) | | Gradient Checkpointing | Enabled (non-reentrant) | | Precision | BF16 | | Total Steps | 303 | | Seed | 42 | ### Training Progression | Metric | Step 1 | Step 50 | Step 114 | Step 150 | Step 214 | Step 303 (Final) | |---|---|---|---|---|---|---| | Loss | 0.870 | 0.498 | 0.244 | 0.210 | 0.115 | 0.062 | | Token Accuracy | 78.0% | 84.4% | 91.5% | 93.5% | 96.5% | 98.1% | | Learning Rate | 0 | 1.94e-5 | 1.51e-5 | 1.29e-5 | 4.66e-6 | 5.99e-10 | | Epoch | 0.01 | 0.50 | 1.11 | 1.33 | 2.10 | 3.00 | ### Datasets 3,204 examples after quality filtering (required `` tags and >200 characters of assistant content): | Dataset | Examples | Description | |---|---|---| | [nohurry/Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | 2,321 | Claude Opus 4.6 reasoning traces (thinking + solution) | | [TeichAI/claude-4.5-opus-high-reasoning-250x](https://huggingface.co/datasets/TeichAI/claude-4.5-opus-high-reasoning-250x) | 250 | High-quality Claude reasoning conversations | | [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | 633 | Qwen reasoning conversations | ### Data Format Each training example follows this structure: ``` <|im_start|>user {problem}<|im_end|> <|im_start|>assistant {chain-of-thought reasoning} {final answer}<|im_end|> ``` ### Quality Filter Examples were filtered to require: 1. At least one assistant message containing `` tags 2. Assistant content longer than 200 characters This removed low-quality or non-reasoning examples from the combined dataset. ## Reasoning Format The model produces reasoning inside `...` tags: ``` Let me analyze this step by step... 1. First consideration 2. Second consideration 3. Conclusion Here is the final answer based on my analysis. ``` ## HF Safetensors For the full-precision HuggingFace model (BF16 safetensors), see [samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled](https://huggingface.co/samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled).