---
language:
- en
license: apache-2.0
tags:
- qwen3
- reasoning
- distillation
- claude-opus
- full-finetune
- gguf
base_model: Qwen/Qwen3-Coder-Next
datasets:
- nohurry/Opus-4.6-Reasoning-3000x-filtered
- TeichAI/claude-4.5-opus-high-reasoning-250x
- Jackrong/Qwen3.5-reasoning-700x
---

# Qwen3-Coder-Next — Opus 4.6 Reasoning Distilled (GGUF)

GGUF quantizations of the full fine-tuned **Qwen/Qwen3-Coder-Next** (~80B total / ~3B active, MoE) with Claude Opus 4.6 reasoning distillation. Trained on 8x H100 80GB SXM with DeepSpeed ZeRO-3, all parameters.

## Model Details

| Property | Value |
|---|---|
| Base Model | [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) |
| Architecture | qwen3_next (Mixture of Experts) |
| Total Parameters | ~80B |
| Active Parameters | ~3B per token (10 of 512 experts) |
| Expert Config | 512 experts, 10 active per token, intermediate=512 |
| Shared Expert | Yes (intermediate=512) |
| Attention | Hybrid: linear attention + full attention every 4th layer |
| Hidden Size | 2048 |
| Layers | 48 |
| Attention Heads | 16 (2 KV heads, GQA) |
| Intermediate Size | 5120 |
| Vocab Size | 151,936 |
| Max Context | 262,144 tokens (256K) |
| RoPE | Partial rotary (0.25), theta=5M |

## Available Quantizations

| Quantization | Size | BPW | Min VRAM | Use Case |
|---|---|---|---|---|
| **BF16** | 149 GB | 16.01 | 2x 96GB | Full precision, lossless |
| **Q8_0** | 79 GB | 8.50 | 1x 96GB | Best quality quant — recommended for RTX PRO 6000 |
| **Q6_K** | 62 GB | 6.58 | 1x 80GB | High quality with room for large context |
| **Q4_K_M** | 46 GB | 4.87 | 1x 48GB | Good quality — fits RTX 4090/A6000 or maximizes context on larger GPUs |

## Benchmark Results

Evaluated by Claude Opus 4.6 across 26 tests in 6 categories, scored 1-10 on correctness, completeness, clarity, and adherence to instructions.

| Category | Base Model | Opus Distilled | Winner | Delta |
|---|---|---|---|---|
| **Coding** | 8.2 | **8.5** | Opus Distilled | +0.3 |
| **Bug Detection** | 8.4 | **9.0** | Opus Distilled | +0.6 |
| **Probability** | **8.6** | 8.0 | Base | -0.6 |
| **Tool Calling** | 3.4 | **7.2** | Opus Distilled | +3.8 |
| **Logic** | **8.5** | 7.5 | Base | -1.0 |
| **Instruction Following** | **7.7** | 7.0 | Base | -0.7 |
| **Overall** | 7.35 | **7.73** | **Opus Distilled** | **+0.38** |

### Detailed Test Results

#### Coding (6 tests)

| Test | Base | Opus | Winner | Notes |
|---|---|---|---|---|
| Python: Trie Implementation | 9 | 9 | Tie | Both correct and complete |
| Python: Async Web Scraper | 8 | 9 | Opus | Opus uses cleaner class-based design with semaphore rate limiting |
| Rust: Custom Iterator | 8 | 9 | Opus | Opus adds idiomatic `impl Iterator<Item = u64>` convenience function |
| TypeScript: Event Emitter | 8 | 8 | Tie | Both implement type-safe emitters with different valid approaches |
| SQL: Complex Query | 7 | 8 | Opus | Base has a bug: references window function in WHERE of same SELECT |
| Python: Graph BFS/DFS | 9 | 8 | Base | Base covers more methods within token budget |

#### Bug Detection (5 tests)

| Test | Base | Opus | Winner | Notes |
|---|---|---|---|---|
| Off-by-one in binary search | 9 | 9 | Tie | Both find all 4 bugs |
| Race condition in Go | 8 | 9 | Opus | Opus adds deadlock diagram + 3 solutions vs 2 |
| Memory leak in C++ | 9 | 9 | Tie | Both identify Rule of Three/Five violations |
| Security bugs in JavaScript | 8 | 9 | Opus | Opus adds severity table, catches JWT forgery and missing rate limiting |
| Deadlock in Python threading | 8 | 9 | Opus | Opus provides step-by-step timeline diagram + Coffman conditions |

#### Probability (5 tests)

| Test | Base | Opus | Winner | Notes |
|---|---|---|---|---|
| Bayes' Theorem | 9 | 9 | Tie | Both correct (~1.94%) |
| Birthday Problem Variant | 9 | 8 | Base | Base more mathematically rigorous |
| Monty Hall Extended | 9 | 9 | Tie | Both correctly derive P(switch)=2/5 |
| Markov Chain | 8 | 6 | Base | Opus made computation error, had to restart |
| Combinatorics: Card Hands | 9 | 9 | Tie | Both correct on completed sections |

#### Tool Calling (5 tests) — Largest improvement

| Test | Base | Opus | Winner | Notes |
|---|---|---|---|---|
| Weather API planning | 2 | 7 | Opus | Base outputs single call (37 tokens). Opus chains all 3 tools. |
| Database CRUD operations | 3 | 8 | Opus | Base: 1 tool call (119 tokens). Opus: complete 4-step workflow. |
| Multi-step file operations | 4 | 3 | Base | Both perform poorly on this test |
| API orchestration | 2 | 7 | Opus | Base outputs malformed tool call. Opus plans 3 clear steps. |
| Complex reasoning with tools | 6 | 7 | Opus | Base batches lookups correctly but stops. Opus completes reasoning. |

#### Logic (2 tests)

| Test | Base | Opus | Winner | Notes |
|---|---|---|---|---|
| Sudoku Solver Explanation | 9 | 8 | Base | Base explains constraint propagation more clearly within token budget |
| Einstein's Riddle | 8 | 7 | Base | Base makes more deduction progress within token budget |

#### Instruction Following (3 tests)

| Test | Base | Opus | Winner | Notes |
|---|---|---|---|---|
| Structured JSON output | 10 | 6 | Base | Base: clean JSON only (453 tokens). Opus: 2048 tokens, ignored constraint |
| Code with exact constraints | 5 | 6 | Opus | Both struggle. Base self-corrects mid-response. |
| Multi-format output | 8 | 8 | Tie | Both truncated, similar quality |

### Key Findings

- **Tool Calling**: Largest improvement (+3.8). Base model outputs only the first tool call and stops. Opus Distilled plans full multi-step tool chains with reasoning between steps.
- **Bug Detection**: Opus Distilled provides more structured analysis with severity tables, timeline diagrams, and catches more edge cases (+0.6).
- **Coding**: Opus Distilled favors class-based architectures with better design patterns. Caught a SQL bug (window function in WHERE clause) that Base missed.
- **Probability**: Base is more concise and made fewer computation errors. Opus Distilled made an error on a Markov Chain steady-state calculation.
- **Logic**: Base makes better progress within token budgets — Opus Distilled spends more tokens on preamble.
- **Instruction Following**: Base adheres more strictly to output format constraints (e.g., "output ONLY valid JSON").

### Verdict

> Opus-Distilled wins overall driven by massively better tool calling and slightly better bug detection and coding. Base wins on math/probability (fewer errors), logic (better token efficiency), and instruction following (better constraint adherence). **For coding assistant use cases where tool calling matters, Opus-Distilled is clearly superior.**

### Performance

Both models run at comparable speeds on RTX PRO 6000 Blackwell (96GB) with Q8_0:

| Metric | Base | Opus Distilled |
|---|---|---|
| Tokens/sec | 100.5 | 102.5 |
| Avg response length | 1,085 tokens | 1,464 tokens |

## Usage with llama.cpp

### Basic Serving

```bash
llama-server \
  --model Qwen3-Coder-Next-Opus-Distilled-Q8_0.gguf \
  --n-gpu-layers -1 \
  --ctx-size 262144 \
  --host 0.0.0.0 --port 8081
```

### With Reasoning Support (Recommended)

The model produces `<think>...</think>` reasoning blocks. To properly separate these from the visible output, use a custom chat template with `--reasoning-format deepseek`:

```bash
llama-server \
  --model Qwen3-Coder-Next-Opus-Distilled-Q8_0.gguf \
  --n-gpu-layers -1 \
  --ctx-size 262144 \
  --chat-template-file qwen3-think.jinja \
  --reasoning-format deepseek \
  --host 0.0.0.0 --port 8081
```

This puts the thinking in `message.reasoning_content` and keeps `message.content` clean.

### Chat Template

The base Qwen3-Coder-Next chat template does not include `<think>` tag support. You need a merged template that supports both thinking and tool calling. Save this as `qwen3-think.jinja`:

<details>
<summary>Click to expand chat template</summary>

```jinja
{%- if messages[0]["role"] == "system" %}
    {%- set system_message = messages[0]["content"] %}
    {%- set loop_messages = messages[1:] %}
{%- else %}
    {%- set loop_messages = messages %}
{%- endif %}

{%- if not tools is defined %}
    {%- set tools = [] %}
{%- endif %}

{%- if system_message is defined %}
    {{- "<|im_start|>system\n" + system_message }}
{%- else %}
    {%- if tools is iterable and tools | length > 0 %}
        {{- "<|im_start|>system\nYou are a helpful AI assistant." }}
    {%- endif %}
{%- endif %}
{%- if tools is iterable and tools | length > 0 %}
    {{- "\n\n# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
    {%- for tool in tools %}
        {%- if tool.function is defined %}
            {%- set tool = tool.function %}
        {%- endif %}
        {{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
        {%- if tool.description is defined %}
            {{- "\n<description>" ~ tool.description ~ "</description>" }}
        {%- endif %}
        {{- "\n<parameters>" ~ (tool.parameters | tojson) ~ "</parameters>" }}
        {{- "\n</function>" }}
    {%- endfor %}
    {{- "\n</tools>" }}
{%- endif %}
{%- if system_message is defined or (tools is iterable and tools | length > 0) %}
    {{- "<|im_end|>\n" }}
{%- endif %}

{%- for message in loop_messages %}
    {%- if message.role == "assistant" %}
        {{- "<|im_start|>assistant\n" }}
        {%- if message.reasoning_content is defined and message.reasoning_content %}
            {{- "<think>\n" + message.reasoning_content + "\n</think>\n\n" }}
        {%- endif %}
        {{- message.content + "<|im_end|>\n" }}
    {%- elif message.role == "tool" %}
        {{- "<|im_start|>user\n<tool_response>\n" + message.content + "\n</tool_response><|im_end|>\n" }}
    {%- else %}
        {{- "<|im_start|>" + message.role + "\n" + message.content + "<|im_end|>\n" }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- "<|im_start|>assistant\n<think>\n" }}
{%- endif %}
```

</details>

### OpenAI-Compatible API

The model serves an OpenAI-compatible API. Example request:

```bash
curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "opus-distilled",
    "messages": [{"role": "user", "content": "Implement a thread-safe LRU cache in Rust"}],
    "max_tokens": 4096,
    "temperature": 0.6
  }'
```

Response includes `reasoning_content` (thinking) separate from `content` (answer).

## Training Details

### Hardware & Infrastructure

- **GPUs**: 8x NVIDIA H100 80GB SXM with NVLink
- **System RAM**: 2 TB DDR5
- **Distribution**: DeepSpeed ZeRO-3 (parameters sharded across all 8 GPUs)
- **Optimizer Offload**: AdamW optimizer states offloaded to CPU RAM (~700GB)
- **Platform**: RunPod

### Hyperparameters

| Parameter | Value |
|---|---|
| Method | Full fine-tune (all parameters) |
| Framework | HuggingFace TRL 1.0.0 + DeepSpeed 0.18.9 |
| Transformers | 5.4.0 |
| Optimizer | AdamW (CPU offloaded via DeepSpeed ZeRO-3) |
| Learning Rate | 2e-5 (cosine schedule) |
| Warmup | 5% of steps |
| Weight Decay | 0.01 |
| Gradient Clipping | 1.0 |
| Epochs | 3 |
| Effective Batch Size | 32 (1 per GPU x 4 grad accum x 8 GPUs) |
| Max Sequence Length | 8192 (training context window) |
| Gradient Checkpointing | Enabled (non-reentrant) |
| Precision | BF16 |
| Total Steps | 303 |
| Seed | 42 |

### Training Progression

| Metric | Step 1 | Step 50 | Step 114 | Step 150 | Step 214 | Step 303 (Final) |
|---|---|---|---|---|---|---|
| Loss | 0.870 | 0.498 | 0.244 | 0.210 | 0.115 | 0.062 |
| Token Accuracy | 78.0% | 84.4% | 91.5% | 93.5% | 96.5% | 98.1% |
| Learning Rate | 0 | 1.94e-5 | 1.51e-5 | 1.29e-5 | 4.66e-6 | 5.99e-10 |
| Epoch | 0.01 | 0.50 | 1.11 | 1.33 | 2.10 | 3.00 |

### Datasets

3,204 examples after quality filtering (required `<think>` tags and >200 characters of assistant content):

| Dataset | Examples | Description |
|---|---|---|
| [nohurry/Opus-4.6-Reasoning-3000x-filtered](https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered) | 2,321 | Claude Opus 4.6 reasoning traces (thinking + solution) |
| [TeichAI/claude-4.5-opus-high-reasoning-250x](https://huggingface.co/datasets/TeichAI/claude-4.5-opus-high-reasoning-250x) | 250 | High-quality Claude reasoning conversations |
| [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | 633 | Qwen reasoning conversations |

### Data Format

Each training example follows this structure:
```
<|im_start|>user
{problem}<|im_end|>
<|im_start|>assistant
<think>
{chain-of-thought reasoning}
</think>

{final answer}<|im_end|>
```

### Quality Filter

Examples were filtered to require:
1. At least one assistant message containing `<think>` tags
2. Assistant content longer than 200 characters

This removed low-quality or non-reasoning examples from the combined dataset.

## Reasoning Format

The model produces reasoning inside `<think>...</think>` tags:

```
<think>
Let me analyze this step by step...
1. First consideration
2. Second consideration
3. Conclusion
</think>

Here is the final answer based on my analysis.
```

## HF Safetensors

For the full-precision HuggingFace model (BF16 safetensors), see [samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled](https://huggingface.co/samuelcardillo/Qwen3-Coder-Next-Opus-4.6-Reasoning-Distilled).