Text Generation
Transformers
Safetensors
English
qwen3_5_moe
image-text-to-text
reasoning
distillation
chain-of-thought
qwen
qwen3.6
kimi
kimi-k2
mixture-of-experts
Mixture of Experts
lora
unsloth
conversational
Instructions to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled") model = AutoModelForMultimodalLM.from_pretrained("lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled
- SGLang
How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio
How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled", max_seq_length=2048, ) - Docker Model Runner
How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with Docker Model Runner:
docker model run hf.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled
Replace stub README with full model card (training details, dataset links, pending eval table)
Browse files
README.md
CHANGED
|
@@ -1,21 +1,157 @@
|
|
| 1 |
---
|
| 2 |
-
base_model: unsloth/Qwen3.6-35B-A3B
|
| 3 |
-
tags:
|
| 4 |
-
- text-generation-inference
|
| 5 |
-
- transformers
|
| 6 |
-
- unsloth
|
| 7 |
-
- qwen3_5_moe
|
| 8 |
license: apache-2.0
|
| 9 |
language:
|
| 10 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
+
library_name: transformers
|
| 6 |
+
pipeline_tag: text-generation
|
| 7 |
+
base_model: Qwen/Qwen3.6-35B-A3B
|
| 8 |
+
datasets:
|
| 9 |
+
- lordx64/reasoning-distill-kimi-k2-6-max-sft
|
| 10 |
+
tags:
|
| 11 |
+
- text-generation
|
| 12 |
+
- reasoning
|
| 13 |
+
- distillation
|
| 14 |
+
- chain-of-thought
|
| 15 |
+
- qwen
|
| 16 |
+
- qwen3.6
|
| 17 |
+
- kimi
|
| 18 |
+
- kimi-k2
|
| 19 |
+
- mixture-of-experts
|
| 20 |
+
- moe
|
| 21 |
+
- lora
|
| 22 |
+
- unsloth
|
| 23 |
+
model-index:
|
| 24 |
+
- name: Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled
|
| 25 |
+
results: []
|
| 26 |
---
|
| 27 |
|
| 28 |
+
# Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled
|
| 29 |
+
|
| 30 |
+
A reasoning-distilled variant of **Qwen3.6-35B-A3B** taught to imitate the chain-of-thought style of **Kimi K2.6**, the frontier reasoning model from Moonshot AI. The goal: port Kimi-grade reasoning behavior into a permissively-licensed Mixture-of-Experts model that an individual can actually run.
|
| 31 |
+
|
| 32 |
+
This is the **second model in the same lineup** as [`lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled`](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled). Same base model, same training pipeline, same Unsloth + LoRA recipe — only the teacher differs. The two are designed to be **directly compared**, so users can see how reasoning style transfers from different upstream teachers into the same student architecture.
|
| 33 |
+
|
| 34 |
+
## Why this model
|
| 35 |
+
|
| 36 |
+
- **Kimi-style reasoning, open weights.** Kimi K2.6 is one of the strongest open-style reasoning models available, but only via the Moonshot API. This model has been fine-tuned on ~7.8k high-quality reasoning traces produced by Kimi K2.6, teaching the base to *think* before answering — with explicit `<think>…</think>` blocks — in Kimi's structure and cadence.
|
| 37 |
+
- **Verbose, deliberate reasoning.** Empirically, Kimi K2.6 produces **~3.4× longer reasoning chains than Claude Opus 4.7** at "max" effort (mean 2,933 tokens/row vs 849, p95 9,764 vs 2,404 — measured on this dataset's tokenized output). The student model trained here inherits that verbosity. If you want long, careful, deliberate chains of thought, this is the variant of the lineup to use.
|
| 38 |
+
- **Sparse activation, dense knowledge.** The base is a 35B-parameter MoE with **256 experts, 8 routed + 1 shared**, of which only about **3B parameters are active** per token. You get the capacity of a 35B model at the inference cost of a small dense model. Full-quality bf16 inference runs on 2× 80GB A100, 1× H200, or any 96GB+ single GPU. Quantized variants fit smaller setups (see below).
|
| 39 |
+
- **Long thinking supported.** 64k token context. The model routinely emits 5–30k tokens of `<think>` reasoning on hard problems before giving the final answer — which is the whole point of reasoning models, and why this one was specifically trained end-to-end with an upstream teacher that also reasons explicitly. Note: 21% of training examples had reasoning chains exceeding the 4,096-token training context, and were truncated; for the next iteration we plan `MAX_SEQ_LENGTH=8192` to fully capture Kimi's longer chains.
|
| 40 |
+
- **Companion to the Claude variant.** Use this when you want Kimi's longer/more deliberate reasoning. Use the Claude variant when you want shorter/tighter chains. Same base, same conversational interface, fully interchangeable for serving.
|
| 41 |
+
|
| 42 |
+
## Intended use
|
| 43 |
+
|
| 44 |
+
Built for hard reasoning: graduate-level STEM, competition math (AIME / MATH), code reasoning with explicit walk-through, multi-step logic puzzles, and agentic planning where explicit `<think>` helps correctness.
|
| 45 |
+
|
| 46 |
+
For short-turn conversational latency-sensitive workloads the thinking budget can be large (longer than the Claude variant); cap `max_new_tokens` or post-process to strip `<think>…</think>` blocks if you only want final answers in production.
|
| 47 |
+
|
| 48 |
+
## How to use
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 52 |
+
import torch
|
| 53 |
+
|
| 54 |
+
repo = "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled"
|
| 55 |
+
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
|
| 56 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 57 |
+
repo, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
messages = [{"role": "user", "content": "How many positive integers less than 1000 have digits that sum to 20?"}]
|
| 61 |
+
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
|
| 62 |
+
out = model.generate(inputs, max_new_tokens=32768, do_sample=False)
|
| 63 |
+
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### Serving with vLLM
|
| 67 |
+
|
| 68 |
+
Recommended backend: **vLLM** for serving — the MoE routing + KV cache benefit significantly from continuous batching.
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
vllm serve lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled \
|
| 72 |
+
--dtype bfloat16 \
|
| 73 |
+
--max-model-len 65536 \
|
| 74 |
+
--gpu-memory-utilization 0.9 \
|
| 75 |
+
--trust-remote-code
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
The `--trust-remote-code` flag is required: the Qwen3.6 tokenizer ships custom code that vLLM and `transformers` need explicit permission to execute.
|
| 79 |
+
|
| 80 |
+
### GGUF (LM Studio / llama.cpp)
|
| 81 |
+
|
| 82 |
+
Quantized GGUF weights are coming. Watch this card or the [author's HF profile](https://huggingface.co/lordx64) for `*-IQ4_XS-GGUF`, `*-Q5_K_M-GGUF`, `*-Q8_0-GGUF` companion repos — and an [APEX-GGUF](https://github.com/mudler/apex-quant) variant once the community quanters pick it up (the [Claude variant's APEX quant](https://huggingface.co/mudler/Qwen3.5-35B-A3B-Claude-Distilled-APEX-GGUF) is the precedent for this lineup).
|
| 83 |
+
|
| 84 |
+
## Training
|
| 85 |
+
|
| 86 |
+
| | |
|
| 87 |
+
|---|---|
|
| 88 |
+
| Base model | `Qwen/Qwen3.6-35B-A3B` (loaded via `unsloth/Qwen3.6-35B-A3B` for faster finetuning) |
|
| 89 |
+
| Teacher | Kimi K2.6 (Moonshot AI), accessed via OpenRouter |
|
| 90 |
+
| Training dataset | [`lordx64/reasoning-distill-kimi-k2-6-max-sft`](https://huggingface.co/datasets/lordx64/reasoning-distill-kimi-k2-6-max-sft) — reasoning traces from Kimi K2.6 reformatted into SFT conversations (ChatML + `<think>…</think>`) |
|
| 91 |
+
| Source dataset | [`lordx64/reasoning-distill-kimi-k2-6-max`](https://huggingface.co/datasets/lordx64/reasoning-distill-kimi-k2-6-max) — raw teacher traces (pre-SFT formatting) |
|
| 92 |
+
| Dataset size | 7,836 full conversations, assistant side trained including `<think>…</think>` |
|
| 93 |
+
| Source prompts | Drawn from `Delta-Vector/Tauri-Physical-Reasoning`, multiple `TeichAI` Claude reasoning sets, and `Crownelius Opus-4.6-Reasoning-2100x` — same prompt distribution as the Claude variant for direct teacher-comparability |
|
| 94 |
+
| Method | SFT with Unsloth + TRL `SFTTrainer` + `train_on_responses_only` (loss only on assistant tokens) |
|
| 95 |
+
| LoRA config | `r=16, alpha=16, dropout=0.0, targets=["q_proj","k_proj","v_proj","o_proj"]` (attention-only) |
|
| 96 |
+
| Hyperparameters | `lr=2e-5`, cosine schedule, `warmup_ratio=0.03`, `weight_decay=0.01`, optimizer `adamw_8bit` |
|
| 97 |
+
| Batch | `per_device=1, grad_accum=16, effective=16`, 2 epochs = 980 steps |
|
| 98 |
+
| Sequence | 4,096 tokens during training (64k usable at inference — base supports it natively) |
|
| 99 |
+
| Precision | bf16 on 1× H200 141GB (HF Inference Endpoint, custom container) |
|
| 100 |
+
| Trainable | 3.44M params out of 35.1B (0.01%) |
|
| 101 |
+
| Wall-clock | ~21 hours on H200 |
|
| 102 |
+
|
| 103 |
+
### Training-time observations
|
| 104 |
+
|
| 105 |
+
- **Loss curve**: descended cleanly from ~0.95 (warmup) → ~0.83 (mid-training), gradient norms steady at ~0.005, no instability throughout 980 steps. Cosine LR decayed from peak 2e-5 to ~0 by the final step.
|
| 106 |
+
- **FLA fast-path disabled**: Unsloth's runtime check rejected the compiled `causal-conv1d==1.6.1` binary on H200/cc-9.0 as ABI-incompatible, forcing the Gated DeltaNet linear-attention layers to run on the slower torch fallback. This is a known issue for this stack and added an estimated ~30–50% to per-step time. Future runs in this lineup will pin `causal-conv1d` to a binary-compatible version.
|
| 107 |
+
- **Token verbosity**: Kimi K2.6 traces averaged 2,933 tokens (mean) and 9,764 tokens (p95) versus 849 / 2,404 for the matched Opus 4.7 dataset — an **effective ~2.5× compute multiplier** for distillation runs at fixed `MAX_SEQ_LENGTH`. Treat this as a budgeting prior when scoping future verbose-teacher distillations.
|
| 108 |
+
|
| 109 |
+
### Why attention-only LoRA on a MoE
|
| 110 |
+
|
| 111 |
+
The initial plan was full LoRA including the MoE expert FFNs (`gate_proj/up_proj/down_proj`). In the course of the sister Claude project I filed and upstreamed a shape-mismatch fix to unsloth-zoo's MoE+LoRA grouped-mm path — [unslothai/unsloth-zoo#601](https://github.com/unslothai/unsloth-zoo/pull/601) — without which the expert-LoRA forward crashes on Qwen3.6's 256-expert layout. Even with that fix, single-GPU memory made expert-LoRA impractical for this run. Attention-only captures most of the signal on *style* distillation at a fraction of the trainable parameter count and memory footprint, and matches the recipe used for the Claude sibling so the two student runs are directly comparable.
|
| 112 |
+
|
| 113 |
+
## Evaluation
|
| 114 |
+
|
| 115 |
+
Formal benchmarks are pending. Numbers will land here once the [`training/eval.py`](https://github.com/lordx64/distillation/blob/main/training/eval.py) sweep (vLLM + lm-eval-harness, with `<think>` block stripping) finishes.
|
| 116 |
+
|
| 117 |
+
| Benchmark | Setup | Score | Status |
|
| 118 |
+
|---|---|---|---|
|
| 119 |
+
| GSM8K | 8-shot CoT, 300 examples | _pending_ | 🟡 in queue |
|
| 120 |
+
| MMLU-Pro | 5-shot, 500 examples | _pending_ | 🟡 in queue |
|
| 121 |
+
| GPQA Diamond | 0-shot CoT zero-shot, 198 problems | _pending_ | 🟡 in queue |
|
| 122 |
+
| AIME 2024 | 0-shot, 30 problems | _pending_ | 🟡 in queue |
|
| 123 |
+
| AIME 2025 | 0-shot, 30 problems | _pending_ | 🟡 in queue |
|
| 124 |
+
| MATH-500 | 0-shot, 100 problems | _pending_ | 🟡 in queue |
|
| 125 |
+
|
| 126 |
+
Comparison baselines will include:
|
| 127 |
+
|
| 128 |
+
- `Qwen/Qwen3.6-35B-A3B` (base, untuned)
|
| 129 |
+
- `lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled` (sibling distillation, different teacher)
|
| 130 |
+
|
| 131 |
+
The point of the lineup is the comparison, not the absolute score. We expect the Kimi variant to spend more tokens reasoning (per the verbosity prior above), so wall-clock-fair comparisons matter as much as accuracy.
|
| 132 |
+
|
| 133 |
+
## Limitations and caveats
|
| 134 |
+
|
| 135 |
+
- **Inherits base limitations.** Anything `Qwen/Qwen3.6-35B-A3B` is bad at, this model is also bad at. Distillation transfers reasoning style; it does not add factual knowledge.
|
| 136 |
+
- **Not safety-tuned beyond the base.** No additional RLHF or safety alignment pass was performed. The model will reason out loud about anything it's asked to. Add your own guardrails before exposing to end users.
|
| 137 |
+
- **Long generations.** As noted, Kimi-style reasoning is verbose. Plan tokens accordingly; default `max_new_tokens=32768` is recommended for hard problems, lower for shorter Q&A.
|
| 138 |
+
- **Truncated training reasoning.** ~21% of training examples were truncated at the 4,096-token training context, so on questions that elicit very long teacher chains, the model has been partially trained on cut-off thoughts. Future iterations will use `MAX_SEQ_LENGTH=8192` to cover the p95.
|
| 139 |
+
- **Apache-2.0 license** matches the base. Use freely for commercial and research work; attribution appreciated but not required.
|
| 140 |
+
|
| 141 |
+
## Citation
|
| 142 |
|
| 143 |
+
```bibtex
|
| 144 |
+
@misc{lordx64_qwen36_kimi_distill_2026,
|
| 145 |
+
author = {lordx64},
|
| 146 |
+
title = {Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled},
|
| 147 |
+
year = {2026},
|
| 148 |
+
url = {https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled},
|
| 149 |
+
}
|
| 150 |
+
```
|
| 151 |
|
| 152 |
+
## Acknowledgements
|
| 153 |
|
| 154 |
+
- **Moonshot AI** for Kimi K2.6, the teacher whose reasoning style this model emulates.
|
| 155 |
+
- **Qwen team** for the strong open-weights MoE base.
|
| 156 |
+
- **Unsloth** for the fast-finetuning stack that made this run tractable.
|
| 157 |
+
- The wider open-weights reasoning-distillation community whose prompt sets and methodology informed the dataset construction.
|