Instructions to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled")
model = AutoModelForMultimodalLM.from_pretrained("lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled

SGLang

How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled",
    max_seq_length=2048,
)

Docker Model Runner
How to use lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled with Docker Model Runner:
```
docker model run hf.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled
```

Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled / README.md

lordx64

Eval: add base MATH-500=53% + TL;DR explaining what distillation actually does (style transfer, not capability gain)

0bf86f7 verified about 2 months ago

16.7 kB

license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen3.6-35B-A3B
datasets:
  - lordx64/reasoning-distill-kimi-k2-6-max-sft
tags:
  - text-generation
  - reasoning
  - distillation
  - chain-of-thought
  - qwen
  - qwen3.6
  - kimi
  - kimi-k2
  - mixture-of-experts
  - moe
  - lora
  - unsloth
model-index:
  - name: Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled
    results: []

Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled

A reasoning-distilled variant of Qwen3.6-35B-A3B taught to imitate the chain-of-thought style of Kimi K2.6, the frontier reasoning model from Moonshot AI. The goal: port Kimi-grade reasoning behavior into a permissively-licensed Mixture-of-Experts model that an individual can actually run.

This is the second model in the same lineup as lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled. Same base model, same training pipeline, same Unsloth + LoRA recipe — only the teacher differs. The two are designed to be directly compared, so users can see how reasoning style transfers from different upstream teachers into the same student architecture.

Why this model

Kimi-style reasoning, open weights. Kimi K2.6 is one of the strongest open-style reasoning models available, but only via the Moonshot API. This model has been fine-tuned on ~7.8k high-quality reasoning traces produced by Kimi K2.6, teaching the base to think before answering — with explicit <think>…</think> blocks — in Kimi's structure and cadence.
Verbose, deliberate reasoning. Empirically, Kimi K2.6 produces ~3.4× longer reasoning chains than Claude Opus 4.7 at "max" effort (mean 2,933 tokens/row vs 849, p95 9,764 vs 2,404 — measured on this dataset's tokenized output). The student model trained here inherits that verbosity. If you want long, careful, deliberate chains of thought, this is the variant of the lineup to use.
Sparse activation, dense knowledge. The base is a 35B-parameter MoE with 256 experts, 8 routed + 1 shared, of which only about 3B parameters are active per token. You get the capacity of a 35B model at the inference cost of a small dense model. Full-quality bf16 inference runs on 2× 80GB A100, 1× H200, or any 96GB+ single GPU. Quantized variants fit smaller setups (see below).
Long thinking supported. 64k token context. The model routinely emits 5–30k tokens of <think> reasoning on hard problems before giving the final answer — which is the whole point of reasoning models, and why this one was specifically trained end-to-end with an upstream teacher that also reasons explicitly.
Companion to the Claude variant. Use this when you want Kimi's longer/more deliberate reasoning. Use the Claude variant when you want shorter/tighter chains. Same base, same conversational interface, fully interchangeable for serving.

Intended use

Built for hard reasoning: graduate-level STEM, competition math (AIME / MATH), code reasoning with explicit walk-through, multi-step logic puzzles, and agentic planning where explicit <think> helps correctness.

For short-turn conversational latency-sensitive workloads the thinking budget can be large (longer than the Claude variant); cap max_new_tokens or post-process to strip <think>…</think> blocks if you only want final answers in production.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)

messages = [{"role": "user", "content": "How many positive integers less than 1000 have digits that sum to 20?"}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=32768, do_sample=False)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Serving with vLLM

Recommended backend: vLLM for serving — the MoE routing + KV cache benefit significantly from continuous batching.

vllm serve lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code

The --trust-remote-code flag is required: the Qwen3.6 tokenizer ships custom code that vLLM and transformers need explicit permission to execute.

GGUF (LM Studio / llama.cpp / Ollama)

Quantized GGUF weights for llama.cpp, LM Studio, and Ollama are published in a sibling repo:

lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled-GGUF

Quant	Approx size	Use case	File
IQ4_XS	18.94 GB	Smallest — fits on a single 24 GB consumer GPU; LM Studio default	`Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.IQ4_XS.gguf`
Q5_K_M	24.73 GB	Balanced quality / size, recommended sweet spot	`Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.Q5_K_M.gguf`
Q8_0	36.90 GB	Near-lossless, closest to bf16 quality	`Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.Q8_0.gguf`

LM Studio

Search lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled in LM Studio's model browser. The IQ4_XS will show up as the default suggestion.

llama.cpp

huggingface-cli download lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled-GGUF \
  Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.IQ4_XS.gguf \
  --local-dir ./models

llama-server -m ./models/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled.IQ4_XS.gguf \
  --ctx-size 65536 \
  --n-gpu-layers -1 \
  --jinja

The --jinja flag is recommended so llama.cpp uses the model's bundled chat template (which preserves <think> blocks correctly).

APEX-GGUF (community)

An APEX-GGUF variant maintained by @mudler — the canonical MoE-aware quantization recipe — may follow once the community picks it up. The Claude variant's APEX quant is the precedent in this lineup.

Training


Base model	`Qwen/Qwen3.6-35B-A3B` (loaded via `unsloth/Qwen3.6-35B-A3B` for faster finetuning)
Teacher	Kimi K2.6 (Moonshot AI), accessed via OpenRouter
Training dataset	`lordx64/reasoning-distill-kimi-k2-6-max-sft` — reasoning traces from Kimi K2.6 reformatted into SFT conversations (ChatML + `<think>…</think>`)
Source dataset	`lordx64/reasoning-distill-kimi-k2-6-max` — raw teacher traces (pre-SFT formatting)
Dataset size	7,836 full conversations, assistant side trained including `<think>…</think>`
Source prompts	Drawn from `Delta-Vector/Tauri-Physical-Reasoning`, multiple `TeichAI` Claude reasoning sets, and `Crownelius Opus-4.6-Reasoning-2100x` — same prompt distribution as the Claude variant for direct teacher-comparability
Method	SFT with Unsloth + TRL `SFTTrainer` + `train_on_responses_only` (loss only on assistant tokens)
LoRA config	`r=16, alpha=16, dropout=0.0, targets=["q_proj","k_proj","v_proj","o_proj"]` (attention-only)
Hyperparameters	`lr=2e-5`, cosine schedule, `warmup_ratio=0.03`, `weight_decay=0.01`, optimizer `adamw_8bit`
Batch	`per_device=1, grad_accum=16, effective=16`, 2 epochs = 980 steps
Sequence	4,096 tokens during training (64k usable at inference — base supports it natively)
Precision	bf16 on 1× H200 141GB (HF Inference Endpoint, custom container)
Trainable	3.44M params out of 35.1B (0.01%)
Wall-clock	~21 hours on H200

Training-time observations

Loss curve: descended cleanly from ~0.95 (warmup) → ~0.83 (mid-training), gradient norms steady at ~0.005, no instability throughout 980 steps. Cosine LR decayed from peak 2e-5 to ~0 by the final step.
FLA fast-path disabled: Unsloth's runtime check rejected the compiled causal-conv1d==1.6.1 binary on H200/cc-9.0 as ABI-incompatible, forcing the Gated DeltaNet linear-attention layers to run on the slower torch fallback. This is a known issue for this stack and added an estimated ~30–50% to per-step time. Future runs in this lineup will pin causal-conv1d to a binary-compatible version.
Token verbosity: Kimi K2.6 traces averaged 2,933 tokens (mean) and 9,764 tokens (p95) versus 849 / 2,404 for the matched Opus 4.7 dataset — an effective ~2.5× compute multiplier for distillation runs at fixed MAX_SEQ_LENGTH. Treat this as a budgeting prior when scoping future verbose-teacher distillations.

Why attention-only LoRA on a MoE

The initial plan was full LoRA including the MoE expert FFNs (gate_proj/up_proj/down_proj). In the course of the sister Claude project I filed and upstreamed a shape-mismatch fix to unsloth-zoo's MoE+LoRA grouped-mm path — unslothai/unsloth-zoo#601 — without which the expert-LoRA forward crashes on Qwen3.6's 256-expert layout. Even with that fix, single-GPU memory made expert-LoRA impractical for this run. Attention-only captures most of the signal on style distillation at a fraction of the trainable parameter count and memory footprint, and matches the recipe used for the Claude sibling so the two student runs are directly comparable.

Evaluation

Methodology: vLLM + lm-eval-harness with a custom <think>-stripping wrapper, max_gen_toks=16384 to allow full Kimi-style reasoning chains before answer extraction. Each model evaluated under identical conditions on a single H200. See training/eval.py.

TL;DR — what this distillation actually does, based on the head-to-head data below:

The distillation makes the model unconditionally think, but doesn't improve raw reasoning capability over the base.

On benchmarks where the base fails to invoke its own thinking (GSM8K under fewshot pattern, most MMLU-Pro subjects): distill +28 to +40 pp.

On benchmarks where the base already thinks natively (MATH-500, GPQA Diamond): distill −4 to −6 pp (a small style-imitation cost).

Use this model when you want predictable <think>-block reasoning on prompts that don't trigger the base's thinking mode (fewshot evals, plain Q&A, latency-sensitive deploys with fixed prompt templates). Use the base directly when you can already prompt it for zero-shot CoT.

Head-to-head: Kimi-Distill vs Base

Benchmark	Setup	Base Qwen3.6-35B-A3B	Kimi-Distill (this model)	Δ
GSM8K	8-shot CoT, 300 examples, strict-match	64.00%	92.67%	+28.67 pp ✅
MATH-500	0-shot, 100 problems, math_verify	53.00%	47.00%	-6.00 pp
GPQA Diamond	0-shot CoT, 198 problems, flex-extract	79.29%	75.25%	-4.04 pp
MMLU-Pro math	5-shot, custom-extract	27.20%	64.80%	+37.60 pp ✅
MMLU-Pro CS	5-shot, custom-extract	20.49%	61.46%	+40.97 pp ✅
MMLU-Pro engineering	5-shot, custom-extract	18.60%	30.80%	+12.20 pp ✅
MMLU-Pro chemistry	5-shot, custom-extract	13.00%	26.60%	+13.60 pp ✅
MMLU-Pro overall	5-shot, custom-extract	6.35%	14.67%	+8.32 pp (extractor-affected for both)
AIME 2024 / 2025	0-shot, 30 problems, strict-match	0.00%	0.00%	extractor format issue (see note)

What the data is actually telling us

Two patterns, depending on whether the base model invokes its own thinking under the eval prompt:

Where the base fails to invoke thinking (GSM8K under fewshot pattern, most MMLU-Pro subjects), the distillation wins by a wide margin: GSM8K +28.67 pp, MMLU-Pro Math +37.60 pp, MMLU-Pro CS +40.97 pp. The Kimi-distill emits <think> blocks unconditionally — it doesn't follow the fewshot pattern; it reasons regardless. That's a real prompt-robustness gain.
Where the base already thinks natively (MATH-500, GPQA Diamond), the distill is slightly worse: -6.00 pp on MATH-500, -4.04 pp on GPQA. Style imitation incurs a small cost when the base was already doing the right thing.

So the honest framing: this distillation transfers Kimi K2.6's verbose reasoning style and makes the model think unconditionally. It does not add raw reasoning capability or factual knowledge over the base. Use the Kimi-distill when you want predictable thinking on prompts that don't trigger the base's own thinking mode (fewshot evals, plain Q&A patterns); use the base directly when you can already prompt it correctly (zero-shot CoT on math/STEM).

Notes on the methodology issues

AIME 2024 / 2025 — 0% is cosmetic, not a real model failure. Inspecting log_samples shows the model correctly arrives at the integer answer (e.g., AIME 2024-II-4: model produces "$m + n = 25 + 8 = 33$", target = 33), but lm-eval's strict-match expects the literal \boxed{N} format. The Kimi-distill's training traces produce prose-style final answers, not boxed format. A custom extractor is in the queue.
MMLU-Pro overall is depressed by the extractor for both models equally. The per-subject results above show the real signal — distillation adds dramatically on quantitative subjects.
MATH-500 base score pending — a re-run of the base with sympy / math_verify deps installed is in flight; will fill that cell when it lands.

Limitations and caveats

Inherits base limitations. Anything Qwen/Qwen3.6-35B-A3B is bad at, this model is also bad at. Distillation transfers reasoning style; it does not add factual knowledge.
Not safety-tuned beyond the base. No additional RLHF or safety alignment pass was performed. The model will reason out loud about anything it's asked to. Add your own guardrails before exposing to end users.
Long generations. As noted, Kimi-style reasoning is verbose. Plan tokens accordingly; default max_new_tokens=32768 is recommended for hard problems, lower for shorter Q&A.
Apache-2.0 license matches the base. Use freely for commercial and research work; attribution appreciated but not required.

Roadmap

The next iteration in this lineup will:

Bump training context to MAX_SEQ_LENGTH=8192 to fully capture Kimi's p95 reasoning length (currently ~9.7k tokens — see Training-time observations). This will let the student learn from complete chains on the longest, hardest problems.
Pin a binary-compatible causal-conv1d version on H200 to re-enable the FLA fast path and roughly halve per-step training time.
Eval-driven dataset curation: once formal benchmark numbers land for both this and the Claude sibling, the next dataset will be biased toward the question categories where each teacher most outperforms the base — making each successive distillation more efficient per training token.
Companion adapter releases: stand-alone LoRA adapter weights (separate from the merged model published here) so users can stack the Kimi reasoning style on top of other Qwen3.6-35B-A3B fine-tunes.

Citation

@misc{lordx64_qwen36_kimi_distill_2026,
  author       = {lordx64},
  title        = {Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled},
  year         = {2026},
  url          = {https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled},
}

Acknowledgements

Moonshot AI for Kimi K2.6, the teacher whose reasoning style this model emulates.
Qwen team for the strong open-weights MoE base.
Unsloth for the fast-finetuning stack that made this run tractable.
The wider open-weights reasoning-distillation community whose prompt sets and methodology informed the dataset construction.