Qwen3.6-35B-A3B β€” GLM-5.1 + Claude 4.7 Reasoning Distilled

A dual-teacher distilled variant of Qwen3.6-35B-A3B that combines reasoning behaviors from both GLM-5.1 (754B MoE, SOTA on agentic/coding) and Claude Opus 4.7 (Anthropic's frontier reasoning model).

Architecture

  • Base: Qwen3.6-35B-A3B β€” 35B total parameters, ~3B active per token (256 experts, 8 routed + 1 shared)
  • Quantization: Unsloth mixed-precision affine (4/5/6/8-bit per layer)
  • Context: 64k tokens natively supported

Training

Stage 1: Claude 4.7 Opus Distillation (from base model)

Base Qwen/Qwen3.6-35B-A3B via Unsloth
Teacher Claude Opus 4.7 (Anthropic API)
Dataset lordx64/reasoning-distill-opus-4-7-max-sft (~7,800 conversations)
Method SFT + LoRA (attention-only: q/k/v/o_proj), train_on_responses_only
Config r=16, alpha=16, lr=2e-5, cosine, warmup 3%, adamw_8bit, 2 epochs
Source splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ4e

Stage 2: GLM-5.1 Reasoning Distillation (this model)

Base Stage 1 checkpoint (Claude-distilled, merged + quantized)
Teacher GLM-5.1 (754B MoE, zai-org/GLM-5.1)
Dataset Kassadin88/GLM-5.1-1000000x β€” Math config, 5k curated samples
Method SFT + LoRA (attention-only: q/k/v/o_proj), assistant_only_loss=True
Config r=32, alpha=64, lr=1e-5, cosine, warmup 5%, adamw_8bit, 2 epochs
Filtering Quality filter: 500 < output_chars < 16,000 (per DED paper recipe)
Framework Unsloth + TRL SFTTrainer v1.2.0
Hardware A100-80GB

Design choices

  • Attention-only LoRA preserves the base model's expert FFN knowledge while transferring reasoning style from both teachers
  • Conservative LR (1e-5) prevents overwriting Claude's reasoning patterns β€” the goal is to blend both styles
  • Math config chosen because GLM-5.1 excels at structured mathematical reasoning (AIME 2026: 95.3%, GPQA-Diamond: 86.2%)
  • 5k curated samples following DED paper (arxiv:2508.09883) findings that quality >>> quantity for distillation

Key literature informing this recipe

Paper Key insight applied
DED (arxiv:2508.09883) ~1k curated traces can match 800k+ with right filtering; lr=1e-5 optimal
REDI (arxiv:2505.24850) SFT on correct traces only = stage 1; response length filtering matters
DLCoT (arxiv:2503.16385) Cross-architecture transfer (GLM→Qwen) has 5-15% degradation; attention-only LoRA mitigates this
AM-Thinking (arxiv:2505.14464) Diverse token length distribution in traces improves student quality

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

repo = "Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
    repo, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that there are infinitely many primes of the form 4k+3."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=16384, do_sample=False)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Reproduce

pip install torch transformers trl peft datasets trackio accelerate unsloth bitsandbytes
python train_distill.py

See train_distill.py for the full training script.

Acknowledgements

  • zai-org β€” for GLM-5.1 and its open weights
  • Anthropic β€” Claude Opus 4.7 as the first-stage teacher
  • Qwen Team β€” Qwen3.6 base model (Apache-2.0)
  • Kassadin88 β€” GLM-5.1-1000000x reasoning trace dataset
  • lordx64 β€” original Claude distillation work and reasoning-distill dataset
  • splats β€” the Stage 1 merged quantized checkpoint
  • Unsloth β€” fast MoE LoRA training
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled

Dataset used to train Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled

Papers for Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled