Kassadin88/GLM-5.1-1000000x
Viewer β’ Updated β’ 471k β’ 1.61k β’ 43
How to use Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled") # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled", dtype="auto")How to use Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled
How to use Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled with Docker Model Runner:
docker model run hf.co/Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled
A dual-teacher distilled variant of Qwen3.6-35B-A3B that combines reasoning behaviors from both GLM-5.1 (754B MoE, SOTA on agentic/coding) and Claude Opus 4.7 (Anthropic's frontier reasoning model).
| Base | Qwen/Qwen3.6-35B-A3B via Unsloth |
| Teacher | Claude Opus 4.7 (Anthropic API) |
| Dataset | lordx64/reasoning-distill-opus-4-7-max-sft (~7,800 conversations) |
| Method | SFT + LoRA (attention-only: q/k/v/o_proj), train_on_responses_only |
| Config | r=16, alpha=16, lr=2e-5, cosine, warmup 3%, adamw_8bit, 2 epochs |
| Source | splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ4e |
| Base | Stage 1 checkpoint (Claude-distilled, merged + quantized) |
| Teacher | GLM-5.1 (754B MoE, zai-org/GLM-5.1) |
| Dataset | Kassadin88/GLM-5.1-1000000x β Math config, 5k curated samples |
| Method | SFT + LoRA (attention-only: q/k/v/o_proj), assistant_only_loss=True |
| Config | r=32, alpha=64, lr=1e-5, cosine, warmup 5%, adamw_8bit, 2 epochs |
| Filtering | Quality filter: 500 < output_chars < 16,000 (per DED paper recipe) |
| Framework | Unsloth + TRL SFTTrainer v1.2.0 |
| Hardware | A100-80GB |
| Paper | Key insight applied |
|---|---|
| DED (arxiv:2508.09883) | ~1k curated traces can match 800k+ with right filtering; lr=1e-5 optimal |
| REDI (arxiv:2505.24850) | SFT on correct traces only = stage 1; response length filtering matters |
| DLCoT (arxiv:2503.16385) | Cross-architecture transfer (GLMβQwen) has 5-15% degradation; attention-only LoRA mitigates this |
| AM-Thinking (arxiv:2505.14464) | Diverse token length distribution in traces improves student quality |
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
repo = "Lgr54HFi/Qwen3.6-35B-A3B-GLM5.1-Claude4.7-Reasoning-Distilled"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
repo, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
)
messages = [{"role": "user", "content": "Prove that there are infinitely many primes of the form 4k+3."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=16384, do_sample=False)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
pip install torch transformers trl peft datasets trackio accelerate unsloth bitsandbytes
python train_distill.py
See train_distill.py for the full training script.
Base model
Qwen/Qwen3.6-35B-A3B