---
language:
- en
- zh
license: apache-2.0
base_model: Jackrong/DASD-4B-Thinking-2507-GRPO-v2
tags:
- qwen3
- unsloth
- text-generation
- reasoning
- math
- grpo
- sft
- distillation
- conversational
- glm-4.7
pipeline_tag: text-generation
---
# Qwen3-4B-Thinking-2507-GLM-4.7-Distilled
**Qwen3-4B-Thinking-2507-GLM-4.7-Distilled** is a fine-tuned model built upon the GRPO-optimized `Jackrong/DASD-4B-Thinking-2507-GRPO-v2` (originally based on [`Qwen/Qwen3-4B-Thinking-2507`](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)). This model was developed using a Supervised Fine-Tuning (SFT) strategy heavily distilled from the GLM-4.7 model series (at a default temperature of 1.0), with a central focus on multi-turn conversational alignment and **structured Chain-of-Thought (CoT) execution**.
🎯 **Core Improvement:**
The primary objective of this fine-tuning was to transform the model's reasoning pattern for everyday and lightweight tasks. Instead of the typical linear, free-associative, and highly self-correcting ("think-as-you-go") stream of consciousness, this model has learned to adopt a highly confident, **"Plan-then-Execute"** paradigm. It systematically breaks down tasks into logical outlines and executes modular, report-like responses without unnecessary self-doubt or hesitation.
---
## 🧬 Training Pipeline Overview
This model is the culmination of two sequential training stages targeting mathematical reasoning and conversational CoT tracking:
```text
Qwen/Qwen3-4B-Thinking-2507
│
▼ Stage 0: GRPO (RL on Math & Reasoning)
DASD-4B-Thinking-2507-GRPO-v2
│
▼ Stage 1: SFT with GLM-4.7 Series Distilled Datasets (T=1.0)
Qwen3-4B-Thinking-2507-GLM-4.7-Distilled ← (this model)
```
### 🧠 Chain of Thought (CoT) Evolution: Base vs. Distilled
A significant shift in the model's reasoning style is observed after distillation from the GLM-4.7 series data. The model transitions from a **spontaneous thinker** into a **structured planner**:
| 🎯 Feature | 🌀 Base Model (Qwen3-4B-Thinking) | ✨ Distilled Model (GLM-4.7-Distilled) |
| :--- | :--- | :--- |
| **Thinking Style** | 🌊 Linear, stream-of-consciousness | 🧱 Modularized, report-like |
| **Execution** | 🏃 Thinks on the fly, writes as it thinks | 📝 "Plan-then-Execute" framework |
| **Structure** | 🔀 Unstructured, organic self-correction mid-thought | 📑 Highly structured with headings & logical phases |
| **Confidence** | 🤔 High self-doubt ("Wait...", "Maybe...", "Should I...") | 🚀 Highly confident, rarely hesitates |
| **Output Tone** | 🗣️ Conversational, exploring multiple paths | 📊 Objective, direct, and systematic |
**🌟 Key Takeaway:**
Through the GLM-4.7 dataset distillation, the model successfully learned the **modular thinking paradigm**. Instead of continuously questioning itself, it now *breaks down tasks, creates a clear outline, and systematically executes each step* like writing a formal report.
---
## 📚 Stage Details
### Stage 0 — GRPO Reinforcement Learning: `DASD-4B-Thinking-2507-GRPO-v2`
Starting from the base model `Qwen/Qwen3-4B-Thinking-2507`, Group Relative Policy Optimization (GRPO) was applied. This stage consisted of:
- **Cold Start:** Fine-tuning on the [`unsloth/OpenMathReasoning-mini`](https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini) dataset.
- **Reinforcement Learning:** Applying GRPO via the [`open-r1/DAPO-Math-17k-Processed`](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) dataset.
This stage significantly improved the model's:
- Correctness on math problem solving
- Step-by-step logical reasoning
- Reward signal alignment for verifiable tasks
---
### Stage 1 — SFT GLM-4.7 Distillation (T=1.0): `Qwen3-4B-Thinking-2507-GLM-4.7-Distilled` (this model)
Building on the reasoning foundation of `DASD-4B-Thinking-2507-GRPO-v2`, Stage 1 SFT was performed using a mixed dataset heavily utilizing **GLM-4.7** synthetic data generated at a **default temperature of 1.0**, along with multi-turn alignments.
Higher-temperature data introduces greater **lexical diversity, broader mode coverage, and more formatted/structured chain-of-thought traces**, enabling the model to generalize better across diverse conversational reasoning patterns and problem domains. It helps the model handle multi-turn conversations effectively while protecting its internal structure of `...` tracking.
---
## 🗂️ All Datasets Used
| Stage | Dataset | Purpose |
|-------|---------|---------|
| GRPO (Cold Start) | [`unsloth/OpenMathReasoning-mini`](https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini) | Initial foundational mathematical reasoning |
| GRPO (RL) | [`open-r1/DAPO-Math-17k-Processed`](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) | Math & reasoning RL training via GRPO |
| SFT Distillation | [`Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b`](https://huggingface.co/datasets/Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b) (Stage 2) | Diverse reasoning structures |
| SFT Distillation | [`Jackrong/glm-4.7-multiturn-CoT`](https://huggingface.co/datasets/Jackrong/glm-4.7-multiturn-CoT) | Multi-turn CoT alignment |
| SFT Distillation | [`Jackrong/glm-4.7-Superior-Reasoning-stage1`](https://huggingface.co/datasets/Jackrong/glm-4.7-Superior-Reasoning-stage1) | Enhanced fundamental reasoning |
| SFT Distillation | [`TeichAI/glm-4.7-2000x`](https://huggingface.co/datasets/TeichAI/glm-4.7-2000x) | Generalization and lexical diversity |
| SFT Distillation | [`Jackrong/MultiReason-ChatAlpaca`](https://huggingface.co/datasets/Jackrong/MultiReason-ChatAlpaca) | Conversational multi-turn tracking |
---
## 🏃 Quickstart
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Jackrong/Qwen3-4B-Thinking-2507-GLM-4.7-Distilled"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [
{"role": "user", "content": "Solve: find all real solutions to x^3 - 6x^2 + 11x - 6 = 0."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```
> **Tip:** This model naturally generates `...` reasoning traces before the final answer. You can parse these to inspect the chain-of-thought.
---
## 📋 Model Details
| Attribute | Value |
|-----------|-------|
| **Base Model** | `Jackrong/DASD-4B-Thinking-2507-GRPO-v2` |
| **Architecture** | Qwen3 (4B Dense) |
| **License** | Apache 2.0 |
| **Language(s)** | English, Chinese |
| **Training Framework** | [Unsloth](https://github.com/unslothai/unsloth) + Hugging Face TRL |
| **RL Algorithm** | GRPO (Group Relative Policy Optimization) |
| **Fine-tuning Method** | SFT (GLM-4.7 Distillation at T=1.0) |
| **Developed by** | Jackrong |
---
## ⚠️ Limitations & Intended Use
- This model is intended for **research and educational purposes** related to reasoning and mathematical problem-solving.
- While mathematical and logical reasoning capabilities have been enhanced, the model may still produce incorrect answers or hallucinations — always verify outputs on critical tasks.
- The model inherits the capabilities and limitations of the underlying `Qwen3-4B-Thinking-2507` architecture.
- Not intended for deployment in high-stakes applications without additional safety evaluation.
---
## 📎 Related Models
| Model | Description |
|-------|-------------|
| [`Qwen/Qwen3-4B-Thinking-2507`](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) | Base model |
| [`Jackrong/DASD-4B-Thinking-2507-GRPO-v2`](https://huggingface.co/Jackrong/DASD-4B-Thinking-2507-GRPO-v2) | After GRPO RL training |
| [`Jackrong/Qwen3-4B-Thinking-2507-GLM-4.7-Distilled`](https://huggingface.co/Jackrong/Qwen3-4B-Thinking-2507-GLM-4.7-Distilled) | **This model** — GLM-4.7 Distilled |
---
## 🙏 Acknowledgements
- [Zhipu AI](https://huggingface.co/THUDM) for the GLM-4.7 model series capability
- [Alibaba Cloud Apsara Lab](https://huggingface.co/Alibaba-Apsara) for reasoning datasets
- [Open-R1](https://huggingface.co/open-r1) for the DAPO Math dataset
- [Unsloth](https://github.com/unslothai/unsloth) for efficient fine-tuning infrastructure
- [Qwen Team](https://huggingface.co/Qwen) for the excellent base model