--- language: - en - zh license: apache-2.0 base_model: Jackrong/DASD-4B-Thinking-2507-GRPO-v2 tags: - qwen3 - unsloth - text-generation - reasoning - math - grpo - sft - distillation - conversational - glm-4.7 pipeline_tag: text-generation --- # Qwen3-4B-Thinking-2507-GLM-4.7-Distilled **Qwen3-4B-Thinking-2507-GLM-4.7-Distilled** is a fine-tuned model built upon the GRPO-optimized `Jackrong/DASD-4B-Thinking-2507-GRPO-v2` (originally based on [`Qwen/Qwen3-4B-Thinking-2507`](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)). This model was developed using a Supervised Fine-Tuning (SFT) strategy heavily distilled from the GLM-4.7 model series (at a default temperature of 1.0), with a central focus on multi-turn conversational alignment and **structured Chain-of-Thought (CoT) execution**. 🎯 **Core Improvement:** The primary objective of this fine-tuning was to transform the model's reasoning pattern for everyday and lightweight tasks. Instead of the typical linear, free-associative, and highly self-correcting ("think-as-you-go") stream of consciousness, this model has learned to adopt a highly confident, **"Plan-then-Execute"** paradigm. It systematically breaks down tasks into logical outlines and executes modular, report-like responses without unnecessary self-doubt or hesitation. --- ## 🧬 Training Pipeline Overview This model is the culmination of two sequential training stages targeting mathematical reasoning and conversational CoT tracking: ```text Qwen/Qwen3-4B-Thinking-2507 │ ▼ Stage 0: GRPO (RL on Math & Reasoning) DASD-4B-Thinking-2507-GRPO-v2 │ ▼ Stage 1: SFT with GLM-4.7 Series Distilled Datasets (T=1.0) Qwen3-4B-Thinking-2507-GLM-4.7-Distilled ← (this model) ``` ### 🧠 Chain of Thought (CoT) Evolution: Base vs. Distilled A significant shift in the model's reasoning style is observed after distillation from the GLM-4.7 series data. The model transitions from a **spontaneous thinker** into a **structured planner**: | 🎯 Feature | 🌀 Base Model (Qwen3-4B-Thinking) | ✨ Distilled Model (GLM-4.7-Distilled) | | :--- | :--- | :--- | | **Thinking Style** | 🌊 Linear, stream-of-consciousness | 🧱 Modularized, report-like | | **Execution** | 🏃 Thinks on the fly, writes as it thinks | 📝 "Plan-then-Execute" framework | | **Structure** | 🔀 Unstructured, organic self-correction mid-thought | 📑 Highly structured with headings & logical phases | | **Confidence** | 🤔 High self-doubt ("Wait...", "Maybe...", "Should I...") | 🚀 Highly confident, rarely hesitates | | **Output Tone** | 🗣️ Conversational, exploring multiple paths | 📊 Objective, direct, and systematic | **🌟 Key Takeaway:** Through the GLM-4.7 dataset distillation, the model successfully learned the **modular thinking paradigm**. Instead of continuously questioning itself, it now *breaks down tasks, creates a clear outline, and systematically executes each step* like writing a formal report. --- ## 📚 Stage Details ### Stage 0 — GRPO Reinforcement Learning: `DASD-4B-Thinking-2507-GRPO-v2` Starting from the base model `Qwen/Qwen3-4B-Thinking-2507`, Group Relative Policy Optimization (GRPO) was applied. This stage consisted of: - **Cold Start:** Fine-tuning on the [`unsloth/OpenMathReasoning-mini`](https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini) dataset. - **Reinforcement Learning:** Applying GRPO via the [`open-r1/DAPO-Math-17k-Processed`](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) dataset. This stage significantly improved the model's: - Correctness on math problem solving - Step-by-step logical reasoning - Reward signal alignment for verifiable tasks --- ### Stage 1 — SFT GLM-4.7 Distillation (T=1.0): `Qwen3-4B-Thinking-2507-GLM-4.7-Distilled` (this model) Building on the reasoning foundation of `DASD-4B-Thinking-2507-GRPO-v2`, Stage 1 SFT was performed using a mixed dataset heavily utilizing **GLM-4.7** synthetic data generated at a **default temperature of 1.0**, along with multi-turn alignments. Higher-temperature data introduces greater **lexical diversity, broader mode coverage, and more formatted/structured chain-of-thought traces**, enabling the model to generalize better across diverse conversational reasoning patterns and problem domains. It helps the model handle multi-turn conversations effectively while protecting its internal structure of `...` tracking. --- ## 🗂️ All Datasets Used | Stage | Dataset | Purpose | |-------|---------|---------| | GRPO (Cold Start) | [`unsloth/OpenMathReasoning-mini`](https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini) | Initial foundational mathematical reasoning | | GRPO (RL) | [`open-r1/DAPO-Math-17k-Processed`](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) | Math & reasoning RL training via GRPO | | SFT Distillation | [`Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b`](https://huggingface.co/datasets/Alibaba-Apsara/Superior-Reasoning-SFT-gpt-oss-120b) (Stage 2) | Diverse reasoning structures | | SFT Distillation | [`Jackrong/glm-4.7-multiturn-CoT`](https://huggingface.co/datasets/Jackrong/glm-4.7-multiturn-CoT) | Multi-turn CoT alignment | | SFT Distillation | [`Jackrong/glm-4.7-Superior-Reasoning-stage1`](https://huggingface.co/datasets/Jackrong/glm-4.7-Superior-Reasoning-stage1) | Enhanced fundamental reasoning | | SFT Distillation | [`TeichAI/glm-4.7-2000x`](https://huggingface.co/datasets/TeichAI/glm-4.7-2000x) | Generalization and lexical diversity | | SFT Distillation | [`Jackrong/MultiReason-ChatAlpaca`](https://huggingface.co/datasets/Jackrong/MultiReason-ChatAlpaca) | Conversational multi-turn tracking | --- ## 🏃 Quickstart ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Jackrong/Qwen3-4B-Thinking-2507-GLM-4.7-Distilled" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto") messages = [ {"role": "user", "content": "Solve: find all real solutions to x^3 - 6x^2 + 11x - 6 = 0."} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=4096) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print(response) ``` > **Tip:** This model naturally generates `...` reasoning traces before the final answer. You can parse these to inspect the chain-of-thought. --- ## 📋 Model Details | Attribute | Value | |-----------|-------| | **Base Model** | `Jackrong/DASD-4B-Thinking-2507-GRPO-v2` | | **Architecture** | Qwen3 (4B Dense) | | **License** | Apache 2.0 | | **Language(s)** | English, Chinese | | **Training Framework** | [Unsloth](https://github.com/unslothai/unsloth) + Hugging Face TRL | | **RL Algorithm** | GRPO (Group Relative Policy Optimization) | | **Fine-tuning Method** | SFT (GLM-4.7 Distillation at T=1.0) | | **Developed by** | Jackrong | --- ## ⚠️ Limitations & Intended Use - This model is intended for **research and educational purposes** related to reasoning and mathematical problem-solving. - While mathematical and logical reasoning capabilities have been enhanced, the model may still produce incorrect answers or hallucinations — always verify outputs on critical tasks. - The model inherits the capabilities and limitations of the underlying `Qwen3-4B-Thinking-2507` architecture. - Not intended for deployment in high-stakes applications without additional safety evaluation. --- ## 📎 Related Models | Model | Description | |-------|-------------| | [`Qwen/Qwen3-4B-Thinking-2507`](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) | Base model | | [`Jackrong/DASD-4B-Thinking-2507-GRPO-v2`](https://huggingface.co/Jackrong/DASD-4B-Thinking-2507-GRPO-v2) | After GRPO RL training | | [`Jackrong/Qwen3-4B-Thinking-2507-GLM-4.7-Distilled`](https://huggingface.co/Jackrong/Qwen3-4B-Thinking-2507-GLM-4.7-Distilled) | **This model** — GLM-4.7 Distilled | --- ## 🙏 Acknowledgements - [Zhipu AI](https://huggingface.co/THUDM) for the GLM-4.7 model series capability - [Alibaba Cloud Apsara Lab](https://huggingface.co/Alibaba-Apsara) for reasoning datasets - [Open-R1](https://huggingface.co/open-r1) for the DAPO Math dataset - [Unsloth](https://github.com/unslothai/unsloth) for efficient fine-tuning infrastructure - [Qwen Team](https://huggingface.co/Qwen) for the excellent base model