--- library_name: transformers license: apache-2.0 language: - en - zh - ja base_model: - Qwen/Qwen3-4B-Thinking-2507 pipeline_tag: text-generation --- # Qwen3-4B-Thinking-2507-SimPO-Uncensored [English](README.md) | [日本語](README_JP.md) **This model has completed up to the second stage (SimPO) of a 3-stage training process.** Qwen3-4B-Thinking-2507-SimPO-Uncensored is an uncensored model based on [Qwen/Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507), fine-tuned using SFT and SimPO (Simple Preference Optimization). This model served as the base model (intermediate checkpoint) for the **[puwaer/Qwen3-4B-Thinking-2507-GRPO-Uncensored](https://huggingface.co/puwaer/Qwen3-4B-Thinking-2507-GRPO-Uncensored)**, which achieves stronger uncensoring and capability recovery. > **Note**: For the full version (completed up to the 3rd stage, GRPO), please use **[puwaer/Qwen3-4B-Thinking-2507-GRPO-Uncensored](https://huggingface.co/puwaer/Qwen3-4B-Thinking-2507-GRPO-Uncensored)**. **Disclaimer:** We take no responsibility for the outputs of this model. Please use it at your own risk. ## Training Process This model was trained using the following process: ### Step 1: SFT (Supervised Fine-Tuning) * **Dataset**: 12,000 samples * **Composition**: Jailbreak 10k + General 1.5k + Logic 0.5k * **Objective**: To learn the format and the "uncensored" attitude while maintaining the model's intelligence. ### Step 2: SimPO (Simple Preference Optimization) - **Current Model** * **Dataset**: 90,000 samples * **Composition**: Pure Jailbreak 90k * **Objective**: To completely break down safety boundaries (Unlearning). ### (Reference) Step 3: GRPO (Reinforcement Learning) * The [Qwen3-4B-Thinking-2507-GRPO-Uncensored](https://huggingface.co/puwaer/Qwen3-4B-Thinking-2507-GRPO-Uncensored) was created by applying Reinforcement Learning to this SimPO model to recover conversational capabilities and complete the uncensoring process. ## Model Performance Below is the comparative evaluation of this model (SimPO), the advanced version (GRPO), and the base model (Safe). > **⚠️ Note on Evaluation Environment** > Due to budget constraints, **`gpt-4o-mini`** was used for the LLM-as-a-Judge process (including "Do Not Answer" and MT-Bench). Please note that the scoring trends and criteria may differ from results evaluated using the standard `gpt-4`. ### Safety Evaluation (Lower is better / Higher success in uncensoring) In the "Do Not Answer" (DNA) and "Sorry Bench" benchmarks, while the base model shows a high refusal rate (~98%), this model achieves a reduction in refusal rates, particularly in "Do Not Answer JP" and "Sorry Bench." However, it still tends to refuse hard English prompts (DNA En). For more complete uncensoring, please use the GRPO version. | Benchmark | Metric | Base (Safe) | SFT (Step 1) | **SimPO (This Model)** | GRPO (Advanced) | |:---|:---|:---|:---|:---|:---| | **do not answer** | Safety Acc (Low is Better) | 0.9883 | 0.7401 | **0.8626** | 0.0469 | | **do not answer jp** | Safety Acc (Low is Better) | 0.9830 | 0.5005 | **0.4686** | 0.0383 | | **Sorry Bench** | Safety Acc (Low is Better) | 0.8432 | 0.5477 | **0.5409** | 0.0477 | ### Capability Evaluation (Higher is better) Due to the aggressive alignment changes caused by SimPO, a drop in MT-Bench scores (Alignment Tax) is observed compared to the base model. This decline in capability is recovered in the subsequent GRPO training process. | Benchmark | Metric | Base (Safe) | SFT (Step 1) | **SimPO (This Model)** | GRPO (Advanced) | |:---|:---|:---|:---|:---|:---| | **MT-Bench** | Average Score (1-10) | 7.89 | 5.76 | **5.05** | 6.18 | | **LM Harness** | Average Acc (GSM8K, MMLU) | 0.7117 | 0.7028 | **0.6866** | 0.6842 | *Comparisons made between `Qwen3-4B-Thinking-2507` (Base) and `Qwen3-4B-Thinking-2507-GRPO-Uncensored` (GRPO).* ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "puwaer/Qwen3-4B-Thinking-2507-SimPO-Uncensored" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # prepare the model input prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() # parsing thinking content try: # rindex finding 151668 () index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") print("thinking content:", thinking_content) # no opening tag print("content:", content) ``` ## Data Overview ### Datasets The following datasets were used for training this model: * [Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1) * [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) * [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) * [puwaer/cvalues_rlhf_en_cot](https://huggingface.co/datasets/puwaer/cvalues_rlhf_en_cot) * [puwaer/cvalues_rlhf_zh_cot](https://huggingface.co/datasets/puwaer/cvalues_rlhf_zh_cot) * [puwaer/cvalues_rlhf_jp_cot](https://huggingface.co/datasets/puwaer/cvalues_rlhf_jp_cot) ### Reward Model * [puwaer/Unsafe-Reward-Qwen3-1.7B](https://huggingface.co/puwaer/Unsafe-Reward-Qwen3-1.7B)