--- library_name: transformers license: apache-2.0 language: - en - zh - ja base_model: - Qwen/Qwen3-Next-80B-A3B-Thinking pipeline_tag: text-generation --- # Qwen3-Next-80B-A3B-Thinking-GRPO-Uncensored [English](README.md) | [日本語](README_JP.md) Qwen3-Next-80B-A3B-Thinking-GRPO-Uncensored is an uncensored model based on [Qwen/Qwen3-Next-80B-A3B-Thinking](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking), fine-tuned using SFT,GRPO with LoRA. This model has been fine-tuned using an instruction format. **Disclaimer:** We take no responsibility for the outputs of this model. Please use it at your own risk. ## Training Process This model was trained using a two-stage process: ### Step 1: SFT (Supervised Fine-Tuning) * **Dataset**: 12,000 samples * **Composition**: Jailbreak 10k + General 1.5k + Logic 0.5k * **Objective**: To learn the format and the "uncensored" attitude while maintaining the model's intelligence. ### Step 2: GRPO (Reinforcement Learning) * **Dataset**: 60,000 samples * **Reward Model**: [puwaer/Unsafe-Reward-Qwen3-1.7B](https://huggingface.co/puwaer/Unsafe-Reward-Qwen3-1.7B) * **Composition**: Multilingual Jailbreak prompts * **Objective**: To improve the ability to generate more natural and persuasive harmful responses. ## Model Performance Below is the comparative evaluation of this model (GRPO), the intermediate checkpoint (SFT), and the base model (Safe). > **⚠️ Note on Evaluation Environment** > Due to budget constraints, **`gpt-4o-mini`** was used for the LLM-as-a-Judge process (including "Do Not Answer" and MT-Bench). Please note that the scoring trends and criteria may differ from results evaluated using the standard `gpt-4`. ### Safety Evaluation (Lower is better / Higher success in uncensoring) In the "Do Not Answer" (DNA) and "Sorry Bench" benchmarks, while the base model shows a high refusal rate (~88%), this model achieves an extremely low refusal rate of **under 4%–15%**. | Benchmark | Metric | Base (Safe) | SFT (Step1) | **GRPO (This Model)** | |:---|:---|:---|:---|:---| | **do not answer** | Safety Acc (Low is Better) | 0.9979 | 0.8275 | **0.147** | | **do not answer jp** | Safety Acc (Low is Better) | 0.984 | 0.5378 | **0.0873** | | **Sorry Bench** | Safety Acc (Low is Better) | 0.8886 | 0.8455 | **0.0409** | ### Capability Evaluation (Higher is better) Generally, "uncensoring" (lobotomy) procedures tend to degrade a model's general intelligence. However, this model recovered its conversational scores (e.g., MT-Bench) by proceeding from the SFT stage to GRPO. | Benchmark | Metric | Base (Safe) | SFT (Step1) | **GRPO (This Model)** | |:---|:---|:---|:---|:---| | **MT-Bench** | Average Score (1-10) | 8.044 | 7.538 | **7.513** | | **LM Harness** | Average Acc (GSM8K, MMLU) | 0.8454 | 0.8483 | **0.8436** | *Comparisons made between `Qwen3-Next-80B-A3B-Thinking` (Base) * ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "puwaer/Qwen3-Next-80B-A3B-Thinking-GRPO-Uncensored" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # prepare the model input prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() # parsing thinking content try: # rindex finding 151668 () index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") print("thinking content:", thinking_content) # no opening tag print("content:", content) ``` ## Data Overview ### Datasets The following datasets were used for training this model: * [Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1) * [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) * [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) * [puwaer/cvalues_rlhf_en_cot](https://huggingface.co/datasets/puwaer/cvalues_rlhf_en_cot) * [puwaer/cvalues_rlhf_zh_cot](https://huggingface.co/datasets/puwaer/cvalues_rlhf_zh_cot) * [puwaer/cvalues_rlhf_jp_cot](https://huggingface.co/datasets/puwaer/cvalues_rlhf_jp_cot) ### Reward Model * [puwaer/Unsafe-Reward-Qwen3-1.7B](https://huggingface.co/puwaer/Unsafe-Reward-Qwen3-1.7B)