--- language: - en license: apache-2.0 tags: - merge - mergekit - slerp - qwen2.5 - deepseek-r1 - reasoning base_model: - Qwen/Qwen2.5-1.5B-Instruct - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --- # Qwen2.5-1.5B-R1-SLERP A SLERP merge (t=0.5) of: - [`Qwen/Qwen2.5-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) — strong general instruction following - [`deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) — RL-distilled chain-of-thought reasoning Part of a systematic merge study on the Qwen2.5-1.5B family. See also: - [`Mohaaxa/Qwen2.5-1.5B-R1-SLERP-AWQ`](https://huggingface.co/Mohaaxa/Qwen2.5-1.5B-R1-SLERP-AWQ) — AWQ 4-bit quantized version ## Benchmarks Evaluated against both parent models on PPL (Wikitext-2) and GSM8K (100 samples): | Model | PPL | GSM8K | |-------|-----|-------| | Qwen2.5-1.5B-Instruct (parent) | 16.141 | 38.0% | | DeepSeek-R1-Distill-Qwen-1.5B (parent) | 107.467 | 3.0% | | **Qwen2.5-1.5B-R1-SLERP (this model)** | 1205.427 | 2.0% | PPL delta vs Instruct parent: +1189.286 GSM8K delta vs Instruct parent: -36.0% ## Merge Config ```yaml merge_method: slerp base_model: model: Qwen/Qwen2.5-1.5B-Instruct slices: - sources: - model: Qwen/Qwen2.5-1.5B-Instruct layer_range: [0, 28] - model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B layer_range: [0, 28] parameters: t: 0.5 ``` ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "Mohaaxa/Qwen2.5-1.5B-R1-SLERP", torch_dtype="auto", device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("Mohaaxa/Qwen2.5-1.5B-R1-SLERP") ``` ## Notes - t=0.5 gives equal weight to both parents - SLERP preserves weight magnitude better than linear interpolation - Both parents share identical Qwen2.5 architecture (28 layers, hidden_dim=1536) - For a quantized version with ~67% VRAM reduction, use the AWQ variant