rahmasaber commited on
Commit
f0d929e
Β·
verified Β·
1 Parent(s): 7d40ede

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +255 -1
README.md CHANGED
@@ -1,3 +1,257 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: peft
3
+ license: apache-2.0
4
+ base_model: Qwen/Qwen2.5-1.5B-Instruct
5
+ tags:
6
+ - qlora
7
+ - lora
8
+ - fine-tuning
9
+ - reasoning
10
+ - qwen2.5
11
+ - openthoughts
12
+ - 4-bit
13
+ - nf4
14
+ datasets:
15
+ - open-thoughts/OpenThoughts-114k
16
+ language:
17
+ - en
18
+ pipeline_tag: text-generation
19
+ model-index:
20
+ - name: qwen2.5-iq-Finetuning-qlora
21
+ results: []
22
  ---
23
+
24
+ # Qwen2.5-1.5B-Instruct β€” QLoRA Fine-Tuned on OpenThoughts-114k
25
+
26
+ A QLoRA adapter for [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), fine-tuned on curated reasoning traces from [OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) to produce clean, structured, step-by-step solutions.
27
+
28
+ ## Key Details
29
+
30
+ | | |
31
+ |---|---|
32
+ | **Base Model** | Qwen/Qwen2.5-1.5B-Instruct |
33
+ | **Method** | QLoRA (4-bit NF4 + LoRA) |
34
+ | **Dataset** | 30K samples from OpenThoughts-114k |
35
+ | **Hardware** | Single NVIDIA T4 (16GB VRAM, free Colab) |
36
+ | **Adapter Size** | ~50MB |
37
+ | **Trainable Params** | ~1.5% of total model parameters |
38
+
39
+ ## What This Adapter Does
40
+
41
+ The base Qwen2.5-1.5B-Instruct model produces reasonable answers but tends to be verbose and sometimes loses structure in multi-step reasoning. This adapter improves:
42
+
43
+ - **Response conciseness** β€” ~12% shorter outputs on average, cutting fluff while retaining substance
44
+ - **Step-by-step structure** β€” cleaner formatting with numbered steps and proper LaTeX math notation
45
+ - **Reasoning accuracy** β€” correct answers on trick questions and logic puzzles where the base model fumbles
46
+
47
+ ## Training Details
48
+
49
+ ### Quantization
50
+
51
+ ```
52
+ BitsAndBytesConfig(
53
+ load_in_4bit=True,
54
+ bnb_4bit_quant_type="nf4",
55
+ bnb_4bit_compute_dtype=torch.float16,
56
+ bnb_4bit_use_double_quant=True,
57
+ )
58
+ ```
59
+
60
+ ### LoRA Configuration
61
+
62
+ ```
63
+ LoraConfig(
64
+ r=32,
65
+ lora_alpha=64,
66
+ lora_dropout=0.05,
67
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
68
+ "gate_proj", "up_proj", "down_proj"],
69
+ bias="none",
70
+ task_type="CAUSAL_LM",
71
+ )
72
+ ```
73
+
74
+ ### Training Hyperparameters
75
+
76
+ | Parameter | Value |
77
+ |---|---|
78
+ | Epochs | 1 |
79
+ | Batch size | 1 (Γ— 4 gradient accumulation) |
80
+ | Learning rate | 2e-4 |
81
+ | Scheduler | Cosine with 50-step warmup |
82
+ | Optimizer | Paged AdamW 8-bit |
83
+ | Max sequence length | 2048 |
84
+ | NEFTune noise alpha | 5 |
85
+ | Precision | fp16 |
86
+
87
+ ### Data Preprocessing β€” The Critical Step
88
+
89
+ The OpenThoughts-114k dataset contains DeepSeek-R1 reasoning traces with two sections:
90
+ - `<begin_of_thought>` β€” thousands of tokens of raw internal reasoning
91
+ - `<begin_of_solution>` β€” the clean, structured final answer
92
+
93
+ **We train only on the extracted solution block.** Training on the full traces causes the model to produce rambling, unfocused output. Extracting only the solution with a simple regex produced dramatically better results β€” same model, same hyperparameters, completely different output quality.
94
+
95
+ ```python
96
+ import re
97
+
98
+ def formatting_func(example):
99
+ role_map = {"human": "user", "gpt": "assistant"}
100
+ messages = []
101
+
102
+ if example.get("system"):
103
+ messages.append({"role": "system", "content": example["system"]})
104
+
105
+ for turn in example["conversations"]:
106
+ role = role_map.get(turn["from"], turn["from"])
107
+ content = turn["value"]
108
+
109
+ # Extract only the final solution
110
+ if role == "assistant":
111
+ match = re.search(
112
+ r"<\|begin_of_solution\|>(.*?)<\|end_of_solution\|>",
113
+ content, re.DOTALL,
114
+ )
115
+ if match:
116
+ content = match.group(1).strip()
117
+
118
+ messages.append({"role": role, "content": content})
119
+
120
+ return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
121
+ ```
122
+
123
+ ### Response Masking
124
+
125
+ Labels are padded with `-100` on all non-assistant tokens using `DataCollatorForSeq2Seq`, so the cross-entropy loss is only computed on the tokens the model needs to generate at inference time. This improves sample efficiency β€” every gradient update is focused on useful generation.
126
+
127
+ ## Usage
128
+
129
+ ### Load with PEFT
130
+
131
+ ```python
132
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
133
+ from peft import PeftModel
134
+ import torch
135
+
136
+ # Load base model in 4-bit
137
+ bnb_config = BitsAndBytesConfig(
138
+ load_in_4bit=True,
139
+ bnb_4bit_quant_type="nf4",
140
+ bnb_4bit_compute_dtype=torch.float16,
141
+ bnb_4bit_use_double_quant=True,
142
+ )
143
+
144
+ base_model = AutoModelForCausalLM.from_pretrained(
145
+ "Qwen/Qwen2.5-1.5B-Instruct",
146
+ quantization_config=bnb_config,
147
+ device_map="auto",
148
+ trust_remote_code=True,
149
+ )
150
+
151
+ # Load adapter
152
+ model = PeftModel.from_pretrained(base_model, "rahmasaber/qwen2.5-iq-Finetuning-qlora")
153
+ tokenizer = AutoTokenizer.from_pretrained("rahmasaber/qwen2.5-iq-Finetuning-qlora")
154
+
155
+ model.eval()
156
+ ```
157
+
158
+ ### Generate
159
+
160
+ ```python
161
+ messages = [
162
+ {"role": "system", "content": "You are a helpful assistant that thinks step-by-step."},
163
+ {"role": "user", "content": "If 5 machines produce 5 widgets in 5 minutes, how many minutes for 100 machines to produce 100 widgets?"},
164
+ ]
165
+
166
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
167
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
168
+
169
+ with torch.no_grad():
170
+ output = model.generate(
171
+ **inputs,
172
+ max_new_tokens=512,
173
+ temperature=0.7,
174
+ top_p=0.9,
175
+ do_sample=True,
176
+ )
177
+
178
+ response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
179
+ print(response)
180
+ ```
181
+
182
+ ### Compare Base vs Fine-Tuned
183
+
184
+ ```python
185
+ # Disable adapter β†’ base model behavior
186
+ model.disable_adapter_layers()
187
+ base_response = generate(prompt)
188
+
189
+ # Enable adapter β†’ fine-tuned behavior
190
+ model.enable_adapter_layers()
191
+ ft_response = generate(prompt)
192
+ ```
193
+
194
+ ## Evaluation
195
+
196
+ Tested on 10 handcrafted reasoning prompts across 5 categories:
197
+
198
+ | Category | # Prompts | What it tests |
199
+ |---|---|---|
200
+ | Logic Puzzles | 2 | Trick questions, careful reading |
201
+ | Math | 3 | Word problems, sequential operations |
202
+ | Reasoning | 2 | Formal logic, deductive puzzles |
203
+ | Code | 1 | Algorithm complexity analysis |
204
+ | Science | 2 | Physics principles, Archimedes |
205
+
206
+ ### Results vs Base Model
207
+
208
+ | Metric | Base | Fine-Tuned |
209
+ |---|---|---|
210
+ | Avg response length (tokens) | 314 | 275 (-12%) |
211
+ | Correct on "all but 9 sheep" | βœ… | βœ… |
212
+ | Correct on average speed (harmonic mean) | βœ… | βœ… |
213
+ | Correct on discount stacking (32%) | βœ… | βœ… |
214
+ | Correct on 5 machines/5 widgets | ❌ | βœ… |
215
+ | Structured step-by-step format | Sometimes | Consistently |
216
+
217
+ ### Held-Out Test Set
218
+
219
+ 200 examples held out from the training sample for overfitting detection. Train/test loss gap remained healthy (< 0.5), confirming the model generalizes rather than memorizing.
220
+
221
+ ## Limitations
222
+
223
+ - **Small base model** β€” 1.5B parameters limits complex multi-hop reasoning
224
+ - **1 epoch on 1.2K-3K samples** β€” more data and epochs would improve accuracy
225
+ - **Self-evaluation bias** β€” LLM-as-judge uses the same model family; use a stronger external model (GPT-4, Claude) for rigorous evaluation
226
+ - **Science questions** β€” the fine-tuned model occasionally gets physics wrong (e.g., feather vs bowling ball on Moon)
227
+ - **No benchmark scores** β€” not evaluated on GSM8K, MATH, or HumanEval yet
228
+
229
+ ## Files
230
+
231
+ ```
232
+ .
233
+ β”œβ”€β”€ adapter_config.json # LoRA configuration
234
+ β”œβ”€β”€ adapter_model.safetensors # LoRA weights (~50MB)
235
+ β”œβ”€β”€ tokenizer_config.json # Tokenizer settings
236
+ β”œβ”€β”€ tokenizer.json # Tokenizer vocabulary
237
+ β”œβ”€β”€ special_tokens_map.json # Special token mappings
238
+ └── README.md # This file
239
+ ```
240
+
241
+ ## Citation
242
+
243
+ ```bibtex
244
+ @misc{saber2026qwen25qlora,
245
+ title={QLoRA Fine-Tuning Qwen2.5-1.5B-Instruct on OpenThoughts-114k},
246
+ author={Rahma Saber},
247
+ year={2026},
248
+ url={https://huggingface.co/rahmasaber/qwen2.5-iq-Finetuning-qlora}
249
+ }
250
+ ```
251
+
252
+ ## Acknowledgments
253
+
254
+ - [Qwen Team](https://huggingface.co/Qwen) for the base model
255
+ - [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) for the reasoning dataset
256
+ - [Hugging Face](https://huggingface.co/) for PEFT, TRL, and the Hub
257
+ - [Google Colab](https://colab.research.google.com/) for free GPU access