Transformers
GGUF
text-generation-inference
unsloth
gemma4
reasoning
conversational
armand0e commited on
Commit
631b095
·
1 Parent(s): e1a3fdd

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +390 -0
README.md ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill-v2
3
+ tags:
4
+ - text-generation-inference
5
+ - transformers
6
+ - unsloth
7
+ - gemma4
8
+ - reasoning
9
+ license: apache-2.0
10
+ datasets:
11
+ - TeichAI/Claude-Opus-4.6-Reasoning-887x
12
+ - TeichAI/claude-4.5-opus-high-reasoning-250x
13
+ - Crownelius/Opus-4.6-Reasoning-2100x-formatted
14
+ ---
15
+
16
+ # 🌟 Gemma 4 - 26B A4B x Claude Opus 4.6 (v2)
17
+
18
+ > **Build Environment & Features:**
19
+ > - **Fine-tuning Framework**: **Unsloth**
20
+ > - **Reasoning Effort**: **High**
21
+ > - This model bridges the gap between Google's exceptional open-weights architecture and Claude 4.6's profound reasoning capabilities, leveraging cutting-edge fine-tuning environments.
22
+ > - v2 fixes some looping or cut off response issues. different training parameters were also used.
23
+ > - This model was able to successfully work inside of Cline, Codex, and Cursor to build funtional web apps and scripts.
24
+
25
+ ![Gemma 4 Benchmarks](https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/gemma-4-table_light_Web_with_Arena.jpg)
26
+
27
+ ## 💡 Model Introduction
28
+ **Gemma 4 - 26B A4B x Claude Opus 4.6** is a highly capable model fine-tuned on top of the powerful `unsloth/gemma-4-26B-A4B-it` architecture. The model's core directive is to absorb state-of-the-art reasoning distillation, primarily sourced from Claude-4.6 Opus interactions.
29
+
30
+ By utilizing datasets where the reasoning effort was explicitly set to **High**, this model excels in breaking down complex problems and delivering precise, nuanced solutions across a variety of demanding domains.
31
+
32
+ ## 🗺️ Training Pipeline Overview
33
+
34
+ ```text
35
+ Base Model (unsloth/gemma-4-26B-A4B-it)
36
+
37
+
38
+ Supervised Fine-Tuning (SFT) + High-Effort Reasoning Datasets
39
+
40
+
41
+ Final Model (Gemma 4 - 26B A4B x Claude Opus 4.6)
42
+ ````
43
+
44
+ ## 📋 Stage Details & Benchmarks
45
+
46
+ *Benchmarks coming soon*
47
+
48
+ **Performance vs Size:**
49
+
50
+ > **Deep Dive Analysis:** For more comprehensive insights regarding the base capabilities of the Gemma 4 architecture, please refer to [this Analysis Document](https://huggingface.co/TeichAI/gemma-4-31B-it-Claude-Opus-Distill/resolve/main/Gemma%204%20Analysis.pdf).
51
+
52
+ ### 🔹 Supervised Fine-Tuning (Meeting Claude)
53
+
54
+ - **Objective:** To inject high-density reasoning logic and establish a strict format for complex problem-solving.
55
+ - **Methodology:** We utilized **Unsloth** for highly efficient memory and compute optimization during the fine-tuning process. The model was trained extensively on various reasoning trajectories from Claude Opus 4.6 to adopt a structured and efficient thinking pattern.
56
+
57
+ ### 📚 All Datasets Used
58
+
59
+ The dataset consists of high-quality, high-effort reasoning distillation data:
60
+
61
+ | Dataset Name | Description / Purpose |
62
+ |--------------|-----------------------|
63
+ | `TeichAI/Claude-Opus-4.6-Reasoning-887x` | Core Claude 4.6 Opus reasoning trajectories. |
64
+ | `TeichAI/claude-4.5-opus-high-reasoning-250x` | Legacy high-intensity reasoning distillation. |
65
+ | `Crownelius/Opus-4.6-Reasoning-2100x-formatted` | Crownelius's extensively formatted Opus reasoning dataset for structural reinforcement. |
66
+
67
+ ## 🌟 Core Skills & Capabilities
68
+
69
+ Thanks to its robust base model and high-effort reasoning distillation, this model is highly optimized for the following use cases:
70
+
71
+ 1. **💻 Coding:** Advanced code generation, debugging, and software architecture planning.
72
+ 2. **🔬 Science:** Deep scientific reasoning, hypothesis evaluation, and analytical problem-solving.
73
+ 3. **🔎 Deep Research:** Navigating complex, multi-step research queries and synthesizing vast amounts of information.
74
+ 4. **🧠 General Purpose:** Highly capable instruction-following for everyday tasks requiring high logical coherence.
75
+
76
+ ## Getting Started
77
+
78
+ You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:
79
+
80
+ `pip install -U transformers torch accelerate`
81
+
82
+ Once you have everything installed, you can proceed to load the model with the code below:
83
+
84
+ ```python
85
+ from transformers import AutoProcessor, AutoModelForCausalLM
86
+
87
+ MODEL_ID = "google/gemma-4-31B-it"
88
+
89
+ # Load model
90
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
91
+ model = AutoModelForCausalLM.from_pretrained(
92
+ MODEL_ID,
93
+ dtype="auto",
94
+ device_map="auto"
95
+ )
96
+ ```
97
+
98
+ Once the model is loaded, you can start generating output:
99
+
100
+ ```python
101
+ # Prompt
102
+ messages = [
103
+ {"role": "system", "content": "You are a helpful assistant."},
104
+ {"role": "user", "content": "Write a short joke about saving RAM."},
105
+ ]
106
+
107
+ # Process input
108
+ text = processor.apply_chat_template(
109
+ messages,
110
+ tokenize=False,
111
+ add_generation_prompt=True,
112
+ enable_thinking=False
113
+ )
114
+ inputs = processor(text=text, return_tensors="pt").to(model.device)
115
+ input_len = inputs["input_ids"].shape[-1]
116
+
117
+ # Generate output
118
+ outputs = model.generate(**inputs, max_new_tokens=1024)
119
+ response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
120
+
121
+ # Parse output
122
+ processor.parse_response(response)
123
+ ```
124
+
125
+ To enable reasoning, set `enable_thinking=True` and the `parse_response` function will take care of parsing the thinking output.
126
+
127
+ Below, you will also find snippets for processing audio (E2B and E4B only), images, and video alongside text:
128
+
129
+ <details>
130
+ <summary>Code for processing Audio</summary>
131
+
132
+ Instead of using `AutoModelForCausalLM`, you can use `AutoModelForMultimodalLM` to process audio. To use it, make sure to install the following packages:
133
+
134
+
135
+ `pip install -U transformers torch librosa accelerate`
136
+
137
+ You can then load the model with the code below:
138
+
139
+ ```python
140
+ from transformers import AutoProcessor, AutoModelForMultimodalLM
141
+
142
+ MODEL_ID = "google/gemma-4-E2B-it"
143
+
144
+ # Load model
145
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
146
+ model = AutoModelForMultimodalLM.from_pretrained(
147
+ MODEL_ID,
148
+ dtype="auto",
149
+ device_map="auto"
150
+ )
151
+ ```
152
+
153
+ Once the model is loaded, you can start generating output by directly referencing the audio URL in the prompt:
154
+
155
+
156
+ ```python
157
+ # Prompt - add audio before text
158
+ messages = [
159
+ {
160
+ "role": "user",
161
+ "content": [
162
+ {"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/journal1.wav"},
163
+ {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
164
+ ]
165
+ }
166
+ ]
167
+
168
+ # Process input
169
+ inputs = processor.apply_chat_template(
170
+ messages,
171
+ tokenize=True,
172
+ return_dict=True,
173
+ return_tensors="pt",
174
+ add_generation_prompt=True,
175
+ ).to(model.device)
176
+ input_len = inputs["input_ids"].shape[-1]
177
+
178
+ # Generate output
179
+ outputs = model.generate(**inputs, max_new_tokens=512)
180
+ response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
181
+
182
+ # Parse output
183
+ processor.parse_response(response)
184
+ ```
185
+
186
+ </details>
187
+
188
+ <details>
189
+ <summary>Code for processing Images</summary>
190
+
191
+ Instead of using `AutoModelForCausalLM`, you can use `AutoModelForMultimodalLM` to process images. To use it, make sure to install the following packages:
192
+
193
+
194
+ `pip install -U transformers torch torchvision accelerate`
195
+
196
+ You can then load the model with the code below:
197
+
198
+ ```python
199
+ from transformers import AutoProcessor, AutoModelForMultimodalLM
200
+
201
+ MODEL_ID = "google/gemma-4-31B-it"
202
+
203
+ # Load model
204
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
205
+ model = AutoModelForMultimodalLM.from_pretrained(
206
+ MODEL_ID,
207
+ dtype="auto",
208
+ device_map="auto"
209
+ )
210
+ ```
211
+
212
+ Once the model is loaded, you can start generating output by directly referencing the image URL in the prompt:
213
+
214
+
215
+ ```python
216
+ # Prompt - add image before text
217
+ messages = [
218
+ {
219
+ "role": "user", "content": [
220
+ {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"},
221
+ {"type": "text", "text": "What is shown in this image?"}
222
+ ]
223
+ }
224
+ ]
225
+
226
+ # Process input
227
+ inputs = processor.apply_chat_template(
228
+ messages,
229
+ tokenize=True,
230
+ return_dict=True,
231
+ return_tensors="pt",
232
+ add_generation_prompt=True,
233
+ ).to(model.device)
234
+ input_len = inputs["input_ids"].shape[-1]
235
+
236
+ # Generate output
237
+ outputs = model.generate(**inputs, max_new_tokens=512)
238
+ response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
239
+
240
+ # Parse output
241
+ processor.parse_response(response)
242
+ ```
243
+
244
+ </details>
245
+
246
+
247
+ <details>
248
+ <summary>Code for processing Videos</summary>
249
+
250
+ Instead of using `AutoModelForCausalLM`, you can use `AutoModelForMultimodalLM` to process videos. To use it, make sure to install the following packages:
251
+
252
+ `pip install -U transformers torch torchvision torchcodec librosa accelerate`
253
+
254
+ You can then load the model with the code below:
255
+
256
+ ```python
257
+ from transformers import AutoProcessor, AutoModelForMultimodalLM
258
+
259
+ MODEL_ID = "google/gemma-4-31B-it"
260
+
261
+ # Load model
262
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
263
+ model = AutoModelForMultimodalLM.from_pretrained(
264
+ MODEL_ID,
265
+ dtype="auto",
266
+ device_map="auto"
267
+ )
268
+ ```
269
+
270
+ Once the model is loaded, you can start generating output by directly referencing the video URL in the prompt:
271
+
272
+
273
+ ```python
274
+ # Prompt - add video before text
275
+ messages = [
276
+ {
277
+ 'role': 'user',
278
+ 'content': [
279
+ {"type": "video", "video": "https://github.com/bebechien/gemma/raw/refs/heads/main/videos/ForBiggerBlazes.mp4"},
280
+ {'type': 'text', 'text': 'Describe this video.'}
281
+ ]
282
+ }
283
+ ]
284
+
285
+ # Process input
286
+ inputs = processor.apply_chat_template(
287
+ messages,
288
+ tokenize=True,
289
+ return_dict=True,
290
+ return_tensors="pt",
291
+ add_generation_prompt=True,
292
+ ).to(model.device)
293
+ input_len = inputs["input_ids"].shape[-1]
294
+
295
+ # Generate output
296
+ outputs = model.generate(**inputs, max_new_tokens=512)
297
+ response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
298
+
299
+ # Parse output
300
+ processor.parse_response(response)
301
+ ```
302
+
303
+ </details>
304
+
305
+ ## **Best Practices**
306
+
307
+ For the best performance, use these configurations and best practices:
308
+
309
+ ### 1. Sampling Parameters
310
+
311
+ Use the following standardized sampling configuration across all use cases:
312
+
313
+ * `temperature=1.0`
314
+ * `top_p=0.95`
315
+ * `top_k=64`
316
+
317
+ ### 2. Thinking Mode Configuration
318
+
319
+ Compared to Gemma 3, the models use standard `system`, `assistant`, and `user` roles. To properly manage the thinking process, use the following control tokens:
320
+
321
+ * **Trigger Thinking:** Thinking is enabled by including the `<|think|>` token at the start of the system prompt. To disable thinking, remove the token.
322
+ * **Standard Generation:** When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
323
+ `<|channel>thought\n`**[Internal reasoning]**`<channel|>`
324
+ * **Disabled Thinking Behavior:** For all models except for the E2B and E4B variants, if thinking is disabled, the model will still generate the tags but with an empty thought block:
325
+ `<|channel>thought\n<channel|>`**[Final answer]**
326
+
327
+ > [!Note]
328
+ > Note that many libraries like Transformers and llama.cpp handle the complexities of the chat template for you.
329
+
330
+ ### 3. Multi-Turn Conversations
331
+
332
+ * **No Thinking Content in History**: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must *not be added* before the next user turn begins.
333
+
334
+ ### 4. Modality order
335
+
336
+ * For optimal performance with multimodal inputs, place image and/or audio content **before** the text in your prompt.
337
+
338
+ ### 5. Variable Image Resolution
339
+
340
+ Aside from variable aspect ratios, Gemma 4 supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.
341
+
342
+ * The supported token budgets are: **70**, **140**, **280**, **560**, and **1120**.
343
+ * Use *lower budgets* for classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail.
344
+ * Use *higher budgets* for tasks like OCR, document parsing, or reading small text.
345
+
346
+ ### 6. Audio
347
+
348
+ Use the following prompt structures for audio processing:
349
+
350
+ * **Audio Speech Recognition (ASR)**
351
+
352
+ ```text
353
+ Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.
354
+
355
+ Follow these specific instructions for formatting the answer:
356
+ * Only output the transcription, with no newlines.
357
+ * When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.
358
+ ```
359
+
360
+ * **Automatic Speech Translation (AST)**
361
+
362
+ ```text
363
+ Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
364
+ When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.
365
+ ```
366
+
367
+ ### 7. Audio and Video Length
368
+
369
+ All models support image inputs and can process videos as frames whereas the E2B and E4B models also support audio inputs. Audio supports a maximum length of 30 seconds. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.
370
+
371
+ ## 🙏 Acknowledgements
372
+
373
+ - **Google**: For providing an exceptional open weights model. Read more about Gemma 4 on the [Google Innovation Blog](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/).
374
+ - **Unsloth**: For assembling ready-to-use, cutting-edge fine-tuning environments that make this work possible.
375
+ - **Crownelius**: For creating and sharing his awesome Opus reasoning dataset with the community.
376
+
377
+
378
+ ## 📖 Citation
379
+
380
+ If you use this model in your research or projects, please cite:
381
+
382
+ ```bibtex
383
+ @misc{teichai_gemma4_26b_a4b_opus_distilled_v2,
384
+ title = {Gemma-4-26B-A4B-it-Claude-Opus-Distill-v2},
385
+ author = {TeichAI},
386
+ year = {2026},
387
+ publisher = {Hugging Face},
388
+ howpublished = {\url{https://huggingface.co/TeichAI/gemma-4-26B-A4B-it-Claude-Opus-Distill-v2}}
389
+ }
390
+ ```