--- base_model: google/gemma-4-12B-it base_model_relation: finetune library_name: transformers pipeline_tag: text-generation license: other tags: - transformers - safetensors - fine-tuned - qlora - reasoning - compact-reasoning - gemma-4 datasets: - hotdogs/uka-glm-5.2 - Scale-or-Reason/general-reasoning-ift-pairs - samcheng0/lumia-reasoning-sft-v1 - HSH-Intelligence/verified-math-reasoning-3k - kd13/CodeDebug-Instruct-v2-Reasoning - Madarabr/cortex-adaptive-thinking - CL-From-Nothing/code_rose_initial_1_7B_SFT_10K_rollouts_Qwen3-4B-Thinking-2507_k12_t0.7_maxtok12288 model-index: - name: Grug-12B results: - task: type: text-generation name: EOS-only local math reasoning proxy dataset: name: Local 36-row math reasoning eval type: local metrics: - type: proxy_accuracy value: 1.0 name: Grug-12B proxy accuracy - type: generated_tokens value: 2482 name: Grug-12B total generated tokens - type: avg_generated_tokens value: 68.9444 name: Grug-12B average generated tokens --- ![Grug 12B banner](banner.png) # Grug 12B Grug 12B is a compact-reasoning fine-tune of `google/gemma-4-12B-it`. It was trained to keep the useful information from a reasoning trace while making the trace shorter, denser, and less verbose. This repository is published as merged Transformers/safetensors model weights. It was trained with QLoRA, then merged into the base model for release. ## What Changed The training target is a terse internal-reasoning style: short high-density steps, fewer filler words, and explicit preservation of key constraints, branching decisions, invariants, edge cases, and final-answer checks. The goal is lower reasoning-token usage relative to the base model while preserving answer quality. It is not meant to hide uncertainty or remove needed reasoning. ## Training Data The data pipeline started from a recent, filtered reasoning pool and converted verbose traces into compact traces before SFT packing. Source gate: - Run date: June 30, 2026. - Default freshness cutoff: 45 days. Sources older than May 16, 2026 were rejected unless manually allowed. - Allowed train licenses: MIT, Apache-2.0, CC-BY-4.0, CC0-1.0. - Hard reject terms included OpenAI, ChatGPT, GPT-5, Claude, Anthropic, Opus, Sonnet, and Gemini. - Soft-risk sources marked as synthetic/distill were manually reviewed or rejected depending on provenance and license. Final verified source mix: | Source | License | Domain | Verified rows | |---|---|---|---:| | `hotdogs/uka-glm-5.2` | MIT | agent code | 1,617 | | `Scale-or-Reason/general-reasoning-ift-pairs` | MIT | general reasoning | 1,305 | | `samcheng0/lumia-reasoning-sft-v1` | Apache-2.0 | code reasoning | 1,103 | | `HSH-Intelligence/verified-math-reasoning-3k` | Apache-2.0 | math | 672 | | `kd13/CodeDebug-Instruct-v2-Reasoning` | MIT | code debug | 600 | | `Madarabr/cortex-adaptive-thinking` | Apache-2.0 | adaptive reasoning | 300 | | `CL-From-Nothing/code_rose_initial_1_7B_SFT_10K_rollouts_Qwen3-4B-Thinking-2507_k12_t0.7_maxtok12288` | Apache-2.0 | code reasoning | 143 | Row counts: - Normalized recent reasoning pool: 8,680 rows. - Selected verbose reasoning set: 6,144 rows. - Compact raw transform output: 6,144 rows. - Verified compact rows: 5,740 rows. - Rejected compact rows: 404 rows. - Packed SFT split: 5,166 train / 287 validation / 287 test. The compact reasoning transform was generated with `cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit` served by vLLM. Rows were checked for compression ratio, answer preservation, and obvious loss of critical reasoning information before training. ## Training Procedure Training was completion-only SFT: prompt tokens were masked with `-100`, and only the assistant completion was trained. Core settings: - Base model: `google/gemma-4-12B-it`. - Method: QLoRA / PEFT LoRA, merged into full model weights for upload. - Quantization during training: 4-bit NF4 with BF16 compute. - Max sequence length: 6,144. - LoRA rank: 16. - LoRA alpha: 32. - LoRA dropout: 0.05. - Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`. - Batch size: 1. - Gradient accumulation: 16. - Learning rate: 8e-5. - Max steps: 100. - Eval steps: 50. - Save steps: 50. - Train runtime: about 35 minutes 20 seconds on one A100. - Final eval loss: 0.8895. No train or validation rows were skipped in the final run. ## Local Evaluation Small local EOS-only math proxy eval, no generation token cap: | Model | Rows | Total generated tokens | Avg generated tokens | Proxy accuracy | Numeric last-match rate | |---|---:|---:|---:|---:|---:| | `google/gemma-4-12B-it` base | 36 | 8,227 | 228.53 | 91.7% | 86.1% | | Grug 12B | 36 | 2,482 | 68.94 | 100.0% | 100.0% | This is a small proxy eval, not a broad benchmark. Treat it as a smoke test showing the intended token-efficiency direction, then run your own benchmark. ## Usage ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "kai-os/Grug-12B" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) model.eval() messages = [ {"role": "user", "content": "If a shirt is $80 and goes 25% off, what is the sale price?"} ] inputs = tokenizer.apply_chat_template( messages, return_tensors="pt", add_generation_prompt=True, ).to(model.device) with torch.no_grad(): output = model.generate(inputs, do_sample=False, max_new_tokens=512) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` For token-efficiency tests, compare against the base model with the same prompt, same decoding settings, and no artificial token cap unless your deployment requires one. ## Limitations - This is an experimental fine-tune. - It may over-compress reasoning on tasks that need longer derivations. - It inherits the base model's limitations and safety behavior. - The reported eval is small and local. - The dataset includes synthetic and distilled reasoning traces from the listed open datasets; review source licenses and provenance before using this in commercial or sensitive settings. ## Acknowledgements Thanks to [Lambda](https://lambda.com/), the inference provider, for compute credits that supported the dataset work, training, and evaluation.