---
base_model: google/gemma-4-12B-it
base_model_relation: finetune
library_name: transformers
pipeline_tag: text-generation
license: other
tags:
- transformers
- safetensors
- fine-tuned
- qlora
- reasoning
- compact-reasoning
- gemma-4
datasets:
- hotdogs/uka-glm-5.2
- Scale-or-Reason/general-reasoning-ift-pairs
- samcheng0/lumia-reasoning-sft-v1
- HSH-Intelligence/verified-math-reasoning-3k
- kd13/CodeDebug-Instruct-v2-Reasoning
- Madarabr/cortex-adaptive-thinking
- CL-From-Nothing/code_rose_initial_1_7B_SFT_10K_rollouts_Qwen3-4B-Thinking-2507_k12_t0.7_maxtok12288
model-index:
- name: Grug-12B
  results:
  - task:
      type: text-generation
      name: EOS-only local math reasoning proxy
    dataset:
      name: Local 36-row math reasoning eval
      type: local
    metrics:
    - type: proxy_accuracy
      value: 1.0
      name: Grug-12B proxy accuracy
    - type: generated_tokens
      value: 2482
      name: Grug-12B total generated tokens
    - type: avg_generated_tokens
      value: 68.9444
      name: Grug-12B average generated tokens
---

![Grug 12B banner](banner.png)

# Grug 12B

Grug 12B is a compact-reasoning fine-tune of `google/gemma-4-12B-it`.
It was trained to keep the useful information from a reasoning trace while
making the trace shorter, denser, and less verbose.

This repository is published as merged Transformers/safetensors model weights.
It was trained with QLoRA, then merged into the base model for release.

## What Changed

The training target is a terse internal-reasoning style: short high-density
steps, fewer filler words, and explicit preservation of key constraints,
branching decisions, invariants, edge cases, and final-answer checks.

The goal is lower reasoning-token usage relative to the base model while
preserving answer quality. It is not meant to hide uncertainty or remove needed
reasoning.

## Training Data

The data pipeline started from a recent, filtered reasoning pool and converted
verbose traces into compact traces before SFT packing.

Source gate:

- Run date: June 30, 2026.
- Default freshness cutoff: 45 days. Sources older than May 16, 2026 were
  rejected unless manually allowed.
- Allowed train licenses: MIT, Apache-2.0, CC-BY-4.0, CC0-1.0.
- Hard reject terms included OpenAI, ChatGPT, GPT-5, Claude, Anthropic, Opus,
  Sonnet, and Gemini.
- Soft-risk sources marked as synthetic/distill were manually reviewed or
  rejected depending on provenance and license.

Final verified source mix:

| Source | License | Domain | Verified rows |
|---|---|---|---:|
| `hotdogs/uka-glm-5.2` | MIT | agent code | 1,617 |
| `Scale-or-Reason/general-reasoning-ift-pairs` | MIT | general reasoning | 1,305 |
| `samcheng0/lumia-reasoning-sft-v1` | Apache-2.0 | code reasoning | 1,103 |
| `HSH-Intelligence/verified-math-reasoning-3k` | Apache-2.0 | math | 672 |
| `kd13/CodeDebug-Instruct-v2-Reasoning` | MIT | code debug | 600 |
| `Madarabr/cortex-adaptive-thinking` | Apache-2.0 | adaptive reasoning | 300 |
| `CL-From-Nothing/code_rose_initial_1_7B_SFT_10K_rollouts_Qwen3-4B-Thinking-2507_k12_t0.7_maxtok12288` | Apache-2.0 | code reasoning | 143 |

Row counts:

- Normalized recent reasoning pool: 8,680 rows.
- Selected verbose reasoning set: 6,144 rows.
- Compact raw transform output: 6,144 rows.
- Verified compact rows: 5,740 rows.
- Rejected compact rows: 404 rows.
- Packed SFT split: 5,166 train / 287 validation / 287 test.

The compact reasoning transform was generated with
`cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit` served by vLLM. Rows were checked for
compression ratio, answer preservation, and obvious loss of critical reasoning
information before training.

## Training Procedure

Training was completion-only SFT: prompt tokens were masked with `-100`, and
only the assistant completion was trained.

Core settings:

- Base model: `google/gemma-4-12B-it`.
- Method: QLoRA / PEFT LoRA, merged into full model weights for upload.
- Quantization during training: 4-bit NF4 with BF16 compute.
- Max sequence length: 6,144.
- LoRA rank: 16.
- LoRA alpha: 32.
- LoRA dropout: 0.05.
- Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`,
  `up_proj`, `down_proj`.
- Batch size: 1.
- Gradient accumulation: 16.
- Learning rate: 8e-5.
- Max steps: 100.
- Eval steps: 50.
- Save steps: 50.
- Train runtime: about 35 minutes 20 seconds on one A100.
- Final eval loss: 0.8895.

No train or validation rows were skipped in the final run.

## Local Evaluation

Small local EOS-only math proxy eval, no generation token cap:

| Model | Rows | Total generated tokens | Avg generated tokens | Proxy accuracy | Numeric last-match rate |
|---|---:|---:|---:|---:|---:|
| `google/gemma-4-12B-it` base | 36 | 8,227 | 228.53 | 91.7% | 86.1% |
| Grug 12B | 36 | 2,482 | 68.94 | 100.0% | 100.0% |

This is a small proxy eval, not a broad benchmark. Treat it as a smoke test
showing the intended token-efficiency direction, then run your own benchmark.

## Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kai-os/Grug-12B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

messages = [
    {"role": "user", "content": "If a shirt is $80 and goes 25% off, what is the sale price?"}
]
inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

with torch.no_grad():
    output = model.generate(inputs, do_sample=False, max_new_tokens=512)

print(tokenizer.decode(output[0], skip_special_tokens=True))
```

For token-efficiency tests, compare against the base model with the same prompt,
same decoding settings, and no artificial token cap unless your deployment
requires one.

## Limitations

- This is an experimental fine-tune.
- It may over-compress reasoning on tasks that need longer derivations.
- It inherits the base model's limitations and safety behavior.
- The reported eval is small and local.
- The dataset includes synthetic and distilled reasoning traces from the
  listed open datasets; review source licenses and provenance before using this
  in commercial or sensitive settings.

## Acknowledgements

Thanks to [Lambda](https://lambda.com/), the inference provider, for compute
credits that supported the dataset work, training, and evaluation.