Instructions to use kai-os/Grug-12B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kai-os/Grug-12B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="kai-os/Grug-12B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("kai-os/Grug-12B")
model = AutoModelForMultimodalLM.from_pretrained("kai-os/Grug-12B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use kai-os/Grug-12B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kai-os/Grug-12B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Grug-12B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/kai-os/Grug-12B

SGLang

How to use kai-os/Grug-12B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kai-os/Grug-12B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Grug-12B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kai-os/Grug-12B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kai-os/Grug-12B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use kai-os/Grug-12B with Docker Model Runner:
```
docker model run hf.co/kai-os/Grug-12B
```

Grug-12B / README.md

kai-os

Clean public metadata and model card wording

aa0b2c1 verified about 5 hours ago

preview code

Raw

History Blame Contribute Delete

6.45 kB

metadata

base_model: google/gemma-4-12B-it
base_model_relation: finetune
library_name: transformers
pipeline_tag: text-generation
license: other
tags:
  - transformers
  - safetensors
  - fine-tuned
  - qlora
  - reasoning
  - compact-reasoning
  - gemma-4
datasets:
  - hotdogs/uka-glm-5.2
  - Scale-or-Reason/general-reasoning-ift-pairs
  - samcheng0/lumia-reasoning-sft-v1
  - HSH-Intelligence/verified-math-reasoning-3k
  - kd13/CodeDebug-Instruct-v2-Reasoning
  - Madarabr/cortex-adaptive-thinking
  - >-
    CL-From-Nothing/code_rose_initial_1_7B_SFT_10K_rollouts_Qwen3-4B-Thinking-2507_k12_t0.7_maxtok12288
model-index:
  - name: Grug-12B
    results:
      - task:
          type: text-generation
          name: EOS-only local math reasoning proxy
        dataset:
          name: Local 36-row math reasoning eval
          type: local
        metrics:
          - type: proxy_accuracy
            value: 1
            name: Grug-12B proxy accuracy
          - type: generated_tokens
            value: 2482
            name: Grug-12B total generated tokens
          - type: avg_generated_tokens
            value: 68.9444
            name: Grug-12B average generated tokens

Grug 12B

Grug 12B is a compact-reasoning fine-tune of google/gemma-4-12B-it. It was trained to keep the useful information from a reasoning trace while making the trace shorter, denser, and less verbose.

This repository is published as merged Transformers/safetensors model weights. It was trained with QLoRA, then merged into the base model for release.

What Changed

The training target is a terse internal-reasoning style: short high-density steps, fewer filler words, and explicit preservation of key constraints, branching decisions, invariants, edge cases, and final-answer checks.

The goal is lower reasoning-token usage relative to the base model while preserving answer quality. It is not meant to hide uncertainty or remove needed reasoning.

Training Data

The data pipeline started from a recent, filtered reasoning pool and converted verbose traces into compact traces before SFT packing.

Source gate:

Run date: June 30, 2026.
Default freshness cutoff: 45 days. Sources older than May 16, 2026 were rejected unless manually allowed.
Allowed train licenses: MIT, Apache-2.0, CC-BY-4.0, CC0-1.0.
Hard reject terms included OpenAI, ChatGPT, GPT-5, Claude, Anthropic, Opus, Sonnet, and Gemini.
Soft-risk sources marked as synthetic/distill were manually reviewed or rejected depending on provenance and license.

Final verified source mix:

Source	License	Domain	Verified rows
`hotdogs/uka-glm-5.2`	MIT	agent code	1,617
`Scale-or-Reason/general-reasoning-ift-pairs`	MIT	general reasoning	1,305
`samcheng0/lumia-reasoning-sft-v1`	Apache-2.0	code reasoning	1,103
`HSH-Intelligence/verified-math-reasoning-3k`	Apache-2.0	math	672
`kd13/CodeDebug-Instruct-v2-Reasoning`	MIT	code debug	600
`Madarabr/cortex-adaptive-thinking`	Apache-2.0	adaptive reasoning	300
`CL-From-Nothing/code_rose_initial_1_7B_SFT_10K_rollouts_Qwen3-4B-Thinking-2507_k12_t0.7_maxtok12288`	Apache-2.0	code reasoning	143

Row counts:

Normalized recent reasoning pool: 8,680 rows.
Selected verbose reasoning set: 6,144 rows.
Compact raw transform output: 6,144 rows.
Verified compact rows: 5,740 rows.
Rejected compact rows: 404 rows.
Packed SFT split: 5,166 train / 287 validation / 287 test.

The compact reasoning transform was generated with cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit served by vLLM. Rows were checked for compression ratio, answer preservation, and obvious loss of critical reasoning information before training.

Training Procedure

Training was completion-only SFT: prompt tokens were masked with -100, and only the assistant completion was trained.

Core settings:

Base model: google/gemma-4-12B-it.
Method: QLoRA / PEFT LoRA, merged into full model weights for upload.
Quantization during training: 4-bit NF4 with BF16 compute.
Max sequence length: 6,144.
LoRA rank: 16.
LoRA alpha: 32.
LoRA dropout: 0.05.
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
Batch size: 1.
Gradient accumulation: 16.
Learning rate: 8e-5.
Max steps: 100.
Eval steps: 50.
Save steps: 50.
Train runtime: about 35 minutes 20 seconds on one A100.
Final eval loss: 0.8895.

No train or validation rows were skipped in the final run.

Local Evaluation

Small local EOS-only math proxy eval, no generation token cap:

Model	Rows	Total generated tokens	Avg generated tokens	Proxy accuracy	Numeric last-match rate
`google/gemma-4-12B-it` base	36	8,227	228.53	91.7%	86.1%
Grug 12B	36	2,482	68.94	100.0%	100.0%

This is a small proxy eval, not a broad benchmark. Treat it as a smoke test showing the intended token-efficiency direction, then run your own benchmark.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kai-os/Grug-12B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

messages = [
    {"role": "user", "content": "If a shirt is $80 and goes 25% off, what is the sale price?"}
]
inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

with torch.no_grad():
    output = model.generate(inputs, do_sample=False, max_new_tokens=512)

print(tokenizer.decode(output[0], skip_special_tokens=True))

For token-efficiency tests, compare against the base model with the same prompt, same decoding settings, and no artificial token cap unless your deployment requires one.

Limitations

This is an experimental fine-tune.
It may over-compress reasoning on tasks that need longer derivations.
It inherits the base model's limitations and safety behavior.
The reported eval is small and local.
The dataset includes synthetic and distilled reasoning traces from the listed open datasets; review source licenses and provenance before using this in commercial or sensitive settings.

Acknowledgements

Thanks to Lambda, the inference provider, for compute credits that supported the dataset work, training, and evaluation.