gemma2b-dolly-qlora

A QLoRA fine-tuned version of google/gemma-2b-it on the databricks/databricks-dolly-15k instruction-following dataset.

Fine-tuned using QLoRA (Quantized Low-Rank Adaptation) — the base model is frozen in 4-bit precision and only the LoRA adapter weights (~13M params out of 2B) are trained. This makes fine-tuning possible on a single free-tier T4 GPU.

Merged full model also available: adithash/gemma2b-dolly-qlora-merged — base + adapter fused into a single standalone model.

Model Details

Property	Value
Base model	`google/gemma-2b-it`
Fine-tuning method	QLoRA (4-bit NF4 quantized base + LoRA adapters)
Dataset	`databricks/databricks-dolly-15k`
Training samples	14,911
Training steps	500 (capped for free Colab T4)
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
Learning rate	2e-4
Batch size	2 (effective: 8 with grad accum ×4)
Sequence length	256 tokens
Quantization	4-bit NF4, compute dtype bfloat16
Hardware	Google Colab T4 (16 GB VRAM)
Training time	~2.5 hours
Adapter size	~50 MB
Framework	`transformers` + `peft` + `trl` (SFTTrainer)

Training Loss

Loss dropped significantly in early steps and continued to converge steadily:

Step	Training Loss
25	3.60
50	2.65
75	2.32
100	2.22

Prompt Format

This model uses the Gemma chat template format. Always wrap your inputs correctly:

<start_of_turn>user
Your instruction here<end_of_turn>
<start_of_turn>model

If your prompt includes context (e.g. a passage to summarise), append it to the instruction:

<start_of_turn>user
Summarise the following text.

Context: <your context here><end_of_turn>
<start_of_turn>model

How to Use

Option A — LoRA Adapter (Recommended)

Lightweight (~50 MB). Loads the frozen base model and attaches the adapter on top.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    quantization_config=bnb_config,
    device_map="auto",
)

# Attach LoRA adapter
model = PeftModel.from_pretrained(base_model, "adithash/gemma2b-dolly-qlora")
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora")

def chat(instruction, context="", max_new_tokens=200, temperature=0.7):
    user_msg = f"{instruction}\n\nContext: {context}" if context.strip() else instruction
    prompt = (
        f"<start_of_turn>user\n{user_msg}<end_of_turn>\n"
        f"<start_of_turn>model\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
        )
    return tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()

# Example
print(chat("Explain what overfitting is in machine learning and how to prevent it."))

Option B — Merged Model (Standalone)

Full model with adapter baked in. No need for the base model separately. Larger (~3 GB) but simpler to load.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "adithash/gemma2b-dolly-qlora-merged",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora-merged")

def chat(instruction, max_new_tokens=200, temperature=0.7):
    prompt = (
        f"<start_of_turn>user\n{instruction}<end_of_turn>\n"
        f"<start_of_turn>model\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
        )
    return tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()

print(chat("What is the difference between SQL and NoSQL databases?"))

Repository Structure

adithash/gemma2b-dolly-qlora/
├── adapter_config.json       # LoRA config (rank, alpha, target modules)
├── adapter_model.safetensors # Trained LoRA weights (~50 MB)
├── tokenizer.json
├── tokenizer_config.json
└── README.md

Intended Use

✅ Learning and experimentation with QLoRA fine-tuning
✅ Portfolio demonstration of end-to-end fine-tuning pipeline
✅ Starting point for domain-specific instruction tuning
❌ Not intended for production or commercial use
❌ Not suitable for safety-critical applications

Limitations

Fine-tuned for only 500 steps as a proof-of-concept on free Colab T4 — a full epoch would be ~7,400 steps
Gemma 2B is a small model — complex multi-step reasoning will be limited
Training sequence length capped at 256 tokens — very long prompts will be truncated
A newer base model exists: google/gemma-2-2b-it
Subject to Google's Gemma Terms of Use

Training Infrastructure

Component	Detail
Notebook	Google Colab (free tier)
GPU	NVIDIA T4 — 16 GB VRAM
Libraries	`transformers 4.x`, `peft`, `trl 1.3+`, `bitsandbytes`, `accelerate`
Trainer	`SFTTrainer` with `SFTConfig`
Gradient checkpointing	Enabled
Mixed precision	bfloat16

Citation

If you use this model or adapter in your work, please credit the base model:

@article{gemma_2024,
  title  = {Gemma: Open Models Based on Gemini Research and Technology},
  author = {Gemma Team, Google DeepMind},
  year   = {2024},
  url    = {https://arxiv.org/abs/2403.08295}
}

Author

Aditya Dey — ML Engineer
🤗 HuggingFace · GitHub

Downloads last month: 29

Model tree for adithash/gemma2b-dolly-qlora

Base model

google/gemma-2b-it

Adapter

(674)

this model

Paper for adithash/gemma2b-dolly-qlora

Gemma: Open Models Based on Gemini Research and Technology

Paper • 2403.08295 • Published Mar 13, 2024 • 51