gemma2b-dolly-qlora

A QLoRA fine-tuned version of google/gemma-2b-it on the databricks/databricks-dolly-15k instruction-following dataset.

Fine-tuned using QLoRA (Quantized Low-Rank Adaptation) β€” the base model is frozen in 4-bit precision and only the LoRA adapter weights (~13M params out of 2B) are trained. This makes fine-tuning possible on a single free-tier T4 GPU.

Merged full model also available: adithash/gemma2b-dolly-qlora-merged β€” base + adapter fused into a single standalone model.


Model Details

Property Value
Base model google/gemma-2b-it
Fine-tuning method QLoRA (4-bit NF4 quantized base + LoRA adapters)
Dataset databricks/databricks-dolly-15k
Training samples 14,911
Training steps 500 (capped for free Colab T4)
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
Learning rate 2e-4
Batch size 2 (effective: 8 with grad accum Γ—4)
Sequence length 256 tokens
Quantization 4-bit NF4, compute dtype bfloat16
Hardware Google Colab T4 (16 GB VRAM)
Training time ~2.5 hours
Adapter size ~50 MB
Framework transformers + peft + trl (SFTTrainer)

Training Loss

Loss dropped significantly in early steps and continued to converge steadily:

Step Training Loss
25 3.60
50 2.65
75 2.32
100 2.22

Prompt Format

This model uses the Gemma chat template format. Always wrap your inputs correctly:

<start_of_turn>user
Your instruction here<end_of_turn>
<start_of_turn>model

If your prompt includes context (e.g. a passage to summarise), append it to the instruction:

<start_of_turn>user
Summarise the following text.

Context: <your context here><end_of_turn>
<start_of_turn>model

How to Use

Option A β€” LoRA Adapter (Recommended)

Lightweight (~50 MB). Loads the frozen base model and attaches the adapter on top.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    quantization_config=bnb_config,
    device_map="auto",
)

# Attach LoRA adapter
model = PeftModel.from_pretrained(base_model, "adithash/gemma2b-dolly-qlora")
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora")

def chat(instruction, context="", max_new_tokens=200, temperature=0.7):
    user_msg = f"{instruction}\n\nContext: {context}" if context.strip() else instruction
    prompt = (
        f"<start_of_turn>user\n{user_msg}<end_of_turn>\n"
        f"<start_of_turn>model\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
        )
    return tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()

# Example
print(chat("Explain what overfitting is in machine learning and how to prevent it."))

Option B β€” Merged Model (Standalone)

Full model with adapter baked in. No need for the base model separately. Larger (~3 GB) but simpler to load.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "adithash/gemma2b-dolly-qlora-merged",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("adithash/gemma2b-dolly-qlora-merged")

def chat(instruction, max_new_tokens=200, temperature=0.7):
    prompt = (
        f"<start_of_turn>user\n{instruction}<end_of_turn>\n"
        f"<start_of_turn>model\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
        )
    return tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()

print(chat("What is the difference between SQL and NoSQL databases?"))

Repository Structure

adithash/gemma2b-dolly-qlora/
β”œβ”€β”€ adapter_config.json       # LoRA config (rank, alpha, target modules)
β”œβ”€β”€ adapter_model.safetensors # Trained LoRA weights (~50 MB)
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
└── README.md

Intended Use

  • βœ… Learning and experimentation with QLoRA fine-tuning
  • βœ… Portfolio demonstration of end-to-end fine-tuning pipeline
  • βœ… Starting point for domain-specific instruction tuning
  • ❌ Not intended for production or commercial use
  • ❌ Not suitable for safety-critical applications

Limitations

  • Fine-tuned for only 500 steps as a proof-of-concept on free Colab T4 β€” a full epoch would be ~7,400 steps
  • Gemma 2B is a small model β€” complex multi-step reasoning will be limited
  • Training sequence length capped at 256 tokens β€” very long prompts will be truncated
  • A newer base model exists: google/gemma-2-2b-it
  • Subject to Google's Gemma Terms of Use

Training Infrastructure

Component Detail
Notebook Google Colab (free tier)
GPU NVIDIA T4 β€” 16 GB VRAM
Libraries transformers 4.x, peft, trl 1.3+, bitsandbytes, accelerate
Trainer SFTTrainer with SFTConfig
Gradient checkpointing Enabled
Mixed precision bfloat16

Citation

If you use this model or adapter in your work, please credit the base model:

@article{gemma_2024,
  title  = {Gemma: Open Models Based on Gemini Research and Technology},
  author = {Gemma Team, Google DeepMind},
  year   = {2024},
  url    = {https://arxiv.org/abs/2403.08295}
}

Author

Aditya Dey β€” ML Engineer
πŸ€— HuggingFace Β· GitHub

Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for adithash/gemma2b-dolly-qlora

Adapter
(674)
this model

Paper for adithash/gemma2b-dolly-qlora