---
language:
- en
base_model: google/gemma-4-e2b-it
pipeline_tag: any-to-any
library_name: transformers
tags:
  - gemma
  - gemma4
  - gemma-4
  - quantized
  - int8
  - bitsandbytes
license: apache-2.0
---


# Gemma 4 E2B it — INT8 Quantization (bitsandbytes)

Quantized version of [google/gemma-4-e2b-it](https://huggingface.co/google/gemma-4-e2b-it) 
using bitsandbytes INT8. Tested on RTX 5090 (Blackwell, sm_120).

## Benchmark Results

Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.  
Greedy decoding (`do_sample=False`), 200 max new tokens.

| Metric | FP16 (baseline) | Q8 | Q4 |
|---|---|---|---|
| SQNR | — | **27.49 dB** | 18.75 dB |
| Top-1 Agreement | — | **92.9%** | 81.1% |
| KL Divergence | — | **0.0496** | 0.3334 |
| Speed (tok/s) | 56.9 | 14.5 | **40.2** |
| VRAM | 9.5 GB | **7.4 GB** | 6.3 GB |

## Results by Category

| Category | SQNR | Top-1 Agreement | KL Divergence | Speed (tok/s) |
|---|---|---|---|---|
| 🔢 Math | 27.09 dB | 92.4% | 0.0424 | 14.8 |
| 🧠 Logic | 27.18 dB | 92.8% | 0.0802 | 13.9 |
| 💻 Code | 29.49 dB | 94.5% | 0.0346 | 14.8 |
| 🔬 Science | 26.34 dB | 92.1% | 0.0410 | 14.7 |

## Key Findings

- **Quality**: Q8 retains 92.9% token agreement with FP16 — nearly identical output
- **VRAM**: Saves 2.1 GB vs FP16 (1.3x compression)
- **Speed**: 14.5 tok/s — slower than Q4 due to bfloat16→float16 cast overhead in bitsandbytes on Blackwell GPUs
- **Best for**: Cases where quality is priority over speed

## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model = AutoModelForCausalLM.from_pretrained(
    "MichaelLowrance/gemma-4-e2b-q8",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("MichaelLowrance/gemma-4-e2b-q8")

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=200, do_sample=False)

print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

## Hardware

Tested on: NVIDIA RTX 5090 (Blackwell, sm_120, 32GB GDDR7)  
CUDA: 12.8 | Python: 3.12 | transformers: 5.6.0.dev