--- language: - en base_model: google/gemma-4-e2b-it pipeline_tag: any-to-any library_name: transformers tags: - gemma - gemma4 - gemma-4 - quantized - int8 - bitsandbytes license: apache-2.0 --- # Gemma 4 E2B it β€” INT8 Quantization (bitsandbytes) Quantized version of [google/gemma-4-e2b-it](https://huggingface.co/google/gemma-4-e2b-it) using bitsandbytes INT8. Tested on RTX 5090 (Blackwell, sm_120). ## Benchmark Results Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each. Greedy decoding (`do_sample=False`), 200 max new tokens. | Metric | FP16 (baseline) | Q8 | Q4 | |---|---|---|---| | SQNR | β€” | **27.49 dB** | 18.75 dB | | Top-1 Agreement | β€” | **92.9%** | 81.1% | | KL Divergence | β€” | **0.0496** | 0.3334 | | Speed (tok/s) | 56.9 | 14.5 | **40.2** | | VRAM | 9.5 GB | **7.4 GB** | 6.3 GB | ## Results by Category | Category | SQNR | Top-1 Agreement | KL Divergence | Speed (tok/s) | |---|---|---|---|---| | πŸ”’ Math | 27.09 dB | 92.4% | 0.0424 | 14.8 | | 🧠 Logic | 27.18 dB | 92.8% | 0.0802 | 13.9 | | πŸ’» Code | 29.49 dB | 94.5% | 0.0346 | 14.8 | | πŸ”¬ Science | 26.34 dB | 92.1% | 0.0410 | 14.7 | ## Key Findings - **Quality**: Q8 retains 92.9% token agreement with FP16 β€” nearly identical output - **VRAM**: Saves 2.1 GB vs FP16 (1.3x compression) - **Speed**: 14.5 tok/s β€” slower than Q4 due to bfloat16β†’float16 cast overhead in bitsandbytes on Blackwell GPUs - **Best for**: Cases where quality is priority over speed ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import torch model = AutoModelForCausalLM.from_pretrained( "MichaelLowrance/gemma-4-e2b-q8", quantization_config=BitsAndBytesConfig(load_in_8bit=True), device_map="cuda", ) tokenizer = AutoTokenizer.from_pretrained("MichaelLowrance/gemma-4-e2b-q8") messages = [{"role": "user", "content": "Hello!"}] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda") with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=200, do_sample=False) print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ## Hardware Tested on: NVIDIA RTX 5090 (Blackwell, sm_120, 32GB GDDR7) CUDA: 12.8 | Python: 3.12 | transformers: 5.6.0.dev