DBMe/gemma-4-31B-it-heretic-exl3
EXL3 (ExLlamaV3) quantizations of coder3101/gemma-4-31B-it-heretic. All credit for the original model goes to the original authors.
π Available Quantizations & VRAM
The model weights are stored in separate branches. Please switch to a branch to download. Note: VRAM estimates include PyTorch context overhead (~0.8GB) and assume an unquantized FP16 KV cache.
| Target BPW | Head BPW | Branch (Download Link) | WikiText-2 PPL (512 ctx)ΒΉ | 2K ctx | 4K ctx | 8K ctx | 16K ctx | 32K ctx |
|---|---|---|---|---|---|---|---|---|
| 3.5 | h6 | 3.5bpw_h6 | 8842.6808 | ~19.15 GB | ~20.87 GB | ~24.3 GB | ~31.18 GB | ~44.93 GB |
| 4.0 | h6 | 4.0bpw_h6 | 6856.5833 | ~20.86 GB | ~22.57 GB | ~26.01 GB | ~32.89 GB | ~46.64 GB |
| 5.0 | h6 | 5.0bpw_h6 | 6504.4025 | ~24.27 GB | ~25.98 GB | ~29.42 GB | ~36.3 GB | ~50.05 GB |
| 6.0 | h6 | 6.0bpw_h6 | 5900.2612 | ~27.67 GB | ~29.39 GB | ~32.83 GB | ~39.71 GB | ~53.46 GB |
| 8.0 | h8 | 8.0bpw_h8 | 6355.6026 | ~34.82 GB | ~36.54 GB | ~39.98 GB | ~46.85 GB | ~60.6 GB |
ΒΉ Evaluated against WikiText-2 with ExLlamaV3 using a strided 512-token context window (-c 512) in llama.cpp parity mode (-g). Lower is better. (Higher BPW = higher quality, lower BPW = fits in less VRAM).
π₯ How to Download
It's recommended to use the huggingface-cli to download specific branches. (Do not use git clone as it will download all branches!)
Ensure you have the CLI installed:
pip install -U "huggingface_hub[cli]"
Download a specific branch (e.g., 3.5bpw_h6):
# Example: Downloading the 3.5bpw_h6 branch
huggingface-cli download DBMe/gemma-4-31B-it-heretic-exl3 --revision 3.5bpw_h6 --local-dir gemma-4-31B-it-heretic-exl3-3.5bpw_h6
π» Supported Engines
These models are highly optimized for modern GPUs and can be run using:
- TabbyAPI: A fast, OpenAI-compatible API server. (Set
model_name: "gemma-4-31B-it-heretic-exl3-<BranchName>"in your config) - Text-Generation-WebUI: A local web interface. (Select the
exllamav3loader) - ExLlamaV3 (Native): Python library for custom integration.
π Perplexity Degradation Curve
βοΈ Advanced: Quantization Environment & Settings
π¬ Quantization Settings
Codebook: mcg
Output Scales: always
Calibration Rows: 250
Calibration Cols: 2048
Calibration Dataset: ExLlamaV3 Default (Wiki/C4/Code)
High Quality (HQ) Mode: False
ExLlamaV3:
0.0.29(Commit:cb1a436)Hardware:
NVIDIA RTX PRO 6000 Blackwell Server Edition
Model tree for DBMe/gemma-4-31B-it-heretic-exl3
Base model
google/gemma-4-31B