Instructions to use stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small", filename="gemma-3-12b-it-q4_0_s.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S # Run inference directly in the terminal: llama-cli -hf stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S # Run inference directly in the terminal: llama-cli -hf stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S # Run inference directly in the terminal: ./llama-cli -hf stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S # Run inference directly in the terminal: ./build/bin/llama-cli -hf stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S
Use Docker
docker model run hf.co/stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S
- LM Studio
- Jan
- Ollama
How to use stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small with Ollama:
ollama run hf.co/stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S
- Unsloth Studio new
How to use stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small to start chatting
- Docker Model Runner
How to use stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small with Docker Model Runner:
docker model run hf.co/stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S
- Lemonade
How to use stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S
Run and chat with the model
lemonade run user.google-gemma-3-12b-it-qat-q4_0-gguf-small-Q4_0_S
List all available models
lemonade list
This is a "self" merge of https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-gguf and https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF.
The official QAT weights released by google use fp16 (instead of Q6_K) for the embeddings table, which makes this model take a significant extra amount of memory (and storage) compared to what Q4_0 quants are supposed to take. Instead of quantizing the table myself, I extracted it from Bartowski's quantized models because I thought using imatrix quants would give better quality (it doesn't, imatrix isn't used for token embeddings).
Here are some perplexity measurements:
| Model | File size ↓ | PPL (wiki.text.raw) ↓ | Hellaswag, 4k tasks ↑ |
|---|---|---|---|
| iQ3_xs (bartowski) | 5.21 GB | 10.0755 +/- 0.08024 | --- |
| This model | 6.89 GB | 9.2637 +/- 0.07216 | 72.925% [71.5366M, 74.2794%] |
| Q4_0 (bartowski) | 6.91 GB | 9.5589 +/- 0.07527 | 73.125% [71.7295%, 74.4761%] |
| QAT Q4_0 (google) | 8.07 GB | 9.2565 +/- 0.07212 | 72.850% [71.4505%, 74.2056%] |
| Q5_K_S (bartowski) | 8.23 GB | 9.8540 +/- 0.08016 | --- |
Note that this model ends up smaller than the Q4_0 from Bartowski. This is because llama.cpp sets some tensors to Q4_1 when quantizing models to Q4_0 with imatrix, but this is a static quant. I don't understand why Q5_K_S is performing worse on that test than the default Q4_0, I wasn't expecting this outcome. This merge seems to be a good balance between model size and perplexity. I believe this is representative to the overall quality of the model.
- Downloads last month
- 785
4-bit
Model tree for stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small
Base model
google/gemma-3-12b-pt
docker model run hf.co/stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small:Q4_0_S