title: Gemma4 Coder GGUF Chat
emoji: 💬
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
app_port: 7860
pinned: false
models:
- yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
tags:
- llama.cpp
- gguf
- gemma4
- coding
- cpu
Gemma4 12B Coder GGUF Chat
Hugging Face Spaces Docker chatbot for:
- Model:
yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF - Default quant:
gemma4-coding-Q4_K_M.gguf - Backend: prebuilt
llama.cppllama-server - UI: native
llama.cppweb UI - Target: testing Gemma4 Coder on HF Spaces CPU
Why Q4 by default?
gemma4-coding-Q2_K.gguf is smaller and faster, but it can produce broken fake-language responses on CPU. This Space uses gemma4-coding-Q4_K_M.gguf by default for better coherence. It is slower than Q2, but it is the safer option if the goal is a usable chatbot.
Default settings
MODEL_REPO=yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
MODEL_FILE=gemma4-coding-Q4_K_M.gguf
LLAMA_VERSION=b9592
THREADS=4
CTX_SIZE=2048
BATCH_SIZE=default
UBATCH_SIZE=default
FLASH_ATTN=default
CACHE_TYPE_K=default
CACHE_TYPE_V=default
TEMPERATURE=0.2
TOP_P=0.95
TOP_K=64
REPEAT_PENALTY=1.08
The launcher downloads the GGUF into /data, fetches the model chat template from Hugging Face metadata, then hands the process over to llama-server on port 7860.
default means the launcher does not pass that flag, so native llama.cpp picks its own optimized default. This is closer to the fast reference Space and avoids CPU overhead from experimental KV-cache quantization or tiny batch settings.
If you want to compare Q2
Change this environment variable back:
MODEL_FILE=gemma4-coding-Q2_K.gguf
Q2 starts and responds faster, but the output may be incoherent.
Upload
Upload these files to the root of a Docker Space:
Dockerfileapp.pyREADME.md