fable5 / README.md
cl4ude's picture
Upload 3 files
6e23cd8 verified
|
Raw
History Blame Contribute Delete
1.9 kB
---
title: Gemma4 Coder GGUF Chat
emoji: "💬"
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
app_port: 7860
pinned: false
models:
- yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
tags:
- llama.cpp
- gguf
- gemma4
- coding
- cpu
---
# Gemma4 12B Coder GGUF Chat
Hugging Face Spaces Docker chatbot for:
- Model: `yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF`
- Default quant: `gemma4-coding-Q4_K_M.gguf`
- Backend: prebuilt `llama.cpp` `llama-server`
- UI: native `llama.cpp` web UI
- Target: testing Gemma4 Coder on HF Spaces CPU
## Why Q4 by default?
`gemma4-coding-Q2_K.gguf` is smaller and faster, but it can produce broken fake-language responses on CPU. This Space uses `gemma4-coding-Q4_K_M.gguf` by default for better coherence. It is slower than Q2, but it is the safer option if the goal is a usable chatbot.
## Default settings
```text
MODEL_REPO=yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
MODEL_FILE=gemma4-coding-Q4_K_M.gguf
LLAMA_VERSION=b9592
THREADS=4
CTX_SIZE=2048
BATCH_SIZE=default
UBATCH_SIZE=default
FLASH_ATTN=default
CACHE_TYPE_K=default
CACHE_TYPE_V=default
TEMPERATURE=0.2
TOP_P=0.95
TOP_K=64
REPEAT_PENALTY=1.08
```
The launcher downloads the GGUF into `/data`, fetches the model chat template from Hugging Face metadata, then hands the process over to `llama-server` on port `7860`.
`default` means the launcher does not pass that flag, so native `llama.cpp` picks its own optimized default. This is closer to the fast reference Space and avoids CPU overhead from experimental KV-cache quantization or tiny batch settings.
## If you want to compare Q2
Change this environment variable back:
```text
MODEL_FILE=gemma4-coding-Q2_K.gguf
```
Q2 starts and responds faster, but the output may be incoherent.
## Upload
Upload these files to the root of a Docker Space:
- `Dockerfile`
- `app.py`
- `README.md`