fable5

Running

File size: 1,898 Bytes

fea3064
6e23cd8
 
 
 
fea3064
6e23cd8
 
fea3064
6e23cd8
 
 
 
 
 
 
 
fea3064
 
6e23cd8

---
title: Gemma4 Coder GGUF Chat
emoji: "💬"
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
app_port: 7860
pinned: false
models:
  - yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
tags:
  - llama.cpp
  - gguf
  - gemma4
  - coding
  - cpu
---

# Gemma4 12B Coder GGUF Chat

Hugging Face Spaces Docker chatbot for:

- Model: `yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF`
- Default quant: `gemma4-coding-Q4_K_M.gguf`
- Backend: prebuilt `llama.cpp` `llama-server`
- UI: native `llama.cpp` web UI
- Target: testing Gemma4 Coder on HF Spaces CPU

## Why Q4 by default?

`gemma4-coding-Q2_K.gguf` is smaller and faster, but it can produce broken fake-language responses on CPU. This Space uses `gemma4-coding-Q4_K_M.gguf` by default for better coherence. It is slower than Q2, but it is the safer option if the goal is a usable chatbot.

## Default settings

```text
MODEL_REPO=yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
MODEL_FILE=gemma4-coding-Q4_K_M.gguf
LLAMA_VERSION=b9592
THREADS=4
CTX_SIZE=2048
BATCH_SIZE=default
UBATCH_SIZE=default
FLASH_ATTN=default
CACHE_TYPE_K=default
CACHE_TYPE_V=default
TEMPERATURE=0.2
TOP_P=0.95
TOP_K=64
REPEAT_PENALTY=1.08
```

The launcher downloads the GGUF into `/data`, fetches the model chat template from Hugging Face metadata, then hands the process over to `llama-server` on port `7860`.

`default` means the launcher does not pass that flag, so native `llama.cpp` picks its own optimized default. This is closer to the fast reference Space and avoids CPU overhead from experimental KV-cache quantization or tiny batch settings.

## If you want to compare Q2

Change this environment variable back:

```text
MODEL_FILE=gemma4-coding-Q2_K.gguf
```

Q2 starts and responds faster, but the output may be incoherent.

## Upload

Upload these files to the root of a Docker Space:

- `Dockerfile`
- `app.py`
- `README.md`