fable5

Running

App Files Files Community

fable5 / README.md

cl4ude

Upload 3 files

6e23cd8 verified 20 days ago

preview code

Raw

History Blame Contribute Delete

1.9 kB

metadata

title: Gemma4 Coder GGUF Chat
emoji: 💬
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
app_port: 7860
pinned: false
models:
  - yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
tags:
  - llama.cpp
  - gguf
  - gemma4
  - coding
  - cpu

Gemma4 12B Coder GGUF Chat

Hugging Face Spaces Docker chatbot for:

Model: yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
Default quant: gemma4-coding-Q4_K_M.gguf
Backend: prebuilt llama.cpp llama-server
UI: native llama.cpp web UI
Target: testing Gemma4 Coder on HF Spaces CPU

Why Q4 by default?

gemma4-coding-Q2_K.gguf is smaller and faster, but it can produce broken fake-language responses on CPU. This Space uses gemma4-coding-Q4_K_M.gguf by default for better coherence. It is slower than Q2, but it is the safer option if the goal is a usable chatbot.

Default settings

MODEL_REPO=yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
MODEL_FILE=gemma4-coding-Q4_K_M.gguf
LLAMA_VERSION=b9592
THREADS=4
CTX_SIZE=2048
BATCH_SIZE=default
UBATCH_SIZE=default
FLASH_ATTN=default
CACHE_TYPE_K=default
CACHE_TYPE_V=default
TEMPERATURE=0.2
TOP_P=0.95
TOP_K=64
REPEAT_PENALTY=1.08

The launcher downloads the GGUF into /data, fetches the model chat template from Hugging Face metadata, then hands the process over to llama-server on port 7860.

default means the launcher does not pass that flag, so native llama.cpp picks its own optimized default. This is closer to the fast reference Space and avoids CPU overhead from experimental KV-cache quantization or tiny batch settings.

If you want to compare Q2

Change this environment variable back:

MODEL_FILE=gemma4-coding-Q2_K.gguf

Q2 starts and responds faster, but the output may be incoherent.

Upload

Upload these files to the root of a Docker Space:

Dockerfile
app.py
README.md