Spaces:

jngb-labs
/

sms-spam-clusters

Sleeping

App Files Files Community

sms-spam-clusters / DEPLOY.md

Jakob Neugebauer

Initial commit: SMS Spam Clusters

3694062 about 1 month ago

preview code

Raw

History Blame Contribute Delete

2.46 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Deploy

1. Run the offline build

The Space loads four artefact files at startup. Generate them locally:

cd sms-spam-clusters

# One-time: build dependencies (not what the Space installs)
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-build.txt

# HF token for cluster naming (Gemma 4 via the Inference router)
export HF_TOKEN=hf_...

# Full build: embed + cluster + name. Takes ~5-10 min the first time
# (the bge-base model downloads ~440 MB, embedding 5,159 messages on
# CPU is the slow part). Subsequent runs reuse the embedding cache.
python scripts/build.py

Outputs land in data/:

embeddings.npy ~16 MB
corpus.parquet ~1 MB
clusters.json ~3 KB
ham_subclusters.json ~10 KB

To iterate on clustering parameters without re-embedding:

python scripts/build.py --min-cluster-size 25 --ham-min-cluster-size 20

To iterate on the UI without paying for cluster-naming calls:

python scripts/build.py --no-llm

2. Smoke-test the Space locally

pip install -r requirements.txt
python app.py
# opens at http://127.0.0.1:7860

Paste a few messages, switch between Full corpus and Ham sub-clusters, check that the black diamond marker appears in a sensible place and the nearest neighbours look right.

3. Push to Hugging Face

# One-time setup
git remote add hf https://huggingface.co/spaces/jngb-labs/sms-spam-clusters
git lfs install
# (.gitattributes already tracks *.npy and *.parquet as LFS)

git add .
git commit -m "Initial commit: clustering Space"
git push hf main

If HF complains the repo doesn't exist, create it first via the web UI at huggingface.co/new-space (SDK: Gradio).

4. Set Space secret

In Settings → Variables and secrets add:

HF_TOKEN = hf_...

The Space doesn't actually call the LLM at runtime (cluster names are baked into clusters.json from the offline build), but having the token set avoids any warning chatter in the build logs and leaves room to add an optional zero-shot panel later without re-deploying.

5. Verify

Once the Space is built (~3-5 minutes), open it and check:

The scatter plot loads with all 5,159 points
Hover shows preview text + cluster name
Pasting a message produces a black diamond marker and 5 neighbours
Switching to Ham sub-clusters re-projects to the ham-only layout
Filter chips (spam / ham / both) work in the full corpus view