# Changelog

All notable changes to the Liquid AI Spam Classifier project.

## [v0.5.9] - 2026-04-16 (retrain script fixes: NaN gradients + accurate example counts)

### Summary
Fixed gradient explosion (NaN loss) that crashed full retrains on Apple Silicon, and
corrected stale example counts and time estimates across all retrain command files.

### Fixed
- `retrain_liquid.py` — `ACTIVATION_OFFLOADING = True` caused `AssertionError: Torch not
  compiled with CUDA enabled`; TRL's `OffloadActivations` is CUDA-only and crashes on MPS.
  Set to `False`.
- `retrain_liquid.py` — gradient explosion (`loss: 0`, `grad_norm: nan`, `entropy: nan`)
  caused by learning rate too aggressive without activation offloading. Fixed:
  - `LEARNING_RATE`: `2e-4` → `5e-5`
  - `max_grad_norm`: `0.3` → `1.0`
  - Added `warmup_steps=100` to ramp LR gradually and prevent early gradient spikes
- `retrain.command` (liquid), `Retrain.command`, `spam-classifier-mlx/retrain.command` —
  menu displayed stale example counts (~20,000) and time estimates (~2.5-3.5 hrs) that
  didn't match actual dataset sizes. Corrected to actual counts (liquid/mlx full: ~16,000;
  mlx fast: ~6,800) and recalculated time estimates.
- `retrain.command` (liquid) — removed "activation offloading" from memory optimizations
  note since it is now disabled on MPS.

### Changed
- `spam-classifier-liquid/spam_classifier_liquid.ipynb` — added `torch_empty_cache_steps=50`
  and `dataloader_pin_memory=False` to notebook `SFTConfig` to match retrain script

---

## [v0.5.3] - 2026-04-16 (GGUF rename to spam-classifier-F16.gguf)

### Summary
Renamed the local and HuggingFace GGUF file to `spam-classifier-F16.gguf` so that
HuggingFace's model card parser can detect the quantization type (F16) and display
it in the GGUF variants widget. Updated all local file references accordingly.

### Changed
- `spam-classifier.gguf` → `spam-classifier-F16.gguf` (local file rename)
- `VoltageVagabond/spam-classifier-liquid-GGUF` — deleted `spam-classifier.gguf`,
  uploaded `spam-classifier-F16.gguf`, updated README
- Updated all references in `StartServer.command`, `Retrain.command`,
  `merge_and_convert_gguf.py`, `verify_gguf_model.py`, both `Modelfile`s,
  `spam-classifier-liquid-GGUF/README.md`, and this changelog

---

## [v0.5.2] - 2026-04-16 (GGUF system prompt patch + llama-server fixes)

### Summary
Baked the spam classifier system prompt directly into the GGUF model file's
`tokenizer.chat_template` metadata so any client (llama.cpp, LM Studio, Ollama)
applies the correct behavior without manual configuration. Fixed two llama-server
startup bugs introduced by a brew update.

### Changed
- `spam-classifier-F16.gguf` — patched `tokenizer.chat_template` to use
  `"You are an email spam classifier..."` as the default system prompt; done via raw
  binary rewrite of the GGUF metadata section (string grows by 118 bytes; tensor
  data section is untouched)
- `StartServer.command` — fixed `-fa on` (flag syntax changed in brew b8680; was
  bare `-fa`, now requires explicit `on`/`off`/`auto`)
- `StartServer.command` — added `--webui-config` with `systemMessage` and
  `temperature` so the llama.cpp Web UI pre-fills the system prompt automatically
  (the Web UI uses the raw `/completion` endpoint and does not apply the chat
  template on its own)

---

## [v0.5.1] - 2026-04-16 (consolidated retrain script + memory optimizations)

### Summary
Replaced the two separate `retrain-fast.command` and `retrain-full.command` scripts
with a single `retrain.command` that prompts for fast or full mode at launch.
Applied memory optimizations to `retrain_liquid.py` to reduce MPS GPU pressure
during training. Added a top-level `Retrain.command` pipeline script in the LLM
Project root that chains retrain → GGUF rebuild → HuggingFace upload.

### Changed
- `retrain-fast.command` + `retrain-full.command` → replaced by single `retrain.command`
  - Double-click launches a menu: f) Fast (~1-1.5 hrs) / u) Full (~2.5-3.5 hrs) / q) Quit
  - Includes adapter swap prompt with backup logic (same as before, just unified)
  - Reminds user to run `Retrain.command` in LLM Project root for GGUF rebuild

### Memory optimizations in `retrain_liquid.py`
- `activation_offloading=True` — offloads forward-pass activations from MPS to CPU
  RAM; frees ~25% MPS memory at ~15% speed cost (biggest knob for avoiding OOM)
- `torch_empty_cache_steps=50` — flushes the MPS memory pool every 50 optimizer
  steps; prevents memory fragmentation from causing OOM mid-run
- `optim="adamw_torch_fused"` — fused AdamW kernel; slightly faster and lower peak
  memory than unfused `adamw_torch`
- `dataloader_pin_memory=False` — pin_memory is a CUDA optimization that wastes
  memory on MPS; explicitly disabled
- Already enabled: `gradient_checkpointing=True`, `bf16=True`, `MAX_LENGTH=256`

### Added (LLM Project root)
- `Retrain.command` — end-to-end pipeline: retrain → swap adapter → rebuild GGUF
  (clears `merged-liquid-full/` cache so new adapter is actually baked in) →
  upload adapter + GGUF to HuggingFace → remind to restart llama.cpp server

---

## [v0.5.0] - 2026-04-16 (GGUF merged model + server commands)

### Summary
Converted the trained LoRA adapter into a fully merged standalone GGUF file
suitable for llama.cpp, Ollama, and LM Studio. Added StartServer.command and
StopServer.command for launching the llama.cpp server locally. Uploaded the
merged GGUF to a new HuggingFace repo with full platform instructions.

### Added
- `merge_and_convert_gguf.py` — merges LoRA adapter into base model weights
  then converts to GGUF F16 using llama.cpp's convert_hf_to_gguf.py script
- `spam-classifier-F16.gguf` (~2.2 GB) — fully merged standalone GGUF;
  no separate base model or adapter file needed at runtime
- `StartServer.command` — double-click launcher for llama.cpp server with all
  Apple Silicon performance flags (-ngl 99, -fa, --mlock, 8-bit KV cache,
  perf-core thread pinning) and system prompt injected at startup
- `StopServer.command` — kills the server by PID file, falls back to port kill
- `Modelfile` — Ollama configuration with system prompt and temperature baked in;
  allows `ollama create spam-classifier -f Modelfile` for zero-config deployment
- `llama-server-config.json` — reference config showing all server flags
- `upload_adapter_to_root.py` — uploads adapter files to HF repo root (required
  by gguf-my-lora Space which expects adapter_config.json at root, not in subfolder)
- `upload_merged_gguf.py` — creates and uploads to VoltageVagabond/spam-classifier-liquid-GGUF
- `upload_gguf_readme.py` — uploads README with per-platform usage instructions
- `verify_gguf_model.py` — tests the GGUF against real test set examples using
  llama-cpp-python; confirms fine-tuning is active, not just base model behavior

### New HuggingFace repo
- `VoltageVagabond/spam-classifier-liquid-GGUF` — merged F16 GGUF with Modelfile,
  README covering Ollama / LM Studio / llama.cpp server / llama.cpp CLI usage,
  and educational disclaimer for senior project context

### docs
- `docs/08-gguf-conversion-guide.md` — full guide: Option A (gguf-my-lora Space),
  Option B (local merge + convert), troubleshooting section covering every error
  encountered (wrong Space, nested subfolder, too many requests, redirect loop,
  adapter-only GGUF failing to load)
- `docs/README.md` — added guide 8 to table of contents

### Key lessons documented
- `gguf-my-lora` Space produces an adapter-only GGUF (~8.6 MB), NOT a standalone
  model — this causes "failed to load" errors in Ollama/LM Studio without `--lora`
- The system prompt must match training format exactly or the model falls back to
  base LFM2.5 general-assistant behavior
- GGUF format cannot embed a system prompt in weights — Modelfile (Ollama) is the
  closest "set it and forget it" workaround for end users

---

## [v0.4.9] - 2026-04-16 (GGUF conversion guide + adapter repo root upload)

### Summary
Documented how to convert the trained LoRA adapter to GGUF format so it can be
used with llama.cpp, Ollama, and LM Studio. Also fixed the HuggingFace model repo
so that adapter files are at the root level (required by the gguf-my-lora Space).

### Added
- `docs/08-gguf-conversion-guide.md` — step-by-step guide covering two conversion
  paths (Option A: gguf-my-lora Space in browser; Option B: merge locally then convert
  with llama.cpp), plus a full troubleshooting section for every error encountered
- `upload_adapter_to_root.py` (project root) — helper script that uploads
  `adapter_config.json`, `adapter_model.safetensors`, `tokenizer_config.json`,
  `tokenizer.json`, and `chat_template.jinja` to the root of the
  `VoltageVagabond/spam-classifier-liquid` HF repo (the gguf-my-lora Space requires
  `adapter_config.json` at root, not inside an `adapters/` subfolder)
- `docs/README.md` updated to include guide 8 in the table of contents

### Issues encountered and fixed (documented in guide 8)
- `gguf-my-repo` Space gave "no model_type in config.json" — wrong Space; LoRA
  adapters need `gguf-my-lora`, not `gguf-my-repo`
- `gguf-my-lora` gave "adapter_config.json not found" — adapter files were nested
  in `adapters/` subfolder on HF, not at repo root; fixed by uploading to root
- `gguf-my-repo` showed "too many requests" — Space has 1,900+ likes and gets
  heavy traffic; workaround is to duplicate the Space to your own account
- HuggingFace sign-in redirect loop — caused by stale cookies; fixed by clearing
  cookies or using incognito window

---

## [v0.4.8] - 2026-04-14 (8-bit KV cache quantization)

### Summary
Enable 8-bit quantization for the KV cache at inference time to reduce memory
usage without changing model weights or training.

### What changed in app.py
- `model.generate()` now passes `cache_implementation="quantized"` and
  `cache_config={"backend": "hqq", "nbits": 8}`, quantizing both the key and
  value cache to 8-bit during generation
- Used the `hqq` backend (recommended for int8; `quanto` only supports int2/int4)
- Model weights remain at BF16; only the runtime KV cache is affected

### What changed in requirements.txt
- Added `hqq>=0.2.0` — required package for the HQQ quantization backend

---

## [v0.4.7] - 2026-04-14 (Documentation sync with fine_tune.py)

### Summary
Audit pass to bring `README.md` in line with `fine_tune.py`. The README had a
stale "Binary classification only" limitation note (3-class has been live
since v0.4.0) and an out-of-date batch size, plus it was still quoting the
pre-optimization training time.

### Changes to README.md
- Training Details table:
  - Batch size `4` → `1 (effective 4 with gradient accumulation steps = 4)`
    to match `BATCH_SIZE = 1` and `GRADIENT_ACCUMULATION_STEPS = 4` in
    fine_tune.py (lines 72-73)
  - Added explicit rows for Max sequence length (256), Optimizer
    (`adamw_torch`), Weight dtype (bfloat16), Device (MPS), and Max gradient
    norm (0.3) to match the code
  - Training time `~2–2.5 hours` → `~1–1.5 hours` to match the in-code comment
    on line 241, with a note that the older figure reflected the
    pre-v0.4.3 config
- Limitations: "Binary classification only" note replaced with
  "Three-class classification (SPAM / HAM / PHISHING) as of v0.4.0"

### Rationale
`fine_tune.py` is the source of truth. Values read from the file:
```
LORA_RANK                  = 8     (line 53)
LORA_ALPHA                 = 16    (line 54)
LORA_DROPOUT               = 0.1   (line 55)
LORA_TARGET_MODULES        = 8     (lines 56-68; q/k/v/out_proj, w1/w2/w3, in_proj)
NUM_EPOCHS                 = 3     (line 71)
BATCH_SIZE                 = 1     (line 72)
GRADIENT_ACCUMULATION_STEPS= 4     (line 73)
LEARNING_RATE              = 2e-4  (line 74)
MAX_LENGTH                 = 256   (line 75)
optim                      = "adamw_torch"   (line 226)
torch_dtype                = bfloat16         (line 167)
device_map                 = "mps"            (line 166)
max_grad_norm              = 0.3              (comment / training args)
Training time comment      = "~1-1.5 hours"   (line 241)
```
No code changes, no retraining in this release.

---

## [v0.4.6] - 2026-04-14 (HF Spaces deployment fixes)

### Summary
Got the liquid Space (`VoltageVagabond/spam-classifier-liquid`) running on HF after
several iterations diagnosing adapter download failures.

### Q&A from this session

**Q: Why does the Space log say `Adapters not found at /app/adapters` when the local
app works fine?**
A: The local `adapters/` directory is git-ignored and never uploaded to the Space
(too large + the upload script explicitly excludes it). On HF Spaces the directory
doesn't exist, so the app falls through to the "no adapter" code path.

**Q: How was that fixed?**
A: Added a `snapshot_download` fallback in `app.py`: if local adapters are missing,
download them from the `VoltageVagabond/spam-classifier-liquid` model repo at startup.

**Q: First attempt got `401 Repository Not Found`. Why?**
A: The model repo was set to **private** and the Space had no `HF_TOKEN` secret.
The Space container runs anonymously by default, so it couldn't authenticate.
Fix: made the model repo public (no token needed). Alternative: keep private and
add `HF_TOKEN` as a Space repository secret with read scope.

**Q: Next error: `Can't find 'adapter_config.json' at '/root/.cache/.../snapshots/...'`. Why?**
A: The model repo doesn't store adapter files at the root — they're nested under
`adapters_fast/`, `adapters_full/`, `adapters_backup/`. The download succeeded but
`PeftModel.from_pretrained` looked at the snapshot root and couldn't find
`adapter_config.json`. Fix: use `allow_patterns=["adapters_fast/*"]` and set
`ADAPTER_PATH = snapshot_path / "adapters_fast"` so PEFT loads from the right subdir.

**Q: Why is classification slow on HF but fast locally?**
A: HF free tier (`cpu-basic`) is 2 vCPUs, 16 GB RAM, no GPU. Local Mac uses Apple
Silicon Metal/MPS acceleration. A 1.2B-param transformer on CPU is just slow.
Realistic speedups (high → low impact):
1. Upgrade Space to a T4 GPU (~$0.40/hr, only billed when running)
2. 4-bit quantization via `bitsandbytes` (~2-3× faster on CPU)
3. Reduce `max_tokens` from 750 → ~100 (you only need SPAM/HAM)
4. `model.merge_and_unload()` — bake LoRA into base model, removes per-call overhead
5. Switch to GGUF + llama-cpp-python — significantly faster than HF transformers on CPU

**Q: Why does the model repo need to be public for the Space to work?**
A: The Space container runs anonymously. Public repo = anonymous downloads work.
Private repo = need an authenticated `HF_TOKEN` secret in the Space settings.
The Space being public/private is independent — that controls who can view the
demo, not what the container can fetch.

### Changes
- `app.py` — added `snapshot_download` fallback that pulls from the HF model repo
  when local adapters are missing
- `app.py` — passes `os.environ.get("HF_TOKEN")` to `snapshot_download` so the same
  code path works for both public and private model repos
- `app.py` — `allow_patterns=["adapters_fast/*"]` and `ADAPTER_PATH` now points at
  the `adapters_fast/` subdirectory inside the downloaded snapshot

---

## [v0.4.5] - 2026-04-14

### Beginner-Code Compliance — app.py

Refactored `app.py` to match the beginner-friendly coding style used in course lecture notebooks.

**What changed:**
- Replaced 3 lambda functions in Gradio event handlers with named functions (`make_example_handler`, `clear_input`)
- Replaced ternary operator for emoji selection with explicit `if/else` block
- No behavior changes — all Gradio event wiring, feedback logging, and chat logic unchanged

## [v0.4.4] - 2026-04-14

### Chat App Upgrade — app.py

Replaced the two-tab Gradio app (Classify + Chat) with a polished chat-only interface.

**What changed:**
- Removed the Classify tab entirely — chat is now the full interface
- Added HTML topbar with project title, model name, and badge pills (matches XAI project style)
- Added clickable example prompt buttons (spam, ham, phishing) that populate the input
- Added 👍 / 👎 feedback buttons that log to `data/feedback/feedback_log.csv`
  - CSV columns: `timestamp`, `user_input`, `model_response`, `rating`
  - Feedback status resets after each new submission
- Increased `max_tokens` from 500 → 750 to reduce mid-sentence cutoffs
- Fixed Gradio 6 compatibility: `theme`/`css` moved to `launch()`, `gr.Chatbot` returns full history list
- Paths anchored to `Path(__file__).parent` so the app works from any launch directory
- Updated `Dockerfile`: consolidated to install deps from `requirements.txt`, removed redundant pip install lines

## [v0.4.3] - 2026-04-07

### Memory & Speed Optimization — fine_tune.py

Reduced peak memory usage from ~50 GB to a target of ~8–14 GB by changing five training parameters. No change to model architecture or LoRA adapter structure — accuracy is unaffected.

| Parameter | Before | After | Why |
|-----------|--------|-------|-----|
| `BATCH_SIZE` | 4 | 1 | Smaller batch = 4× less activation memory per step |
| `GRADIENT_ACCUMULATION_STEPS` | 1 | 4 | Keeps effective batch size at 4 so training dynamics are unchanged |
| `MAX_LENGTH` | 512 | 256 | Attention memory scales O(n²) with sequence length — halving it cuts ~4× attention memory; spam emails rarely exceed 256 tokens |
| `optim` | `adamw` (default) | `adamw_8bit` | Adam optimizer normally stores 2 full float32 copies of every parameter for momentum tracking (~9.6 GB for a 1.2B model); 8-bit Adam quantizes those to 8-bit integers with negligible quality loss (~75% reduction) |
| `torch_dtype` | `"auto"` | `torch.bfloat16` | Forces model weights to load in bfloat16 (2 bytes/param) instead of float32 (4 bytes/param), halving weight memory; bfloat16 has the same exponent range as float32 so training stability is preserved |
| `device_map` | `"auto"` | `"mps"` | Pins all layers to the MPS GPU; `"auto"` can spill layers to CPU causing slow cross-device copies and inflated memory readings |
| `gradient_checkpointing_kwargs` | not set | `{"use_reentrant": False}` | Suppresses deprecation warning on newer PyTorch; no behavior change |
| `max_grad_norm` | not set | `0.3` | Clips gradient norms to prevent occasional instability spikes during training |

**Why quality is unaffected:**
- 8-bit Adam was validated by Dettmers et al. (2022) to match full-precision Adam loss curves on LLM fine-tuning
- bfloat16 was designed specifically for training — same exponent range as float32, just less mantissa precision
- Effective batch size (1 × 4 accumulation = 4) is identical to the original (4 × 1)
- 256 tokens covers the vast majority of spam/ham emails in this dataset

## [v0.4.2] - 2026-04-07

### Updated — Training Data Pipeline
- **Added puyang2025/seven-phishing-email-datasets and zefang-liu/phishing-email-dataset** as additional sources in `build_liquid_datasets.py` — parquets generated by the spam-xai-project sibling and shared across all three classifier projects
- **Updated data counts** in `retrain-fast.command` and `retrain-full.command` to reflect new ~190K source pool

## [v0.4.1] - 2026-03-28

### Retrain Commands with Adapter Swap
- `retrain-fast.command` and `retrain-full.command` now prompt after training to swap the new adapter as the default
- Selecting "y" backs up `adapters/` to `adapters_backup/` and copies the new adapter in
- App and notebook automatically use whichever adapter is in `adapters/`
- Old `retrain.command` (2-class, 4K examples) removed — replaced by fast/full versions

## [v0.4.0] - 2026-03-28

### Added — 3-Class Training Data + HuggingFace Upload
- **NEW: Phishing detection** — model can now classify as SPAM, HAM, or PHISHING (previously binary only)
- Prepared two new training datasets from 5 combined sources:
  - **FAST** (8,000 examples): ~1 hr retrain — `new_training_data/liquid_fast/`
  - **FULL** (20,000 examples): ~3 hr retrain — `new_training_data/liquid_full/`
- Data sources: existing 4K FaroukMoc2 + locuoco 250K (HF) + ealvaradob phishing (HF) + luongnv89 phishing with reasoning (HF) + Enron
- Added `retrain_liquid.py` script with `--mode fast` and `--mode full` (saves to `adapters_fast/` or `adapters_full/`)
- Uploaded project to HuggingFace: `VoltageVagabond/spam-classifier-liquid` (model repo)
- Created HuggingFace Space: `VoltageVagabond/spam-classifier-liquid-space` (Docker + Gradio demo)
- Created `README.md` with HF model card metadata and `Dockerfile` for HF Space
- Uploaded complete dataset to HF: `VoltageVagabond/spam-email-dataset` with all raw sources

## [v0.3.2] - 2026-03-28

### Fixed
- Fixed `ValueError: train_dataset is required` crash during evaluation step — SFTTrainer requires `train_dataset` even for eval-only usage

### Added
- `--eval-only` flag for `fine_tune.py` — loads saved adapter and runs evaluation + generation test without retraining (~minutes instead of ~2 hours)
- `evaluate.command` — double-click launcher for eval-only mode

## [v0.3.1] - 2026-03-27

### Updated
- Corrected training time estimates across all files:
  - Notebook (1 epoch): ~45 minutes on Apple Silicon
  - fine_tune.py (3 epochs): ~2-2.5 hours on Apple Silicon
  - Slowdown vs v0.2.0 due to targeting 8 module types instead of 4 (better quality, more compute per step)
- Fixed training data counts in setup guide (3,200 train / 800 test, not 500/100)
- Added training time comparison table to training guide
- Added batch size 4 saturation note to tuning tips
- Added `docs/07-code-sources-reference.md` — every source, citation, and empirical finding for paper writing

## [v0.3.0] - 2026-03-27

### Changed — LoRA config aligned with Liquid AI official cookbook
- **Source:** [Liquid4All/cookbook](https://github.com/Liquid4All/cookbook/blob/main/finetuning/notebooks/sft_with_trl.ipynb)
- Target modules expanded from 4 (attention only) to 8 (attention + GLU + conv):
  - Attention: `q_proj`, `k_proj`, `v_proj`, `out_proj`
  - Feed-forward GLU: `w1`, `w2`, `w3`
  - Conv: `in_proj`
- LoRA rank 32 → 8, alpha 64 → 16 (matching cookbook values)
- Dropout 0.05 → 0.1 (matching cookbook)
- Fixed `o_proj` → `out_proj` (correct layer name for LFM2 architecture)

## [v0.2.1] - 2026-03-27

### Note
- Verified Liquid AI version does NOT have the orphaned port issue that affected the MLX version
  - PyTorch loads the model directly into the Python process — no child servers spawned
  - When the app exits, all model memory is freed automatically
  - No cleanup trap needed (unlike MLX version which spawns llama-server processes)

## [v0.2.0] - 2026-03-27

### Changed
- Increased batch size from 1 to 4 for faster training (parallel processing on MPS)
- Increased LoRA rank from 16 to 32 (and alpha from 32 to 64) for better adapter quality
- Removed gradient accumulation (not needed with batch size 4)
- Memory usage ~7-8 GB (comfortable on 24 GB Apple Silicon)

### Tested and reverted
- Batch size 8 tested — MPS GPU saturates at batch size 4, no speed gain beyond that. Steps halved but each step took 2x longer. Batch size 4 is the sweet spot for Apple Silicon.

## [v0.1.1] - 2026-03-27

### Fixed
- Renamed `max_seq_length` to `max_length` in fine_tune.py, notebook, and docs for TRL v0.29 compatibility
- Fixed `launch-notebook.command` not showing Jupyter install errors
- Added model loading time note (30-60 seconds) to `launch UI.command`

## [v0.1.0] - 2026-03-27

### Added
- Project scaffolding (requirements and gitignore)
- Training data copied from MLX sibling project
- `fine_tune.py` — LoRA fine-tuning via TRL SFTTrainer (Liquid AI's official method)
- `app.py` — Gradio web UI with Classify and Chat tabs
- `.command` launcher scripts for macOS
- Beginner-friendly documentation (6 guides)
- Interactive Jupyter notebook walkthrough