Instructions to use shahfazal/lgtm-575-gemma4-e4b-v0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use shahfazal/lgtm-575-gemma4-e4b-v0.1 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E4B-it") model = PeftModel.from_pretrained(base_model, "shahfazal/lgtm-575-gemma4-e4b-v0.1") - Notebooks
- Google Colab
- Kaggle
lgtm-575-gemma4-e4b-v0.1
v0.1: research artifact, not a production model.
This is the first iteration in a four-part methodology series. Form (5-7-5 syllable compliance) did NOT improve to a statistically significant degree under SFT (McNemar's exact test, p=0.263, n=100, n_discordant=20). Content metrics improved substantially. v0.1 is published as a reproducibility anchor for the lgtm-575 blog series, not as a tool for production code review.
The primary deliverable of this project is the eval harness, not these weights. v1.0 (reasoning-token architecture) is in progress.
A LoRA adapter on Gemma 4 E4B that distills a PR code-review comment into a 5-7-5 haiku. The project's primary deliverable is the evaluation harness, not the model; the model is the artifact the harness was built to judge.
What this is
The student sees only a code diff and writes a haiku review from scratch (autonomous reviewer: it finds and phrases the issue itself). The human reviewer comment is never shown to the model; it was used only to build the teacher's distillation target and to score the output. That framing makes relevance a stiff test, by design.
Status
- v0.1: shipped as a documented baseline. LoRA SFT on Gemma 4 E4B. Form did not improve (it regressed, and the change is not statistically significant). Content metrics (relevance, category) improved and cleared their targets.
- v1.0: in progress (reasoning-token architecture). Not yet released.
Evaluation
Held-out 100 (Python subset of ronantakizawa/codereview-bench), scored by the canonical harness (scripts/score.py) with the same prompt and greedy decoding on both the base and the fine-tuned model (bench-drop discipline: same code, same set, both sides).
| Metric | Base floor | v0.1 | Golden number | Verdict |
|---|---|---|---|---|
| #1 Form (valid 5-7-5) | 14% (14/100) | 8% (8/100) | >= 45% | Miss |
| #3 Relevance (real-pair mean) | 0.321 | 0.394 | >= 0.37 | Hit |
| #2 Category (4-class macro-F1) | 0.319 | 0.458 | hold gap (>= +0.139) | Hit |
Details:
- Form regressed from 14% to 8%. Paired McNemar exact test: p = 0.263 (n_discordant = 20; 13 rows lost validity, 7 gained). The change is not statistically significant; pure SFT on 113 pairs did not transfer the generate-check-revise loop's form skill into the weights.
- Relevance (metric #3): real-pair mean 0.321 -> 0.394, clearing the 0.37 golden. The gap over the shuffle floor held and widened (+0.192 -> +0.287; shuffle floor 0.129 -> 0.107).
- Category (metric #2): 4-class macro-F1 0.319 -> 0.458, majority baseline 0.179. The gap over baseline more than doubled (+0.139 -> +0.279), clearing the hold-the-gap commitment.
Golden numbers were committed before any SFT run, so the result cannot be retrofit. The form target was the ambitious one (~3.2x the floor); a miss there is reported honestly and is on-thesis for a project whose argument is that the eval mattered more than the model.
Training
- Base: google/gemma-4-E4B-it
- Method: LoRA SFT (r=16, lora_alpha=32, lora_dropout=0.05, target_modules="all-linear", CAUSAL_LM), completion-only loss with the prompt tokens masked.
- Data: 113 (diff, haiku) pairs. Haiku targets generated by the teacher Qwen/Qwen3-30B-A3B-Instruct-2507 via the v2 teacher loop (form check plus a relevance gate).
- Optimization: effective batch 8 (per_device_train_batch_size=1 x gradient_accumulation_steps=8), learning_rate=2e-4, bf16, 3 epochs (about 45 optimizer steps).
- Hardware: A100-80GB on Modal.
Limitations
- Form regression: v0.1 produces fewer valid 5-7-5 haikus than the base model. The loop can steer the teacher to valid form at inference, but a single SFT pass on 113 pairs did not install that capability in the student.
- Counter caveats: the harness syllable counter has known limits on snake_case identifiers and unknown acronyms (it does not consult the acronym table inside a snake_case split; see the project's open issues). Identifier-heavy lines have syllable counts that are non-trivial to verify by ear.
- Scope: research and educational use only. Not for production code review.
Links
- Code and eval harness: https://github.com/shahfazal/lgtm-575
- Blog series: "PR Reviews in Haiku, and the Eval That Mattered More" (link when published)
License
- LoRA adapter weights in this repo: MIT.
- Combined model (base Gemma 4 E4B plus this adapter): subject to the Gemma Terms of Use, see google/gemma-4-E4B-it. MIT applies to the adapter, not to the base weights.
- Downloads last month
- -