Text Generation
PEFT
Safetensors
lora
code-review
haiku
evaluation
conversational

lgtm-575-gemma4-e4b-v0.1

License: MIT

v0.1: research artifact, not a production model.

This is the first iteration in a four-part methodology series. Form (5-7-5 syllable compliance) did NOT improve to a statistically significant degree under SFT (McNemar's exact test, p=0.263, n=100, n_discordant=20). Content metrics improved substantially. v0.1 is published as a reproducibility anchor for the lgtm-575 blog series, not as a tool for production code review.

The primary deliverable of this project is the eval harness, not these weights. v1.0 (reasoning-token architecture) is in progress.

A LoRA adapter on Gemma 4 E4B that distills a PR code-review comment into a 5-7-5 haiku. The project's primary deliverable is the evaluation harness, not the model; the model is the artifact the harness was built to judge.

What this is

The student sees only a code diff and writes a haiku review from scratch (autonomous reviewer: it finds and phrases the issue itself). The human reviewer comment is never shown to the model; it was used only to build the teacher's distillation target and to score the output. That framing makes relevance a stiff test, by design.

Status

  • v0.1: shipped as a documented baseline. LoRA SFT on Gemma 4 E4B. Form did not improve (it regressed, and the change is not statistically significant). Content metrics (relevance, category) improved and cleared their targets.
  • v1.0: in progress (reasoning-token architecture). Not yet released.

Evaluation

Held-out 100 (Python subset of ronantakizawa/codereview-bench), scored by the canonical harness (scripts/score.py) with the same prompt and greedy decoding on both the base and the fine-tuned model (bench-drop discipline: same code, same set, both sides).

Metric Base floor v0.1 Golden number Verdict
#1 Form (valid 5-7-5) 14% (14/100) 8% (8/100) >= 45% Miss
#3 Relevance (real-pair mean) 0.321 0.394 >= 0.37 Hit
#2 Category (4-class macro-F1) 0.319 0.458 hold gap (>= +0.139) Hit

Details:

  • Form regressed from 14% to 8%. Paired McNemar exact test: p = 0.263 (n_discordant = 20; 13 rows lost validity, 7 gained). The change is not statistically significant; pure SFT on 113 pairs did not transfer the generate-check-revise loop's form skill into the weights.
  • Relevance (metric #3): real-pair mean 0.321 -> 0.394, clearing the 0.37 golden. The gap over the shuffle floor held and widened (+0.192 -> +0.287; shuffle floor 0.129 -> 0.107).
  • Category (metric #2): 4-class macro-F1 0.319 -> 0.458, majority baseline 0.179. The gap over baseline more than doubled (+0.139 -> +0.279), clearing the hold-the-gap commitment.

Golden numbers were committed before any SFT run, so the result cannot be retrofit. The form target was the ambitious one (~3.2x the floor); a miss there is reported honestly and is on-thesis for a project whose argument is that the eval mattered more than the model.

Training

  • Base: google/gemma-4-E4B-it
  • Method: LoRA SFT (r=16, lora_alpha=32, lora_dropout=0.05, target_modules="all-linear", CAUSAL_LM), completion-only loss with the prompt tokens masked.
  • Data: 113 (diff, haiku) pairs. Haiku targets generated by the teacher Qwen/Qwen3-30B-A3B-Instruct-2507 via the v2 teacher loop (form check plus a relevance gate).
  • Optimization: effective batch 8 (per_device_train_batch_size=1 x gradient_accumulation_steps=8), learning_rate=2e-4, bf16, 3 epochs (about 45 optimizer steps).
  • Hardware: A100-80GB on Modal.

Limitations

  • Form regression: v0.1 produces fewer valid 5-7-5 haikus than the base model. The loop can steer the teacher to valid form at inference, but a single SFT pass on 113 pairs did not install that capability in the student.
  • Counter caveats: the harness syllable counter has known limits on snake_case identifiers and unknown acronyms (it does not consult the acronym table inside a snake_case split; see the project's open issues). Identifier-heavy lines have syllable counts that are non-trivial to verify by ear.
  • Scope: research and educational use only. Not for production code review.

Links

License

  • LoRA adapter weights in this repo: MIT.
  • Combined model (base Gemma 4 E4B plus this adapter): subject to the Gemma Terms of Use, see google/gemma-4-E4B-it. MIT applies to the adapter, not to the base weights.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shahfazal/lgtm-575-gemma4-e4b-v0.1

Adapter
(110)
this model

Datasets used to train shahfazal/lgtm-575-gemma4-e4b-v0.1