Anyone successfully reproduced this model with Jackrong's GitHub notebook? I'm getting results below baseline and wondering if it's just me.

#26

by sunboy - opened 19 days ago

The shared notebook (Jackrong's LLM Fine-tuning Guide) has been incredibly helpful for learning how to post-train an LLM for improved coding performance. I downloaded Jackrong's trained/reference model and confirmed it does outperform the baseline (Qwen3.5-27B).

However, when I followed the notebook (Qwopus3.5 27B SFT Google Colab) to train my own model, the results came in below baseline — so I'm wondering if anyone else has experienced the same issue.

Below is a comparison between the baseline, the model I trained using Jackrong's notebook, and Jackrong's published model.

My setup was nearly identical to the notebook, with one exception to avoid OOM: I used PER_DEV_BS=4, GRAD_ACCUM=9 instead of PER_DEV_BS=6, GRAD_ACCUM=6. My understanding is that this should only affect training speed (since the effective batch size remains the same) without significantly impacting model quality.

sunboy

15 days ago

•

edited 15 days ago

Hey HF community! I tweaked a few parameters in the notebook and managed to squeeze out a small improvement on HumanEval+, while matching the original on MBPP+.

Huge shoutout to @Jackrong for sharing the notebook — couldn't have done any of this without it. Sharing my setup and results here in case it's helpful, and would love to hear what others have tried!

Here is the settings and rational behind them:

Here are the results

Other bfcl, humaneval, mbpp benchmarks have slightly worse performance compared to the reference data. Part of the reason is they may saturated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment