Anyone successfully reproduced this model with Jackrong's GitHub notebook? I'm getting results below baseline and wondering if it's just me.

#26
by sunboy - opened

The shared notebook (Jackrong's LLM Fine-tuning Guide) has been incredibly helpful for learning how to post-train an LLM for improved coding performance. I downloaded Jackrong's trained/reference model and confirmed it does outperform the baseline (Qwen3.5-27B).

However, when I followed the notebook (Qwopus3.5 27B SFT Google Colab) to train my own model, the results came in below baseline β€” so I'm wondering if anyone else has experienced the same issue.

Below is a comparison between the baseline, the model I trained using Jackrong's notebook, and Jackrong's published model.

compare

My setup was nearly identical to the notebook, with one exception to avoid OOM: I used PER_DEV_BS=4, GRAD_ACCUM=9 instead of PER_DEV_BS=6, GRAD_ACCUM=6. My understanding is that this should only affect training speed (since the effective batch size remains the same) without significantly impacting model quality.

Hey HF community! I tweaked a few parameters in the notebook and managed to squeeze out a small improvement on HumanEval+, while matching the original on MBPP+.

Huge shoutout to @Jackrong for sharing the notebook β€” couldn't have done any of this without it. Sharing my setup and results here in case it's helpful, and would love to hear what others have tried!

Here is the settings and rational behind them:

qwen_v2_settings

Here are the results

qwen_v2_eval_numbers

qwen_v2_eval_plots

Other bfcl, humaneval, mbpp benchmarks have slightly worse performance compared to the reference data. Part of the reason is they may saturated.

Sign up or log in to comment