srv-sngh commited on
Commit
00ca9ca
·
verified ·
1 Parent(s): 8d1b2e7

Card: final complete benchmark table (base vs coder vs agentic, HE/HE+/MBPP/MBPP+/GSM8K, off+on)

Browse files
Files changed (1) hide show
  1. README.md +4 -5
README.md CHANGED
@@ -63,11 +63,10 @@ GSM8K = 150‑problem subset, 8‑shot. **Not** EvalPlus‑leaderboard‑compara
63
  |---|---|---|---|---|---|---|
64
  | Google gemma-4-12B-it (base) | off | 57.3 | 56.7 | 42.1 | 37.6 | 95.3 |
65
  | Google gemma-4-12B-it (base) | on | 48.8 | 48.8 | 49.5 | 43.9 | 90.7 |
66
- | coder v1 | off | 81.7 | | 79.4 | | 90.7 |
67
- | **agentic v2** ⟵ this model | off | 83.5 | 81.7 | | | |
68
- | **agentic v2** ⟵ this model | on | 86.0 | 82.9 | | | |
69
-
70
- > ⚠️ **Partial — benchmark run still in progress.** Empty cells (—) are filling in; this card updates as the full sweep completes.
71
 
72
  **Takeaways:** the code/agentic fine‑tunes massively out‑code the Google base on HumanEval/MBPP, while
73
  the base is stronger at math (GSM8K). Reasoning‑on helps the fine‑tunes but tends to *hurt* the base's
 
63
  |---|---|---|---|---|---|---|
64
  | Google gemma-4-12B-it (base) | off | 57.3 | 56.7 | 42.1 | 37.6 | 95.3 |
65
  | Google gemma-4-12B-it (base) | on | 48.8 | 48.8 | 49.5 | 43.9 | 90.7 |
66
+ | coder v1 | off | 81.7 | 78.0 | 79.4 | 68.3 | 90.7 |
67
+ | coder v1 | on | 80.5 | 76.2 | 80.4 | 68.8 | 90.0 |
68
+ | **agentic v2** ⟵ this model | off | 83.5 | 81.7 | 84.1 | 74.1 | 90.7 |
69
+ | **agentic v2** ⟵ this model | on | 86.0 | 82.9 | 83.6 | 73.0 | 91.3 |
 
70
 
71
  **Takeaways:** the code/agentic fine‑tunes massively out‑code the Google base on HumanEval/MBPP, while
72
  the base is stronger at math (GSM8K). Reasoning‑on helps the fine‑tunes but tends to *hurt* the base's