real benchmarks

#58
by anonymousmaharaj - opened

image

I ran some real benchmarks on this model - not many, but better than nothing. It's pretty disappointing. Just pure, unadulterated hype over absolutely nothing.

Thanks for putting the time into this β€” nine benchmarks is real work, and honestly it's one of the more useful breakdowns anyone's sent me. It gave me a lot to think about, so genuinely, thank you.

You're right on the core point: the edge is the training domain, not "agentic" in general. The telecom gain is real and that's what this version was scoped and named for β€” but it clearly doesn't transfer to other agentic domains the way the label might imply. Fair call.

On AIME / BigCodeBench / the knowledge benches β€” those were never the target. It's a narrow agentic finetune, and one-shot math/code plus general knowledge regressing is the expected cost of that recipe: narrow SFT with no knowledge replay erodes the base distribution. Your numbers show it cleanly, and that one's on me. The single-shot weakness in particular is exactly right.

That's precisely what v3 is built to fix β€” per-axis β‰₯ base instead of winning one domain, knowledge replay to stop the forgetting, and self-gated thinking for the single-shot tasks this version over-triggered tools on. When it's out I'll post the same per-axis table, same harness, base vs finetune, so you can re-run it and call it straight again. This kind of testing is more useful than any praise β€” appreciate it.

Appreciate you engaging with it instead of getting defensive, that's rare. Your v3 read is exactly right: knowledge replay for the forgetting, and self-gating the thinking so it stops firing tools on one-shot tasks. I saw that over-triggering in the traces, it kept acting where a direct answer would have scored.

Send v3 whenever it's ready and I'll run the identical harness the same day. Base vs finetune, per-axis, posted straight. I'll share the configs upfront so you can reproduce on your end too. If v3 lands per-axis >= base, that's a strong result and I'll say so just as loudly as I said this.

Sign up or log in to comment