Benchmark Supra reasoning: accuracy + cost + hallucination at 50M scale

#1
by vigneshwar234 - opened

Hi SupraLabs team πŸ‘‹

A 50M parameter model that "thinks" is a fascinating research direction! For teams evaluating whether small reasoning models can replace large ones for specific tasks, the key metrics are:

β†’ 🧠 Reasoning Quality β€” does the structured reasoning actually improve outputs? Scored 1-10.
β†’ 🎯 Accuracy β€” absolute accuracy vs larger models on same tasks
β†’ πŸ’° Cost per 1K tokens β€” small model cost advantage, quantified
β†’ ⚑ Latency p95 β€” small models should be faster, this confirms how much
β†’ πŸ” Hallucination Rate β€” do smaller reasoning models hallucinate less or more?

I built an open source LLM Evaluation Framework that measures all five in one run.

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source. Would love to benchmark Supra 50M in this framework!

SupraLabs org

Hey thanks for the evaluation pipeline, we already use standardized tools, we will definitely check out your tool.

We do focus on all these, as we do know the model may not perform closer to LLMs, but we are working on making the model hallucinate less, and give an overall "polished" answer.

SupraLabs org
β€’
edited 4 days ago

your benchmark fails on edge cases btw(ignore the thing typo)

Screenshot 2026-06-08 113024

SupraLabs org

Thanks for letting us know!
We will check with LH, or axion and let you know.

I handle the datasets here😁

Sign up or log in to comment