Benchmark Supra reasoning: accuracy + cost + hallucination at 50M scale
Hi SupraLabs team π
A 50M parameter model that "thinks" is a fascinating research direction! For teams evaluating whether small reasoning models can replace large ones for specific tasks, the key metrics are:
β π§ Reasoning Quality β does the structured reasoning actually improve outputs? Scored 1-10.
β π― Accuracy β absolute accuracy vs larger models on same tasks
β π° Cost per 1K tokens β small model cost advantage, quantified
β β‘ Latency p95 β small models should be faster, this confirms how much
β π Hallucination Rate β do smaller reasoning models hallucinate less or more?
I built an open source LLM Evaluation Framework that measures all five in one run.
Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
Open source. Would love to benchmark Supra 50M in this framework!
Hey thanks for the evaluation pipeline, we already use standardized tools, we will definitely check out your tool.
We do focus on all these, as we do know the model may not perform closer to LLMs, but we are working on making the model hallucinate less, and give an overall "polished" answer.
Thanks for letting us know!
We will check with LH, or axion and let you know.
I handle the datasets hereπ
