Spaces:

SupraLabs
/

Supra-50M-Reasoning-Demo

Running

Benchmark Supra reasoning: accuracy + cost + hallucination at 50M scale

by vigneshwar234 - opened 4 days ago

Hi SupraLabs team 👋

A 50M parameter model that "thinks" is a fascinating research direction! For teams evaluating whether small reasoning models can replace large ones for specific tasks, the key metrics are:

→ 🧠 Reasoning Quality — does the structured reasoning actually improve outputs? Scored 1-10.
→ 🎯 Accuracy — absolute accuracy vs larger models on same tasks
→ 💰 Cost per 1K tokens — small model cost advantage, quantified
→ ⚡ Latency p95 — small models should be faster, this confirms how much
→ 🔍 Hallucination Rate — do smaller reasoning models hallucinate less or more?

I built an open source LLM Evaluation Framework that measures all five in one run.

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source. Would love to benchmark Supra 50M in this framework!

QyrouNnet-AI

SupraLabs org 4 days ago

Hey thanks for the evaluation pipeline, we already use standardized tools, we will definitely check out your tool.

We do focus on all these, as we do know the model may not perform closer to LLMs, but we are working on making the model hallucinate less, and give an overall "polished" answer.

Jamessl

SupraLabs org 4 days ago

•

edited 4 days ago

your benchmark fails on edge cases btw(ignore the thing typo)

QyrouNnet-AI

SupraLabs org 4 days ago

Thanks for letting us know!
We will check with LH, or axion and let you know.

I handle the datasets here😁

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment