Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|
Frequently Asked Questions
What does the “Pass” column represent?
“Pass” counts how many benchmark tasks the model answered correctly without any follow‑up or correction.
How is the “Refine” score calculated?
If a model initially fails a task but corrects itself after a clarification, it is counted as a “Refine”.
Why are some rows marked “Refusal”?
A “Refusal” is logged when the model declines to answer a question (e.g., policy‑blocked).
What does “$ mToK” mean?
Estimated cost (in USD) to run 1 million tokens for that model – a quick proxy for compute expense.
Can I contribute my own benchmark data?
Yes! Fork the repo on GitHub, add your CSV and submit a pull request. The site will automatically pick up the new rows.