LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor

Frequently Asked Questions

What does the “Pass” column represent?

“Pass” counts how many benchmark tasks the model answered correctly without any follow‑up or correction.

How is the “Refine” score calculated?

If a model initially fails a task but corrects itself after a clarification, it is counted as a “Refine”.

Why are some rows marked “Refusal”?

A “Refusal” is logged when the model declines to answer a question (e.g., policy‑blocked).

What does “$ mToK” mean?

Estimated cost (in USD) to run 1 million tokens for that model – a quick proxy for compute expense.

Can I contribute my own benchmark data?

Yes! Fork the repo on GitHub, add your CSV and submit a pull request. The site will automatically pick up the new rows.