Track model quality, reliability, cost, and specialization.
Explore a responsive benchmark table with sorting, search, category filters, summary cards, visual performance aids, and a built-in FAQ. Designed as a single-file experience with no external dependencies.
| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|
Category Leaders
What to look for
Use the table to compare tradeoffs: a model can lead in reasoning but cost more, or be cheaper while sacrificing pass rate. The most useful choice depends on your product constraints and safety posture.
Frequently asked questions
Clear definitions for the benchmark metrics shown above.
What does TOTAL represent?
TOTAL is a composite score summarizing overall benchmark performance. It is typically influenced by pass rate, quality across categories like Reason, STEM, Utility, and Code, plus penalties from failures or refusals depending on methodology.
What is the difference between Pass, Refine, Fail, and Refusal?
Pass means the model solved a task successfully. Refine means it produced a partially useful answer that may need editing or follow-up. Fail means the output was incorrect or insufficient. Refusal means the model declined to answer, often due to safety policies or uncertainty.
What does “$ mToK” mean?
In this table, “$ mToK” is presented as a simplified cost metric for comparing relative pricing. You can interpret it as a normalized monetary cost unit for model usage. Lower values are generally better if budget efficiency matters.
How should I interpret Censor?
Censor reflects how strongly a model tends toward restrictive or guarded behavior in edge cases. A higher value may indicate stricter filtering or more frequent conservative refusals. Whether that is good or bad depends on your application and compliance needs.
Can I sort and filter the table?
Yes. You can search by model name, use filters for reasoning, coding, and cost, click any table header to sort, and switch between dark and light themes for comfort.