Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|
Frequently Asked Questions
Everything you need to know about our LLM benchmarks
We evaluate models across multiple dimensions including: Pass rate (successful task completion), Refinement capability, Failure rate, Refusal rate, Cost efficiency ($ per million tokens), Reasoning ability, STEM performance, General utility, Code generation quality, and Censorship levels.
Our benchmarks are updated weekly to reflect the latest model releases and performance improvements. We also re-test existing models when significant updates are released by their providers.
The TOTAL score is a weighted average of all performance metrics, designed to give a comprehensive view of a model's overall capability. It considers success rates, specialized performance, and cost efficiency to provide a balanced assessment.
Cost efficiency is measured in dollars per million tokens ($ mToK). This metric helps you understand the operational cost of using each model at scale, making it easier to balance performance needs with budget constraints.
Yes! You can export the data by clicking the export button in the table controls. We support CSV, JSON, and Excel formats for easy integration with your analysis tools.