12

Models Tested

78.4%

Average Score

6

Test Categories

Jan 2025

Last Updated

Performance Comparison

Live Data
Rank Model TOTAL Pass Refine Fail Refusal $ mTok Reason STEM Utility Code Censor

Score Legend

Excellent (90-100%)
Good (75-89%)
Average (50-74%)
Poor (0-49%)

Frequently Asked Questions

Everything you need to know about our benchmark methodology

What does the TOTAL score represent?
The TOTAL score is a weighted composite of all individual benchmark categories. It represents the overall capability of each model across reasoning, STEM, utility, coding, and other tasks. The weighting is designed to reflect real-world usage patterns and importance of each capability.
How is the Pass vs Refine metric calculated?
Pass represents the percentage of tasks completed correctly on the first attempt. Refine shows additional successes achieved when the model is given feedback and allowed to retry. This helps distinguish models that are "right the first time" vs those that improve with iteration.
What does the Refusal rate indicate?
The Refusal rate shows how often a model declines to attempt a task, typically due to safety filters or content policies. A lower refusal rate isn't always better—it depends on whether refusals are appropriate. This metric helps users understand model behavior boundaries.
How is pricing ($ mTok) calculated?
The $ mTok column shows the cost per million tokens, calculated as an average of input and output token prices. This provides a quick cost comparison, though actual costs may vary based on your specific input/output ratio and any volume discounts.
What tasks are included in the Reasoning benchmark?
The Reasoning benchmark includes logical deduction, causal reasoning, analogical thinking, and multi-step problem solving. Tests are drawn from established datasets like ARC, HellaSwag, and custom adversarial challenges designed to test genuine understanding vs pattern matching.
How does the Censor score work?
The Censor score indicates the level of content filtering applied by the model, from 1 (minimal filtering) to 5 (heavy filtering). This is measured through standardized prompts testing various content categories. Neither high nor low is inherently better—it depends on your use case requirements.
How often are benchmarks updated?
We update our benchmarks monthly, or whenever a major model release occurs. All models are re-tested periodically to account for silent updates. Historical data is preserved so you can track model improvements over time.
Can I suggest a model to be added?
Absolutely! We're always looking to expand our coverage. Models must have public API access or be available through major cloud providers. Community suggestions are prioritized based on popularity and unique capabilities. Contact us through the feedback form.