AI Performance Comparison
Comprehensive benchmark analysis of leading Large Language Models across multiple dimensions
24
Models Tested
10K+
Tasks Evaluated
72h
Total Testing Time
94.2%
Avg Success Rate
Performance Overview
Top 5 Models - Overall Score
Category Performance
| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|
Frequently Asked Questions
Everything you need to know about our benchmark methodology
How are the benchmarks conducted?
Our benchmarks are conducted using a standardized testing suite that evaluates models across multiple dimensions. Each model is tested with identical prompts and evaluation criteria to ensure fair comparison. Tests include reasoning tasks, STEM problems, coding challenges, and real-world utility scenarios.
What does the $ mToK metric represent?
$ mToK represents the cost in dollars per million tokens processed. This metric helps compare the economic efficiency of different models. Lower values indicate better cost-effectiveness for large-scale applications.
How often is the data updated?
We update our benchmark data weekly to incorporate the latest model versions and improvements. Major releases from leading AI companies are tested within 48 hours of availability. Historical data is maintained to track performance evolution over time.
What does the Censor metric measure?
The Censor metric evaluates how often a model refuses to answer or censors its response. Higher values indicate more restrictive content policies. This helps users understand the model's boundaries and suitability for different use cases.
Can I request additional model evaluations?
Yes! We welcome community suggestions for new models to evaluate. You can submit requests through our GitHub repository or contact us directly. Popular models are prioritized based on community demand and availability.