LLM Benchmark Table

Compare the performance of leading Language Models across multiple dimensions

24
Models Tested
87%
Average Pass Rate
$2.4
Avg Cost/mToK
Live
Last Updated
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

Frequently Asked Questions

Everything you need to know about our LLM benchmarks

What metrics are used to evaluate the models?

We evaluate models across multiple dimensions including: Pass rate (successful task completion), Refinement capability, Failure rate, Refusal rate, Cost efficiency ($ per million tokens), Reasoning ability, STEM performance, General utility, Code generation quality, and Censorship levels.

How often are the benchmarks updated?

Our benchmarks are updated weekly to reflect the latest model releases and performance improvements. We also re-test existing models when significant updates are released by their providers.

What does the TOTAL score represent?

The TOTAL score is a weighted average of all performance metrics, designed to give a comprehensive view of a model's overall capability. It considers success rates, specialized performance, and cost efficiency to provide a balanced assessment.

How is cost efficiency calculated?

Cost efficiency is measured in dollars per million tokens ($ mToK). This metric helps you understand the operational cost of using each model at scale, making it easier to balance performance needs with budget constraints.

Can I export the benchmark data?

Yes! You can export the data by clicking the export button in the table controls. We support CSV, JSON, and Excel formats for easy integration with your analysis tools.