LLM Benchmark Table

Comprehensive AI Model Performance Comparison

TOTAL
Overall Performance
PASS
Successful Tasks
REFINE
Improvement Rate
$ mToK
Cost per Million Tokens
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
Loading benchmark data...

Frequently Asked Questions

What is the TOTAL score?
The TOTAL score represents the overall performance metric calculated from all individual benchmarks. It's a weighted average that considers performance across reasoning, STEM, utility, coding, and censorship resistance tasks.
How is the $ mToK calculated?
$ mToK stands for "Dollars per Million Tokens" and represents the approximate cost to process one million tokens using each model. This includes both input and output token costs where applicable.
What do the Pass/Fail/Refusal metrics mean?
Pass indicates successful task completion, Fail shows tasks the model attempted but completed incorrectly, and Refusal represents tasks the model declined to attempt due to safety or policy constraints.
How often is this data updated?
Benchmark results are updated weekly as new model versions are released and tested. The date of the last update is displayed in the table footer.