LLM Benchmark Table

What is the TOTAL score?

The TOTAL score represents the overall performance metric calculated from all individual benchmarks. It's a weighted average that considers performance across reasoning, STEM, utility, coding, and censorship resistance tasks.

How is the $ mToK calculated?

$ mToK stands for "Dollars per Million Tokens" and represents the approximate cost to process one million tokens using each model. This includes both input and output token costs where applicable.

What do the Pass/Fail/Refusal metrics mean?

Pass indicates successful task completion, Fail shows tasks the model attempted but completed incorrectly, and Refusal represents tasks the model declined to attempt due to safety or policy constraints.

How often is this data updated?

Benchmark results are updated weekly as new model versions are released and tested. The date of the last update is displayed in the table footer.

Frequently Asked Questions