AI Performance Comparison

Comprehensive benchmarks for leading Large Language Models. Compare reasoning, coding ability, and cost efficiency.

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

Frequently Asked Questions

What is the "TOTAL" score based on?
The TOTAL score is a weighted aggregate of the STEM, Utility, Code, and Reasoning benchmarks. It provides a quick snapshot of overall model capability.
What does $ mToK mean?
This stands for "Dollars per Million Tokens". It represents the average cost for processing 1 million tokens (input + output). Lower is cheaper.
How is the Censor score measured?
The Censor score rates the model's tendency to refuse prompts due to safety alignment filters. A higher score indicates more frequent refusals.