Comprehensive benchmarks for leading Large Language Models. Compare reasoning, coding ability, and cost efficiency.
Model
TOTAL
Pass
Refine
Fail
Refusal
$ mToK
Reason
STEM
Utility
Code
Censor
Frequently Asked Questions
What is the "TOTAL" score based on?
The TOTAL score is a weighted aggregate of the STEM, Utility, Code, and Reasoning benchmarks. It provides a quick snapshot of overall model capability.
What does $ mToK mean?
This stands for "Dollars per Million Tokens". It represents the average cost for processing 1 million tokens (input + output). Lower is cheaper.
How is the Censor score measured?
The Censor score rates the model's tendency to refuse prompts due to safety alignment filters. A higher score indicates more frequent refusals.