AI Performance Leaderboard
A comprehensive comparison of Large Language Models across reasoning, coding, STEM, and utility benchmarks. Analyze costs, refusal rates, and failure modes.
Top Pass Rate
--%
Highest Cost
--
Models Ranked
0
Benchmark Data
| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|
Frequently Asked Questions
"mToK" stands for "Million Tokens". It represents the cost in USD to process 1 million tokens of text (input + output). Lower is generally better for cost efficiency.
The total score is a weighted aggregate of the Pass, Refine, Fail, and Refusal categories, giving higher weight to successful task completion (Pass) without the need for refinement.
Pass: The model completed the task correctly on the first attempt.
Refine: The model required feedback or a second prompt to correct its answer before succeeding.
This metric tracks the frequency at which the model refuses to answer a prompt (Refusal) or filters content based on safety guidelines. A higher number indicates stricter moderation or higher sensitivity.