AI Performance Leaderboard

A comprehensive comparison of Large Language Models across reasoning, coding, STEM, and utility benchmarks. Analyze costs, refusal rates, and failure modes.

Top Pass Rate

--%

Highest Cost

--

Models Ranked

0

Benchmark Data

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

Frequently Asked Questions

What does "mToK" stand for?

"mToK" stands for "Million Tokens". It represents the cost in USD to process 1 million tokens of text (input + output). Lower is generally better for cost efficiency.

How is the "TOTAL" score calculated?

The total score is a weighted aggregate of the Pass, Refine, Fail, and Refusal categories, giving higher weight to successful task completion (Pass) without the need for refinement.

What is the difference between Pass and Refine?

Pass: The model completed the task correctly on the first attempt.
Refine: The model required feedback or a second prompt to correct its answer before succeeding.

What does the Censor column indicate?

This metric tracks the frequency at which the model refuses to answer a prompt (Refusal) or filters content based on safety guidelines. A higher number indicates stricter moderation or higher sensitivity.