LLM Performance Arena

Real-world benchmark comparison of leading large language models. Updated August 2025.

MODELS TRACKED
24
AVERAGE TOTAL
81.4
CHEAPEST
$0.15
LAST UPDATED
2d ago
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

Frequently Asked Questions

What does "Refine" mean?

Percentage of cases where the model initially failed but successfully passed after one round of self-refinement.

What is $ mToK?

Cost per million tokens (input+output average). Lower is better.

How is the TOTAL score calculated?

Weighted average of Pass/Refine rate, reasoning, STEM, coding, and utility benchmarks. Censorship is penalized.