LLM Benchmark Table

LLM Performance Arena

Real-world benchmark comparison of leading large language models. Updated August 2025.

MODELS TRACKED

AVERAGE TOTAL

81.4

CHEAPEST

$0.15

LAST UPDATED

2d ago

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor

What does "Refine" mean?

Percentage of cases where the model initially failed but successfully passed after one round of self-refinement.

What is $ mToK?

Cost per million tokens (input+output average). Lower is better.

How is the TOTAL score calculated?

Weighted average of Pass/Refine rate, reasoning, STEM, coding, and utility benchmarks. Censorship is penalized.