Real-world benchmark comparison of leading large language models. Updated August 2025.
| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|
Percentage of cases where the model initially failed but successfully passed after one round of self-refinement.
Cost per million tokens (input+output average). Lower is better.
Weighted average of Pass/Refine rate, reasoning, STEM, coding, and utility benchmarks. Censorship is penalized.