LLM Benchmark

LLM Performance Matrix

Comprehensive comparison of latest Large Language Models across Reasoning, Coding, STEM, and Utility.

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

Frequently Asked Questions

What does the "TOTAL" score represent?
The TOTAL score is a weighted aggregate of performance across all evaluated categories (Reasoning, STEM, Code, etc.), normalized to a 0-100 scale.
How is "$ mToK" calculated?
"$ mToK" represents the approximate cost per million tokens (input + output) for API usage. Lower is better for cost-efficiency.
What is the "Refusal" metric?
This measures the model's tendency to refuse benign queries. A lower score indicates a model that answers more freely but may be more prone to hallucinations or policy violations.