LLM Performance Matrix
Comprehensive comparison of latest Large Language Models across Reasoning, Coding, STEM, and Utility.
| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|
Frequently Asked Questions
What does the "TOTAL" score represent?
The TOTAL score is a weighted aggregate of performance across all evaluated categories (Reasoning, STEM, Code, etc.), normalized to a 0-100 scale.
How is "$ mToK" calculated?
"$ mToK" represents the approximate cost per million tokens (input + output) for API usage. Lower is better for cost-efficiency.
What is the "Refusal" metric?
This measures the model's tendency to refuse benign queries. A lower score indicates a model that answers more freely but may be more prone to hallucinations or policy violations.