LLM Performance Matrix

Comprehensive comparison of latest Large Language Models across Reasoning, Coding, STEM, and Utility.

Model ⇅	TOTAL ⇅	Pass ⇅	Refine ⇅	Fail ⇅	Refusal ⇅	$ mToK ⇅	Reason ⇅	STEM ⇅	Utility ⇅	Code ⇅	Censor ⇅

Frequently Asked Questions

What does the "TOTAL" score represent?

The TOTAL score is a weighted aggregate of performance across all evaluated categories (Reasoning, STEM, Code, etc.), normalized to a 0-100 scale.

How is "$ mToK" calculated?

"$ mToK" represents the approximate cost per million tokens (input + output) for API usage. Lower is better for cost-efficiency.

What is the "Refusal" metric?

This measures the model's tendency to refuse benign queries. A lower score indicates a model that answers more freely but may be more prone to hallucinations or policy violations.