LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor

Frequently Asked Questions

What do the different score categories mean?

Pass: Percentage of tasks completed successfully without assistance.
Refine: Tasks requiring iterative improvement to achieve success.
Fail: Tasks that could not be completed satisfactorily.
Refusal: Tasks the model declined to attempt.

How is the TOTAL score calculated?

The TOTAL score is a weighted average of all category scores, with Pass receiving full weight, Refine receiving 70% weight, and Fail/Refusal receiving 0% weight. The formula accounts for task difficulty and importance.

What does "$ mToK" represent?

This represents the cost in USD per million tokens (mToK) for using the model. This includes both input and output tokens at standard API pricing rates.

How often is this data updated?

Benchmark data is updated weekly with new model releases and monthly with comprehensive re-evaluations. Pricing information is updated in real-time when providers announce changes.

What benchmark datasets are used?

We use a combination of standardized benchmarks including MMLU, HumanEval, GSM8K, HellaSwag, and custom evaluation tasks designed to test real-world performance across reasoning, coding, mathematics, and general knowledge domains.