Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|
Frequently Asked Questions
Pass: Percentage of tasks completed successfully without assistance.
Refine: Tasks requiring iterative improvement to achieve success.
Fail: Tasks that could not be completed satisfactorily.
Refusal: Tasks the model declined to attempt.
The TOTAL score is a weighted average of all category scores, with Pass receiving full weight, Refine receiving 70% weight, and Fail/Refusal receiving 0% weight. The formula accounts for task difficulty and importance.
This represents the cost in USD per million tokens (mToK) for using the model. This includes both input and output tokens at standard API pricing rates.
Benchmark data is updated weekly with new model releases and monthly with comprehensive re-evaluations. Pricing information is updated in real-time when providers announce changes.
We use a combination of standardized benchmarks including MMLU, HumanEval, GSM8K, HellaSwag, and custom evaluation tasks designed to test real-world performance across reasoning, coding, mathematics, and general knowledge domains.