Benchmarked
Evaluated
Latest Results
Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|
Find answers to common questions about our benchmarking methodology and results
The metrics evaluate different aspects of LLM performance: "Pass" measures correct responses, "Refine" indicates responses needing improvement, "Fail" shows incorrect answers, "Refusal" counts when the model refuses to answer, "Reason" evaluates logical reasoning, "STEM" assesses science/math capabilities, "Utility" measures practical usefulness, "Code" evaluates programming ability, and "Censor" shows refusal rates on sensitive topics.
We update our benchmark data weekly as new models are released and existing models are updated. All tests are re-run with each update to ensure consistency and accuracy across versions. Major model releases may trigger immediate testing and inclusion in our tables.
The "$ mToK" metric represents the cost per million tokens for each model. This helps compare the cost-effectiveness of different models, especially important for developers and businesses considering large-scale implementations. Lower values indicate better cost efficiency.
The TOTAL score is a weighted average of all other metrics, with higher weights given to core capabilities like reasoning, STEM knowledge, and utility. The formula is: TOTAL = (Pass * 0.3) + (Refine * 0.1) + (Reason * 0.2) + (STEM * 0.15) + (Utility * 0.15) + (Code * 0.1). This provides a comprehensive overall performance score.
Yes! We welcome contributions from the community. You can submit new test cases, suggest improvements to our methodology, or report inconsistencies through our GitHub repository. All contributions are reviewed by our team before inclusion in the benchmark.