24 Models

Benchmarked

12 Metrics

Evaluated

Updated Daily

Latest Results

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

Frequently Asked Questions

Find answers to common questions about our benchmarking methodology and results

What do the different metrics measure?

The metrics evaluate different aspects of LLM performance: "Pass" measures correct responses, "Refine" indicates responses needing improvement, "Fail" shows incorrect answers, "Refusal" counts when the model refuses to answer, "Reason" evaluates logical reasoning, "STEM" assesses science/math capabilities, "Utility" measures practical usefulness, "Code" evaluates programming ability, and "Censor" shows refusal rates on sensitive topics.

How often is the benchmark data updated?

We update our benchmark data weekly as new models are released and existing models are updated. All tests are re-run with each update to ensure consistency and accuracy across versions. Major model releases may trigger immediate testing and inclusion in our tables.

What does "$ mToK" represent?

The "$ mToK" metric represents the cost per million tokens for each model. This helps compare the cost-effectiveness of different models, especially important for developers and businesses considering large-scale implementations. Lower values indicate better cost efficiency.

How is the TOTAL score calculated?

The TOTAL score is a weighted average of all other metrics, with higher weights given to core capabilities like reasoning, STEM knowledge, and utility. The formula is: TOTAL = (Pass * 0.3) + (Refine * 0.1) + (Reason * 0.2) + (STEM * 0.15) + (Utility * 0.15) + (Code * 0.1). This provides a comprehensive overall performance score.

Can I contribute to the benchmark or suggest improvements?

Yes! We welcome contributions from the community. You can submit new test cases, suggest improvements to our methodology, or report inconsistencies through our GitHub repository. All contributions are reviewed by our team before inclusion in the benchmark.