LLM Benchmark Table - AI Performance Comparison

Models Compared

Average Score

Top Performer

Benchmarks Run

AI Model Performance Comparison

Scroll for more

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor	Details

Frequently Asked Questions

What do the benchmark scores mean?

Our benchmark scores represent the percentage of successful completions across various tasks. Higher scores indicate better performance. The TOTAL score is a weighted average across all categories, with each category (Reason, STEM, Utility, Code, Censor) representing different capabilities of the AI model.

How is the $ mToK cost calculated?

$ mToK represents the cost per million tokens for each model. This includes both input and output tokens where applicable. Prices are based on official API rates and are updated regularly. Free models show as $0, while commercial models vary based on their pricing tiers.

What's the difference between Pass, Refine, and Fail?

Pass: The model successfully completed the task on the first attempt.
Refine: The model needed clarification or iteration but eventually succeeded.
Fail: The model was unable to complete the task successfully.
Refusal: The model declined to attempt the task, usually due to content policy.

How often is this data updated?

We update our benchmark results weekly as new model versions are released. Major updates are performed within 24 hours of a new model's public release. You can see the last update timestamp in the table footer.

Can I suggest new models or benchmarks?

Absolutely! We welcome community suggestions. Use the "Suggest Model" button at the top of the table to submit new models for benchmarking. For benchmark suggestions, please contact our team via the link in the footer.