AI Model Performance Comparison
| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor | Details |
|---|
Frequently Asked Questions
Our benchmark scores represent the percentage of successful completions across various tasks. Higher scores indicate better performance. The TOTAL score is a weighted average across all categories, with each category (Reason, STEM, Utility, Code, Censor) representing different capabilities of the AI model.
$ mToK represents the cost per million tokens for each model. This includes both input and output tokens where applicable. Prices are based on official API rates and are updated regularly. Free models show as $0, while commercial models vary based on their pricing tiers.
Pass: The model successfully completed the task on the first attempt.
Refine: The model needed clarification or iteration but eventually succeeded.
Fail: The model was unable to complete the task successfully.
Refusal: The model declined to attempt the task, usually due to content policy.
We update our benchmark results weekly as new model versions are released. Major updates are performed within 24 hours of a new model's public release. You can see the last update timestamp in the table footer.
Absolutely! We welcome community suggestions. Use the "Suggest Model" button at the top of the table to submit new models for benchmarking. For benchmark suggestions, please contact our team via the link in the footer.