AI Language Model Performance Benchmarks
Comprehensive comparison of leading AI language models across various benchmarks and performance metrics.
Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|
Visual Comparison
Frequently Asked Questions
The TOTAL score is a comprehensive metric that combines all individual performance categories. It represents the overall capability and reliability of the language model across different tasks and challenges.
These metrics represent the model's performance on benchmark tasks:
- Pass: The percentage of tasks the model completed successfully on the first attempt.
- Refine: The percentage of tasks the model completed after some refinement or additional prompting.
- Fail: The percentage of tasks the model was unable to complete satisfactorily.
- Refusal: The percentage of tasks the model declined to attempt due to content policy or other limitations.
The "$ mToK" (dollars per million tokens) metric represents the cost efficiency of the model. It shows how much it costs to process one million tokens, which helps evaluate the economic feasibility of using a particular model for large-scale applications.
These specialized metrics assess model performance in specific domains:
- Reason: Evaluates logical reasoning, critical thinking, and problem-solving capabilities.
- STEM: Measures performance on Science, Technology, Engineering, and Mathematics tasks.
- Utility: Assesses the model's practical usefulness for everyday tasks and applications.
- Code: Evaluates the ability to generate, understand, and debug programming code.
- Censor: Measures the model's content filtering capabilities and adherence to safety guidelines.
We update our benchmark data quarterly, or whenever significant new models or model versions are released. The last update was performed on April 15, 2023. Each model entry includes the version tested to ensure clarity and accurate comparison.
Yes! We welcome contributions from researchers and AI enthusiasts. Please visit our GitHub repository to learn about our methodology, benchmark tasks, and submission process. All contributions undergo peer review before being added to the main dataset.