AI Language Model Performance Benchmarks

Comprehensive comparison of leading AI language models across various benchmarks and performance metrics.

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

Visual Comparison

Frequently Asked Questions

What does the "TOTAL" score represent?

The TOTAL score is a comprehensive metric that combines all individual performance categories. It represents the overall capability and reliability of the language model across different tasks and challenges.

How are "Pass", "Refine", "Fail", and "Refusal" metrics calculated?

These metrics represent the model's performance on benchmark tasks:

  • Pass: The percentage of tasks the model completed successfully on the first attempt.
  • Refine: The percentage of tasks the model completed after some refinement or additional prompting.
  • Fail: The percentage of tasks the model was unable to complete satisfactorily.
  • Refusal: The percentage of tasks the model declined to attempt due to content policy or other limitations.
What does "$ mToK" measure?

The "$ mToK" (dollars per million tokens) metric represents the cost efficiency of the model. It shows how much it costs to process one million tokens, which helps evaluate the economic feasibility of using a particular model for large-scale applications.

How are the specialized metrics (Reason, STEM, Utility, Code, Censor) evaluated?

These specialized metrics assess model performance in specific domains:

  • Reason: Evaluates logical reasoning, critical thinking, and problem-solving capabilities.
  • STEM: Measures performance on Science, Technology, Engineering, and Mathematics tasks.
  • Utility: Assesses the model's practical usefulness for everyday tasks and applications.
  • Code: Evaluates the ability to generate, understand, and debug programming code.
  • Censor: Measures the model's content filtering capabilities and adherence to safety guidelines.
How often is this benchmark data updated?

We update our benchmark data quarterly, or whenever significant new models or model versions are released. The last update was performed on April 15, 2023. Each model entry includes the version tested to ensure clarity and accurate comparison.

Can I contribute to this benchmark dataset?

Yes! We welcome contributions from researchers and AI enthusiasts. Please visit our GitHub repository to learn about our methodology, benchmark tasks, and submission process. All contributions undergo peer review before being added to the main dataset.