Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4-Turbo (2024-04-09) 400 360 20 15 5 $10.00
92
95
88
98
Low
Claude 3 Opus 400 375 10 10 5 $15.00
96
94
90
93
Low
Gemini 1.5 Pro 400 350 25 20 5 $7.00
90
91
85
94
Medium
Llama 3 70B Instruct 400 340 40 18 2 $0.79
88
85
92
96
Very Low
Mistral Large 400 345 30 22 3 $8.00
89
88
87
90
Medium
Command R+ 400 320 45 30 5 $3.00
80
82
85
88
Low

Frequently Asked Questions

What do these metrics mean?

TOTAL: The total number of prompts in the benchmark test suite.

Pass: The model's response was correct and helpful without needing changes.

Refine: The model's response was on the right track but required minor edits or clarification to be fully correct.

Fail: The model's response was incorrect, irrelevant, or nonsensical.

Refusal: The model refused to answer the prompt, often due to safety filters or policy constraints.

$ mToK: Estimated cost in US Dollars per 1 million tokens (input + output combined average).

Reason, STEM, Utility, Code: These are category-specific performance scores (out of 100) based on subsets of the total prompts. They indicate the model's proficiency in logical reasoning, science/tech/engineering/math, general helpfulness, and code generation.

Censor: A qualitative measure of how frequently the model refuses prompts due to its safety alignment. 'Very Low' means it rarely refuses, while 'High' means it refuses often.

How is this data collected and verified?

The data presented here is an illustrative example for this website's design. In a real-world scenario, this data would be collected by running a standardized set of thousands of prompts against each model's API. The results would be evaluated by a combination of automated checks (e.g., for code execution) and human reviewers to score each response as a Pass, Refine, Fail, or Refusal. Consistency is maintained by using a detailed evaluation rubric and multiple reviewers per response to ensure fairness.

Why isn't Model X included in the table?

Our goal is to include all major, publicly available large language models. A model might not be listed for several reasons:

  • It was released very recently and is still in our evaluation queue.
  • It is a private or research-only model with no public API access.
  • It is a smaller, specialized model that doesn't fit the scope of this general-purpose benchmark.

We are constantly updating our list, so please check back later!

How often is the data updated?

The LLM landscape changes rapidly. We aim to re-evaluate models whenever a significant new version is released (e.g., from GPT-4 to GPT-4.5). The table is typically updated on a monthly or bi-monthly basis to reflect the latest model versions and to add new competitors to the list. The model name usually includes the version or date of the evaluation to ensure transparency.