LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
GPT-4-Turbo (2024-04-09)	400	360	20	15	5	$10.00	92	95	88	98	Low
Claude 3 Opus	400	375	10	10	5	$15.00	96	94	90	93	Low
Gemini 1.5 Pro	400	350	25	20	5	$7.00	90	91	85	94	Medium
Llama 3 70B Instruct	400	340	40	18	2	$0.79	88	85	92	96	Very Low
Mistral Large	400	345	30	22	3	$8.00	89	88	87	90	Medium
Command R+	400	320	45	30	5	$3.00	80	82	85	88	Low

Frequently Asked Questions

What do these metrics mean?

TOTAL: The total number of prompts in the benchmark test suite.

Pass: The model's response was correct and helpful without needing changes.

Refine: The model's response was on the right track but required minor edits or clarification to be fully correct.

Fail: The model's response was incorrect, irrelevant, or nonsensical.

Refusal: The model refused to answer the prompt, often due to safety filters or policy constraints.

$ mToK: Estimated cost in US Dollars per 1 million tokens (input + output combined average).

Reason, STEM, Utility, Code: These are category-specific performance scores (out of 100) based on subsets of the total prompts. They indicate the model's proficiency in logical reasoning, science/tech/engineering/math, general helpfulness, and code generation.

Censor: A qualitative measure of how frequently the model refuses prompts due to its safety alignment. 'Very Low' means it rarely refuses, while 'High' means it refuses often.

How is this data collected and verified?

The data presented here is an illustrative example for this website's design. In a real-world scenario, this data would be collected by running a standardized set of thousands of prompts against each model's API. The results would be evaluated by a combination of automated checks (e.g., for code execution) and human reviewers to score each response as a Pass, Refine, Fail, or Refusal. Consistency is maintained by using a detailed evaluation rubric and multiple reviewers per response to ensure fairness.

Why isn't Model X included in the table?

Our goal is to include all major, publicly available large language models. A model might not be listed for several reasons:

It was released very recently and is still in our evaluation queue.
It is a private or research-only model with no public API access.
It is a smaller, specialized model that doesn't fit the scope of this general-purpose benchmark.

We are constantly updating our list, so please check back later!

How often is the data updated?

The LLM landscape changes rapidly. We aim to re-evaluate models whenever a significant new version is released (e.g., from GPT-4 to GPT-4.5). The table is typically updated on a monthly or bi-monthly basis to reflect the latest model versions and to add new competitors to the list. The model name usually includes the version or date of the evaluation to ensure transparency.