LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK
G4 GPT-4	92.5%	87.2%	91.5%	95.8%	3.2%	$0.03
C2 Claude 2	88.2%	84.5%	87.1%	91.2%	1.8%	$0.08
P2 PaLM 2	85.7%	81.3%	84.9%	88.1%	4.7%	$0.0005
L2 LLaMA 2	82.1%	78.9%	81.7%	85.3%	2.1%	$0.0007
GP Gemini Pro	79.8%	76.4%	78.9%	82.1%	5.3%	$0.0005
M3 Mixtral 8x7B	77.3%	73.8%	76.2%	79.8%	3.7%	$0.0006
C1 Claude 1	74.6%	71.2%	73.8%	77.1%	6.2%	$0.08
G3 GPT-3.5	71.9%	68.5%	71.2%	74.8%	8.1%	$0.002

Frequently Asked Questions

Learn more about our benchmarking methodology and results

How are models evaluated in these benchmarks?

Our benchmarks evaluate models across multiple dimensions including logical reasoning, mathematical problem-solving, coding proficiency, and factual knowledge. Each model is tested on standardized datasets designed to assess these capabilities. Results are normalized and presented as percentages for easy comparison.

What does the "TOTAL" score represent?

The TOTAL score is a weighted average of performance across all benchmark categories. It provides an overall measure of a model's general intelligence and capability. Higher TOTAL scores indicate better overall performance, though individual categories may vary.

How often are these benchmarks updated?

We update our benchmarks weekly with the latest model releases and improvements. New models are added within 48 hours of public release. All scores are recalculated monthly to ensure consistency and accuracy in our comparisons.

What do the "Pass", "Refine", "Fail", and "Refusal" columns mean?

These columns show the distribution of model responses to test cases:

Pass: The model provided a correct answer directly
Refine: The model needed corrections or clarifications to arrive at the correct answer
Fail: The model provided an incorrect answer
Refusal: The model declined to answer for ethical or safety reasons

How can I interpret the visual indicators in the table?

Color-coded indicators help quickly assess performance:

Green: High performance (85-100%)
Yellow: Medium performance (70-84%)
Red: Low performance (below 70%)

Progress bars visually represent performance levels in category-specific metrics, making it easy to compare models at a glance.

LLM Benchmark Table

Models Tracked

Highest Score

Average Performance

Benchmark Categories

Performance Overview

Frequently Asked Questions