Models Tracked

24
3 new this month

Highest Score

92.5%
GPT-4 leading

Average Performance

78.3%
-2.1% from last month

Benchmark Categories

8
Updated weekly

Performance Overview

92.5%
GPT-4
88.2%
Claude 2
85.7%
PaLM 2
82.1%
LLaMA 2
79.8%
Gemini Pro
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
G4
GPT-4
92.5% 87.2% 91.5% 95.8% 3.2% $0.03
C2
Claude 2
88.2% 84.5% 87.1% 91.2% 1.8% $0.08
P2
PaLM 2
85.7% 81.3% 84.9% 88.1% 4.7% $0.0005
L2
LLaMA 2
82.1% 78.9% 81.7% 85.3% 2.1% $0.0007
GP
Gemini Pro
79.8% 76.4% 78.9% 82.1% 5.3% $0.0005
M3
Mixtral 8x7B
77.3% 73.8% 76.2% 79.8% 3.7% $0.0006
C1
Claude 1
74.6% 71.2% 73.8% 77.1% 6.2% $0.08
G3
GPT-3.5
71.9% 68.5% 71.2% 74.8% 8.1% $0.002

Frequently Asked Questions

Learn more about our benchmarking methodology and results

How are models evaluated in these benchmarks?

Our benchmarks evaluate models across multiple dimensions including logical reasoning, mathematical problem-solving, coding proficiency, and factual knowledge. Each model is tested on standardized datasets designed to assess these capabilities. Results are normalized and presented as percentages for easy comparison.

What does the "TOTAL" score represent?

The TOTAL score is a weighted average of performance across all benchmark categories. It provides an overall measure of a model's general intelligence and capability. Higher TOTAL scores indicate better overall performance, though individual categories may vary.

How often are these benchmarks updated?

We update our benchmarks weekly with the latest model releases and improvements. New models are added within 48 hours of public release. All scores are recalculated monthly to ensure consistency and accuracy in our comparisons.

What do the "Pass", "Refine", "Fail", and "Refusal" columns mean?

These columns show the distribution of model responses to test cases:

  • Pass: The model provided a correct answer directly
  • Refine: The model needed corrections or clarifications to arrive at the correct answer
  • Fail: The model provided an incorrect answer
  • Refusal: The model declined to answer for ethical or safety reasons

How can I interpret the visual indicators in the table?

Color-coded indicators help quickly assess performance:

  • Green: High performance (85-100%)
  • Yellow: Medium performance (70-84%)
  • Red: Low performance (below 70%)
Progress bars visually represent performance levels in category-specific metrics, making it easy to compare models at a glance.