Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4
|
92.5% | 87.2% | 91.5% | 95.8% | 3.2% | $0.03 | |||||
Claude 2
|
88.2% | 84.5% | 87.1% | 91.2% | 1.8% | $0.08 | |||||
PaLM 2
|
85.7% | 81.3% | 84.9% | 88.1% | 4.7% | $0.0005 | |||||
LLaMA 2
|
82.1% | 78.9% | 81.7% | 85.3% | 2.1% | $0.0007 | |||||
Gemini Pro
|
79.8% | 76.4% | 78.9% | 82.1% | 5.3% | $0.0005 | |||||
Mixtral 8x7B
|
77.3% | 73.8% | 76.2% | 79.8% | 3.7% | $0.0006 | |||||
Claude 1
|
74.6% | 71.2% | 73.8% | 77.1% | 6.2% | $0.08 | |||||
GPT-3.5
|
71.9% | 68.5% | 71.2% | 74.8% | 8.1% | $0.002 |
Learn more about our benchmarking methodology and results
Our benchmarks evaluate models across multiple dimensions including logical reasoning, mathematical problem-solving, coding proficiency, and factual knowledge. Each model is tested on standardized datasets designed to assess these capabilities. Results are normalized and presented as percentages for easy comparison.
The TOTAL score is a weighted average of performance across all benchmark categories. It provides an overall measure of a model's general intelligence and capability. Higher TOTAL scores indicate better overall performance, though individual categories may vary.
We update our benchmarks weekly with the latest model releases and improvements. New models are added within 48 hours of public release. All scores are recalculated monthly to ensure consistency and accuracy in our comparisons.
These columns show the distribution of model responses to test cases:
Color-coded indicators help quickly assess performance: