LLM Benchmark Table
| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|---|---|---|---|---|---|---|---|---|---|---|
| gpt-4o-2024-08-06 | 80.5 | 85.1 | 12.9 | 1.3 | 0.6 | 0.00053 | 64.0 | 66.6 | 92.0 | 96.1 | 95.4 |
| claude-3-5-sonnet-20240620 | 77.6 | 86.9 | 10.8 | 1.6 | 0.6 | 0.003 | 50.1 | 71.3 | 90.9 | 97.8 | 94.9 |
| claude-3-haiku-20240307 | 75.5 | 86.2 | 11.7 | 1.5 | 0.6 | 0.00026 | 50.1 | 64.6 | 90.5 | 95.3 | 93.2 |
| gemini-1.5-pro-exp-05-2024 | 73.7 | 87.6 | 9.5 | 2.2 | 0.6 | 0.007 | 45.4 | 63.1 | 89.4 | 99.7 | 94.7 |
| gpt-4-turbo-2024-04-09 | 72.5 | 85.6 | 12.4 | 1.2 | 0.8 | 0.003 | 49.1 | 64.9 | 87.7 | 91.0 | 94.2 |
| claude-3-opus-20240229 | 69.2 | 85.2 | 12.7 | 1.5 | 0.6 | 0.015 | 39.2 | 64.4 | 89.0 | 94.8 | 92.9 |
| gemini-1.5-pro-05-2024 | 68.2 | 85.0 | 11.0 | 3.1 | 0.9 | 0.007 | 39.2 | 61.1 | 88.6 | 96.0 | 91.7 |
| claude-3-sonnet-20240229 | 68.0 | 84.6 | 12.9 | 1.9 | 0.6 | 0.003 | 39.2 | 65.8 | 86.9 | 92.7 | 91.8 |
| gpt-4-0613 | 66.9 | 82.9 | 14.2 | 1.9 | 1.0 | 0.03 | 36.0 | 69.0 | 85.7 | 80.8 | 92.5 |
| command-r-plus | 63.7 | 82.1 | 15.3 | 1.9 | 0.6 | 0.003 | 36.0 | 59.6 | 86.2 | 86.0 | 87.8 |
| gemini-pro-1.5-05-2024 | 62.4 | 82.3 | 13.9 | 3.0 | 0.8 | 0.0007 | 33.3 | 64.6 | 84.0 | 84.9 | 90.0 |
| gpt-3.5-turbo-0125 | 62.4 | 80.7 | 16.6 | 1.9 | 0.9 | 0.0005 | 39.2 | 65.8 | 83.7 | 69.6 | 89.6 |
| llama-3-70b-instruct | 61.4 | 80.1 | 17.3 | 2.0 | 0.6 | 0.00059 | 30.4 | 64.6 | 83.3 | 88.4 | 89.1 |
FAQ
This benchmark provides a comprehensive comparison of various AI language models across multiple performance metrics to help users understand their capabilities and limitations.
The benchmark data is updated regularly as new models are released and new performance metrics become available. We strive to maintain current information.
Yes, you can download the complete dataset including all models and their performance metrics. Look for the download button at the bottom of the table.
The metrics include overall performance (TOTAL), pass rates, refinement capabilities, failure rates, refusal rates, cost metrics, reasoning abilities, STEM knowledge, utility, coding skills, and censorship handling.
Models are ranked by their TOTAL score by default, but you can sort by any metric by clicking on the column headers. Click again to reverse the sort order.