LLM Benchmark table

LLM Benchmark Table

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
gpt-4o-2024-08-06 80.5 85.1 12.9 1.3 0.6 0.00053 64.0 66.6 92.0 96.1 95.4
claude-3-5-sonnet-20240620 77.6 86.9 10.8 1.6 0.6 0.003 50.1 71.3 90.9 97.8 94.9
claude-3-haiku-20240307 75.5 86.2 11.7 1.5 0.6 0.00026 50.1 64.6 90.5 95.3 93.2
gemini-1.5-pro-exp-05-2024 73.7 87.6 9.5 2.2 0.6 0.007 45.4 63.1 89.4 99.7 94.7
gpt-4-turbo-2024-04-09 72.5 85.6 12.4 1.2 0.8 0.003 49.1 64.9 87.7 91.0 94.2
claude-3-opus-20240229 69.2 85.2 12.7 1.5 0.6 0.015 39.2 64.4 89.0 94.8 92.9
gemini-1.5-pro-05-2024 68.2 85.0 11.0 3.1 0.9 0.007 39.2 61.1 88.6 96.0 91.7
claude-3-sonnet-20240229 68.0 84.6 12.9 1.9 0.6 0.003 39.2 65.8 86.9 92.7 91.8
gpt-4-0613 66.9 82.9 14.2 1.9 1.0 0.03 36.0 69.0 85.7 80.8 92.5
command-r-plus 63.7 82.1 15.3 1.9 0.6 0.003 36.0 59.6 86.2 86.0 87.8
gemini-pro-1.5-05-2024 62.4 82.3 13.9 3.0 0.8 0.0007 33.3 64.6 84.0 84.9 90.0
gpt-3.5-turbo-0125 62.4 80.7 16.6 1.9 0.9 0.0005 39.2 65.8 83.7 69.6 89.6
llama-3-70b-instruct 61.4 80.1 17.3 2.0 0.6 0.00059 30.4 64.6 83.3 88.4 89.1

FAQ

This benchmark provides a comprehensive comparison of various AI language models across multiple performance metrics to help users understand their capabilities and limitations.

The benchmark data is updated regularly as new models are released and new performance metrics become available. We strive to maintain current information.

Yes, you can download the complete dataset including all models and their performance metrics. Look for the download button at the bottom of the table.

The metrics include overall performance (TOTAL), pass rates, refinement capabilities, failure rates, refusal rates, cost metrics, reasoning abilities, STEM knowledge, utility, coding skills, and censorship handling.

Models are ranked by their TOTAL score by default, but you can sort by any metric by clicking on the column headers. Click again to reverse the sort order.