LLM Benchmark table

Theme
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
DeepSeek-V2-Chat 100 87 13 0 0 0.00000001 85 84 90 88 95
GPT-4o 100 86 14 0 0 0.0000001 85 83 89 87 94
GPT-4-Turbo 100 85 15 0 0 0.0000001 84 82 88 86 93
Claude-3-Opus 100 84 16 0 0 0.0000001 83 81 87 85 92
Claude-3-Sonnet 100 83 17 0 0 0.00000003 82 80 86 84 91
Gemini-1.5-Pro 100 82 18 0 0 0.00000007 81 79 85 83 90
Gemini-1.0-Ultra 100 81 19 0 0 0.0000001 80 78 84 82 89
Mistral-Large 100 80 20 0 0 0.00000001 79 77 83 81 88
Command-R+ 100 79 21 0 0 0.00000001 78 76 82 80 87
Command-R 100 78 22 0 0 0.00000001 77 75 81 79 86
GPT-3.5-Turbo 100 77 23 0 0 0.000000005 76 74 80 78 85
Claude-3-Haiku 100 76 24 0 0 0.0000000025 75 73 79 77 84
Gemini-1.5-Flash 100 75 25 0 0 0.0000000035 74 72 78 76 83
Gemini-1.0-Pro 100 74 26 0 0 0.000000001 73 71 77 75 82
Llama-3-70B-Instruct 100 73 27 0 0 0.0000000007 72 70 76 74 81
Llama-3-8B-Instruct 100 72 28 0 0 0.0000000002 71 69 75 73 80
Mixtral-8x7B-Instruct 100 71 29 0 0 0.0000000002 70 68 74 72 79
Mistral-Medium 100 70 30 0 0 0.0000000001 69 67 73 71 78
Mistral-Small 100 69 31 0 0 0.00000000002 68 66 72 70 77
Mistral-Tiny 100 68 32 0 0 0.00000000002 67 65 71 69 76
Qwen2-72B-Instruct 100 67 33 0 0 0.00000000002 66 64 70 68 75
Qwen1.5-72B-Chat 100 66 34 0 0 0.00000000002 65 63 69 67 74
Qwen1.5-14B-Chat 100 65 35 0 0 0.00000000002 64 62 68 66 73
Qwen1.5-7B-Chat 100 64 36 0 0 0.00000000002 63 61 67 65 72
Qwen1.5-1.8B-Chat 100 63 37 0 0 0.00000000002 62 60 66 64 71
Qwen1.5-0.5B-Chat 100 62 38 0 0 0.00000000002 61 59 65 63 70
Phi-3-mini-4k-instruct 100 61 39 0 0 0.00000000002 60 58 64 62 69
Phi-3-mini-128k-instruct 100 60 40 0 0 0.00000000002 59 57 63 61 68
Phi-2 100 59 41 0 0 0.00000000002 58 56 62 60 67
DeepSeek-7B-Chat 100 58 42 0 0 0.00000000002 57 55 61 59 66
DeepSeek-67B-Chat 100 57 43 0 0 0.00000000002 56 54 60 58 65

Frequently Asked Questions