Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4 | 95 | 70 | 10 | 10 | 5 | 0.03 | 90 | 88 | 92 | 94 | 2 |
Claude-3 | 90 | 65 | 15 | 8 | 2 | 0.02 | 85 | 86 | 89 | 87 | 5 |
LLaMA-2 | 80 | 50 | 20 | 10 | 0 | 0.01 | 78 | 75 | 80 | 82 | 3 |
Gemini Pro | 85 | 55 | 18 | 10 | 2 | 0.025 | 80 | 82 | 84 | 86 | 4 |
FAQ
What is this benchmark table?
This table compares the performance of various Large Language Models (LLMs) across multiple categories, including accuracy, reasoning, coding ability, and censorship tendencies.
How is the data collected?
Data is obtained from a variety of public benchmarks, fine-tuned tasks, and testing environments to ensure a fair comparison.
What does mToK mean?
It stands for cost per thousand tokens ($ per milli-ToK), representing the approximate input processing cost for the model.
Will this be updated?
Yes, we aim to keep the benchmark updated as new models and versions are released.