Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4 | 100 | 78 | 12 | 5 | 5 | 0.06 | |||||
GPT-3.5 | 100 | 55 | 20 | 15 | 10 | 0.02 | |||||
PaLM 2 | 100 | 62 | 18 | 12 | 8 | 0.04 | |||||
LLaMA 2 | 100 | 47 | 25 | 18 | 10 | 0.015 | |||||
Claude 2 | 100 | 70 | 15 | 8 | 7 | 0.05 |
FAQ
What does mToK
mean?
mToK
stands for “million tokens per dollar.” It indicates how many tokens of prompt/inference you get per dollar spent.
How often is this data updated?
We aim to update benchmarks monthly, based on publicly available test suites and user contributions.
How is the benchmark measured?
Scores like Pass, Fail, and Refine come from standardized evaluation tasks across multiple domains (STEM, reasoning, coding, etc.).