Model |
TOTAL |
Pass |
Refine |
Fail |
Refusal |
$ mToK |
Reason |
STEM |
Utility |
Code |
Censor |
GPT-4 | 96% | 92% | 4% | 2% | 2% | $30 | 98% | 95% | 97% | 93% | 5% |
LLaMA-2 70B | 89% | 85% | 4% | 6% | 5% | $5 | 90% | 85% | 87% | 84% | 7% |
Claude-2 | 93% | 89% | 4% | 4% | 3% | $15 | 95% | 90% | 92% | 91% | 6% |
Palm-2 | 88% | 84% | 4% | 8% | 4% | $20 | 89% | 87% | 88% | 85% | 5% |
FAQ
What does TOTAL represent?
TOTAL indicates overall benchmark score calculated from multiple tasks.
How is $ mToK calculated?
"$ mToK" denotes the approximate cost per million tokens processed by the model.
What is the meaning of Refusal?
Refusal tracks percentage of tasks the AI explicitly declined to attempt.