Compare performance metrics across state-of-the-art models.
| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 92 | 88 | 2 | 2 | 0 | 0.05 | 95 | 94 | 91 | 90 | Low |
| Claude 3.5 | 91 | 87 | 3 | 1 | 0 | 0.03 | 94 | 92 | 93 | 89 | Low |
| Llama 3 | 85 | 80 | 4 | 1 | 0 | 0.01 | 88 | 85 | 80 | 82 | Med |
The total score is a weighted aggregate of Pass, STEM, and Utility metrics.
Cost per million tokens in USD ($).