A sleek comparison of AI model performances across various benchmarks.
Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4 | 95 | 80 | 10 | 5 | 0 | 1.2 | 92 | 88 | 96 | 97 | 85 |
Llama 2 | 85 | 70 | 10 | 5 | 0 | 0.8 | 82 | 78 | 86 | 87 | 75 |
Claude | 90 | 75 | 12 | 3 | 0 | 1.0 | 88 | 85 | 92 | 93 | 80 |
Gemini | 88 | 72 | 11 | 5 | 0 | 0.9 | 85 | 80 | 89 | 90 | 78 |
Mistral | 82 | 68 | 9 | 5 | 0 | 0.7 | 80 | 75 | 84 | 85 | 72 |
TOTAL is the overall performance score aggregated from all benchmarks.
It stands for millions of Tokens per benchmark, indicating efficiency.
Data is compiled from public benchmarks and community contributions. Always verify with official sources.
Yes, contact us via the form (not implemented in this demo).