A sleek comparison of AI model performance across various metrics.
Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4 | 1000 | 850 |
100 | 40 | 10 | $0.03 | High | 9.5 | 9.2 | 9.8 | Low |
Claude-3 | 1200 | 1020 |
120 | 50 | 10 | $0.025 | High | 9.3 | 9.4 | 9.6 | Medium |
Gemini Ultra | 1100 | 935 |
110 | 45 | 10 | $0.02 | Medium | 9.0 | 9.1 | 9.4 | Low |
Llama-2 70B | 900 | 720 |
90 | 70 | 20 | $0.01 | Medium | 8.5 | 8.8 | 9.0 | High |
Grok | 800 | 650 |
80 | 60 | 10 | $0.015 | High | 8.7 | 8.9 | 8.8 | Low |
This table compares Large Language Models (LLMs) across key performance metrics like total tests, pass rates, and specialized scores in STEM, utility, code, and censorship levels.
The mini-charts next to "Pass" values represent pass rates as a percentage bar. Hover over rows for highlights, and sort columns by clicking headers.
Currently, this is a static demo. In a real implementation, we'd welcome contributions via a form (not included here).