Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4 | 9500 | 9000 | 300 | 100 | 50 | $2500 | High accuracy | 87% | 92% | 85% | Moderate |
GPT-3.5 | 8000 | 7500 | 400 | 100 | 50 | $2000 | Good performance | 80% | 88% | 78% | Moderate |
BLOOM | 6500 | 6200 | 200 | 100 | 0 | $1500 | Open-source | 75% | 80% | 70% | Low |
LLaMA | 7000 | 6800 | 150 | 50 | 0 | $1800 | Efficient | 78% | 85% | 72% | Low |
Performance Visuals
[Bar Chart Placeholder]
[Pie Chart Placeholder]
Frequently Asked Questions
What does the "TOTAL" score represent?
The "TOTAL" score indicates the overall performance score of the model across various metrics in the benchmark.
How is "$ mToK" calculated?
It represents the monetary value scaled in thousands of dollars based on model efficiency and performance.
What does "Refuse" mean?
"Refusal" indicates the number of times the model refused to answer or refused to comply with a prompt.
Can I compare models directly?
Yes! You can sort columns and search models to compare their performance easily.
How do I switch to dark mode?
Click the moon icon in the header to toggle between light and dark themes.