LLM Benchmark Table

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 9500 9000 300 100 50 $2500 High accuracy 87% 92% 85% Moderate
GPT-3.5 8000 7500 400 100 50 $2000 Good performance 80% 88% 78% Moderate
BLOOM 6500 6200 200 100 0 $1500 Open-source 75% 80% 70% Low
LLaMA 7000 6800 150 50 0 $1800 Efficient 78% 85% 72% Low

Performance Visuals

[Bar Chart Placeholder]
[Pie Chart Placeholder]

Frequently Asked Questions

What does the "TOTAL" score represent?
The "TOTAL" score indicates the overall performance score of the model across various metrics in the benchmark.
How is "$ mToK" calculated?
It represents the monetary value scaled in thousands of dollars based on model efficiency and performance.
What does "Refuse" mean?
"Refusal" indicates the number of times the model refused to answer or refused to comply with a prompt.
Can I compare models directly?
Yes! You can sort columns and search models to compare their performance easily.
How do I switch to dark mode?
Click the moon icon in the header to toggle between light and dark themes.