LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
GPT-4	9500	9000	300	100	50	$2500	High accuracy	87%	92%	85%	Moderate
GPT-3.5	8000	7500	400	100	50	$2000	Good performance	80%	88%	78%	Moderate
BLOOM	6500	6200	200	100	0	$1500	Open-source	75%	80%	70%	Low
LLaMA	7000	6800	150	50	0	$1800	Efficient	78%	85%	72%	Low

Performance Visuals

[Bar Chart Placeholder]

[Pie Chart Placeholder]

What does the "TOTAL" score represent?

The "TOTAL" score indicates the overall performance score of the model across various metrics in the benchmark.

How is "$ mToK" calculated?

It represents the monetary value scaled in thousands of dollars based on model efficiency and performance.

What does "Refuse" mean?

"Refusal" indicates the number of times the model refused to answer or refused to comply with a prompt.

Can I compare models directly?

Yes! You can sort columns and search models to compare their performance easily.

How do I switch to dark mode?

Click the moon icon in the header to toggle between light and dark themes.