LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
GPT-4	96%	92%	4%	2%	2%	$30	98%	95%	97%	93%	5%
LLaMA-2 70B	89%	85%	4%	6%	5%	$5	90%	85%	87%	84%	7%
Claude-2	93%	89%	4%	4%	3%	$15	95%	90%	92%	91%	6%
Palm-2	88%	84%	4%	8%	4%	$20	89%	87%	88%	85%	5%

What does TOTAL represent?

TOTAL indicates overall benchmark score calculated from multiple tasks.

How is $ mToK calculated?

"$ mToK" denotes the approximate cost per million tokens processed by the model.

What is the meaning of Refusal?

Refusal tracks percentage of tasks the AI explicitly declined to attempt.

FAQ