LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK
GPT-4	100	78	12	5	5	0.06
GPT-3.5	100	55	20	15	10	0.02
PaLM 2	100	62	18	12	8	0.04
LLaMA 2	100	47	25	18	10	0.015
Claude 2	100	70	15	8	7	0.05

FAQ

What does mToK mean?

mToK stands for “million tokens per dollar.” It indicates how many tokens of prompt/inference you get per dollar spent.

How often is this data updated?

We aim to update benchmarks monthly, based on publicly available test suites and user contributions.

How is the benchmark measured?

Scores like Pass, Fail, and Refine come from standardized evaluation tasks across multiple domains (STEM, reasoning, coding, etc.).