LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
GPT-4	95	70	10	10	5	0.03	90	88	92	94	2
Claude-3	90	65	15	8	2	0.02	85	86	89	87	5
LLaMA-2	80	50	20	10	0	0.01	78	75	80	82	3
Gemini Pro	85	55	18	10	2	0.025	80	82	84	86	4

FAQ

What is this benchmark table?

This table compares the performance of various Large Language Models (LLMs) across multiple categories, including accuracy, reasoning, coding ability, and censorship tendencies.

How is the data collected?

Data is obtained from a variety of public benchmarks, fine-tuned tasks, and testing environments to ensure a fair comparison.

What does mToK mean?

It stands for cost per thousand tokens ($ per milli-ToK), representing the approximate input processing cost for the model.

Will this be updated?

Yes, we aim to keep the benchmark updated as new models and versions are released.