LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
GPT-4	1000	850	100	40	10	$0.03	High	9.5	9.2	9.8	Low
Claude-3	1200	1020	120	50	10	$0.025	High	9.3	9.4	9.6	Medium
Gemini Ultra	1100	935	110	45	10	$0.02	Medium	9.0	9.1	9.4	Low
Llama-2 70B	900	720	90	70	20	$0.01	Medium	8.5	8.8	9.0	High
Grok	800	650	80	60	10	$0.015	High	8.7	8.9	8.8	Low

What is this benchmark table?

This table compares Large Language Models (LLMs) across key performance metrics like total tests, pass rates, and specialized scores in STEM, utility, code, and censorship levels.

How is the data collected?

Data is aggregated from various independent evaluations, open-source benchmarks (e.g., MMLU, HumanEval), and user-reported tests. Metrics like "Pass" indicate successful responses, while "Fail" shows errors. "$ mToK" is the cost per million tokens.

What do the visual aids mean?

The mini-charts next to "Pass" values represent pass rates as a percentage bar. Hover over rows for highlights, and sort columns by clicking headers.

Can I contribute data?

Currently, this is a static demo. In a real implementation, we'd welcome contributions via a form (not included here).

FAQ

What is this benchmark table?

How is the data collected?

What do the visual aids mean?

Can I contribute data?