LLM Benchmark Table

A sleek comparison of AI model performances across various benchmarks.

Model ▲	TOTAL ▲	Pass ▲	Refine ▲	Fail ▲	$ mToK ▲	Reason ▲	STEM ▲	Utility ▲	Code ▲	Censor ▲
GPT-4	95	80	10	5	1.2	92	88	96	97	85
Llama 2	85	70	10	5	0.8	82	78	86	87	75
Claude	90	75	12	3	1.0	88	85	92	93	80
Gemini	88	72	11	5	0.9	85	80	89	90	78
Mistral	82	68	9	5	0.7	80	75	84	85	72

Visual Comparison of TOTAL Scores

GPT-4: 95

Llama 2: 85

Claude: 90

Gemini: 88

Mistral: 82

What does TOTAL represent?

TOTAL is the overall performance score aggregated from all benchmarks.

What is $ mToK?

It stands for millions of Tokens per benchmark, indicating efficiency.

How is the data sourced?

Data is compiled from public benchmarks and community contributions. Always verify with official sources.

Can I contribute data?

Yes, contact us via the form (not implemented in this demo).