LLM Benchmark Table

A sleek comparison of AI model performances across various benchmarks.

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 95 80 10 5 0 1.2 92 88 96 97 85
Llama 2 85 70 10 5 0 0.8 82 78 86 87 75
Claude 90 75 12 3 0 1.0 88 85 92 93 80
Gemini 88 72 11 5 0 0.9 85 80 89 90 78
Mistral 82 68 9 5 0 0.7 80 75 84 85 72

Visual Comparison of TOTAL Scores

GPT-4: 95
Llama 2: 85
Claude: 90
Gemini: 88
Mistral: 82

FAQ

What does TOTAL represent?

TOTAL is the overall performance score aggregated from all benchmarks.

What is $ mToK?

It stands for millions of Tokens per benchmark, indicating efficiency.

How is the data sourced?

Data is compiled from public benchmarks and community contributions. Always verify with official sources.

Can I contribute data?

Yes, contact us via the form (not implemented in this demo).