LLM Benchmark Table

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 95 70 10 10 5 0.03 90 88 92 94 2
Claude-3 90 65 15 8 2 0.02 85 86 89 87 5
LLaMA-2 80 50 20 10 0 0.01 78 75 80 82 3
Gemini Pro 85 55 18 10 2 0.025 80 82 84 86 4

FAQ

What is this benchmark table?

This table compares the performance of various Large Language Models (LLMs) across multiple categories, including accuracy, reasoning, coding ability, and censorship tendencies.

How is the data collected?

Data is obtained from a variety of public benchmarks, fine-tuned tasks, and testing environments to ensure a fair comparison.

What does mToK mean?

It stands for cost per thousand tokens ($ per milli-ToK), representing the approximate input processing cost for the model.

Will this be updated?

Yes, we aim to keep the benchmark updated as new models and versions are released.