LLM Benchmark Table

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 100 78 12 5 5 0.06
GPT-3.5 100 55 20 15 10 0.02
PaLM 2 100 62 18 12 8 0.04
LLaMA 2 100 47 25 18 10 0.015
Claude 2 100 70 15 8 7 0.05

FAQ

What does mToK mean?

mToK stands for “million tokens per dollar.” It indicates how many tokens of prompt/inference you get per dollar spent.

How often is this data updated?

We aim to update benchmarks monthly, based on publicly available test suites and user contributions.

How is the benchmark measured?

Scores like Pass, Fail, and Refine come from standardized evaluation tasks across multiple domains (STEM, reasoning, coding, etc.).