LLM Benchmark Table

Interactive performance comparison of modern language models

Toggle Dark Mode
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT‑4.1 92 88%
6% 4% 2% $0.03 95 96 94 97 Low
Claude 3 Opus 90 85%
8% 5% 2% $0.025 93 94 95 90 Medium
Gemini Ultra 86 80%
10% 7% 3% $0.02 90 92 89 88 High

FAQ

What is this benchmark measuring?
The benchmark evaluates language models across reasoning, STEM, coding, safety behavior, refusals, and real‑world utility tasks.
What does “Refine” mean?
Refine indicates the model required follow‑up prompts or corrections to reach a valid solution.
Is this data official?
No. This table is a demonstration of how benchmarks can be visualized. Replace values with your own evaluation data.