LLM Benchmark Table

Compare state-of-the-art large language model (LLM) performance across tasks and categories.

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

FAQ

What does each column mean?
  • Model: The evaluated LLM's name.
  • TOTAL: Total test items.
  • Pass/Refine/Fail/Refusal: Breakdown of how many test items were passed, needed refinement, failed, or were refused.
  • $ mToK: Cost per million tokens in USD.
  • Reason: Explanation for the results (e.g., model strengths/weaknesses).
  • STEM, Utility, Code: Task-specific scores for Science/Math, Utility (general questions), and Code.
  • Censor: Whether the model is heavily censored ("Yes") or not ("No").
How is the data collected?
Data is aggregated from open-source LLM leaderboards, proprietary benchmarks, and standardized test sets, then normalized for comparability.
What do Pass, Refine, Fail, and Refusal mean?
Pass: Model answered correctly.
Refine: Model required guidance or clarification to reach correct answer.
Fail: Model failed to provide a correct answer.
Refusal: Model declined to answer, usually due to safety/censorship policies.
How do I compare models efficiently?
Use the search box to filter by model or reason. Click column headers to sort by scores or cost. Toggle columns to focus on relevant categories.
How often is this benchmark updated?
The benchmark is updated frequently to include new models and results as they become available.