LLM Performance Comparison

Compare the latest large language models across multiple benchmarks including reasoning, STEM, utility, code generation, and more. Find the best model for your needs.

Total Models

12
2 new this month

Avg. Pass Rate

68%
3% vs last month

Top Model

GPT-4o
92% pass rate

Cost Efficiency

$0.002/mTok
15% more efficient
Model TOTAL Pass Refine Fail Refusal $/mTok Reason STEM Utility Code Censor

Frequently Asked Questions

What do Pass, Refine, Fail, and Refusal mean?

Pass: The model provided a correct and complete answer on the first attempt.

Refine: The model provided a partially correct answer that needed refinement or follow-up questions to reach the correct solution.

Fail: The model provided an incorrect answer or failed to solve the problem.

Refusal: The model refused to answer the question, typically due to safety or ethical concerns.

How is the TOTAL score calculated?

The TOTAL score is a weighted average of all benchmark categories (Reason, STEM, Utility, Code) with Pass answers weighted at 1.0, Refine at 0.7, Fail at 0.0, and Refusal at 0.3. This provides a comprehensive view of model performance across different task types.

What do the category scores mean?

Each category score represents the model's performance in that specific domain:

  • Reason: Logical reasoning and problem-solving abilities
  • STEM: Performance on science, technology, engineering, and math problems
  • Utility: General usefulness for everyday tasks and questions
  • Code: Ability to generate, explain, and debug programming code
  • Censor: How heavily the model filters its responses (lower is less censored)
How often is this data updated?

We aim to update the benchmark data monthly, typically within the first week of each month. The last update was on June 1, 2023. We also add new models as they're released.

Why are some models more expensive than others?

Cost per million tokens ($/mTok) varies based on several factors:

  • Model size: Larger models with more parameters typically cost more to run
  • Infrastructure: Some providers have more efficient hardware or better optimization
  • Business model: Some companies subsidize costs to gain market share
  • Specialization: Models fine-tuned for specific tasks may have different pricing

The cost shown is the average across input and output tokens when available.

How can I use this data to choose a model?

Consider these factors when choosing a model:

  1. Your use case: If you need coding help, look at the Code scores. For math/science, focus on STEM.
  2. Budget: Balance performance with cost - sometimes a slightly less capable model is much cheaper.
  3. Response quality: Some applications need high Pass rates, others can tolerate more Refine answers.
  4. Censorship: If you need unfiltered responses, look for lower Censor scores.
  5. Refusal rate: High refusal rates may indicate the model is too cautious for your needs.

You can sort the table by any column to find the best model for your specific requirements.