LLM Performance Comparison
Compare the latest large language models across multiple benchmarks including reasoning, STEM, utility, code generation, and more. Find the best model for your needs.
Total Models
Avg. Pass Rate
Top Model
Cost Efficiency
| Model | TOTAL | Pass | Refine | Fail | Refusal | $/mTok | Reason | STEM | Utility | Code | Censor |
|---|
Frequently Asked Questions
Pass: The model provided a correct and complete answer on the first attempt.
Refine: The model provided a partially correct answer that needed refinement or follow-up questions to reach the correct solution.
Fail: The model provided an incorrect answer or failed to solve the problem.
Refusal: The model refused to answer the question, typically due to safety or ethical concerns.
The TOTAL score is a weighted average of all benchmark categories (Reason, STEM, Utility, Code) with Pass answers weighted at 1.0, Refine at 0.7, Fail at 0.0, and Refusal at 0.3. This provides a comprehensive view of model performance across different task types.
Each category score represents the model's performance in that specific domain:
- Reason: Logical reasoning and problem-solving abilities
- STEM: Performance on science, technology, engineering, and math problems
- Utility: General usefulness for everyday tasks and questions
- Code: Ability to generate, explain, and debug programming code
- Censor: How heavily the model filters its responses (lower is less censored)
We aim to update the benchmark data monthly, typically within the first week of each month. The last update was on June 1, 2023. We also add new models as they're released.
Cost per million tokens ($/mTok) varies based on several factors:
- Model size: Larger models with more parameters typically cost more to run
- Infrastructure: Some providers have more efficient hardware or better optimization
- Business model: Some companies subsidize costs to gain market share
- Specialization: Models fine-tuned for specific tasks may have different pricing
The cost shown is the average across input and output tokens when available.
Consider these factors when choosing a model:
- Your use case: If you need coding help, look at the Code scores. For math/science, focus on STEM.
- Budget: Balance performance with cost - sometimes a slightly less capable model is much cheaper.
- Response quality: Some applications need high Pass rates, others can tolerate more Refine answers.
- Censorship: If you need unfiltered responses, look for lower Censor scores.
- Refusal rate: High refusal rates may indicate the model is too cautious for your needs.
You can sort the table by any column to find the best model for your specific requirements.