LLM Benchmark Table
Comprehensive comparison of Large Language Models across multiple dimensions. Interactive table with sortable columns and performance visualizations.
Overall Performance
Cost Efficiency
Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|
Frequently Asked Questions
Our benchmarks are calculated through a comprehensive suite of tests across multiple domains. Each model is evaluated on standardized tasks in reasoning, STEM, code generation, utility, and censorship behavior. Scores are normalized to a 100-point scale for easy comparison.
"mToK" stands for "millions of Tokens for $1000". This metric helps you understand the cost efficiency of each model. A higher number means you get more tokens per $1000 spent, making it more cost-effective for large-scale deployments.
We update our benchmarks monthly to reflect the latest model versions and performance improvements. Additionally, we perform ad-hoc updates whenever significant new models are released or major updates are deployed by providers.
Yes! You're welcome to use our benchmark data for research, presentations, or any other purpose. We only ask that you cite LLM Benchmark Table as your source. For bulk data access or API integration, please contact us.
We track model versions carefully and note when benchmarks were run against specific versions. When a model is updated, we re-run our evaluation suite and update the results. Historical data is maintained for comparison purposes.