LLM Benchmark Table

Comprehensive comparison of Large Language Models across multiple dimensions. Interactive table with sortable columns and performance visualizations.

Overall Performance

Cost Efficiency

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

Frequently Asked Questions

How are these benchmarks calculated?

Our benchmarks are calculated through a comprehensive suite of tests across multiple domains. Each model is evaluated on standardized tasks in reasoning, STEM, code generation, utility, and censorship behavior. Scores are normalized to a 100-point scale for easy comparison.

What does "mToK" mean in the cost column?

"mToK" stands for "millions of Tokens for $1000". This metric helps you understand the cost efficiency of each model. A higher number means you get more tokens per $1000 spent, making it more cost-effective for large-scale deployments.

How often is this data updated?

We update our benchmarks monthly to reflect the latest model versions and performance improvements. Additionally, we perform ad-hoc updates whenever significant new models are released or major updates are deployed by providers.

Can I use this data for my research or presentation?

Yes! You're welcome to use our benchmark data for research, presentations, or any other purpose. We only ask that you cite LLM Benchmark Table as your source. For bulk data access or API integration, please contact us.

How do you handle model updates and versioning?

We track model versions carefully and note when benchmarks were run against specific versions. When a model is updated, we re-run our evaluation suite and update the results. Historical data is maintained for comparison purposes.