LLM Benchmark Table - AI Performance Comparison

What do the metrics mean?

TOTAL: Overall performance score (0-100) across all benchmark categories.

Pass: Percentage of tasks completed successfully on first attempt.

Refine: Percentage of tasks requiring iteration or clarification.

Fail: Percentage of tasks the model couldn't complete.

Refusal: Percentage of tasks the model refused to attempt.

$ mToK: Cost per million tokens (input + output average).

What are the category scores?

Reason: Logical reasoning, problem-solving, and analytical thinking capabilities.

STEM: Performance on science, technology, engineering, and mathematics tasks.

Utility: Practical task completion, following instructions, and general helpfulness.

Code: Programming ability, code generation, debugging, and understanding.

Censor: Content moderation score - lower means more restrictive, higher means more permissive.

How is the data collected?

Our benchmark suite includes over 10,000 diverse tasks across multiple categories. Each model is tested under identical conditions with standardized prompts. Tasks are evaluated by a combination of automated scoring and expert human review to ensure accuracy and fairness.

How often is this updated?

The benchmark table is updated monthly with the latest model versions and pricing. New models are added within 2 weeks of public release. We also re-test existing models quarterly to account for any API updates or improvements.

Can I contribute or suggest improvements?

Absolutely! We welcome community feedback, benchmark task suggestions, and collaboration. Our methodology is transparent and we're always looking to improve. Contact us through our GitHub repository or email for more information.

Which model should I choose?

The "best" model depends on your specific needs. Consider the category scores relevant to your use case, your budget ($ mToK), and whether you need specialized capabilities. High-scoring models aren't always necessary - sometimes a mid-tier model at lower cost is the optimal choice for production deployments.

Frequently Asked Questions