LLM Benchmark Table

What do the different metrics mean?

TOTAL: Overall performance score combining all metrics.
Pass: Percentage of tests passed successfully.
Refine: Percentage of responses needing refinement.
Fail: Percentage of tests failed.
Refusal: Percentage of requests refused by the model.
$ mToK: Cost in dollars per million tokens.
Reason: Reasoning capability score.
STEM: Science, Technology, Engineering, Math score.
Utility: General usefulness and versatility score.
Code: Code generation and understanding score.
Censor: Content moderation sensitivity level.

How often is this data updated?

The benchmark data is updated monthly with the latest model releases and performance metrics. We conduct continuous testing to ensure accuracy and relevance. Subscribe to our newsletter for monthly updates on new models and improvements.

Which model should I choose for my use case?

It depends on your needs:
• High Performance: Choose models with high TOTAL scores
• Budget-Friendly: Look for low cost ($ mToK) models
• Code Development: Prioritize high CODE scores
• Research/Analysis: Focus on STEM and Reasoning scores
• General Purpose: Look for balanced scores across all metrics

What is the "Reason" metric?

The Reason metric measures a model's ability to perform logical reasoning, multi-step problem solving, and complex analytical tasks. Higher scores indicate better performance in tasks requiring chain-of-thought reasoning and deductive logic.

How are costs calculated?

Costs are displayed as dollars per million tokens (mToK), which includes both input and output token pricing. This allows fair comparison across models with different pricing structures. Lower values indicate more cost-effective models.

What is the "Censor" metric?

The Censor metric indicates the level of content moderation applied by each model. Higher values indicate stricter content filtering and more refusals on potentially sensitive topics. Choose based on your application's content policy requirements.

Can I filter by specific categories?

Yes! Use the search box to filter models by name or category. The sort dropdown allows you to prioritize by different metrics. You can also click on column headers to sort by specific scores. Combine multiple filters for precise results.

Is there an API for this data?

Currently, the data is available through this interactive table. We're working on providing an API for developers to integrate benchmark data into their applications. Check back soon or contact us for early access information.

Frequently Asked Questions