LLM Benchmark Table

What does the "TOTAL" score represent?

The TOTAL score is a weighted composite of all evaluation metrics, providing an overall performance indicator for each model. It combines scores from reasoning ability, task completion, and specialized capabilities.

How are the Pass/Refine/Fail/Refusal categories determined?

These categories represent different response outcomes in standardized testing scenarios:

Pass: Correct and complete response
Refine: Partially correct but requires refinement
Fail: Incorrect or irrelevant response
Refusal: Model declined to answer

What is the difference between Utility and Code scores?

Utility measures general task completion ability across diverse domains (writing, analysis, etc.), while Code specifically evaluates programming capability including code generation, debugging, and explanation.

How often is this benchmark data updated?

We aim to update the benchmark data quarterly. Major model releases may trigger interim updates. The last update was May 15, 2024.

Can I contribute to or suggest improvements for these benchmarks?

Yes! We welcome community input. Please contact us through the feedback form (coming soon) with your suggestions for test cases or evaluation methodologies.

Frequently Asked Questions