LLM Benchmark Table

What do the different metrics mean?

Pass: The model successfully completed the task.

Refine: The model provided a partially correct response that needs refinement.

Fail: The model failed to complete the task correctly.

Refusal: The model refused to answer the query.

$ mToK: Cost per million tokens for the model.

Reason/STEM/Utility/Code/Censor: Performance in specific capability categories.

How often is this data updated?

We update our benchmark data quarterly or when significant new model versions are released.

How can I contribute to this benchmark?

Please contact us if you have new benchmark results or suggestions for improving our evaluation methodology.

Frequently Asked Questions