Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|
Pass: The model successfully completed the task.
Refine: The model provided a partially correct response that needs refinement.
Fail: The model failed to complete the task correctly.
Refusal: The model refused to answer the query.
$ mToK: Cost per million tokens for the model.
Reason/STEM/Utility/Code/Censor: Performance in specific capability categories.
We update our benchmark data quarterly or when significant new model versions are released.
Please contact us if you have new benchmark results or suggestions for improving our evaluation methodology.