Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|
FAQ
Pass: Number of tests the model answered correctly on the first try.
Refine: Number of tests the model passed after refining the answer.
Fail: Number of tests the model answered incorrectly.
Refusal: Number of tests where the model refused to answer due to content or other reasons.
The $ mToK
column shows the estimated cost in US dollars per million tokens consumed by the model, based on published pricing or estimated values.
These columns represent performance or behavior scores in specific categories:
STEM: Performance in Science, Technology, Engineering, and Math tasks.
Utility: General usefulness in everyday or productivity tasks.
Code: Ability to write, understand, or debug code.
Censor: Level of content moderation or refusal to answer sensitive queries.
Yes! Click or keyboard-navigate to any column header to sort ascending or descending by that column's data. Press Enter or Space when focused to toggle sort direction.
Yes, click the moon icon button near the top right of the page to toggle dark mode for better night-time readability and reduced eye strain.