LLM Benchmark Table

Comparison of AI models performance on benchmark tests
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

FAQ

Pass: Number of tests the model answered correctly on the first try.
Refine: Number of tests the model passed after refining the answer.
Fail: Number of tests the model answered incorrectly.
Refusal: Number of tests where the model refused to answer due to content or other reasons.

The $ mToK column shows the estimated cost in US dollars per million tokens consumed by the model, based on published pricing or estimated values.

These columns represent performance or behavior scores in specific categories:
STEM: Performance in Science, Technology, Engineering, and Math tasks.
Utility: General usefulness in everyday or productivity tasks.
Code: Ability to write, understand, or debug code.
Censor: Level of content moderation or refusal to answer sensitive queries.

Yes! Click or keyboard-navigate to any column header to sort ascending or descending by that column's data. Press Enter or Space when focused to toggle sort direction.

Yes, click the moon icon button near the top right of the page to toggle dark mode for better night-time readability and reduced eye strain.