LLM Benchmark Table

What do the columns "Pass", "Refine", "Fail", and "Refusal" mean?

Pass: Number of tests the model answered correctly on the first try.
Refine: Number of tests the model passed after refining the answer.
Fail: Number of tests the model answered incorrectly.
Refusal: Number of tests where the model refused to answer due to content or other reasons.

How is the "$ mToK" column calculated?

The $ mToK column shows the estimated cost in US dollars per million tokens consumed by the model, based on published pricing or estimated values.

What do the "STEM", "Utility", "Code", and "Censor" columns represent?

These columns represent performance or behavior scores in specific categories:
STEM: Performance in Science, Technology, Engineering, and Math tasks.
Utility: General usefulness in everyday or productivity tasks.
Code: Ability to write, understand, or debug code.
Censor: Level of content moderation or refusal to answer sensitive queries.

Can I sort the table by any column?

Yes! Click or keyboard-navigate to any column header to sort ascending or descending by that column's data. Press Enter or Space when focused to toggle sort direction.

Is there a dark mode option?

Yes, click the moon icon button near the top right of the page to toggle dark mode for better night-time readability and reduced eye strain.

FAQ