LLM Benchmark table

Frequently Asked Questions

What do these benchmark metrics mean?

TOTAL: The aggregated overall performance score across all categories.
Pass/Refine/Fail: Represents the percentage of prompts the model passed on the first try, passed after self-refinement, or failed completely.
Refusal: The rate at which the model refuses to answer a safe prompt.
$ mToK: Estimated cost per million tokens (input/output blended).
Reason/STEM/Utility/Code: Domain-specific accuracy scores.
Censor: The percentage of false-positive censorship instances.

How often is this table updated?

The benchmark data is updated on a weekly basis as new closed-source and open-weights models are released or updated by their respective organizations.

Why are some bars different colors?

Visual aids utilize a color-coding system based on performance thresholds. Green indicates leading performance (>90), blue is highly competitive (80-90), yellow is average (70-80), and red denotes lower relative performance or higher negative metrics (like Fail rates).