| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|
TOTAL: The aggregated overall performance score across all categories.
Pass/Refine/Fail: Represents the percentage of prompts the model passed on the first try, passed after self-refinement, or failed completely.
Refusal: The rate at which the model refuses to answer a safe prompt.
$ mToK: Estimated cost per million tokens (input/output blended).
Reason/STEM/Utility/Code: Domain-specific accuracy scores.
Censor: The percentage of false-positive censorship instances.
The benchmark data is updated on a weekly basis as new closed-source and open-weights models are released or updated by their respective organizations.
Visual aids utilize a color-coding system based on performance thresholds. Green indicates leading performance (>90), blue is highly competitive (80-90), yellow is average (70-80), and red denotes lower relative performance or higher negative metrics (like Fail rates).