How are these scores calculated?
Scores are derived from a composite of standardized benchmarks including MMLU, HumanEval, and GSM8K, weighted by real-world utility tests.
What does "Refusal" represent?
Refusal tracks how often a model declines to answer a valid prompt due to over-active safety filters or alignment constraints.
How often is the data updated?
The table is updated weekly as new model weights and API versions are released by providers like OpenAI, Anthropic, and Meta.