| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|
TOTAL is the headline benchmark score for each model. In this demo, it acts as the primary overall performance indicator used for ranking.
These three columns provide a quick breakdown of outcome quality. Pass is the strongest signal, Refine indicates partially correct responses needing improvement, and Fail captures incorrect or unusable outputs.
Refusal estimates how often a model declines to answer. A lower refusal rate can be useful, but the ideal value depends on safety and policy goals.
$ mToK is displayed as a compact cost/efficiency-style metric. You can treat it as a proxy for model value or a custom benchmark-specific conversion score.
Yes. The layout is responsive, the table scrolls horizontally on smaller screens, and the controls remain easy to reach.