Benchmark Highlights
Model | Total | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|
FAQ
What do the Pass / Refine / Fail / Refusal buckets represent?
Pass counts direct success on first try. Refine means the model needed light coaching or retries to reach an acceptable answer. Fail captures incorrect or incomplete results even after refinement. Refusal indicates the model declined the task for policy or safety reasons.
How is $ mToK calculated?
The cost metric standardizes model pricing as dollars per million tokens (prompt + completion). Values are normalized to on-demand pricing at the time of evaluation and rounded to three decimal places for readability.
Explain the Reason, STEM, Utility, Code, and Censor scores.
These sub-scores are normalized from 0 to 1.00. Reason measures multistep reasoning accuracy, STEM focuses on quantitative and scientific problems, Utility covers general productivity tasks, Code measures software generation & debugging, while Censor reflects moderation strictness (lower is more permissive). Visual bars in the table encode these scores for quick scanning.
Which features make the table interactive?
You can sort any column, search by model name, and filter by pass rate, refusal ceiling, and cost tolerance. Hover states and stacked bars clarify distribution of outcomes, while summary cards and the highlight chart surface the top performers automatically.
How often is the benchmark updated?
The pipeline is designed for weekly refreshes. Models that receive major architecture or pricing updates are re-tested ad hoc to keep the comparison trustworthy.