Interactive Benchmark Table
Click column headers to sort. Use filters and search to refine. TOTAL is a composite score (0–100). Bars are normalized per column.
Filter by:
Normalized score bar (column-aware)
Higher is better (except Refusal & Censor where lower is better)
| Model ↕ | TOTAL ↓ | Pass ↕ | Refine ↕ | Fail ↕ | Refusal ↕ | $ mToK ↕ | Reason ↕ | STEM ↕ | Utility ↕ | Code ↕ | Censor ↕ |
|---|
FAQ
Answers to common questions about metrics, data, and usage.
What do the columns mean?
- Model: Model name.
- TOTAL: Composite score (0–100) combining multiple benchmarks.
- Pass, Refine, Fail: Aggregate outcome counts across tasks.
- Refusal: How often the model refuses appropriate requests (lower is better).
- $ mToK: Estimated cost per 1M tokens (input+output).
- Reason, STEM, Utility, Code, Censor: Category performance scores. Censor: how often it refuses policy-safe requests (lower is better).
How should I interpret the bars?
Bars are normalized within each column to visually compare relative differences. Numeric values in cells remain the raw scores or counts.
How do I sort and filter?
Click a column header to toggle ascending/descending. Use the category chips to include/exclude models for specific capabilities. Use the search box to match models, numbers, or tags (e.g.,
high code, refusal < 10).Is this data real?
Values here are sample data for demonstration and not tied to any provider. Use this site as a template and plug in your real benchmark results.
Can I export or print?
Yes. Use the Export CSV button for the filtered, sorted view. The Print button prints or saves as PDF using your browser.