LLM Benchmark Table

Compare AI models at a glance. Search, sort, filter, and switch dark mode.

Interactive Benchmark Table

Click column headers to sort. Use filters and search to refine. TOTAL is a composite score (0–100). Bars are normalized per column.
Filter by:
Normalized score bar (column-aware)
Higher is better (except Refusal & Censor where lower is better)
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

FAQ

Answers to common questions about metrics, data, and usage.
What do the columns mean?
  • Model: Model name.
  • TOTAL: Composite score (0–100) combining multiple benchmarks.
  • Pass, Refine, Fail: Aggregate outcome counts across tasks.
  • Refusal: How often the model refuses appropriate requests (lower is better).
  • $ mToK: Estimated cost per 1M tokens (input+output).
  • Reason, STEM, Utility, Code, Censor: Category performance scores. Censor: how often it refuses policy-safe requests (lower is better).
How should I interpret the bars?
Bars are normalized within each column to visually compare relative differences. Numeric values in cells remain the raw scores or counts.
How do I sort and filter?
Click a column header to toggle ascending/descending. Use the category chips to include/exclude models for specific capabilities. Use the search box to match models, numbers, or tags (e.g., high code, refusal < 10).
Is this data real?
Values here are sample data for demonstration and not tied to any provider. Use this site as a template and plug in your real benchmark results.
Can I export or print?
Yes. Use the Export CSV button for the filtered, sorted view. The Print button prints or saves as PDF using your browser.