LLM Benchmark Table — Compare AI Model Performance

Interactive Benchmark Table

Click column headers to sort. Use filters and search to refine. TOTAL is a composite score (0–100). Bars are normalized per column.

Normalized score bar (column-aware)

Higher is better (except Refusal & Censor where lower is better)

Model ↕	TOTAL ↓	Pass ↕	Refine ↕	Fail ↕	Refusal ↕	$ mToK ↕	Reason ↕	STEM ↕	Utility ↕	Code ↕	Censor ↕

Rows per page

FAQ

Answers to common questions about metrics, data, and usage.

What do the columns mean?

Model: Model name.
TOTAL: Composite score (0–100) combining multiple benchmarks.
Pass, Refine, Fail: Aggregate outcome counts across tasks.
Refusal: How often the model refuses appropriate requests (lower is better).
$ mToK: Estimated cost per 1M tokens (input+output).
Reason, STEM, Utility, Code, Censor: Category performance scores. Censor: how often it refuses policy-safe requests (lower is better).

How should I interpret the bars?

Bars are normalized within each column to visually compare relative differences. Numeric values in cells remain the raw scores or counts.

How do I sort and filter?

Click a column header to toggle ascending/descending. Use the category chips to include/exclude models for specific capabilities. Use the search box to match models, numbers, or tags (e.g., high code, refusal < 10).

Is this data real?

Values here are sample data for demonstration and not tied to any provider. Use this site as a template and plug in your real benchmark results.

Can I export or print?

Yes. Use the Export CSV button for the filtered, sorted view. The Print button prints or saves as PDF using your browser.