LLM Benchmark table

Compare LLMs with clarity—filter, sort, and inspect signals.

This interactive table blends outcome rates (Pass / Refine / Fail / Refusal), price ($ mToK), and category scores (Reason, STEM, Utility, Code, Censor). Use the controls to build a short-list and export it in one click.

Sticky headers + fast sort

Search + filters + column toggles

In-table bars + top summary

Dark mode

FAQ

How to read the table, what the numbers mean, and how the view works.

What do Pass / Refine / Fail / Refusal mean? +

Think of them as outcome rates across a benchmark suite:

Pass: solved within the evaluator’s success criteria.
Refine: nearly correct, but needed an extra revision / tool pass.
Fail: incorrect or unusable result.
Refusal: the model declined (policy, uncertainty, or safety behavior).

Higher Pass is good; higher Refusal may be desirable in safety contexts, but can reduce utility in general use.

What is $ mToK? +

A simple cost proxy: dollars per million tokens (mToK = million tokens). Lower is cheaper. Use it alongside TOTAL to find value models.

How is TOTAL computed? +

In this demo dataset, TOTAL is a composite score designed for comparison: category scores (Reason/STEM/Utility/Code/Censor) plus outcome rates in a balanced way. It’s not a universal standard.

Tip: Click any column header to sort; click again to reverse. Use Columns to tailor your view.

Keyboard shortcuts? +

/ focus search
Esc clear search (or close menus)
T toggle theme
? show shortcuts (this hint)

LLM BENCHMARK

Compare LLMs with clarity—filter, sort, and inspect signals.

FAQ