LLM BENCHMARK

LLM Benchmark table

Compare LLMs with clarity—filter, sort, and inspect signals.

This interactive table blends outcome rates (Pass / Refine / Fail / Refusal), price ($ mToK), and category scores (Reason, STEM, Utility, Code, Censor). Use the controls to build a short-list and export it in one click.

Sticky headers + fast sort
Search + filters + column toggles
In-table bars + top summary
Dark mode
0
Sort: TOTAL ↓
View: all columns

FAQ

How to read the table, what the numbers mean, and how the view works.

What do Pass / Refine / Fail / Refusal mean? +

Think of them as outcome rates across a benchmark suite:

  • Pass: solved within the evaluator’s success criteria.
  • Refine: nearly correct, but needed an extra revision / tool pass.
  • Fail: incorrect or unusable result.
  • Refusal: the model declined (policy, uncertainty, or safety behavior).

Higher Pass is good; higher Refusal may be desirable in safety contexts, but can reduce utility in general use.

What is $ mToK? +

A simple cost proxy: dollars per million tokens (mToK = million tokens). Lower is cheaper. Use it alongside TOTAL to find value models.

How is TOTAL computed? +

In this demo dataset, TOTAL is a composite score designed for comparison: category scores (Reason/STEM/Utility/Code/Censor) plus outcome rates in a balanced way. It’s not a universal standard.

Tip: Click any column header to sort; click again to reverse. Use Columns to tailor your view.

Keyboard shortcuts? +
  • / focus search
  • Esc clear search (or close menus)
  • T toggle theme
  • ? show shortcuts (this hint)