LLM Benchmark table

A sleek, interactive comparison dashboard for model performance. Filter, sort, search, and inspect benchmark signals across pass rates, refusal behavior, coding, STEM, utility, and censoring tendencies.

Interactive ranking table Dark mode supported Search, sort, filter Visual performance bars

Models tracked

0
Benchmark entries loaded

Average TOTAL

0
Overall benchmark average

Best model

Top TOTAL score

Highest utility

Best utility score

Benchmark Table

Click column headers to sort Hover rows for detail Bars visualize each score Refusal filter uses the Refusal column
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

FAQ

What does TOTAL mean?

TOTAL is the headline benchmark score for each model. In this demo, it acts as the primary overall performance indicator used for ranking.

Why show Pass, Refine, and Fail separately?

These three columns provide a quick breakdown of outcome quality. Pass is the strongest signal, Refine indicates partially correct responses needing improvement, and Fail captures incorrect or unusable outputs.

What is Refusal?

Refusal estimates how often a model declines to answer. A lower refusal rate can be useful, but the ideal value depends on safety and policy goals.

How should I interpret $ mToK?

$ mToK is displayed as a compact cost/efficiency-style metric. You can treat it as a proxy for model value or a custom benchmark-specific conversion score.

Can I use this on mobile?

Yes. The layout is responsive, the table scrolls horizontally on smaller screens, and the controls remain easy to reach.

Visual guide

  • Green bars indicate stronger performance.
  • Purple bars indicate refinement-heavy outputs.
  • Red bars indicate failures or censoring pressure.
  • Use the theme button to switch between light and dark mode.
  • Export to CSV if you want to reuse the benchmark data elsewhere.
Tip: Click any column header to sort by that metric.