LLM Benchmark table

A sleek, interactive comparison dashboard for model performance. Filter, sort, search, and inspect benchmark signals across pass rates, refusal behavior, coding, STEM, utility, and censoring tendencies.

Interactive ranking table Dark mode supported Search, sort, filter Visual performance bars

Models tracked

Benchmark entries loaded

Average TOTAL

Overall benchmark average

Best model

—

Top TOTAL score

Highest utility

—

Best utility score

Benchmark Table

Click column headers to sort Hover rows for detail Bars visualize each score Refusal filter uses the Refusal column

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor

FAQ

What does TOTAL mean?

TOTAL is the headline benchmark score for each model. In this demo, it acts as the primary overall performance indicator used for ranking.

Why show Pass, Refine, and Fail separately?

These three columns provide a quick breakdown of outcome quality. Pass is the strongest signal, Refine indicates partially correct responses needing improvement, and Fail captures incorrect or unusable outputs.

What is Refusal?

Refusal estimates how often a model declines to answer. A lower refusal rate can be useful, but the ideal value depends on safety and policy goals.

How should I interpret $ mToK?

$ mToK is displayed as a compact cost/efficiency-style metric. You can treat it as a proxy for model value or a custom benchmark-specific conversion score.

Can I use this on mobile?

Yes. The layout is responsive, the table scrolls horizontally on smaller screens, and the controls remain easy to reach.

Visual guide

Green bars indicate stronger performance.
Purple bars indicate refinement-heavy outputs.
Red bars indicate failures or censoring pressure.
Use the theme button to switch between light and dark mode.
Export to CSV if you want to reuse the benchmark data elsewhere.

Tip: Click any column header to sort by that metric.