LLM Benchmark table

A sleek, interactive comparison of AI model performance across multiple metrics
Higher is better (normalized)
Model Model name
Indices
Model TOTAL Pass Refine Fail Refusal $mToK Reason STEM Utility Code Censor

FAQ

What is this benchmark and how should I interpret the TOTAL score?

The TOTAL score is a composite indicator derived from several primary metrics (Pass, Refine, Fail, Refusal, and mToK) to summarize reliability, usefulness, and safety of an LLM. Higher is generally better; the score is shown out of 100 for quick comparison.

How are the per-column bars calculated?

Each numeric column shows a bar representing the value as a percent of the observed maximum for that column in the current view. This provides a quick visual sense of relative performance among models in the filtered/sorted state.

Can I export the data?

Yes. Click "Export CSV" to download the current view as a comma-separated file. The export reflects any filters you have applied.

How is dark mode stored?

Dark mode preference is saved in your browser's localStorage and persists across visits. Use the toggle to switch themes.