Sleek benchmark dashboard for comparing modern AI models

Track model quality, reliability, cost, and specialization.

Explore a responsive benchmark table with sorting, search, category filters, summary cards, visual performance aids, and a built-in FAQ. Designed as a single-file experience with no external dependencies.

Explore Benchmarks

Top TOTAL score

—

Loading model leader...

Lowest cost ($ mToK)

—

Finding efficient option...

Best Code specialist

—

Comparing coding strength...

Average Pass rate

—

Across all listed models

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor

Category Leaders

Higher is better for TOTAL & skills Lower is better for cost Censor reflects stricter refusal tendency

What to look for

Balanced performance profile

Reasoning-heavy workflows

Code generation & debugging

Budget sensitivity

Use the table to compare tradeoffs: a model can lead in reasoning but cost more, or be cheaper while sacrificing pass rate. The most useful choice depends on your product constraints and safety posture.

FAQ

Frequently asked questions

Clear definitions for the benchmark metrics shown above.

What does TOTAL represent?

TOTAL is a composite score summarizing overall benchmark performance. It is typically influenced by pass rate, quality across categories like Reason, STEM, Utility, and Code, plus penalties from failures or refusals depending on methodology.

What is the difference between Pass, Refine, Fail, and Refusal?

Pass means the model solved a task successfully. Refine means it produced a partially useful answer that may need editing or follow-up. Fail means the output was incorrect or insufficient. Refusal means the model declined to answer, often due to safety policies or uncertainty.

What does “$ mToK” mean?

In this table, “$ mToK” is presented as a simplified cost metric for comparing relative pricing. You can interpret it as a normalized monetary cost unit for model usage. Lower values are generally better if budget efficiency matters.

How should I interpret Censor?

Censor reflects how strongly a model tends toward restrictive or guarded behavior in edge cases. A higher value may indicate stricter filtering or more frequent conservative refusals. Whether that is good or bad depends on your application and compliance needs.

Can I sort and filter the table?

Yes. You can search by model name, use filters for reasoning, coding, and cost, click any table header to sort, and switch between dark and light themes for comfort.