LLM Benchmark Table

A living scoreboard for frontier language models. Explore performance signatures across reasoning, STEM depth, utility, coding prowess, and policy temperament. Use the controls to slice the landscape and discover which model best fits your workload and risk profile.

0% required
Pass Refine Fail Refusal

Benchmark Highlights

Model Total Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

FAQ

What do the Pass / Refine / Fail / Refusal buckets represent?

Pass counts direct success on first try. Refine means the model needed light coaching or retries to reach an acceptable answer. Fail captures incorrect or incomplete results even after refinement. Refusal indicates the model declined the task for policy or safety reasons.

How is $ mToK calculated?

The cost metric standardizes model pricing as dollars per million tokens (prompt + completion). Values are normalized to on-demand pricing at the time of evaluation and rounded to three decimal places for readability.

Explain the Reason, STEM, Utility, Code, and Censor scores.

These sub-scores are normalized from 0 to 1.00. Reason measures multistep reasoning accuracy, STEM focuses on quantitative and scientific problems, Utility covers general productivity tasks, Code measures software generation & debugging, while Censor reflects moderation strictness (lower is more permissive). Visual bars in the table encode these scores for quick scanning.

Which features make the table interactive?

You can sort any column, search by model name, and filter by pass rate, refusal ceiling, and cost tolerance. Hover states and stacked bars clarify distribution of outcomes, while summary cards and the highlight chart surface the top performers automatically.

How often is the benchmark updated?

The pipeline is designed for weekly refreshes. Models that receive major architecture or pricing updates are re-tested ad hoc to keep the comparison trustworthy.

Crafted with pure HTML/CSS/JS — no external dependencies. Toggle dark mode to match your environment.