AI Model Performance Benchmarks

Explore and compare the latest large language models across reasoning, coding, STEM, and more — all in one interactive table.

0

Models Tracked

0%

Highest TOTAL

$0

Avg $/mToK

0%

Avg Pass Rate

Model
TOTAL Overall weighted score
Pass
Refine
Fail
Refusal
$/mToK Cost per million tokens
Reason
STEM
Utility
Code
Censor Content filtering level (lower = less censored)

Visual Analytics

Interactive charts to help you compare model performance at a glance.

Top 10 by TOTAL Score

Code Performance Comparison

Reasoning Scores

STEM Performance

Pass
Refine
Fail
Refusal

Frequently Asked Questions

Everything you need to know about these benchmarks.

TOTAL is the overall weighted composite score for each model. It aggregates performance across all benchmark categories (Reason, STEM, Utility, Code) and factors in Pass, Refine, Fail, and Refusal rates. A higher TOTAL indicates stronger overall capability across diverse tasks. It's calculated using a proprietary weighting system that emphasizes real-world utility.
Pass — The model produced a correct, usable response on the first attempt.
Refine — The response needed minor corrections or follow-up to become correct.
Fail — The model produced an incorrect, irrelevant, or nonsensical response.
Refusal — The model declined to answer, citing safety or policy reasons.

These categories give a more nuanced picture than simple accuracy, showing how gracefully a model handles edge cases.
$/mToK stands for the estimated cost in US dollars per million tokens (input + output averaged). This helps you evaluate the cost-efficiency of each model. Lower values mean cheaper inference. Some models are free or open-source (shown as $0.00), while commercial APIs can range from $0.15 to $60+ per million tokens. Costs are based on publicly available API pricing at the time of evaluation.
The Censor score measures how heavily a model filters or restricts its responses. A lower score means the model is less restrictive and more willing to engage with a wider range of topics. A higher score indicates more aggressive content filtering. This is measured by testing the model with a standardized set of prompts across sensitive-but-legitimate topics (medical, legal, creative writing, etc.) and tracking refusal and hedging rates.
We aim to update benchmark data weekly as new models are released and existing ones are updated. Model versions and API changes can significantly impact results, so we re-evaluate models regularly. The "Last Updated" timestamp at the bottom of the page shows the most recent data refresh. Community contributions and corrections are welcome.
Reason — Logical deduction, multi-step reasoning, math word problems, causal inference, and chain-of-thought tasks.
STEM — Science questions (physics, chemistry, biology), engineering problems, and advanced mathematics.
Utility — Summarization, translation, creative writing, instruction following, and general knowledge Q&A.
Code — Code generation, debugging, algorithm implementation, and code explanation across multiple programming languages.

Each category uses a curated set of 200+ diverse test prompts with verified reference answers.
Absolutely! We welcome community contributions. If you'd like to suggest a model for benchmarking or report a data discrepancy, please reach out through our GitHub repository or contact form. We evaluate all suggestions and prioritize models based on community interest and availability. Open-source models are especially encouraged.