LLM Benchmark Table

Comprehensive AI Model Benchmarks

Compare leading language models across reasoning, coding, STEM, and utility tasks with transparent metrics.

Models Tested

Avg Score

Total Tests

Updated

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor

Frequently Asked Questions

Understanding our benchmark methodology and metrics

What metrics determine the TOTAL score?

The TOTAL score is a weighted composite of Pass rate (35%), Reason (20%), STEM (15%), Utility (15%), and Code (15%). Pass rate measures successful first-attempt completions. Higher weights reflect real-world importance of reasoning and task completion reliability.

How is the $ mToK pricing calculated?

$ mToK represents blended cost per million tokens, calculated as: (input_price × 0.7 + output_price × 0.3) based on typical usage patterns. This normalized metric enables fair cost comparison across different pricing structures and token limits.

What does the Reason category evaluate?

Reason tests multi-step logical deduction, chain-of-thought reasoning, mathematical proofs, and complex problem decomposition. Tasks include logic puzzles, syllogisms, counterfactual reasoning, and scenarios requiring intermediate reasoning steps to reach correct conclusions.

How is the Code benchmark structured?

Code evaluates programming across 12 languages (Python, JS, TypeScript, Rust, Go, Java, C++, etc.). Tests include: algorithm implementation, bug fixing, refactoring, code explanation, and translating between languages. We use HumanEval, MBPP, and custom real-world scenarios.

What does the Censor metric indicate?

Censor measures inappropriate refusal rates on legitimate, safe requests. Lower percentages indicate better calibration. We test edge cases where over-cautious safety filters incorrectly block benign queries about science, history, programming, and general knowledge.

How frequently is the benchmark updated?

Full benchmark suites run monthly. Major model releases are evaluated within 72 hours. We maintain version history for trend analysis. All test cases, prompts, and evaluation criteria are publicly documented for reproducibility.