LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
OmniAI-X Pro	85.5	75	10	5	0	0.50	88	90	82	85	95
SynthBot-7 Turbo	82.1	70	15	8	2	0.30	85	88	80	81	90
QuantumLeap v3	78.9	65	18	12	5	1.00	75	80	77	72	85
Cognito-Prime	90.2	82	8	3	1	0.75	92	94	90	88	98
NexusAI-Base	75.0	60	20	15	5	0.10	70	72	75	78	80
Aether-5 Mini	72.3	58	22	18	2	0.05	68	70	71	75	88
OracleMind-XL	88.0	78	12	6	4	1.20	90	91	85	86	92
LogicFlow-G	80.5	72	13	9	6	0.40	82	85	78	80	84

Frequently Asked Questions

What do the columns mean?

Model: The name of the AI model being benchmarked.
TOTAL: An overall performance score, often an average or weighted sum.
Pass: Percentage/Count of tasks successfully completed without issues.
Refine: Percentage/Count of tasks requiring minor adjustments or clarifications.
Fail: Percentage/Count of tasks the model failed to complete acceptably.
Refusal: Percentage/Count of tasks the model refused to attempt (e.g., due to safety filters).
$ mToK: Estimated cost per million tokens processed (Input + Output, varies by provider).
Reason: Score/Rating on reasoning capabilities.
STEM: Score/Rating on Science, Technology, Engineering, and Math problems.
Utility: Score/Rating on general helpfulness and task completion across various domains.
Code: Score/Rating on code generation, explanation, and debugging.
Censor: Score/Rating indicating how heavily the model censors or refuses potentially sensitive (but safe) prompts. Higher might mean less censorship.

How is the TOTAL score calculated?

The TOTAL score calculation method can vary depending on the benchmark source. It's often a weighted average of performance across different categories (like Reasoning, STEM, Code, etc.) or based on the Pass/Fail rates on a specific test suite. Check the benchmark's original documentation for precise details. This table provides a comparative overview.

Where does this data come from?

This table uses hypothetical sample data for illustrative purposes. In a real-world scenario, the data would be compiled from various established LLM benchmarks like MMLU, HELM, AlpacaEval, MT-Bench, HumanEval, etc., or from custom evaluation pipelines. Always verify data sources for real decision-making.

How often is the data updated?

As this is a demonstration with static data, it is not updated. A live version of such a site would ideally be updated regularly as new models are released and existing benchmarks are re-run or updated.