LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
GPT‑4.1	92	88%	6%	4%	2%	$0.03	95	96	94	97	Low
Claude 3 Opus	90	85%	8%	5%	2%	$0.025	93	94	95	90	Medium
Gemini Ultra	86	80%	10%	7%	3%	$0.02	90	92	89	88	High

What is this benchmark measuring?

The benchmark evaluates language models across reasoning, STEM, coding, safety behavior, refusals, and real‑world utility tasks.

What does “Refine” mean?

Refine indicates the model required follow‑up prompts or corrections to reach a valid solution.

Is this data official?

No. This table is a demonstration of how benchmarks can be visualized. Replace values with your own evaluation data.

FAQ