LLM Benchmark Table

Compare performance metrics across state-of-the-art models.

Model	TOTAL	Pass	Refine	Fail	$ mToK	Reason	STEM	Utility	Code	Censor
GPT-4o	92	88	2	2	0.05	95	94	91	90	Low
Claude 3.5	91	87	3	1	0.03	94	92	93	89	Low
Llama 3	85	80	4	1	0.01	88	85	80	82	Med

Frequently Asked Questions

The total score is a weighted aggregate of Pass, STEM, and Utility metrics.

Cost per million tokens in USD ($).