LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
GPT-4 Turbo	1000	85%	10%	3%	2%	$10.00	95	92	96	98	70
Claude 3 Opus	1000	88%	8%	2%	2%	$15.00	97	93	95	96	65
Llama 3 70B (Instruct)	1000	80%	12%	5%	3%	$4.00	90	88	90	92	50
Gemini 1.5 Pro	1000	84%	9%	4%	3%	$7.00	93	91	94	95	60
Mixtral 8x7B (Instruct)	1000	75%	15%	7%	3%	$3.00	85	82	86	88	45
Mistral Large	1000	86%	9%	4%	1%	$12.00	94	90	93	94	55
GPT-3.5 Turbo	1000	70%	18%	9%	3%	$1.50	80	75	82	80	75

Frequently Asked Questions

What is this benchmark table for?

This table compares the performance of various Large Language Models (LLMs) across different tasks and metrics to help users understand their strengths and weaknesses.

How is the data collected?

The data presented here is illustrative sample data. In a real benchmark, data would be collected through systematic evaluation using a diverse set of prompts and tasks across different domains.

What do the columns mean?

Hover over the column headers (like TOTAL, Pass, $ mToK) for a tooltip explaining each metric. Key metrics include success rate (Pass, Refine, Fail, Refusal), estimated cost ($ mToK), and scores in specific domains (Reason, STEM, Utility, Code, Censor).

Is the data real-time?

No, this is a static HTML page with sample data. A real-time benchmark would require a backend system to fetch and update results.

Can I contribute data?

This is a demonstration page. A real benchmark would typically have a defined process for data submission or automated evaluation pipelines.