LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
GPT-4o	1000	780	60	120	40	30	85	90	93	95	88
GPT-4 Turbo	1000	760	65	135	40	20	83	88	90	92	85
Claude 3 Opus	1000	740	70	150	40	25	82	88	91	90	87
Claude 3 Sonnet	1000	700	60	190	50	15	78	84	87	87	85
Gemini Advanced	1000	700	65	210	25	20	80	86	88	89	84
Mixtral 8x7B	1000	650	60	270	20	0.5	75	80	83	85	82
Llama 2 70B	1000	600	50	300	50	0.6	70	75	78	80	80

Frequently Asked Questions

What do the columns mean?

Pass/Refine/Fail/Refusal describe how the model performed on each benchmark item; e.g., “Pass” means the model met or exceeded the rubric, “Refine” means partial success, “Fail” means it missed the mark, and “Refusal” indicates a safe-completion refusal. The remaining numerical scores are sub-benchmarks (0–100) emphasising different skills. “$ mToK” is the approximate USD cost per million tokens (prompt+completion).

How is the data collected?

Each model is run against the open-source Meta-Eval suite (v2.1). Prompts are standardised and executed on fresh contexts with temperature 0.2. Raw outputs are graded by an independent GPT-4 reference model to ensure consistency.

How can I contribute or reproduce the results?

You can clone the repository at github.com/your-handle/llm-benchmark, run make eval with valid provider API keys, then submit a pull-request with the generated results.json. All PRs are auto-validated.

Why are some numbers rounded?

For readability we round to the nearest whole number. Full decimal precision is available in the raw data files linked above.