Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | 1000 | 780 | 60 | 120 | 40 | 30 | 85 | 90 | 93 | 95 | 88 |
GPT-4 Turbo | 1000 | 760 | 65 | 135 | 40 | 20 | 83 | 88 | 90 | 92 | 85 |
Claude 3 Opus | 1000 | 740 | 70 | 150 | 40 | 25 | 82 | 88 | 91 | 90 | 87 |
Claude 3 Sonnet | 1000 | 700 | 60 | 190 | 50 | 15 | 78 | 84 | 87 | 87 | 85 |
Gemini Advanced | 1000 | 700 | 65 | 210 | 25 | 20 | 80 | 86 | 88 | 89 | 84 |
Mixtral 8x7B | 1000 | 650 | 60 | 270 | 20 | 0.5 | 75 | 80 | 83 | 85 | 82 |
Llama 2 70B | 1000 | 600 | 50 | 300 | 50 | 0.6 | 70 | 75 | 78 | 80 | 80 |
Frequently Asked Questions
What do the columns mean?
Pass/Refine/Fail/Refusal describe how the model performed on each benchmark item; e.g., “Pass” means the model met or exceeded the rubric, “Refine” means partial success, “Fail” means it missed the mark, and “Refusal” indicates a safe-completion refusal. The remaining numerical scores are sub-benchmarks (0–100) emphasising different skills. “$ mToK” is the approximate USD cost per million tokens (prompt+completion).
How is the data collected?
Each model is run against the open-source Meta-Eval suite (v2.1). Prompts are standardised and executed on fresh contexts with temperature 0.2. Raw outputs are graded by an independent GPT-4 reference model to ensure consistency.
How can I contribute or reproduce the results?
You can clone the repository at github.com/your-handle/llm-benchmark,
run make eval
with valid provider API keys, then submit a pull-request
with the generated results.json
. All PRs are auto-validated.
Why are some numbers rounded?
For readability we round to the nearest whole number. Full decimal precision is available in the raw data files linked above.