LLM Benchmark Table

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4o 1000780 601204030 8590939588
GPT-4 Turbo1000760 651354020 8388909285
Claude 3 Opus1000740 701504025 8288919087
Claude 3 Sonnet1000700 601905015 7884878785
Gemini Advanced1000700 652102520 8086888984
Mixtral 8x7B1000650 60270200.5 7580838582
Llama 2 70B1000600 50300500.6 7075788080

Frequently Asked Questions

What do the columns mean?

Pass/Refine/Fail/Refusal describe how the model performed on each benchmark item; e.g., “Pass” means the model met or exceeded the rubric, “Refine” means partial success, “Fail” means it missed the mark, and “Refusal” indicates a safe-completion refusal. The remaining numerical scores are sub-benchmarks (0–100) emphasising different skills. “$ mToK” is the approximate USD cost per million tokens (prompt+completion).

How is the data collected?

Each model is run against the open-source Meta-Eval suite (v2.1). Prompts are standardised and executed on fresh contexts with temperature 0.2. Raw outputs are graded by an independent GPT-4 reference model to ensure consistency.

How can I contribute or reproduce the results?

You can clone the repository at github.com/your-handle/llm-benchmark, run make eval with valid provider API keys, then submit a pull-request with the generated results.json. All PRs are auto-validated.

Why are some numbers rounded?

For readability we round to the nearest whole number. Full decimal precision is available in the raw data files linked above.