LLM Benchmark Table

Comparing the performance of various Large Language Models across key metrics.

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 Turbo 1000 85% 10% 3% 2% $10.00 95 92 96 98 70
Claude 3 Opus 1000 88% 8% 2% 2% $15.00 97 93 95 96 65
Llama 3 70B (Instruct) 1000 80% 12% 5% 3% $4.00 90 88 90 92 50
Gemini 1.5 Pro 1000 84% 9% 4% 3% $7.00 93 91 94 95 60
Mixtral 8x7B (Instruct) 1000 75% 15% 7% 3% $3.00 85 82 86 88 45
Mistral Large 1000 86% 9% 4% 1% $12.00 94 90 93 94 55
GPT-3.5 Turbo 1000 70% 18% 9% 3% $1.50 80 75 82 80 75

Frequently Asked Questions

What is this benchmark table for?

This table compares the performance of various Large Language Models (LLMs) across different tasks and metrics to help users understand their strengths and weaknesses.

How is the data collected?

The data presented here is illustrative sample data. In a real benchmark, data would be collected through systematic evaluation using a diverse set of prompts and tasks across different domains.

What do the columns mean?

Hover over the column headers (like TOTAL, Pass, $ mToK) for a tooltip explaining each metric. Key metrics include success rate (Pass, Refine, Fail, Refusal), estimated cost ($ mToK), and scores in specific domains (Reason, STEM, Utility, Code, Censor).

Is the data real-time?

No, this is a static HTML page with sample data. A real-time benchmark would require a backend system to fetch and update results.

Can I contribute data?

This is a demonstration page. A real benchmark would typically have a defined process for data submission or automated evaluation pipelines.