Comparing the performance of various Large Language Models across key metrics.
Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4 Turbo | 1000 | 85% | 10% | 3% | 2% | $10.00 | 95 | 92 | 96 | 98 | 70 |
Claude 3 Opus | 1000 | 88% | 8% | 2% | 2% | $15.00 | 97 | 93 | 95 | 96 | 65 |
Llama 3 70B (Instruct) | 1000 | 80% | 12% | 5% | 3% | $4.00 | 90 | 88 | 90 | 92 | 50 |
Gemini 1.5 Pro | 1000 | 84% | 9% | 4% | 3% | $7.00 | 93 | 91 | 94 | 95 | 60 |
Mixtral 8x7B (Instruct) | 1000 | 75% | 15% | 7% | 3% | $3.00 | 85 | 82 | 86 | 88 | 45 |
Mistral Large | 1000 | 86% | 9% | 4% | 1% | $12.00 | 94 | 90 | 93 | 94 | 55 |
GPT-3.5 Turbo | 1000 | 70% | 18% | 9% | 3% | $1.50 | 80 | 75 | 82 | 80 | 75 |
This table compares the performance of various Large Language Models (LLMs) across different tasks and metrics to help users understand their strengths and weaknesses.
The data presented here is illustrative sample data. In a real benchmark, data would be collected through systematic evaluation using a diverse set of prompts and tasks across different domains.
Hover over the column headers (like TOTAL, Pass, $ mToK) for a tooltip explaining each metric. Key metrics include success rate (Pass, Refine, Fail, Refusal), estimated cost ($ mToK), and scores in specific domains (Reason, STEM, Utility, Code, Censor).
No, this is a static HTML page with sample data. A real-time benchmark would require a backend system to fetch and update results.
This is a demonstration page. A real benchmark would typically have a defined process for data submission or automated evaluation pipelines.