AI Model Performance Comparison
Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4 | 1000 | 850 | 120 | 30 | 0 | $0.02 | High accuracy in logical tasks | 95% | 90% | 85% | Low |
Claude-3 | 950 | 820 | 100 | 30 | 0 | $0.015 | Strong in ethical reasoning | 92% | 88% | 90% | Medium |
Llama-2 | 800 | 700 | 80 | 20 | 0 | $0.01 | Fast and efficient for open-source | 85% | 80% | 75% | High |
Grok-1 | 900 | 780 | 90 | 30 | 0 | $0.018 | Innovative with real-time data | 88% | 85% | 80% | Low |
Frequently Asked Questions
This benchmark compares large language models (LLMs) based on metrics like TOTAL tests, Pass rate, and more, to evaluate their performance in real-world scenarios.
$ mToK refers to the cost per million tokens, which estimates the financial efficiency of running the model for large-scale applications.
Metrics like Pass (successful responses), Fail (errors), and STEM (performance in science tasks) help quantify how well a model handles different challenges.