LLM Benchmark Table

AI Model Performance Comparison

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 1000 850 120 30 0 $0.02 High accuracy in logical tasks 95% 90% 85% Low
Claude-3 950 820 100 30 0 $0.015 Strong in ethical reasoning 92% 88% 90% Medium
Llama-2 800 700 80 20 0 $0.01 Fast and efficient for open-source 85% 80% 75% High
Grok-1 900 780 90 30 0 $0.018 Innovative with real-time data 88% 85% 80% Low

Frequently Asked Questions

This benchmark compares large language models (LLMs) based on metrics like TOTAL tests, Pass rate, and more, to evaluate their performance in real-world scenarios.
$ mToK refers to the cost per million tokens, which estimates the financial efficiency of running the model for large-scale applications.
Metrics like Pass (successful responses), Fail (errors), and STEM (performance in science tasks) help quantify how well a model handles different challenges.