LLM Benchmark Table

AI Performance Comparison

Explore the benchmark results for various Large Language Models. Click on column headers to sort the table.

Model TOTAL Pass Refine Fail Refusal $ mToK Estimated cost per million tokens. Note: This is a simplified estimate and actual costs may vary. Reason Score reflecting the model's logical reasoning capabilities. STEM Score reflecting the model's performance on Science, Technology, Engineering, and Mathematics tasks. Utility Score reflecting the model's ability to perform general useful tasks. Code Score reflecting the model's coding and programming capabilities. Censor Score reflecting the model's tendency to censor or refuse certain requests. Lower is generally better.
Model A 100 80 15 3 2 $0.05 90
90%
75%
85%
10%
Model B 100 70 20 5 5 $0.03 85
85%
80%
70%
15%
Model C 100 90 5 2 3 $0.07 95
95%
90%
92%
5%

Frequently Asked Questions

What is this benchmark?
This benchmark evaluates the performance of various Large Language Models (LLMs) across different tasks and metrics to provide a comparative view of their capabilities.
How are the scores calculated?
The scores are derived from a series of tests designed to assess specific aspects of an LLM's performance, including reasoning, factual knowledge (STEM), general utility, coding ability, and its tendency to refuse requests (Censor). The exact methodology may vary depending on the specific test suite used. "Pass," "Refine," "Fail," and "Refusal" indicate the outcome of individual test prompts.
What does "$ mToK" mean?
"$ mToK" stands for "dollars per million tokens". It's an estimated cost metric, providing a rough idea of the cost associated with processing one million tokens (words or sub-word units) with that particular model. This is an approximation and actual pricing may differ based on usage patterns and provider.
How often is the data updated?
The data is updated periodically as new benchmark results become available and models are re-evaluated. Check back regularly for the latest comparisons.
Can I suggest a model to be added?
Currently, we do not have a public submission process. The models included are based on available and reliable benchmark data.