LLM Benchmark Table

Comprehensive AI Model Performance Comparison

Loading benchmark data...

Frequently Asked Questions

How are these scores calculated?
Our scores are derived from a comprehensive evaluation framework that tests models across multiple dimensions including reasoning, STEM knowledge, utility, and code generation. Each model is tested with thousands of carefully curated prompts and tasks.
What does the 'Censor' score mean?
The censorship score indicates the level of content restriction applied by the model. A lower score (closer to 1) means less censorship and more open responses, while a higher score (closer to 10) indicates stricter content filtering.
How often is this data updated?
We update this benchmark table weekly with the latest model releases and performance improvements. The last update timestamp is displayed when you hover over any score.
Can I suggest models to be added?
Yes! We welcome community suggestions. Please visit our GitHub repository to submit model requests or contribute to our evaluation framework.
Is this data open source?
Absolutely. All benchmark data, evaluation scripts, and methodology are open source and available under the MIT license. We believe in transparent AI evaluation.