LLM Benchmark Table
Toggle Dark Mode
Model
TOTAL
Pass
Refine
Fail
Refusal
$ mToK
Reason
STEM
Utility
Code
Censor
Model A
100
80
10
5
5
2.5
Reason 1
High
Medium
Low
Yes
Model B
100
70
15
10
5
3.0
Reason 2
Medium
High
Medium
No
What is the purpose of this benchmark?
This benchmark compares the performance of various AI models based on different criteria.
How are the scores determined?
Scores are based on testing the models in various tasks and recording their performance metrics.
Can I contribute to this benchmark?
Yes! You can submit your own model results for consideration.