LLM Benchmark Table

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
Model A 100 80 10 5 5 2.5 Reason 1 High Medium Low Yes
Model B 100 70 15 10 5 3.0 Reason 2 Medium High Medium No

What is the purpose of this benchmark?

This benchmark compares the performance of various AI models based on different criteria.

How are the scores determined?

Scores are based on testing the models in various tasks and recording their performance metrics.

Can I contribute to this benchmark?

Yes! You can submit your own model results for consideration.