🦾 LLM Benchmark Table
Model
TOTAL
Pass
Refine
Fail
Refusal
$ mToK
Reason
STEM
Utility
Code
Censor
GPT-4
95.2
12
3
5
$0.03
92
96
94
89
Medium
FAQ
What do these metrics mean?
TOTAL: Overall performance score combining all metrics...