🦾 LLM Benchmark Table

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 95.2
12 3 5 $0.03 92 96 94 89 Medium

FAQ

What do these metrics mean?

TOTAL: Overall performance score combining all metrics...