LLM Benchmark Table

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 100 90 5 3 2 1.2
0.80
0.75
0.85
0.78
0.55
GPT-4 Turbo 100 88 6 4 2 1.1
0.78
0.72
0.82
0.75
0.50
GPT-3.5 Turbo 100 75 10 10 5 0.9
0.65
0.60
0.70
0.68
0.45

FAQ

What does the TOTAL column represent?

The TOTAL column indicates the total number of benchmark questions evaluated for each model.

How do I sort the table?

Click any column header to sort ascending or descending by that column.

What is $ mToK?

mToK stands for “millions to thousands” processing ratio (example metric); adjust to your own definition.

How do I switch to dark mode?

Use the moon/sun icon in the top-right to toggle light/dark mode. Your preference is saved in localStorage.

Can I search/filter?

Yes—use the search box above to filter any matching text across all rows.