LLM Benchmark Table

Model	TOTAL	Pass	Refine	Fail	Refusal	$ mToK	Reason	STEM	Utility	Code	Censor
GPT-4	100	90	5	3	2	1.2	0.80	0.75	0.85	0.78	0.55
GPT-4 Turbo	100	88	6	4	2	1.1	0.78	0.72	0.82	0.75	0.50
GPT-3.5 Turbo	100	75	10	10	5	0.9	0.65	0.60	0.70	0.68	0.45

FAQ

What does the TOTAL column represent?

The TOTAL column indicates the total number of benchmark questions evaluated for each model.

How do I sort the table?

Click any column header to sort ascending or descending by that column.

What is $ mToK?

mToK stands for “millions to thousands” processing ratio (example metric); adjust to your own definition.

How do I switch to dark mode?

Use the moon/sun icon in the top-right to toggle light/dark mode. Your preference is saved in localStorage.

Can I search/filter?

Yes—use the search box above to filter any matching text across all rows.