LLM Benchmark Table

A sleek comparison of AI model performance across various metrics.

Click column headers to sort. Hover for visuals. Charts show pass rates.
Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 1000 850
100 40 10 $0.03 High 9.5 9.2 9.8 Low
Claude-3 1200 1020
120 50 10 $0.025 High 9.3 9.4 9.6 Medium
Gemini Ultra 1100 935
110 45 10 $0.02 Medium 9.0 9.1 9.4 Low
Llama-2 70B 900 720
90 70 20 $0.01 Medium 8.5 8.8 9.0 High
Grok 800 650
80 60 10 $0.015 High 8.7 8.9 8.8 Low

FAQ

What is this benchmark table?

This table compares Large Language Models (LLMs) across key performance metrics like total tests, pass rates, and specialized scores in STEM, utility, code, and censorship levels.

How is the data collected?

What do the visual aids mean?

The mini-charts next to "Pass" values represent pass rates as a percentage bar. Hover over rows for highlights, and sort columns by clicking headers.

Can I contribute data?

Currently, this is a static demo. In a real implementation, we'd welcome contributions via a form (not included here).