LLM Benchmark Table

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4 1000 850 100 30 20 $1.5 Accuracy High Medium Low Yes
BERT 900 700 150 40 10 $1.2 Speed Medium High Medium No
RoBERTa 800 600 100 80 20 $1.0 Robustness High Low High No

Frequently Asked Questions

What does each column represent?

Model: The name of the AI model being benchmarked.

TOTAL: Total number of tests conducted.

Pass: Number of tests passed successfully.

Refine: Number of tests requiring refinement.

Fail: Number of tests failed.

Refusal: Number of instances where the model refused to respond.

$ mToK: Cost metric in million tokens.

Reason: Primary reason for performance metrics.

STEM: Performance in STEM-related tasks.

Utility: General utility and applicability.

Code: Ability to generate and understand code.

Censor: Level of content censorship applied.

How is the data collected?

Data is collected through a series of standardized tests performed across various AI models to ensure consistency and reliability in benchmarking results.

Can I contribute data to the table?

Currently, contributions are curated by our team to maintain data integrity. However, we welcome suggestions and feedback through our contact page.