Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4 | 1000 | 850 | 100 | 30 | 20 | $1.5 | Accuracy | High | Medium | Low | Yes |
BERT | 900 | 700 | 150 | 40 | 10 | $1.2 | Speed | Medium | High | Medium | No |
RoBERTa | 800 | 600 | 100 | 80 | 20 | $1.0 | Robustness | High | Low | High | No |
Frequently Asked Questions
Model: The name of the AI model being benchmarked.
TOTAL: Total number of tests conducted.
Pass: Number of tests passed successfully.
Refine: Number of tests requiring refinement.
Fail: Number of tests failed.
Refusal: Number of instances where the model refused to respond.
$ mToK: Cost metric in million tokens.
Reason: Primary reason for performance metrics.
STEM: Performance in STEM-related tasks.
Utility: General utility and applicability.
Code: Ability to generate and understand code.
Censor: Level of content censorship applied.
Data is collected through a series of standardized tests performed across various AI models to ensure consistency and reliability in benchmarking results.
Currently, contributions are curated by our team to maintain data integrity. However, we welcome suggestions and feedback through our contact page.