LLM Benchmark Table

Compare performance metrics across state-of-the-art models.

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor
GPT-4o92882200.0595949190Low
Claude 3.591873100.0394929389Low
Llama 385804100.0188858082Med

Frequently Asked Questions

How is the "Total" score calculated?

The total score is a weighted aggregate of Pass, STEM, and Utility metrics.

What does "$ mToK" represent?

Cost per million tokens in USD ($).