LLM Benchmark Table

Comprehensive evaluation of frontier models

Model TOTAL Pass Refine Fail Refusal $ mToK Reason STEM Utility Code Censor

Frequently Asked Questions

How are these scores calculated?
Scores are derived from a composite of standardized benchmarks including MMLU, HumanEval, and internal reasoning stress tests. The "TOTAL" is a weighted average of all sub-categories.
What does $ mToK represent?
This is the estimated cost in USD per 1 million tokens (blended rate of input and output) based on official API pricing at the time of testing.
What is the "Censor" metric?
The Censor metric indicates the model's tendency to refuse prompts based on safety guidelines. "Low" means the model is more permissive, while "High" indicates strict adherence to safety guardrails.