AI Performance Comparison

Comprehensive benchmarks for leading Large Language Models. Compare reasoning, coding ability, and cost efficiency.

Frequently Asked Questions

What is the "TOTAL" score based on?

The TOTAL score is a weighted aggregate of the STEM, Utility, Code, and Reasoning benchmarks. It provides a quick snapshot of overall model capability.

What does $ mToK mean?

This stands for "Dollars per Million Tokens". It represents the average cost for processing 1 million tokens (input + output). Lower is cheaper.

How is the Censor score measured?

The Censor score rates the model's tendency to refuse prompts due to safety alignment filters. A higher score indicates more frequent refusals.