LLM Benchmark Table - AI Performance Comparison

What does the TOTAL score represent?

The TOTAL score is a comprehensive metric that combines all individual benchmark scores weighted by their importance. It provides a single number to quickly compare overall model performance across all tested categories.

How is the $ mToK (cost per million tokens) calculated?

The $ mToK represents the cost per million tokens processed by each model. This includes both input and output tokens, averaged across typical usage patterns. Lower values indicate more cost-effective models for large-scale deployments.

What's the difference between Pass, Refine, and Fail?

Pass indicates tests where the model provided correct answers on the first attempt. Refine shows cases where the model needed additional prompting to reach the correct answer. Fail represents tests where the model couldn't provide a satisfactory answer even with refinement.

How often is the benchmark data updated?

Benchmark data is updated weekly with new model releases and monthly comprehensive re-evaluations. Each model is tested on the same standardized dataset to ensure fair comparison. Historical data is preserved for trend analysis.

What does the Censor score indicate?

The Censor score measures how often a model refuses to answer questions due to safety filters or content policies. A lower score means fewer refusals, but this should be balanced with safety considerations for your specific use case.

Frequently Asked Questions