LLM Benchmark Table

What do these columns mean?

Model: The name of the Large Language Model.

TOTAL: An overall aggregated performance score (e.g., 0-100).

Pass: Percentage of tasks successfully completed.

Refine: Percentage of tasks requiring minor refinement.

Fail: Percentage of tasks failed.

Refusal: Percentage of tasks the model refused to attempt (often due to safety or policy).

$ mToK: Cost in USD per million tokens (input + output, or as specified by provider).

Reason: Score for reasoning capabilities (e.g., logic, problem-solving).

STEM: Score for Science, Technology, Engineering, and Mathematics tasks.

Utility: Score for general helpfulness and practical task completion.

Code: Score for code generation, explanation, and debugging.

Censor: A measure of how censored or restricted the model's outputs are (lower might mean less censored, higher more, interpretation can vary).

How is the TOTAL score calculated?

The TOTAL score is a weighted average of various sub-scores and benchmark results. The exact methodology can vary between benchmark providers. For this table, it's a representative figure to give a general idea of performance.

Where does this data come from?

This data is a curated collection from publicly available benchmarks, research papers, and model provider announcements. It's intended for comparative purposes and may not reflect real-time updates or specific use-case performance.

How often is the data updated?

The data is updated periodically as new models are released or significant benchmark results become available. There is no fixed schedule for updates.

What do the progress bars indicate?

The progress bars for 'Pass', 'Refine', 'Fail', 'Refusal', 'Reason', 'STEM', 'Utility', 'Code', and 'Censor' visually represent the score out of a presumed 100. They help in quickly comparing relative strengths and weaknesses across models.

Frequently Asked Questions