AI Model Performance Tracker

Comprehensive comparison of large language models across reasoning, coding, and safety metrics.

Top 5 Total Score
Best Value ($/Score)
Model TOTAL Pass % Refine Fail % Refusal $ mToK Reason STEM Utility Code Censor

Frequently Asked Questions

How is the TOTAL score calculated?
The TOTAL score is a weighted average of Reasoning (30%), Coding (30%), STEM Knowledge (20%), and Utility (20%), penalized by Refusal rates.
What does "$ mToK" mean?
This stands for Price per Million Output Tokens. It represents the API cost for generating text with that specific model provider.
What is the "Refine" metric?
Refine measures the model's ability to correct its own output when prompted with error messages or user feedback.
How often is this data updated?
This is a demo table. In a real-world scenario, this data would be pulled from an API or database and updated weekly.