Interactive performance comparison of modern language models
| Model | TOTAL | Pass | Refine | Fail | Refusal | $ mToK | Reason | STEM | Utility | Code | Censor |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT‑4.1 | 92 | 88% | 6% | 4% | 2% | $0.03 | 95 | 96 | 94 | 97 | Low |
| Claude 3 Opus | 90 | 85% | 8% | 5% | 2% | $0.025 | 93 | 94 | 95 | 90 | Medium |
| Gemini Ultra | 86 | 80% | 10% | 7% | 3% | $0.02 | 90 | 92 | 89 | 88 | High |