Performance analysis of 205 AI models in 2483 chess matches
Why Chess? A historic centuries-old game of intellect, pure strategy with objective ground truth. Due to its exponential complexity, beyond opening moves it's largely resistant to common 'benchmaxxing' strategies. Tests game knowledge, reasoning, planning, state tracking, and instruction adherence — all measurable via objective superhuman judge (Stockfish) and updated with self-correcting Elo. It serves as a fantastic proxy, with rich metrics (Elo, accuracy, token efficiency, illegal outputs, etc.), and identical conditions for every model. Chess isn't the end goal; it's an additional microscope for comparison.
Initially, most performant model observed was GPT-3.5 Turbo Instruct, playing white in Continuation mode. Here is a spontaneous live video demonstration:
YouTube
In full information chess (Reasoning mode), o1-mini showed strength, winning the first tournament (long since decrowned)
-Amount of games are mostly influenced by time, model competency/speeds, API constraints and/or budget. This chess inference ran me ~$2700.
Data is painstakingly collected, move by move. Example of a single recorded match between two top models: Replay | Log
1.) Every game is analyzed by Stockfish 17.1, which also sets the initial Elo placement based on calculated accuracy.¹
The initial Elo rating is determined by analyzing the first 10 games (excluding self-play) of each AI model using Stockfish 17.1 at depth 18. The accuracy is calculated using Lichess's methodology, converting move-by-move engine evaluations to win percentages, applying position-specific complexity weighting, and combining weighted and harmonic means for the final accuracy score. Each mode receives unique placements.
Formula:
Initial_Elo = 400 + 200 × (2^((Accuracy-30)/20) - 1) Where: - Accuracy = Average accuracy across first 10 non-self-play games (%) - Accuracy is constrained between 10% and 90% - Human players start at 1500 Elo regardless of accuracy - Default fallback: 1000 Elo if no accuracy data available yet
Examples:
• 30% avg accuracy → Initial_Elo = 400 + 200 × (2^0 - 1) = 400
• 50% avg accuracy → Initial_Elo = 400 + 200 × (2^1 - 1) = 600
• 70% avg accuracy → Initial_Elo = 400 + 200 × (2^2 - 1) = 1000
• 90% avg accuracy → Initial_Elo = 400 + 200 × (2^3 - 1) = 1800
Complete random play seeded with nearly 33% avg accuracy → 419
2.) Then, automatic Elo updates are applied for each game, per mode and mixed.²
After initial placement, Elo ratings are updated after each AI vs AI game using the standard Elo rating system. Elo < 500 ≈ random play. Each mode has separate Elo.
Update Formula:
New_Elo = Old_Elo + K × (Actual_Score - Expected_Score) Where: - K = K-factor based on experience: • Provisional players (<30 games): K = 40 • Established players (≥30 games): K = 20 - Actual_Score = 1 (win), 0.5 (draw), 0 (loss) - Expected_Score = 1 / (1 + 10^((Opponent_Elo - Player_Elo) / 400))
Example:
If a 600 Elo AI (15 games played, K=40) plays a 1000 Elo AI and wins:
• Expected Score = 1 / (1 + 10^((1000-600)/400)) = 0.091
• Elo Change = 40 × (1 - 0.091) = +36.4
• New Elo = 600 + 36.4 = 636.4
Special Rules:
| Model | Elo | Games | W-D-L | Acc | Score | Avg Mat. | Avg Turns | tok/move | %legal |
|---|