Performance analysis of 209 AI models in 2635 chess matches
I like it. Plus it's a historic centuries-old game of intellect, pure strategy with objective ground truth. Due to its exponential complexity, beyond opening moves it's largely resistant to common 'benchmaxxing' strategies. Tests game knowledge, reasoning, planning, state tracking, and instruction adherence — measurable via objective superhuman judge (Stockfish) and updated with self-correcting Elo. It serves as a fantastic proxy, with rich metrics (Elo, accuracy, token efficiency, illegal outputs, etc.), and identical conditions for every model. Chess isn't the end goal; it's an additional microscope for comparison.
Back in March 2025, initally most performant model observed was GPT-3.5 Turbo Instruct, playing white in Continuation mode. Here is a spontaneous live video demonstration:
YouTube
In full information chess (Reasoning mode), o1-mini showed strength, winning the first tournament (long since decrowned)
Data collection: Amount of games are mostly influenced by time, model competency/speeds, API constraints and/or budget. This chess inference ran me ~$2800.
Data collected move-by-move — see an example match: Log | Replay.
The initial Elo rating is determined by analyzing the first 10 games (excluding self-play) of each AI model using Stockfish 17.1 at depth 18. The accuracy is calculated using Lichess's methodology, converting move-by-move engine evaluations to win percentages, applying position-specific complexity weighting, and combining weighted and harmonic means for the final accuracy score. Each mode receives unique placements.
Formula:
Initial_Elo = 400 + 200 × (2^((Accuracy-30)/20) - 1) Where: - Accuracy = Average accuracy across first 10 non-self-play games (%) - Accuracy is constrained between 10% and 90% - Human players start at 1500 Elo regardless of accuracy - Default fallback: 1000 Elo if no accuracy data available yet
Examples:
• 30% avg accuracy → Initial_Elo = 400 + 200 × (2^0 - 1) = 400
• 50% avg accuracy → Initial_Elo = 400 + 200 × (2^1 - 1) = 600
• 70% avg accuracy → Initial_Elo = 400 + 200 × (2^2 - 1) = 1000
• 90% avg accuracy → Initial_Elo = 400 + 200 × (2^3 - 1) = 1800
Complete random play seeded with nearly 33% avg accuracy → 419
After initial placement, Elo ratings are updated after each AI vs AI game using the standard Elo rating system. Elo < 500 ≈ random play. Each mode has separate Elo.
Update Formula:
New_Elo = Old_Elo + K × (Actual_Score - Expected_Score) Where: - K = K-factor based on experience: • Provisional players (<30 games): K = 40 • Established players (≥30 games): K = 20 - Actual_Score = 1 (win), 0.5 (draw), 0 (loss) - Expected_Score = 1 / (1 + 10^((Opponent_Elo - Player_Elo) / 400))
Example:
If a 600 Elo AI (15 games played, K=40) plays a 1000 Elo AI and wins:
• Expected Score = 1 / (1 + 10^((1000-600)/400)) = 0.091
• Elo Change = 40 × (1 - 0.091) = +36.4
• New Elo = 600 + 36.4 = 636.4
Special Rules:
| Model | Elo | Games | W-D-L | Acc | Score | Avg Mat. | Avg Turns | tok/move | %legal |
|---|