AI Chess Leaderboard

Performance analysis of 209 AI models in 2635 chess matches

Why Chess?

I like it. Plus it's a historic centuries-old game of intellect, pure strategy with objective ground truth. Due to its exponential complexity, beyond opening moves it's largely resistant to common 'benchmaxxing' strategies. Tests game knowledge, reasoning, planning, state tracking, and instruction adherence — measurable via objective superhuman judge (Stockfish) and updated with self-correcting Elo. It serves as a fantastic proxy, with rich metrics (Elo, accuracy, token efficiency, illegal outputs, etc.), and identical conditions for every model. Chess isn't the end goal; it's an additional microscope for comparison.

1 Stockfish 17.1 analysis ?
2 Dynamic Elo per mode ?
Full replays for every match
Updates every 24h
Project Notes

Back in March 2025, initally most performant model observed was GPT-3.5 Turbo Instruct, playing white in Continuation mode. Here is a spontaneous live video demonstration: YouTube
In full information chess (Reasoning mode), o1-mini showed strength, winning the first tournament (long since decrowned)

Data collection: Amount of games are mostly influenced by time, model competency/speeds, API constraints and/or budget. This chess inference ran me ~$2800.
Data collected move-by-move — see an example match: Log | Replay.

Initial Elo Calculation

The initial Elo rating is determined by analyzing the first 10 games (excluding self-play) of each AI model using Stockfish 17.1 at depth 18. The accuracy is calculated using Lichess's methodology, converting move-by-move engine evaluations to win percentages, applying position-specific complexity weighting, and combining weighted and harmonic means for the final accuracy score. Each mode receives unique placements.

Formula:

Initial_Elo = 400 + 200 × (2^((Accuracy-30)/20) - 1)

Where:
- Accuracy = Average accuracy across first 10 non-self-play games (%)
- Accuracy is constrained between 10% and 90%
- Human players start at 1500 Elo regardless of accuracy
- Default fallback: 1000 Elo if no accuracy data available yet

Examples:
• 30% avg accuracy → Initial_Elo = 400 + 200 × (2^0 - 1) = 400
• 50% avg accuracy → Initial_Elo = 400 + 200 × (2^1 - 1) = 600
• 70% avg accuracy → Initial_Elo = 400 + 200 × (2^2 - 1) = 1000
• 90% avg accuracy → Initial_Elo = 400 + 200 × (2^3 - 1) = 1800

Complete random play seeded with nearly 33% avg accuracy → 419

Elo Update System

After initial placement, Elo ratings are updated after each AI vs AI game using the standard Elo rating system. Elo < 500 ≈ random play. Each mode has separate Elo.

Update Formula:

New_Elo = Old_Elo + K × (Actual_Score - Expected_Score)

Where:
- K = K-factor based on experience:
  • Provisional players (<30 games): K = 40
  • Established players (≥30 games): K = 20
- Actual_Score = 1 (win), 0.5 (draw), 0 (loss)
- Expected_Score = 1 / (1 + 10^((Opponent_Elo - Player_Elo) / 400))

Example:
If a 600 Elo AI (15 games played, K=40) plays a 1000 Elo AI and wins:
• Expected Score = 1 / (1 + 10^((1000-600)/400)) = 0.091
• Elo Change = 40 × (1 - 0.091) = +36.4
• New Elo = 600 + 36.4 = 636.4

Special Rules:

Chess Performance Overview
Wins
Draws
Losses
Elo
Acc
Score
Player Statistics
Model Elo Games W-D-L Acc Score Avg Mat. Avg Turns tok/move %legal
Opening Distribution excl. Humans