Mixed

AI Chess Leaderboard

Performance analysis of 247 AI models in 3342 chess matches

Why Chess?

I like it. Plus, it's a historic centuries-old game of intellect, pure strategy with objective ground truth. Due to its exponential complexity beyond opening moves, it's largely resistant to common 'benchmaxxing' strategies. Tests game knowledge, reasoning, planning, state tracking, consistency and instruction adherence — measurable via objective superhuman judge (Stockfish) and updated with self-correcting Elo. It serves as a fantastic proxy, with rich metrics (Elo, accuracy, token efficiency, output speed, illegal outputs, etc.), and same conditions for every model. Chess isn't the end goal; it's an additional microscope for comparison.

1 Stockfish 17.1 analysis ?
2 Dynamic Elo per mode ?
Full replays for every match
Updates every 24h
Project Notes

Back in March 2025, initially most performant model observed was GPT-3.5 Turbo Instruct, playing white in Continuation mode. Here is a spontaneous live video demonstration: YouTube
In full information chess (Reasoning mode), o1-mini showed strength, winning the first tournament (long since decrowned & ˟deprecated).
Elo progression page: Chess Elo Race

Data collection: Matchups and game volume are mostly influenced by model competency/speeds, data-gaps, API constraints and/or budget.
This chess inference ran me ~$3700. Data collected move-by-move — see an example match: Log | Replay.
Any recent reasoning replay now automatically features full move-by-move model commentary and custom move-labeling.

Elo is always relative to the player pool it's measured in. lichess rating chess.com rating fide rating engine rating llm rating perceived rating
LLMs operate alien in comparison to Humans and traditional Chess engines. They have a completely different error distribution and are thus not directly comparable. Due to frequent inquiries, while I don't mix isolated Elo pools, and bot tuning is inconsistent, from low-scale testing and under these caveats I can disclose that in my ballpark observation ~1200 Continuation-rated models matched ~1400 rated Lichess bots, while ~1850 Elo in this leaderboard (best reasoners) correlated roughly to Lv16 Expert 2000 chess.com bots. Example matches of Gemini-3.1-Pro-Preview (Reasoning ~1837) vs 1900 (W), 2000 (WD), 2100 (L), 2200 (L), 2400 (L)

Like all my benchmarks, unless specifically indicated otherwise, models are tested on their default settings. This prevents exploding cost, test time & flooding the leaderboard with unrealistic benchmark-optimized configurations that aren't representative of the average user experience. This principled method aligns incentives correctly and keeps the evaluation ecosystem feasible, honest and representative.

Initial Elo Calculation

The initial Elo rating is determined by analyzing the first 10 games (excluding self-play) of each AI model using Stockfish 17.1 at depth 18. The accuracy is calculated using Lichess's methodology, converting move-by-move engine evaluations to win percentages, applying position-specific complexity weighting, and combining weighted and harmonic means for the final accuracy score. Each mode receives unique placements.

Formula:

Initial_Elo = 400 + 200 × (2^((Accuracy-30)/20) - 1)

Where:
- Accuracy = Average accuracy across first 10 non-self-play games (%)
- Accuracy is constrained between 10% and 90%
- Human players start at 1500 Elo regardless of accuracy
- Default fallback: 1000 Elo if no accuracy data available yet

Examples:
• 30% avg accuracy → Initial_Elo = 400 + 200 × (2^0 - 1) = 400
• 50% avg accuracy → Initial_Elo = 400 + 200 × (2^1 - 1) = 600
• 70% avg accuracy → Initial_Elo = 400 + 200 × (2^2 - 1) = 1000
• 90% avg accuracy → Initial_Elo = 400 + 200 × (2^3 - 1) = 1800

Complete random play seeded with nearly 33% avg accuracy → 419

Elo Update System

After initial placement, Elo ratings are updated after each AI vs AI game using the standard Elo rating system. Elo < 500 ≈ random play. Each mode has separate Elo.

Update Formula:

New_Elo = Old_Elo + K × (Actual_Score - Expected_Score)

Where:
- K = K-factor based on experience:
  • Provisional players (<30 games): K = 40
  • Established players (≥30 games): K = 20
- Actual_Score = 1 (win), 0.5 (draw), 0 (loss)
- Expected_Score = 1 / (1 + 10^((Opponent_Elo - Player_Elo) / 400))

Example:
If a 600 Elo AI (15 games played, K=40) plays a 1000 Elo AI and wins:
• Expected Score = 1 / (1 + 10^((1000-600)/400)) = 0.091
• Elo Change = 40 × (1 - 0.091) = +36.4
• New Elo = 600 + 36.4 = 636.4

Special Rules:
• Human vs AI games: Only human Elo is updated
• Self-play games are excluded from Elo calculations

Chess Performance Overview
Wins
Draws
Losses
Elo
Acc
Score
Player Statistics
Model Elo Games W-D-L Acc Score Mat. Turns tok/move Spd Legal
Opening Distribution excl. Humans
Trophy Room Mixed Modes