LLM Chess tournament - Single-elimination

1 How Models Play

Models are fed information in standard chess notations about the board state (FEN), the game history (SAN), and a list of legal moves (SAN). They then are instructed to return a JSON object with their reasoning and the chosen move. The response is then validated before updating the game state.

2 Competing Models

15 Models used for this tournament range from very cheap models such as LFM 7B to more pricey and competent ones such as GPT-4 Turbo. This aims to create interesting matchups accounting for presumed skill gaps. All games (PGN) are evaluated by Stockfish 17 to determine model accuracy.

3 Tournament Structure

To prevent the strongest players from all knocking themselves out in the first round, the initial brackets are populated by having the priciest models (black) compete against the cheapest (white), creating a balanced tournament progression. Drawn games are repeated until a victor emerges.

Note: Extremely expensive models such as o1 ($60 mTok x thought chains) and GPT-4.5 ($150 mTok) were excluded in this tournament as a single test match between them (which GPT-4.5 won decisively) cost almost $20 (o1 being ~2.5x more expensive than GPT-4.5 in this format).

Round 1: Quarterfinals
LFM 7B Acc: 46%
GPT-4 Turbo Acc: 49%
GPT-4o-mini Acc: 58%
o3-mini Acc: 46%
DeepSeek V3 Acc: 42%
Claude 3.7 Sonnet Acc: 49%
GPT-3.5 Turbo Acc: 33%
Mistral Large 2411 Acc: 33%
Llama 3.1 405B Instruct Acc: 32%
GPT-4o Acc: 35%
Llama 3.3 70B Instruct Acc: 77%
DeepSeek-R1 Acc: 92%
Gemini 2.0 Flash Acc: 39%
o1-mini Acc: 40%
Round 2: Semifinals
GPT-4o-mini Acc: 55%
GPT-4 Turbo Acc: 41%
Mistral Large 2411 Acc: 39%
Claude 3.7 Sonnet Acc: 38%
GPT-4o Acc: 33%
DeepSeek-R1 Acc: 30%
Gemini 1.5 Pro 002 Acc: 37%
o1-mini Acc: 51%
Round 3: Finals
GPT-4o-mini Acc: 45%
Mistral Large 2411 Acc: 45%
GPT-4o Acc: 43%
o1-mini Acc: 45%
Championship Match (BO3)
Mistral Large 2411 Acc: 40%
o1-mini Acc: 49%
Mistral Large 2411 Acc: 56%
o1-mini Acc: 64%

Champion: o1-mini!

43% of matches ended in a draw, with the closest matchup being between Mistral Large 2411 -vs- Claude 3.7 Sonnet (4 draws out of 5 matches).

This was a fun experiment that took many hours to set up, hope you found it interesting! Try your own chess experiments here: dubesor.de/chess