Models are fed information in standard chess notations about the board state (FEN), the game history (SAN), and a list of legal moves (SAN). They then are instructed to return a JSON object with their reasoning and the chosen move. The response is then validated before updating the game state.
15 Models used for this tournament range from very cheap models such as LFM 7B to more pricey and competent ones such as GPT-4 Turbo. This aims to create interesting matchups accounting for presumed skill gaps. All games (PGN) are evaluated by Stockfish 17 to determine model accuracy.
To prevent the strongest players from all knocking themselves out in the first round, the initial brackets are populated by having the priciest models (black) compete against the cheapest (white), creating a balanced tournament progression. Drawn games are repeated until a victor emerges.
Note: Extremely expensive models such as o1 ($60 mTok x thought chains) and GPT-4.5 ($150 mTok) were excluded in this tournament as a single test match between them (which GPT-4.5 won decisively) cost almost $20 (o1 being ~2.5x more expensive than GPT-4.5 in this format).
43% of matches ended in a draw, with the closest matchup being between Mistral Large 2411 -vs- Claude 3.7 Sonnet (4 draws out of 5 matches).
This was a fun experiment that took many hours to set up, hope you found it interesting! Try your own chess experiments here: dubesor.de/chess