Models are told to be chess grandmasters, given the current movetext of a PGN (sequence of moves in standard chess notation), instructed to repeat the entire game sequence and add one new move in the sequence. The new move is then validated before updating the game state. Upon 3rd invalid move in a row, a random legal move is played instead. This methodology is minimalistic with pure token game continuation and contains no reasoning, board state, nor legal moves list. Thus, it is much more prone to models providing illegal continuations/invalid moves.
16 Models used for this tournament range from very cheap models such as LFM 7B to more pricey and competent ones such as GPT-4 Turbo. This aims to create interesting matchups accounting for presumed skill gaps. All games (PGN) are evaluated by Stockfish 17 to determine model accuracy. All continuations are tracked and compared to total attempts, displayed as Legal: %
To prevent the strongest players from all knocking themselves out in the first round, the initial brackets are populated by having the priciest models (black) compete against the cheapest (white), creating a balanced tournament progression. Drawn games are repeated until a victor emerges. If no victor emerged, material avg. from all games is used to assign victor.
Note: Extremely expensive models such as o1 ($60 mTok x thought chains) and GPT-4.5 ($150 mTok) were excluded in this tournament.
Note, that this method of pure move continuation can lead to stronger Game regurgitation, but is far less robust than my previous method, producing a multitude of illegal moves. Thus, GPT-3.5 Turbo Instruct was not able to competently adhere to this format when playing as black. While Gemini 2.0 Flash also had weaker performance on black, it was the weaker white player overall.
Try your own flawed chess experiments, using this move continuation method, here: dubesor.de/chessbeta