Evaluating the Limitations of Large Language Models in Board Games

Condy Bao1 and FangFang Ma2, 1USA, 2Johns Hopkins University, USA; Condy Bao1 and FangFang Ma2, 1USA, 2Johns Hopkins University, USA

Evaluating the Limitations of Large Language Models in Board Games

Authors

Condy Bao¹ and FangFang Ma², ¹USA, ²Johns Hopkins University, USA

Abstract

In recent years, advancements in Large Language Models (LLMs) have sparked significant interest in their capabilities across various areas. This paper investigates the deficiencies in their performance in board games such as Chess, Connect 4, and Tic Tac Toe. Key findings from the experiments include inconsistencies in board representation in Tic Tac Toe, Connect 4, and Chess. Additionally, GPT-4's ability to solve chess puzzles was tested across different difficulty levels and prompting strategies. For easy puzzles, GPT-4 achieved an accuracy of 19.8% with zero-shot prompting, which improved to 30% with one-shot prompting and 32.8% with two-shot prompting. For intermediate puzzles, the accuracies were 14%, 17.6%, and 18.6% for zero-shot, one-shot, and two-shot prompting, respectively. For hard puzzles, the accuracies were 9.80%, 13.5%, and 15% for the same prompting strategies. When the same puzzles were presented in a multiple-choice format, the accuracies improved significantly to 48.2%, 49.4%, and 47.2%, respectively. Additionally, when given repeated attempts at easy puzzles, GPT-4's accuracy improved from 19.8% to 33.4%. These results highlight the key challenges faced by LLMs in playing board games and suggest potential areas for improvement in current models. Furthermore, the discussions provide insights into their limitations in performing other general tasks involving memory storage and thinking processes, which extend beyond just board games.

Keywords

Board Games, Large Language Models (LLMs), Prompting Strategies, Strategic Reasoning,

CS&IT Conference Proceedings

Evaluating the Limitations of Large Language Models in Board Games