Figuring out why AIs get flummoxed by some games

Published: 1 month ago (March 13, 2026 at 05:47 PM EDT)

7 min read

Source: Ars Technica

Your move
When winning depends on intuiting a mathematical function, AIs come up short.

Oddly, the training methods that work great for chess fail on far simpler games.

Credit: SimpleImages

With its Alpha series of game‑playing AIs, Google’s DeepMind group seemed to have found a way for its AIs to tackle any game, mastering titles like chess and Go by repeatedly playing against themselves during training.

But then some odd things happened as people started identifying Go positions that would lose against relative newcomers to the game yet easily defeat a similar Go‑playing AI (see the Ars Technica article).

While beating an AI at a board game may seem relatively trivial, it can help us identify failure modes of the AI—or ways we can improve their training to avoid developing blind spots. These insights may become critical as people rely on AI input for an ever‑growing range of problems.

A recent paper published in Machine Learning describes an entire category of games where the method used to train AlphaGo and AlphaChess fails. The games in question can be remarkably simple, as exemplified by the one the researchers studied: Nim, a two‑player game in which participants take turns removing matchsticks from a pyramid‑shaped board until one player is left without a legal move.

Impartiality

Nim involves setting up a set of rows of matchsticks, with the top row having a single match and each subsequent row containing two more matches than the one above. This creates a pyramid‑shaped board. Two players then take turns removing matchsticks from the board: they choose a row and remove anywhere from one match to the entire contents of that row. The game ends when there are no legal moves left. It’s a simple game that can easily be taught to children.

It also turns out to be a critical example of an entire category of rule sets that define impartial games. These differ from games like chess, where each player has their own set of pieces; in impartial games, the two players share the same pieces and are bound by the same set of rules. Nim’s importance stems from a theorem showing that any position in an impartial game can be represented by a configuration of a Nim pyramid—meaning that if something applies to Nim, it applies to all impartial games.

One of the distinctive features of Nim and other impartial games is that, at any point in the game, it’s easy to evaluate the board and determine which player has the potential to win. In other words, you can size up the board and know that, if you play the optimal moves from then on, you will likely win. Doing so just requires feeding the board’s configuration into a parity function, which does the math to tell you whether you’re winning.

(Obviously, the player who is currently winning could play a suboptimal move and end up losing. The exact series of optimal moves is not determined until the end, since they depend on exactly what the opponent does.)

The new work, done by Bei Zhou and Soren Riis, asks a simple question: what happens if you take the AlphaGo approach to training an AI to play games and try to develop a Nim‑playing AI? Put differently, they asked whether an AI could develop a representation of a parity function purely by playing itself in Nim.

When Self‑Teaching Fails

AlphaZero’s chess‑playing version was trained using only the rules of chess. By playing against itself, it learns to associate board configurations with a probability of winning. To prevent it from getting stuck in ruts, a random‑sampling element encourages continual exploration of new territory. Once the system can identify a limited set of high‑value moves, it explores deeper into the future possibilities that arise from those moves.

The more games it plays, the higher the probability that it will be able to assign values to potential board configurations that could arise from a given position—although the benefits of additional games tend to taper off after a sufficient number have been played.

Nim vs. Chess

In Nim, there are only a few optimal moves for any given board configuration. If you fail to play one of them, you essentially cede control to your opponent, who can then win by playing only optimal moves. The optimal moves can be identified by evaluating a mathematical parity function.

Consequently, there are reasons to think that the training process that worked for chess might not be effective for Nim. The surprise is just how poorly it performed.

Five‑row Nim – Zhou and Riis found that the AI improved fairly quickly and was still improving after 500 training iterations.
Six‑row Nim – Adding just one more row caused the rate of improvement to slow dramatically.
Seven‑row Nim – Gains in performance had essentially stopped by the time the AI had played itself 500 times.

Illustrating the Problem

To highlight the issue, the researchers replaced the subsystem that suggested potential moves with one that selected moves randomly. On a seven‑row Nim board, the performance of the trained version and the randomized version was indistinguishable over 500 training games. In other words, once the board became large enough, the system was incapable of learning from observed game outcomes.

The initial state of the seven‑row configuration has three potential moves that are all consistent with an eventual win. Yet when the trained move evaluator was asked to assess all potential moves, it rated every single one as roughly equivalent.

Conclusion

The researchers conclude that Nim requires players to learn the parity function to play effectively, and the training procedure that works so well for chess and Go is incapable of doing so.

Not just Nim

One way to view the conclusion is that Nim (and, by extension, all impartial games) is just weird. Zhou and Riis also found signs that similar problems could crop up in chess‑playing AIs trained in this manner. They identified several “wrong” chess moves—ones that missed a mating attack or threw an end‑game—that were initially rated highly by the AI’s board evaluator. Only because the software examined a number of additional branches several moves into the future was it able to avoid these gaffes.

For many Nim board configurations, the optimal branches that lead to a win have to be played out to the end of the game to demonstrate their value, so this sort of avoidance of a potential gaffe is much harder to manage. Chess players have also found mating combinations that require long chains of moves that chess‑playing software often misses entirely. Zhou and Riis suggest that the issue isn’t that chess lacks the same problems, but rather that Nim‑like board configurations are generally rare in chess. Presumably, similar things apply to Go, as illustrated by the odd weaknesses of AIs in that game.

“AlphaZero excels at learning through association,” Zhou and Riis argue, “but fails when a problem requires a form of symbolic reasoning that cannot be implicitly learned from the correlation between game states and outcomes.”
In other words, even if the rules governing a game enable simple rules for deciding what to do, we can’t expect Alpha‑style training to enable an AI to identify them. The result is what they call a tangible, catastrophic failure mode.

Why does this matter?

Lots of people are exploring the utility of AIs for math problems, which often require the kind of symbolic reasoning involved in extrapolating from a board configuration to general rules such as the parity function. While it may not be obvious how to train an AI to do that, it is useful to know which approaches will clearly not work.

Machine Learning, 2026. DOI: 10.1007/s10994-026-06996-1 (About DOIs)

About the author

John Timmer is Ars Technica’s science editor. He holds a B.A. in Biochemistry from Columbia University and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle or a scenic location for communing with his hiking boots.

Comments

22 Comments

HP has new incentive to stop blocking third‑party ink in its printers

Figuring out why AIs get flummoxed by some games

Impartiality

When Self‑Teaching Fails

Nim vs. Chess

Illustrating the Problem

Conclusion

Not just Nim

Why does this matter?

About the author

Comments

Related posts

Learning athletic humanoid tennis skills from imperfect human motion data

What I Gained from Interacting with Shogi AI: The Path to 1st Place in Floodgate and My Approach to Distilled Models

[Paper] ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

[Paper] Automatic Generation of High-Performance RL Environments

Impartiality

When Self‑Teaching Fails

Nim vs. Chess

Illustrating the Problem

Conclusion

Not just Nim

Why does this matter?

About the author

Comments

Most Read

Related posts

Learning athletic humanoid tennis skills from imperfect human motion data

What I Gained from Interacting with Shogi AI: The Path to 1st Place in Floodgate and My Approach to Distilled Models

[Paper] ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement Learning

[Paper] Automatic Generation of High-Performance RL Environments