Teaching AI agents to ask better questions by playing “Battleship”

Published: (June 3, 2026 at 05:00 PM EDT)
7 min read

Source: MIT News - AI

AI Agents and the “Battleship” Challenge

In 2026, the hype for artificial‑intelligence agents is louder than ever. These semi‑autonomous programs can think and execute well‑defined tasks in areas like customer service and software development, typically using language models (LMs). However, fields such as medical diagnosis and scientific discovery require agents to inquire about a vast range of solutions in uncertain environments—something LMs still struggle with.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University’s School of Engineering and Applied Sciences (SEAS) probed deeper into LMs to understand their main issues in high‑stakes settings. Their test: “Battleship,” a classic guessing game that has long helped cognitive scientists study how humans seek information.

Collaborative Battleship

CSAIL and SEAS scholars added a twist by reframing the game around asking and answering natural‑language questions. In their “Collaborative Battleship” game:

  • One participant acts as the captain, who asks about the locations of hidden ships.
  • The teammate plays the spotter, responding to those questions in real time.

The researchers first had over 40 humans play the game together, collecting their questions and yes‑no answers to build the BattleshipQA dataset. This dataset served as a helpful point of comparison when the team tested:

  • State‑of‑the‑art LMs (e.g., GPT‑5)
  • Smaller models (e.g., Llama 4 Scout)

Result: Without any prior training, top LMs could beat humans at Battleship—completing the game in fewer turns—while smaller systems were far less rational.

The Core Problem: Question Generation

The chief issue was that many models are simply not adept at generating useful questions. To improve this, the researchers gave each model a Monte Carlo inference strategy, which carefully measures the likelihood of different options being correct with each response.

  • Outcome: AI models that can beat regular players at Battleship, regardless of scale.

Llama 4 Scout’s Gains

  • Baseline: Beat humans only 8 % of the time.
  • After refinement: Achieved an 82 % win rate versus humans.
  • This efficient questioning style also let the model outpace a frontier model (GPT‑5) while operating at roughly 1 % of its cost.

Improving Spotter Accuracy

Beyond asking better questions, the team worked on answer quality:

  • While GPT‑5 was a reliable spotter, smaller systems often gave wrong answers about ship locations.
  • By converting questions into executable code that explicitly verifies answers (e.g., running a quick search of an area), models saw an average 15 % accuracy boost.

“Today’s language models are primarily optimized to answer complex queries, but it’s less clear whether they learn to ask good questions for themselves,” says MIT PhD student and CSAIL researcher Gabriel Grand SM ’23, lead author of the paper.
“Our work shows that asking informative questions depends on the ability to predict and simulate the world. We find that when we give agents access to a ‘world model,’ they ask better questions and make discoveries more efficiently.”


A Sea Change for LMs

Better Question‑Asking via Monte Carlo

  • The LMs reason about potential guesses as individual particles.
  • Particles that become more plausible with each spotter answer receive higher weight—akin to game balls that inflate or deflate each turn.
  • This adaptive approach lets the captain extract considerably more information from the spotter.

Python‑Powered Spotters

The scientists leveraged the widely used programming language Python to help AI spotters:

  1. Each captain question is automatically converted into an encoded command.
  2. Example:
    • Natural language: “Is there a ship in column one that spans two rows?”
    • Encoded command: a Python routine that searches the specified area and assesses the ship’s width.

By giving the model clear, executable directions, answer correctness rose sharply:

  • GPT‑4o‑mini: ~30 % performance bump.
  • Claude 4 Opus (large model): ~8 % improvement.

“The field has seen a lot of success from ‘auto‑formalization’ strategies, in which LMs generate code to verify their solutions,” says senior author Jacob Andreas, MIT EECS associate professor and CSAIL principal investigator.
“What I find most exciting about this work is that it opens up the possibility of using these techniques to generate better solutions in the first place, by improving LMs’ exploration and information‑gathering capabilities. We are excited to scale this work up from scientific domains to applications like coding and mathematical problem‑solving.”


Let’s Play Something Else

The team also evaluated their enhanced LMs on “Guess Who?”, another classic deduction game:

ModelBaseline Success RateAfter Tweaks
Llama 4 Scout30 %72 %
GPT‑4o62 %90 %
GPT‑5 (spotter)Ensured high‑accuracy answers

While LMs have made promising progress in both games, there remains room for improvement. For instance, the models still…

(The original text cuts off here; continue as needed.)

AI Agents in “Collaborative Battleship”

Authors: Valerio Pepe (OpenAI researcher, recent Harvard graduate), Grand, Jacob Andreas (MIT CSAIL), Joshua Tenenbaum (MIT), Robert Hawkins (Stanford, not involved in the paper)


Overview

“GPT‑5 can beat your average ‘Battleship’ player, and gets a hair better with our methods. However, expert players are still hard to beat for all models, unlike in chess, where even top players don’t succeed against AI systems.” – Valerio Pepe

The researchers’ findings show that AI agents have untapped potential in “needle‑in‑a‑haystack” discovery—navigating a massive space of options to find a rare solution to scientific challenges. While improved information‑seeking skills would make them excellent research assistants (e.g., identifying a compound’s molecular structure), the team cautions that “Collaborative Battleship” is a somewhat simple test bed. They would like to test large language models (LMs) in more complex settings, where the systems must consider far more options.


Key Points

  • Human vs. AI performance – Current models still struggle to answer complex questions compared with humans.
  • Potential for research assistance – Better information‑seeking could help LMs act as assistants for tasks such as molecular‑structure identification.
  • Future directions
    1. Test LMs in richer environments with many more possible actions.
    2. Explore human‑AI collaboration to see whether joint performance exceeds either alone.
    3. Fine‑tune models on game simulations and increase compute to improve inference about game evolution.

“As AI systems become more agentic, the hardest problems turn out to be social ones: tracking common ground, resolving misunderstandings, and adapting to different partners over time.” – Robert Hawkins, Assistant Professor of Linguistics, Stanford University

“This work elegantly captures these phenomena in a controlled collaborative setting, and makes a compelling case that the real bottleneck for AI agents isn’t just the calculation of optimal questions, but the pragmatic reasoning needed to make the most of their answers.” – Robert Hawkins


Authors & Affiliations

  • Grand – Lead investigator, MIT CSAIL
  • Valerio Pepe – OpenAI researcher, recent Harvard graduate
  • Jacob Andreas – Associate Professor, MIT CSAIL (principal investigator)
  • Joshua Tenenbaum – Professor, MIT CSAIL (principal investigator)

Funding & Support

  • MIT Siegel Family Quest for Intelligence
  • MIT‑IBM Watson AI Lab
  • FinTechAI@CSAIL initiative
  • Sloan Research Fellowship
  • Intel
  • Air Force Office of Scientific Research
  • Defense Advanced Research Projects Agency (DARPA)
  • Office of Naval Research
  • National Science Foundation (NSF)

Presentation

The paper was presented as an oral presentation at the International Conference on Learning Representations (ICLR) in April.

0 views
Back to Blog

Related posts

Read more »