[Paper] Can Large Language Models Assist the Comprehension of ROS2 Software Architectures?

Published: 1 day ago (April 23, 2026 at 10:07 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.21699v1

Overview

The paper investigates whether large language models (LLMs) can help robotics engineers understand the intricate, decentralized architectures built with ROS 2—the de‑facto framework for modern robot software. By systematically querying nine popular LLMs about three ROS 2 systems of increasing size, the authors show that LLMs can answer architecture‑related questions with near‑perfect accuracy, opening the door to AI‑assisted debugging and documentation for robot developers.

Key Contributions

Automated question generation: A generic algorithm that extracts architecturally‑relevant facts from any ROS 2 system and turns them into concrete QA prompts.
Large‑scale empirical study: 1,230 prompts were run against nine state‑of‑the‑art LLMs (including Gemini, GPT‑4, Claude, and Llama‑derived models).
Accuracy benchmark: Overall mean correctness of 98.22 %; the best model (gemini‑2.5‑pro) achieved 100 % on every prompt.
Explanation quality analysis: Coherence scores (0.39–0.76) and perplexity measurements reveal how well models can justify their answers.
Practical guidance: A discussion of when and how developers can safely rely on LLMs for ROS 2 architecture comprehension.

Methodology

Select three ROS 2 applications – small, medium, and large, each with thousands of nodes, topics, services, and parameters.
Ground‑truth extraction – the authors run the systems, monitor all communication paths, and record the true architecture data.
Prompt generation – using their algorithm, they automatically create questions such as “Which node publishes on topic X?” or “What is the full communication path from node A to node B?”.
LLM evaluation – each of the nine LLMs receives every prompt (1,230 total). The answer is compared to the ground truth for binary correctness, and the model’s textual explanation is scored for coherence and perplexity.
Statistical analysis – accuracy, error distribution (e.g., most errors occur on the largest system), and explanation quality are aggregated per model.

Results & Findings

High overall correctness: 1,230 prompts → 1,080 correct answers (98.22 %).
Top performers:
- gemini‑2.5‑pro: 100 % accuracy.
- o3: 99.77 % accuracy.
- gemini‑2.5‑flash: 99.72 % accuracy.
Lowest performer: gpt‑4.1 with 95 % accuracy (still impressive).
Error concentration: 249 of the 300 wrong answers belong to the most complex ROS 2 system, indicating scalability pressure.
Explanation coherence: Scores range from 0.394 (service references) to 0.762 (communication path), showing that LLMs are better at describing end‑to‑end data flows than low‑level service links.
Perplexity: chatgpt‑4o yields the most fluent explanations (perplexity ≈ 19.6), while o4‑mini is the least fluent (≈ 103.6).

Practical Implications

Instant architecture lookup: Developers can ask an LLM “Which node subscribes to /cmd_vel?” instead of digging through ROS 2 introspection tools or source code.
Accelerated onboarding: New team members can query the model to get quick, human‑readable overviews of a robot’s node graph, reducing the learning curve.
AI‑augmented debugging: When a communication failure occurs, an LLM can suggest likely missing publishers/subscribers or mis‑configured QoS settings based on the recorded architecture.
Documentation generation: By feeding the LLM the set of generated questions, teams can auto‑create up‑to‑date architecture docs that stay in sync with code changes.
Tool integration: The question‑generation algorithm can be packaged as a ROS 2 plugin, feeding prompts directly to an LLM API from within ros2 topic list or ros2 service list commands.

Limitations & Future Work

Scalability: Accuracy drops modestly on the largest system; future work should test even bigger fleets and explore hierarchical prompting.
Explainability variance: Coherence scores differ across question types, indicating that LLMs sometimes struggle with low‑level service relationships.
Model‑specific quirks: Performance is not uniform—developers need to pick the right LLM (e.g., Gemini‑2.5‑pro) and be aware of version drift.
Safety & correctness guarantees: The study is purely empirical; integrating LLMs into safety‑critical robot control loops will require formal verification or fallback mechanisms.
Extending beyond ROS 2: Applying the same pipeline to other middleware (e.g., DDS directly, ROS 1, or custom robotics stacks) is an open research direction.

Authors

Laura Duits
Bouazza El Moutaouakil
Ivano Malavolta

Paper Information

arXiv ID: 2604.21699v1
Categories: cs.SE
Published: April 23, 2026
PDF: Download PDF

[Paper] Can Large Language Models Assist the Comprehension of ROS2 Software Architectures?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

[Paper] Institutionalizing Best Practices in Research Computing: A Framework and Case Study for Improving User Onboarding

[Paper] Generalizing Test Cases for Comprehensive Test Scenario Coverage

[Paper] Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis