[Paper] Can Large Language Models Assist the Comprehension of ROS2 Software Architectures?
Source: arXiv - 2604.21699v1
Overview
The paper investigates whether large language models (LLMs) can help robotics engineers understand the intricate, decentralized architectures built with ROS 2—the de‑facto framework for modern robot software. By systematically querying nine popular LLMs about three ROS 2 systems of increasing size, the authors show that LLMs can answer architecture‑related questions with near‑perfect accuracy, opening the door to AI‑assisted debugging and documentation for robot developers.
Key Contributions
- Automated question generation: A generic algorithm that extracts architecturally‑relevant facts from any ROS 2 system and turns them into concrete QA prompts.
- Large‑scale empirical study: 1,230 prompts were run against nine state‑of‑the‑art LLMs (including Gemini, GPT‑4, Claude, and Llama‑derived models).
- Accuracy benchmark: Overall mean correctness of 98.22 %; the best model (gemini‑2.5‑pro) achieved 100 % on every prompt.
- Explanation quality analysis: Coherence scores (0.39–0.76) and perplexity measurements reveal how well models can justify their answers.
- Practical guidance: A discussion of when and how developers can safely rely on LLMs for ROS 2 architecture comprehension.
Methodology
- Select three ROS 2 applications – small, medium, and large, each with thousands of nodes, topics, services, and parameters.
- Ground‑truth extraction – the authors run the systems, monitor all communication paths, and record the true architecture data.
- Prompt generation – using their algorithm, they automatically create questions such as “Which node publishes on topic X?” or “What is the full communication path from node A to node B?”.
- LLM evaluation – each of the nine LLMs receives every prompt (1,230 total). The answer is compared to the ground truth for binary correctness, and the model’s textual explanation is scored for coherence and perplexity.
- Statistical analysis – accuracy, error distribution (e.g., most errors occur on the largest system), and explanation quality are aggregated per model.
Results & Findings
- High overall correctness: 1,230 prompts → 1,080 correct answers (98.22 %).
- Top performers:
- gemini‑2.5‑pro: 100 % accuracy.
- o3: 99.77 % accuracy.
- gemini‑2.5‑flash: 99.72 % accuracy.
- Lowest performer: gpt‑4.1 with 95 % accuracy (still impressive).
- Error concentration: 249 of the 300 wrong answers belong to the most complex ROS 2 system, indicating scalability pressure.
- Explanation coherence: Scores range from 0.394 (service references) to 0.762 (communication path), showing that LLMs are better at describing end‑to‑end data flows than low‑level service links.
- Perplexity:
chatgpt‑4oyields the most fluent explanations (perplexity ≈ 19.6), whileo4‑miniis the least fluent (≈ 103.6).
Practical Implications
- Instant architecture lookup: Developers can ask an LLM “Which node subscribes to
/cmd_vel?” instead of digging through ROS 2 introspection tools or source code. - Accelerated onboarding: New team members can query the model to get quick, human‑readable overviews of a robot’s node graph, reducing the learning curve.
- AI‑augmented debugging: When a communication failure occurs, an LLM can suggest likely missing publishers/subscribers or mis‑configured QoS settings based on the recorded architecture.
- Documentation generation: By feeding the LLM the set of generated questions, teams can auto‑create up‑to‑date architecture docs that stay in sync with code changes.
- Tool integration: The question‑generation algorithm can be packaged as a ROS 2 plugin, feeding prompts directly to an LLM API from within
ros2 topic listorros2 service listcommands.
Limitations & Future Work
- Scalability: Accuracy drops modestly on the largest system; future work should test even bigger fleets and explore hierarchical prompting.
- Explainability variance: Coherence scores differ across question types, indicating that LLMs sometimes struggle with low‑level service relationships.
- Model‑specific quirks: Performance is not uniform—developers need to pick the right LLM (e.g., Gemini‑2.5‑pro) and be aware of version drift.
- Safety & correctness guarantees: The study is purely empirical; integrating LLMs into safety‑critical robot control loops will require formal verification or fallback mechanisms.
- Extending beyond ROS 2: Applying the same pipeline to other middleware (e.g., DDS directly, ROS 1, or custom robotics stacks) is an open research direction.
Authors
- Laura Duits
- Bouazza El Moutaouakil
- Ivano Malavolta
Paper Information
- arXiv ID: 2604.21699v1
- Categories: cs.SE
- Published: April 23, 2026
- PDF: Download PDF