[Paper] ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
Source: arXiv - 2602.24172v1
Overview
The paper introduces ArgLLM‑App, a web‑based platform that lets large language models (LLMs) reason through binary decisions using structured argumentative techniques. By turning the model’s internal “thought process” into a visual, contestable argument, the system aims to make AI‑driven decisions both explainable and correctable for end‑users.
Key Contributions
- End‑to‑end ArgLLM pipeline: Integration of LLM‑generated arguments with a visual front‑end for binary decision tasks.
- Interactive explanation UI: Users can explore premises, inference steps, and conclusions, and flag or edit any part they consider erroneous.
- Modular architecture: Plug‑and‑play components for prompt templates, argument extraction, and external knowledge retrieval from trusted sources (e.g., APIs, knowledge bases).
- Open‑source release: The full system (code, Docker images, and a live demo) is publicly available, encouraging reproducibility and community extensions.
- Empirical validation: User studies showing that participants can detect and correct reasoning mistakes significantly faster than with raw LLM outputs.
Methodology
- Argument Generation – A base LLM (e.g., GPT‑4) receives a binary query (“Is X true?”) together with a prompt template that asks it to produce a structured argument: a list of premises, a set of inference rules, and a final claim.
- Argument Extraction – The raw text is parsed into a graph‑like representation (nodes = premises, edges = logical relations). This step uses lightweight rule‑based parsers, keeping the pipeline fast and interpretable.
- External Knowledge Integration – When a premise references factual data, the system can automatically query a trusted API (e.g., a weather service or a corporate DB) and inject the verified value into the argument graph.
- Visualization & Interaction – The argument graph is rendered in the browser. Users can hover over nodes to see source citations, click to edit a premise, or drag‑and‑drop to reorder steps. A “challenge” button lets users submit a counter‑argument, which the LLM re‑evaluates in real time.
- Evaluation – Two user studies were conducted: (a) a contestation task where participants identified deliberately injected reasoning errors, and (b) a decision‑making task comparing ArgLLM‑App’s recommendations against a black‑box LLM baseline.
Results & Findings
| Metric | ArgLLM‑App | Black‑box LLM |
|---|---|---|
| Error detection rate | 84 % (participants correctly flagged injected mistakes) | 46 % |
| Time to contest | 1.8 min per case | 3.4 min |
| Decision accuracy (binary tasks) | 78 % | 71 % |
| User confidence (Likert 1‑5) | 4.3 | 3.6 |
The visual argument representation dramatically improved users’ ability to spot logical flaws and boosted overall decision quality. Moreover, the modular knowledge‑retrieval component reduced factual errors by 32 % compared with a purely generative baseline.
Practical Implications
- Explainable AI for compliance – Industries with strict audit requirements (finance, healthcare, legal) can embed ArgLLM‑App to generate decision logs that are both human‑readable and traceable to source data.
- Human‑in‑the‑loop workflows – Customer‑support bots, code‑review assistants, or policy‑advice tools can surface their reasoning, letting operators intervene before a final action is taken.
- Rapid prototyping of argument‑driven agents – The modular design lets developers swap in domain‑specific knowledge sources (e.g., internal APIs) without rewriting the whole prompt stack.
- Educational tools – Teaching critical thinking or AI literacy becomes easier when students can see exactly how an LLM builds an argument and where it might go wrong.
Limitations & Future Work
- Scalability of visualisation – The current UI handles arguments with up to ~15 premises comfortably; larger, multi‑step debates become cluttered.
- Reliance on prompt engineering – Argument quality still hinges on well‑crafted templates; automatic prompt optimisation is an open challenge.
- Binary focus – Extending the framework to multi‑class or open‑ended decisions will require richer argument structures and evaluation metrics.
- User study scope – Experiments were limited to 30 participants and a narrow set of domains; broader field trials are needed to confirm generalisability.
Future work outlined by the authors includes integrating chain‑of‑thought prompting for deeper reasoning, adding collaborative argument editing (multiple users), and exploring automated consistency checks across the argument graph.
If you want to try ArgLLM‑App yourself, head over to https://argllm.app and check out the demo video at https://youtu.be/vzwlGOr0sPM.
Authors
- Adam Dejl
- Deniz Gorur
- Francesca Toni
Paper Information
- arXiv ID: 2602.24172v1
- Categories: cs.CL, cs.AI
- Published: February 27, 2026
- PDF: Download PDF