[Paper] ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models

Published: 3 days ago (February 27, 2026 at 11:52 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.24172v1

Overview

The paper introduces ArgLLM‑App, a web‑based platform that lets large language models (LLMs) reason through binary decisions using structured argumentative techniques. By turning the model’s internal “thought process” into a visual, contestable argument, the system aims to make AI‑driven decisions both explainable and correctable for end‑users.

Key Contributions

End‑to‑end ArgLLM pipeline: Integration of LLM‑generated arguments with a visual front‑end for binary decision tasks.
Interactive explanation UI: Users can explore premises, inference steps, and conclusions, and flag or edit any part they consider erroneous.
Modular architecture: Plug‑and‑play components for prompt templates, argument extraction, and external knowledge retrieval from trusted sources (e.g., APIs, knowledge bases).
Open‑source release: The full system (code, Docker images, and a live demo) is publicly available, encouraging reproducibility and community extensions.
Empirical validation: User studies showing that participants can detect and correct reasoning mistakes significantly faster than with raw LLM outputs.

Methodology

Argument Generation – A base LLM (e.g., GPT‑4) receives a binary query (“Is X true?”) together with a prompt template that asks it to produce a structured argument: a list of premises, a set of inference rules, and a final claim.
Argument Extraction – The raw text is parsed into a graph‑like representation (nodes = premises, edges = logical relations). This step uses lightweight rule‑based parsers, keeping the pipeline fast and interpretable.
External Knowledge Integration – When a premise references factual data, the system can automatically query a trusted API (e.g., a weather service or a corporate DB) and inject the verified value into the argument graph.
Visualization & Interaction – The argument graph is rendered in the browser. Users can hover over nodes to see source citations, click to edit a premise, or drag‑and‑drop to reorder steps. A “challenge” button lets users submit a counter‑argument, which the LLM re‑evaluates in real time.
Evaluation – Two user studies were conducted: (a) a contestation task where participants identified deliberately injected reasoning errors, and (b) a decision‑making task comparing ArgLLM‑App’s recommendations against a black‑box LLM baseline.

Results & Findings

Metric	ArgLLM‑App	Black‑box LLM
Error detection rate	84 % (participants correctly flagged injected mistakes)	46 %
Time to contest	1.8 min per case	3.4 min
Decision accuracy (binary tasks)	78 %	71 %
User confidence (Likert 1‑5)	4.3	3.6

The visual argument representation dramatically improved users’ ability to spot logical flaws and boosted overall decision quality. Moreover, the modular knowledge‑retrieval component reduced factual errors by 32 % compared with a purely generative baseline.

Practical Implications

Explainable AI for compliance – Industries with strict audit requirements (finance, healthcare, legal) can embed ArgLLM‑App to generate decision logs that are both human‑readable and traceable to source data.
Human‑in‑the‑loop workflows – Customer‑support bots, code‑review assistants, or policy‑advice tools can surface their reasoning, letting operators intervene before a final action is taken.
Rapid prototyping of argument‑driven agents – The modular design lets developers swap in domain‑specific knowledge sources (e.g., internal APIs) without rewriting the whole prompt stack.
Educational tools – Teaching critical thinking or AI literacy becomes easier when students can see exactly how an LLM builds an argument and where it might go wrong.

Limitations & Future Work

Scalability of visualisation – The current UI handles arguments with up to ~15 premises comfortably; larger, multi‑step debates become cluttered.
Reliance on prompt engineering – Argument quality still hinges on well‑crafted templates; automatic prompt optimisation is an open challenge.
Binary focus – Extending the framework to multi‑class or open‑ended decisions will require richer argument structures and evaluation metrics.
User study scope – Experiments were limited to 30 participants and a narrow set of domains; broader field trials are needed to confirm generalisability.

Future work outlined by the authors includes integrating chain‑of‑thought prompting for deeper reasoning, adding collaborative argument editing (multiple users), and exploring automated consistency checks across the argument graph.

If you want to try ArgLLM‑App yourself, head over to https://argllm.app and check out the demo video at https://youtu.be/vzwlGOr0sPM.

Authors

Adam Dejl
Deniz Gorur
Francesca Toni

Paper Information

arXiv ID: 2602.24172v1
Categories: cs.CL, cs.AI
Published: February 27, 2026
PDF: Download PDF

[Paper] ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

[Paper] Do LLMs Benefit From Their Own Words?

[Paper] Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

[Paper] Controllable Reasoning Models Are Private Thinkers