[Paper] AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Published: 2 months ago (February 5, 2026 at 01:50 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06008v1

Overview

The paper AgenticPay introduces a new benchmark and simulation platform that lets large‑language‑model (LLM) agents negotiate buyer‑seller deals using natural language instead of simple numeric bids. By modeling realistic market constraints—private budgets, product‑specific valuations, and multi‑round dialogue—it gives researchers a principled way to evaluate how well LLM‑powered agents can conduct economic transactions.

Key Contributions

AgenticPay benchmark: A comprehensive suite of >110 negotiation tasks covering bilateral bargaining, multi‑buyer/multi‑seller markets, and varied product types.
Simulation framework: Open‑source environment that enforces private constraints, tracks feasibility, efficiency, and overall welfare, and extracts structured actions from free‑form dialogue.
Evaluation metrics: Clear quantitative measures for (i) feasibility (agreements respect all private constraints), (ii) efficiency (total surplus captured), and (iii) welfare (fairness across participants).
Empirical baseline: State‑of‑the‑art proprietary (e.g., GPT‑4) and open‑weight LLMs (Llama‑2, Mistral) are benchmarked, exposing a sizable performance gap in strategic, long‑horizon negotiations.
Open resources: Dataset, code, and evaluation scripts released under an MIT‑style license, enabling reproducible research and rapid prototyping.

Methodology

Market Modeling – Each agent (buyer or seller) receives a private “type”: a budget, a cost curve, and a valuation function that depends on product attributes (e.g., quality, delivery time).
Dialogue Engine – Agents communicate via multi‑turn natural‑language messages. The framework parses these messages into structured intents (offer, counter‑offer, accept, reject, ask‑question) using a lightweight extraction model.
Negotiation Protocol – A turn‑based loop runs until an agreement is reached or a maximum number of rounds is hit. After each turn, the simulator checks feasibility (no budget overruns, price ≥ cost) and updates the state.
Task Generation – Over 110 scenarios are procedurally generated by varying numbers of participants, product dimensions, and constraint tightness, ensuring diverse strategic challenges.
Evaluation – For each run, the system logs the final price, surplus distribution, and dialogue length, then computes the three core metrics (feasibility, efficiency, welfare).

The whole pipeline is packaged as a Python library with simple APIs (run_negotiation(agent_policy, task_id)) so developers can plug in any LLM or custom policy.

Results & Findings

Model	Feasibility	Efficiency (% of optimal surplus)	Welfare (fairness)
GPT‑4 (proprietary)	92 %	68 %	0.71
Llama‑2‑70B (open)	78 %	45 %	0.58
Mistral‑7B	71 %	38 %	0.53
Baseline rule‑based	85 %	30 %	0.49

Strategic depth matters: Even the strongest LLMs struggle with long‑horizon planning, often conceding too early or failing to detect hidden constraints.
Prompt engineering helps but isn’t enough: Adding explicit “budget reminder” prompts improves feasibility modestly (≈+5 %) but does little for efficiency.
Many‑to‑many markets amplify difficulty: When three or more agents interact, success rates drop sharply, highlighting coordination challenges.

Overall, the study shows that current LLM agents are far from being reliable autonomous negotiators in realistic commerce settings.

Practical Implications

E‑commerce bots: Companies looking to deploy AI sales assistants can use AgenticPay to stress‑test their dialogue policies before going live, ensuring bots respect pricing constraints and avoid unfavorable deals.
Supply‑chain automation: Multi‑agent negotiation is a core component of automated procurement; the benchmark offers a sandbox for prototyping negotiation strategies that balance cost savings with supplier fairness.
Marketplace platforms: Peer‑to‑peer platforms (e.g., freelance marketplaces) could integrate LLM negotiators to facilitate price discovery, but the current performance gap suggests a hybrid human‑in‑the‑loop approach is still needed.
Regulatory compliance: By quantifying welfare and feasibility, firms can audit AI‑driven negotiations for fairness and legal compliance (e.g., anti‑price‑gouging).
Developer tooling: The open‑source framework can be wrapped into CI pipelines, allowing teams to benchmark new LLM fine‑tunes or reinforcement‑learning‑from‑human‑feedback (RLHF) policies against a standardized set of economic tasks.

Limitations & Future Work

Synthetic environment: The market scenarios are generated procedurally and may not capture all nuances of real‑world contracts (legal clauses, multi‑modal assets).
Action extraction reliance: The current parser assumes a relatively clean language; noisy or adversarial utterances could break the structured intent extraction.
Scalability: Benchmarks currently cap at modest numbers of participants (≤5); scaling to large marketplaces will require more efficient simulation and possibly hierarchical negotiation protocols.
Strategic learning: The paper highlights the need for agents that can plan over many turns; future work could explore multi‑agent reinforcement learning, game‑theoretic reasoning, or hybrid symbolic‑neural approaches.

By exposing these gaps, AgenticPay sets a clear research agenda for building truly agentic, language‑driven commerce systems that developers can eventually trust in production.

Authors

Xianyang Liu
Shangding Gu
Dawn Song

Paper Information

arXiv ID: 2602.06008v1
Categories: cs.AI, cs.LG
Published: February 5, 2026
PDF: Download PDF

[Paper] AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data