[Paper] AgentTrust: A Self-Improving Trust Layer for AI-Agent Actions

Published: 3 days ago (June 7, 2026 at 05:39 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.08539v1

Overview

AI agents increasingly take consequential actions — shell commands, cloud operations, and arbitrary tool-calls — so a trust layer must decide, per action, whether to allow, warn, block, or escalate. We argue that the right way to reason about such a layer is by threat type. Lexical (fixed-signature) threats, where danger lives in a stable token, are decidable by deterministic rules; semantic (intent-dependent) threats, where a benign and a malicious action share the same surface, are out of reach for rules by construction. We make this concrete with a negative proof: a determined, hand-authored cloud rule pack lifts held-out accuracy only 48 to 56% overall and moves the semantic categories by 0pp (data_db 29 to 29, observability 59 to 59, supply_chain 50 to 50), while a strong LLM judge carries exactly those categories. We give the judge a self-learning capability: on a corpus that is mainly semantic attacks it nearly doubles rule accuracy (48% to 83.6-85.2%) with near-zero false-blocks, and this holds across two model providers. We turn this into a self-improving dual-store system: the judge distills a growing deterministic rule floor on lexical threats (cheaper over time) and feeds a guarded RAG memory on semantic threats (a verdict-cache fails — surface-twins collapse to ~58% — so a corroboration guard lifts semantic accuracy +13pp, 70 to 84). The result is what sets AgentTrust v2 apart from its static v1 predecessor: a trust layer that self-evolves from its own stream of decisions — cheaper on the lexical class (it distils its own rules) and smarter on the semantic class (it accrues guarded precedent), while never hard-blocking a benign action. An end-to-end online replay shows the judge-call rate falling (50% to 44%) and judge-domain accuracy rising (71% to 80%), with 0 benign hard-blocks across 45,000 actions.

Key Contributions

This paper presents research in the following areas:

cs.AI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.AI.

Authors

Chenglin Yang

Paper Information

arXiv ID: 2606.08539v1
Categories: cs.AI
Published: June 7, 2026
PDF: Download PDF

[Paper] AgentTrust: A Self-Improving Trust Layer for AI-Agent Actions

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] When to Align, When to Predict: A Phase Diagram for Multimodal Learning

[Paper] A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

[Paper] EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

[Paper] The Role of Feedback Alignment in Self-Distillation