[Paper] Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

Published: 3 days ago (June 11, 2026 at 01:58 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.13668v1

Overview

With the growth of LLMs’ (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning through the paradigm of influence functions. While methods of this family are effective in its function, they lack the necessary processing speed and storage compactness to be practically implemented on large datasets. We propose a method, Influcoder, as a quick and cost-effective approach to influence-based Data Attribution at scale.

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Dimitri Kachler
Damien Sileo
Pascal Denis

Paper Information

arXiv ID: 2606.13668v1
Categories: cs.CL
Published: June 11, 2026
PDF: Download PDF

[Paper] Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

[Paper] EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery