[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Published: 3 days ago (February 18, 2026 at 12:03 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.16608v1

Overview

Transformers dominate modern AI—from language models that write code to vision systems that tag images—but their deep, multi‑layered architecture makes it hard to understand why they make a particular prediction. The paper introduces Context‑Aware Layer‑wise Integrated Gradients (CA‑LIG), a new explainability framework that tracks relevance through every Transformer block, delivering richer, context‑sensitive attributions than existing methods.

Key Contributions

Unified hierarchical attribution: Computes Integrated Gradients (IG) at each layer of a Transformer and fuses them with class‑specific attention gradients.
Signed, context‑aware maps: Produces both positive (supporting) and negative (opposing) evidence for each token or image patch.
Cross‑domain validation: Demonstrates CA‑LIG on NLP (sentiment analysis, hate‑speech detection) and CV (masked‑autoencoder vision Transformers) across multiple model families (BERT, XLM‑R, AfroLM, MAE‑ViT).
Improved faithfulness & interpretability: Empirical results show higher alignment with ground‑truth rationales and clearer visualizations compared with standard IG, attention rollout, and gradient‑based methods.

Methodology

Layer‑wise Integrated Gradients – For each Transformer block, CA‑LIG integrates the gradient of the model’s output with respect to the block’s hidden states, using a straight‑line path from a baseline (e.g., zero embeddings) to the actual input. This yields a layer‑specific relevance score for every token/patch.
Attention‑gradient fusion – The method also computes gradients of the class logit with respect to the attention weights, capturing how the model’s focus shifts across tokens. These gradients are combined with the IG scores to produce a signed attribution that reflects both content (token embeddings) and structural (attention) contributions.
Contextual aggregation – Attributions from all layers are summed (or weighted) to form a final map that respects the hierarchical flow of information, allowing developers to see how early‑layer signals evolve into final decisions.
Visualization pipeline – The signed scores are visualized as heatmaps (text) or overlay masks (images), with positive values in warm colors and negative values in cool colors, making it easy to spot supporting vs. contradicting evidence.

The approach stays compatible with any standard Transformer architecture because it only requires access to intermediate hidden states and attention matrices—both are exposed in popular libraries like Hugging Face Transformers and PyTorch‑Vision.

Results & Findings

Task / Model	Baseline Explainability Method	CA‑LIG Faithfulness (↑)	Context Sensitivity (↑)	Qualitative Rating
Sentiment (BERT)	Vanilla Integrated Gradients	0.71 → 0.84	0.62 → 0.78	✔️ Clear token‑level rationale
Hate‑speech (XLM‑R, low‑resource)	Attention Rollout	0.58 → 0.73	0.55 → 0.71	✔️ Highlights language‑specific cues
Document Classification (AfroLM)	Gradient×Input	0.65 → 0.80	0.60 → 0.77	✔️ Captures long‑range dependencies
Image Classification (MAE‑ViT)	Grad‑CAM	0.68 → 0.82	0.61 → 0.79	✔️ Shows patch‑level support vs. opposition

Faithfulness: Measured by the Deletion and Insertion metrics, CA‑LIG consistently outperformed alternatives, indicating that the highlighted tokens truly drive the model’s output.
Context awareness: Ablation studies where surrounding tokens were shuffled caused a sharp drop in CA‑LIG scores, confirming that the method respects inter‑token dependencies.
Visualization clarity: User studies with 15 NLP engineers reported that CA‑LIG explanations were easier to interpret and more actionable than those from existing tools.

Practical Implications

Debugging & model QA: Developers can pinpoint which layer introduced spurious correlations (e.g., a bias in early embeddings) and address them directly.
Regulatory compliance: Signed attributions make it straightforward to generate “supporting vs. opposing evidence” reports required by emerging AI transparency regulations.
Low‑resource & multilingual deployments: By exposing how contextual cues from rare languages influence predictions, CA‑LIG helps data scientists refine tokenizers and training data for better fairness.
Vision‑Transformer troubleshooting: In computer‑vision pipelines, CA‑LIG can reveal whether a model is focusing on the object of interest or on background artifacts, guiding data augmentation strategies.
Tooling integration: Because CA‑LIG works with standard forward‑hook APIs, it can be wrapped into existing interpretability dashboards (e.g., Streamlit, TensorBoard) with minimal code changes.

Limitations & Future Work

Computational overhead: Computing IG for every layer adds ~2–3× runtime compared to single‑layer methods; the authors suggest stochastic sampling of integration steps to mitigate this.
Baseline selection: Results can vary with the choice of baseline (zero embeddings vs. mean token). A systematic study of optimal baselines for different domains is still open.
Scalability to extremely large models (e.g., GPT‑4‑scale): Memory constraints may require gradient checkpointing or layer‑wise streaming, which the current implementation does not yet support.
User studies limited to technical audiences: Future work could evaluate CA‑LIG explanations with non‑technical stakeholders (e.g., clinicians, policy makers) to assess broader usability.

Bottom line: CA‑LIG offers a practical, more faithful way to peek inside Transformer decision‑making, giving developers the granularity they need to debug, improve, and responsibly deploy AI systems across language and vision tasks.

Authors

Melkamu Abay Mersha
Jugal Kalita

Paper Information

arXiv ID: 2602.16608v1
Categories: cs.CL, cs.AI, cs.CV, cs.LG
Published: February 18, 2026
PDF: Download PDF

[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

[Paper] Sink-Aware Pruning for Diffusion Language Models

[Paper] CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

[Paper] Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery