[Paper] Visualizing token importance for black-box language models

Published: (December 12, 2025 at 09:01 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.11573v1

Overview

The paper introduces Distribution‑Based Sensitivity Analysis (DBSA), a model‑agnostic technique that lets developers peek inside a black‑box large language model (LLM) and see how each input token influences the generated output. By treating the LLM as a stochastic oracle—without needing gradients or internal weights—DBSA offers a quick, plug‑and‑play way to audit models that are only reachable via API calls, a common scenario in production systems handling legal, medical, or compliance‑critical text.

Key Contributions

  • Model‑agnostic token‑level sensitivity metric – Works with any LLM accessible through a black‑box API, no need for source code or gradient access.
  • Distribution‑based approach – Estimates token importance by comparing output distributions under controlled perturbations, handling the inherent randomness of LLM sampling.
  • Lightweight, plug‑and‑play tool – Requires only a handful of API calls per token, making it practical for real‑time debugging or periodic audits.
  • Visualization framework – Generates intuitive heat‑maps that highlight which tokens the model “relies on” for a given generation.
  • Empirical validation – Demonstrates that DBSA surfaces sensitivities missed by existing interpretability methods (e.g., attention‑based scores, gradient‑based saliency) across several benchmark prompts.

Methodology

  1. Prompt Perturbation – For each token t in the input prompt, DBSA creates a set of n perturbed prompts where t is replaced by a neutral placeholder (e.g., a mask token or a synonym).
  2. Output Sampling – The black‑box LLM is queried k times for each perturbed prompt, collecting a sample of generated continuations (or token‑level probabilities).
  3. Distribution Comparison – The original output distribution (from the unperturbed prompt) is compared to each perturbed distribution using a statistical distance (e.g., Jensen‑Shannon divergence).
  4. Sensitivity Score – The average distance across the k samples becomes the sensitivity score for token t. Higher scores indicate that the model’s output changes noticeably when t is altered.
  5. Visualization – Scores are mapped onto the original prompt as a heat‑map, letting users instantly spot “high‑impact” tokens.

Because the method only relies on repeated forward passes, it sidesteps the need for gradients, making it compatible with any hosted LLM (OpenAI, Anthropic, Cohere, etc.).

Results & Findings

ExperimentSetupKey Observation
Synthetic bias probePrompt containing gendered nouns, ask LLM to generate occupationDBSA highlighted gender tokens as highly sensitive, whereas attention scores were diffuse.
Legal clause analysisPrompt with a contract clause, ask LLM to summarizeTokens related to liability and dates showed the strongest influence on the summary output.
Medical note generationPrompt with patient symptoms, request a diagnosisSymptom tokens received the highest sensitivity scores, confirming clinical relevance.
Comparison with baselinesGradient‑based saliency (when available) and attention weightsDBSA consistently produced clearer, more localized importance maps, especially under stochastic sampling (top‑p, temperature > 0).

Overall, DBSA succeeded in flagging tokens that, when altered, caused statistically significant shifts in the LLM’s response—often surfacing subtle dependencies that other methods missed.

Practical Implications

  • Compliance Audits – Regulators can use DBSA to verify that a model’s decisions are not unduly driven by protected attributes (e.g., race, gender) hidden in the prompt.
  • Prompt Engineering – Developers can iteratively refine prompts, removing or re‑phrasing high‑sensitivity tokens that cause unwanted model behavior.
  • Safety Guardrails – By monitoring sensitivity scores in production, teams can trigger alerts when a new prompt configuration introduces unexpected token dependencies.
  • Vendor‑agnostic Testing – Since DBSA works with any API‑only LLM, it fits naturally into CI/CD pipelines for products that rely on third‑party language services.
  • User‑Facing Explainability – Front‑end tools can display token‑heatmaps to end‑users (e.g., lawyers reviewing AI‑generated contracts), increasing trust and transparency.

Limitations & Future Work

  • Sampling Cost – The need for multiple forward passes per token can become expensive for long prompts or high‑throughput services; the authors suggest adaptive sampling to mitigate this.
  • Perturbation Choice – Replacing a token with a generic mask may not capture nuanced semantic shifts; exploring synonym or paraphrase perturbations could improve fidelity.
  • Statistical Distance Sensitivity – Different divergence measures may yield varying scores; a systematic study of alternatives is left for future research.
  • Dynamic Contexts – DBSA currently assumes a static prompt; extending it to multi‑turn conversations or streaming outputs remains an open challenge.

The authors envision a richer toolbox that combines DBSA with causal inference techniques and integrates directly into API monitoring dashboards.

Authors

  • Paulius Rauba
  • Qiyao Wei
  • Mihaela van der Schaar

Paper Information

  • arXiv ID: 2512.11573v1
  • Categories: cs.CL, cs.LG
  • Published: December 12, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »