[Paper] UNBOX: Unveiling Black-box visual models with Natural-language

Published: (March 9, 2026 at 01:16 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08639v1

Overview

The paper UNBOX tackles a pressing problem: how to interpret modern vision models that are offered only as black‑box APIs (think cloud services that return class probabilities but hide the network architecture, weights, and training data). By turning the classic “activation maximization” task into a semantic search powered by large language models (LLMs) and text‑to‑image diffusion models, UNBOX can surface human‑readable descriptions of what each class “means” to the model—without ever seeing the model’s internals or its training set.

Key Contributions

  • Fully black‑box interpretability framework – works with only output probabilities; no gradients, parameters, or training data are required.
  • Semantic activation maximization – leverages LLMs to generate candidate textual concepts and diffusion models to evaluate how well they trigger a target class.
  • Class‑wise textual descriptors – produces concise natural‑language explanations (e.g., “a bird perched on a branch with a white belly”) that reveal the model’s implicit concepts and biases.
  • Comprehensive evaluation – tests on ImageNet‑1K, Waterbirds, and CelebA show competitive performance against white‑box baselines in fidelity, feature correlation, and bias‑slice discovery.
  • Open‑world auditing tool – demonstrates that developers can audit proprietary vision APIs for fairness and robustness without any privileged access.

Methodology

  1. Prompt Generation with an LLM

    • For each target class (e.g., “sparrow”), the LLM is asked to produce a diverse set of textual phrases that could describe visual concepts related to that class.
    • The prompts are filtered for relevance and diversity using simple similarity metrics.
  2. Semantic Scoring via Diffusion Models

    • Each generated phrase is fed to a text‑to‑image diffusion model (e.g., Stable Diffusion) to synthesize a set of images that match the description.
    • The black‑box vision model is then queried on these synthetic images; the class probability serves as a semantic activation score for the phrase.
  3. Optimization as a Search Problem

    • The pipeline iterates: high‑scoring phrases are expanded (e.g., by adding adjectives or compositional elements) and re‑evaluated, effectively performing a gradient‑free hill‑climb in the space of natural language.
    • The final output for each class is the phrase (or short list of phrases) that yields the highest activation.
  4. Auditing & Bias Detection

    • By inspecting the top phrases across classes, the authors identify systematic biases (e.g., “waterbird” classes overly associated with “lake” vs. “forest”) and hidden training‑distribution cues.

Results & Findings

DatasetMetricUNBOX vs. White‑Box Baselines
ImageNet‑1KSemantic Fidelity (human rating)0.78 vs. 0.81 (Grad‑CAM)
WaterbirdsBias‑slice discovery (precision)0.71 vs. 0.73 (TCAV)
CelebAFeature‑correlation (R²)0.64 vs. 0.66 (Network Dissection)
  • Competitive performance: Despite lacking any internal access, UNBOX’s textual descriptors achieve near‑state‑of‑the‑art fidelity.
  • Interpretability: Human evaluators found UNBOX’s phrases more intuitive than raw activation maps.
  • Bias uncovering: On the Waterbirds dataset, UNBOX automatically surfaced “background water vs. land” cues that the model relied on, matching the insights of white‑box methods.

Practical Implications

  • API Auditing: Companies that consume third‑party vision services (e.g., content moderation, medical imaging) can now run a quick “concept audit” to verify that the model isn’t unintentionally focusing on protected attributes.
  • Model Documentation: Developers can generate natural‑language model cards that list the most salient concepts per class, improving transparency for end‑users and regulators.
  • Rapid Prototyping: When evaluating off‑the‑shelf models, engineers can use UNBOX to compare how different providers encode the same class (e.g., “cat”) without needing to download the weights.
  • Bias Mitigation Pipelines: Detected bias‑related phrases can feed into data‑collection or fine‑tuning loops, guiding the acquisition of more balanced training data.

Limitations & Future Work

  • Dependence on LLM & Diffusion Quality – Poor prompt generation or low‑fidelity image synthesis can mislead the activation scores, especially for fine‑grained or abstract classes.
  • Scalability – The search over textual space is iterative and may become expensive for models with thousands of classes.
  • Domain Shift – The approach assumes the diffusion model’s visual prior aligns with the black‑box model’s training distribution; large domain gaps (e.g., medical imaging) may reduce relevance.
  • Future directions suggested by the authors include: integrating multimodal LLMs to reduce the number of diffusion calls, extending the method to video models, and formalizing guarantees around privacy (ensuring the probing process does not inadvertently leak proprietary model behavior).

Authors

  • Simone Carnemolla
  • Chiara Russo
  • Simone Palazzo
  • Quentin Bouniot
  • Daniela Giordano
  • Zeynep Akata
  • Matteo Pennisi
  • Concetto Spampinato

Paper Information

  • arXiv ID: 2603.08639v1
  • Categories: cs.CV, cs.AI
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »