[Paper] IntRec: Intent-based Retrieval with Contrastive Refinement

Published: (February 19, 2026 at 01:50 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17639v1

Overview

The paper introduces IntRec, an interactive object‑retrieval system that lets users steer a vision model toward the exact item they want—especially in crowded or ambiguous scenes. By keeping track of what the user has confirmed as relevant (positive cues) and what has been rejected (negative constraints), IntRec can refine its predictions on the fly without needing extra training data.

Key Contributions

  • Intent State (IS): a dual‑memory structure that stores positive anchors (objects the user approves) and negative constraints (objects the user rejects).
  • Contrastive Refinement: a ranking loss that simultaneously pulls the target object closer to the positive cues and pushes away the negatives, enabling fine‑grained disambiguation.
  • Interactive Loop: a lightweight feedback mechanism (≈30 ms per interaction) that updates the Intent State and re‑ranks candidates in real time.
  • State‑of‑the‑art Performance: on LVIS, IntRec reaches 35.4 AP, beating strong baselines (OVMR, CoDet, CAKE) by up to +3.7 AP; on the LVIS‑Ambiguous benchmark it gains +7.9 AP after just one user correction.
  • Zero‑Additional Supervision: the system improves accuracy solely through user feedback, avoiding costly re‑training or annotation pipelines.

Methodology

  1. Base Detector – IntRec builds on a pre‑trained open‑vocabulary detector (e.g., a CLIP‑based model) that produces a set of candidate object proposals with visual embeddings.
  2. Intent State Construction – When a user interacts (e.g., clicks “this is the right car” or “not that person”), the system stores the corresponding proposal’s embedding in the positive set; any rejected proposals go into the negative set.
  3. Contrastive Alignment Function – For each remaining candidate (c), the system computes:

[ \text{score}(c) = \frac{1}{|P|}\sum_{p\in P}!! \text{sim}(c,p) ;-; \frac{1}{|N|}\sum_{n\in N}!! \text{sim}(c,n) ]

where (P) and (N) are the positive/negative memories and sim is a cosine similarity in the joint visual‑text embedding space.
4. Re‑ranking & Feedback Loop – Candidates are sorted by this score, the top‑k are shown to the user, and the loop repeats. Because the similarity calculations are vector‑dot‑products, the extra latency per interaction stays under 30 ms.

The whole pipeline is model‑agnostic: any detector that outputs embeddings can be plugged in, and the Intent State can be persisted across sessions for long‑term personalization.

Results & Findings

DatasetBaseline (one‑shot)IntRec (after 1 feedback)Δ AP
LVIS32.1 AP35.4 AP+3.3
LVIS‑Ambiguous27.8 AP35.7 AP+7.9
  • Speed: each feedback iteration adds < 30 ms, making the system suitable for interactive UI/UX.
  • Robustness: The contrastive loss effectively suppresses visually similar distractors, even when the initial query is vague (“a red vehicle”).
  • Generalization: No extra labeled data were required; the same Intent State works across categories, demonstrating the method’s scalability.

Practical Implications

  • Search‑by‑Example UI: Developers can embed IntRec in photo‑management apps, e‑commerce platforms, or video editors, allowing users to “click‑and‑refine” to locate a specific product or scene element.
  • Robotics & AR: An autonomous robot or AR headset can ask a human operator for quick confirmations (“Is this the tool you need?”) and instantly narrow down its perception, improving safety and efficiency.
  • Content Moderation: Moderators can iteratively rule out false positives in large image batches, reducing manual review time while preserving high recall.
  • Personalized Vision Services: By persisting the Intent State per user, services can learn individual visual preferences (e.g., “my favorite brand of sneakers”) without storing explicit labels.

All of these use‑cases benefit from the low latency and zero‑training‑cost nature of IntRec, making it a plug‑and‑play upgrade for existing vision pipelines.

Limitations & Future Work

  • Memory Growth: The dual memory sets grow with each interaction; the authors suggest a simple pruning strategy but more sophisticated memory management could be explored.
  • Dependence on Base Detector Quality: If the underlying detector fails to propose the target object, no amount of feedback can recover it. Future work could integrate proposal generation into the feedback loop.
  • User Interaction Design: The paper assumes binary clicks (accept/reject). Extending to richer signals (e.g., bounding‑box adjustments, textual hints) could further boost performance.
  • Scalability to Video: Applying IntRec across temporal frames raises challenges like maintaining consistent Intent States over time—an open research direction.

Authors

  • Pourya Shamsolmoali
  • Masoumeh Zareapoor
  • Eric Granger
  • Yue Lu

Paper Information

  • arXiv ID: 2602.17639v1
  • Categories: cs.CV
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »