[Paper] T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts...