research — Page 74

0 month ago · ai

[Paper] Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coor...

#research #paper #ai #computer-vision
0 month ago · ai

[Paper] Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight

Automating the calculation of clinical risk scores offers a significant opportunity to reduce physician administrative burden and enhance patient care. The curr...

#research #paper #ai #machine-learning
0 month ago · ai

[Paper] Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built o...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai

[Paper] Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models

Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thi...

#research #paper #ai #computer-vision
0 month ago · ai

[Paper] Zero-shot Reconstruction of In-Scene Object Manipulation from Video

We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed s...

#research #paper #ai #computer-vision
0 month ago · ai

[Paper] From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and ground...

#research #paper #ai #computer-vision
0 month ago · ai

[Paper] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this...

#research #paper #ai #nlp
0 month ago · ai

[Paper] VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean im...

#research #paper #ai #computer-vision
0 month ago · ai

[Paper] WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, ...

#research #paper #ai #machine-learning #computer-vision
0 month ago · ai

[Paper] Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning

Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan...

#research #paper #ai #computer-vision
0 month ago · ai

[Paper] Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)

We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-b...

#research #paper #ai #computer-vision
0 month ago · ai

[Paper] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understand...

#research #paper #ai #machine-learning #nlp

Newer posts

Older posts