[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations
Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still 'think about videos' ie once a video is encoded, reasoning unf...
3997 posts from this source
Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still 'think about videos' ie once a video is encoded, reasoning unf...
Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi-turn interact...
Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video gene...
Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that ev...
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned wi...
Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying ru...
Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models...
Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model's detection capab...
Inverse heat problems refer to the estimation of material thermophysical properties given observed or known heat diffusion behaviour. Inverse heat problems have...
This paper studies the role of activation functions in learning modular addition with two-layer neural networks. We first establish a sharp expressivity gap: si...
Offline reinforcement learning (RL) enables agents to learn optimal policies from pre-collected datasets. However, datasets containing suboptimal and fragmented...
Machine learning models perform well across domains such as diagnostics, weather forecasting, NLP, and autonomous driving, but their limited uncertainty handlin...
We introduce SuperIntelliAgent, an agentic learning framework that couples a trainable small diffusion model (the learner) with a frozen large language model (t...
Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis towar...
Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual ...
Automated vulnerability patching is crucial for software security, and recent advancements in Large Language Models (LLMs) present promising capabilities for au...
Underwater object tracking is challenging due to wavelength dependent attenuation and scattering, which severely distort appearance across depths and water cond...
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop archi...
Split learning is well known as a method for resolving data privacy concerns by training a model on distributed devices, thereby avoiding data sharing that rais...
Small and medium-sized enterprises (SMEs) in Iran increasingly leverage Telegram for sales, where real-time engagement is essential for conversion. However, dev...
Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. Recent research has incre...
We study the online unweighted bipartite matching problem in the random arrival order model, with $n$ offline and $n$ online vertices, in the learning-augmented...
We present the Hierarchical AI-Meteorologist, an LLM-agent system that generates explainable weather reports using a hierarchical forecast reasoning and weather...
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previo...
Novice and expert users have different systematic preferences in task-oriented dialogues. However, whether catering to these preferences actually improves user ...
Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimoda...
Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when i...
In contemporary retail, the variety of products available (e.g. clothing, groceries, cosmetics, frozen goods) make it difficult to predict the demand, prevent s...
Program synthesis is the process of generating a computer program following a set of specifications, such as a set of input-output examples. It can be modeled a...
Knowledge-enhanced text generation aims to enhance the quality of generated text by utilizing internal or external knowledge sources. While language models have...
Gambling disorder is a complex behavioral addiction that is challenging to understand and address, with severe physical, psychological, and social consequences....
Chart-to-code generation is a critical task in automated data visualization, translating complex chart structures into executable programs. While recent Multi-m...
This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. W...
Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilitie...
Mutation-based Fault Localization (MBFL) has been widely explored for automated software debugging, leveraging artificial mutants to identify faulty code entiti...
The content-oblivious model, introduced by Censor-Hillel, Cohen, Gelles, and Sel (PODC 2022; Distributed Computing 2023), captures an extremely weak form of com...
Federated edge learning (FEEL) provides a promising foundation for edge artificial intelligence (AI) by enabling collaborative model training while preserving d...
Modern cloud applications are built on independent, diverse microservices, offering scalability, flexibility, and usage-based billing. However, the structural d...
Dynamically resolving method reachability in Android applications remains a critical and largely unsolved problem. Despite notable advancements in GUI testing a...
Split learning (SL) offloads main computing tasks from multiple resource-constrained user equippments (UEs) to the base station (BS), while preserving local dat...
Vibe coding, the much-touted use of AI techniques for programming, faces two overwhelming obstacles: the difficulty of specifying goals ('prompt engineering' is...
As LLMs reshape software development, integrating LLM-augmented practices into SE education has become imperative. While existing studies explore LLMs' educatio...
High-capacity kernel Hopfield networks exhibit a 'Ridge of Optimization' characterized by extreme stability. While previously linked to 'Spectral Concentration,...
We present Areon, a family of latency-friendly, stake-weighted, multi-proposer proof-of-stake consensus protocols. By allowing multiple proposers per slot and o...
Biological neurons exhibit remarkable intelligence: they maintain internal states, communicate selectively with other neurons, and self-organize into complex gr...
Fall detection for elderly care using non-invasive vision-based systems remains an important yet unsolved problem. Driven by strict privacy requirements, infere...
Reservoir computing (RC) is a powerful framework for predicting nonlinear dynamical systems, yet the role of reservoir topology$-$particularly symmetry in conne...
We liberate Equilibrium Propagation (EP) from the limit of infinitesimal perturbations by establishing a finite-nudge foundation for local credit assignment. By...