[Paper] EasyV2V: A High-quality Instruction-based Video Editing Framework
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the desig...
3983 posts from this source
While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the desig...
Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interaction...
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculation...
The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, pr...
Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing ...
In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from ...
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of ...
Standard practice across domains from robotics to language is to first pretrain a policy on a large-scale demonstration dataset, and then finetune this policy, ...
Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact la...
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning...
Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore *post-hoc watermarking* where an LLM...
We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined o...
AI technologies have rapidly moved into business and research applications that involve large text corpora, including computational journalism research and news...
Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text seq...
The correct use of a Hardware Abstraction Layer (HAL) interface in embedded applications is crucial to prevent malfunctions, crashes, or even hardware damage. S...
Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise o...
Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. Howe...
Equivariant atomistic machine learning models have brought substantial gains in both extrapolation capability and predictive accuracy. Depending on the basis of...
A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics...
The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the s...
Inspired by biology, spiking neural networks (SNNs) process information via discrete spikes over time, offering an energy-efficient alternative to the classical...
Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challe...
Prosody -- the melody of speech -- conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-t...
Nowadays, Large Language Models (LLMs) are foundational components of modern software systems. As their influence grows, concerns about fairness have become inc...
Reactive jammers pose a severe security threat to robotic-swarm networks by selectively disrupting inter-agent communications and undermining formation integrit...
In this paper, the Multi-stage Edge Server Upgrade (M-ESU) is proposed as a new network planning problem, involving the upgrading of an existing multi-access ed...
While comments are non-functional elements of source code, Large Language Models (LLM) frequently rely on them to perform Software Engineering (SE) tasks. Yet, ...
Mutation analysis is a well-established technique for assessing test quality in the traditional software development paradigm by injecting artificial faults int...
We describe the Lockchain Protocol, a lightweight Bitcoin meta-protocol that enables highly efficient transaction discovery at zero marginal block space cost, a...
Generative art systems often involve high-dimensional and complex parameter spaces in which aesthetically compelling outputs occupy only small, fragmented regio...
Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially ...
In this paper, we describe a federated compute platform dedicated to support Artificial Intelligence in scientific workloads. Putting the effort into reproducib...
Choosing the number of topics T in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretabilit...
How can neural networks evolve themselves without relying on external optimizers? We propose Self-Referential Graph HyperNetworks, systems where the very machin...
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascad...
Background: Compilers are fundamental to software development, translating high-level source code into executable software systems. Faults in compilers can have...
Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain s...
Spiking neurons, the fundamental information processing units of Spiking Neural Networks (SNNs), have the all-or-zero information output form that allows SNNs t...
Rank-based zeroth-order (ZO) optimization -- which relies only on the ordering of function evaluations -- offers strong robustness to noise and monotone transfo...
Disaggregated memory (DM) is a promising data center architecture that decouples CPU and memory into independent resource pools to improve resource utilization....
Apache Kafka has become a foundational platform for high throughput event streaming, enabling real time analytics, financial transaction processing, industrial ...
Disaggregated memory (DM) separates compute and memory resources, allowing flexible scaling to achieve high resource utilization. To ensure atomic and consisten...
The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--intr...
The performance of modern software systems is critically dependent on their complex configuration options. Building accurate performance models to navigate this...
As an increasing number of software systems reach unprecedented scale, relying solely on code-level abstractions is becoming impractical. While architectural ab...
Symbolic regression (SR) has emerged as a powerful method for uncovering interpretable mathematical relationships from data, offering a novel route to both scie...
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To ...
At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging ...