[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iterati...
5752 posts from this source
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iterati...
Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing r...
Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step ...
Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (...
Accurate characterization of subsurface flow is critical for Carbon Capture and Storage (CCS) but remains challenged by the ill-posed nature of inverse problems...
We propose an optimization-informed deep neural network approach, named iUzawa-Net, aiming for the first solver that enables real-time solutions for a class of ...
Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both...
Copyright law focuses on whether a new work is 'substantially similar' to an existing one, but generative AI can closely imitate style without copying content, ...
AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforc...
Self-supervised learning (SSL) is a powerful paradigm for learning from unlabeled time-series data. However, popular methods such as masked autoencoders (MAEs) ...
Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their i...
Decentralized protocols claim immutable, rule-based execution, yet many embed emergency mechanisms such as chain-level freezes, protocol pauses, and account qua...
Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as ...
We study Langevin dynamics with noise projected onto the directions orthogonal to an isometric group action. This mathematical model is introduced to shed new l...
Unit testing is essential for verifying the functional correctness of code modules (e.g., classes, methods), but manually writing unit tests is often labor-inte...
We consider bidding in repeated Bayesian first-price auctions. Bidding algorithms that achieve optimal regret have been extensively studied, but their strategic...
This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (L&T) industry. The curriculum ...
Graph neural networks (GNNs) are designed to use attributed graphs to learn representations. Such representations are beneficial in the unsupervised learning of...
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deploym...
Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are i...
Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transc...
Data mixing -- determining the ratios of data from different domains -- is a first-order concern for training language models (LMs). While existing mixing metho...
Neuromorphic vision systems based on spiking neural networks (SNNs) offer ultra-low-power perception for event-based and frame-based cameras, yet catastrophic f...
Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. So...
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primaril...
We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation...
Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training co...
Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics....
AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critica...
The rapid evolution of cyberattacks continues to drive the emergence of unknown (zero-day) threats, posing significant challenges for network intrusion detectio...
State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. ...
Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks incl...
High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. ...
Serving Large Language Models (LLMs) can benefit immensely from parallelizing both the model and input requests across multiple devices, but incoming workloads ...
AI coding agents are increasingly contributing to software development, yet their impact on mobile development has received little empirical attention. In this ...
Modern container-based microservices evolve through rapid deployment cycles, but CI/CD pipelines still rarely measure energy consumption, even though prior work...
Performance antipatterns are known to degrade the responsiveness of microservice-based systems, but their impact on energy consumption remains largely unexplore...
In the Contention Resolution problem n parties each wish to have exclusive use of a shared resource for one unit of time. The problem has been studied since the...
Model checking in TLA+ provides strong correctness guarantees, yet practitioners continue to face significant challenges in interpreting counterexamples, unders...
Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, l...
Multi-agent systems increasingly orchestrate multiple specialized language models to solve complex real-world problems, often invoking them over a shared contex...
Distributed computing has enabled cooperation between multiple computing devices for the simultaneous execution of resource-hungry tasks. Such execution also pl...
A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatic...
Several Deep Learning (DL)-based techniques have been proposed to automate code review. Still, it is unclear the extent to which these approaches can recommend ...
Large language models (LLMs) have shown remarkable capabilities in automated code generation. While effective for mainstream languages, they may underperform on...
In binary classification systems, decision thresholds translate model scores into actions. Choosing suitable thresholds relies on the specific distribution of t...
Designing a rate limiter that is simultaneously accurate, available, and scalable presents a fundamental challenge in distributed systems, primarily due to the ...
Distributing LLM inference across geographical regions can improve Time-to-First-Token (TTFT) by regionalizing service deployments. While existing multi-region ...