[Paper] DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established scienc...
1603 posts from this source
Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established scienc...
Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet curre...
Deep research agents face vast, interdependent, and pervasively uncertain information. Existing systems explore what evolving intermediate representations shoul...
Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions...
Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distrib...
Existing financial NLP benchmarks often rely on labels supplied by outside observers, measuring how language is perceived rather than what speakers have committ...
Efficient learning of user preferences is crucial for many modern decision making systems but typically requires costly labeled data. Active learning reduces th...
Annotating speaker attributes from text is inherently ambiguous, particularly in multilingual settings where demographic and social cues are implicit and cultur...
Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient ...
Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. ...
We test the standard RLVR tool-use recipe -- GRPO on Qwen2.5-7B-Instruct -- on a deliberately minimal knowledge-graph tool API: four Freebase navigation verbs o...
We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both wheth...
Generating code from natural-language requirements has become a primary route for LLM-assisted software development. Although LLMs can successfully complete sma...
Modern atomistic spin simulations combine long stochastic trajectories, thermodynamic sampling, static optimization and multi-image transition-path workflows, a...
Log parsing is a fundamental step for automated log analysis, which transforms raw log messages into structured formats. Existing syntax-based parsers struggle ...
Generative AI tools are rapidly transforming software development practice, prompting unprecedented research interest. However, existing studies have predominan...
Federated edge learning (FEEL) has recently emerged as a promising paradigm for achieving edge intelligence (EI) via enabling collaborative model training acros...
Agentic AI coding assistants can edit files, run commands, and access the internet on behalf of developers. However, their reliance on unvetted external artifac...
Validators on generic Proof of Stake chains earn the same fees whether they handle attestation work correctly or selectively censor it. For chains whose main ac...
Dynamic multi-objective optimization with a changing number of objectives has recently attracted increasing attention due to its relevance to real-world problem...
Retrieval-Augmented Generation (RAG) empowers LLMs with external knowledge, making cross-institutional domain-specific knowledge base integration a highly promi...
Large language models (LLMs) can serve as the semantic-matching engine of a content-based publish/subscribe broker for agentic AI across the edge-cloud computin...
Distributing Transformer inference across embedded edge devices can alleviate individual memory and compute constraints, yet practical benefits on real hardware...
AI-native software development is often evaluated at the level of individual models, prompts, or generated artifacts. This framing is insufficient for productio...
Large language model (LLM) inference is limited by high computational cost and memory bandwidth demands, making deployment on heterogeneous many-core processors...
Multi-agent systems powered by large foundation models (LFMs) are increasingly deployed to control industrial robots through natural language, creating deployme...
We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU an...
Diffusion-based generation is increasingly powering production content pipelines; however, deploying these models at scale remains a significant challenge. Mode...
Context. Large language models (LLMs) are increasingly applied to code-generating tasks (CGTs) in software engineering. While reported results are promising, th...
Responsive Layout Failures (RLFs) typically arise from CSS properties that hinder proper layout behavior in different screen sizes. To find an accurate and effe...
The rapid evolution of large language models (LLMs) has made geographically distributed training necessary due to GPU scarcity within a single cloud region. In ...
We study the symmetric polynomial prod_{αin A_{n,d}}bigl(1+α_1 x_1+cdots+α_n x_nbigr) where A_{n,d}:={αinmathbb{Z}_{ge 0}^n:|α|=d}, which is the total Chern cla...
Spatial and temporal resource constraints are critical for both biological and artificial intelligent systems. Here we define differentiable cost terms for brea...
Particle Swarm Optimization (PSO) frequently suffers from premature convergence. This paper introduces a family of problem-informed diversity-enhancing strategi...
The dominant artificial intelligence paradigm trains neural architectures via gradient descent against proxy objectives and reinforcement learning from human fe...
Sampling-based algorithms for robot path planning offer probabilistic completeness and strong empirical convergence properties across environments with diverse ...
We present an inequality that bounds the short-term memory capability of dynamical systems from below. It can be interpreted as an uncertainty relation between ...
Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimiz...
Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Sup...
Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a ...
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophi...
Language agents increasingly improve by reusing skills -- structured procedural artifacts distilled from past experience. In particular, domain-level and model-...
Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatia...
Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grai...
Identifying which brain regions represent a visual concept in the human brain is a central challenge in neuroscience. Existing approaches have localized coarse ...
We propose Complete-muE, a framework which targets hyperparameter transfer across dense FFN and any Mixture-of-Experts (MoE) setups in transformer blocks. Exist...
Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-...
Mask-free video object insertion has emerged as a challenging task, requiring harmonious integration of reference objects into source videos. However, existing ...