NeurIPS 2025 Best Paper Awards
Source: Hacker News
The Best Paper Award Committee members were nominated by the Program Chairs and the Database and Benchmark track chairs, who selected leading researchers across machine learning topics. These nominations were approved by the General Chairs and Next Generation and Accessibility Chairs.
The best paper award committees were tasked with selecting a handful of highly impactful papers from the Main Track and the Datasets & Benchmark Track of the conference.
We are excited to share the news that the best and runner‑up paper awards this year go to seven groundbreaking papers, including four best papers (one of which is from the datasets and benchmarks track) and three runner‑ups. The seven papers highlight advances in diffusion model theory, self‑supervised reinforcement learning, attention mechanisms for large language models, reasoning capabilities in LLMs, online learning theory, neural scaling laws, and benchmarking methodologies for language model diversity.
The winners are presented here in alphabetical order by title.
Artificial Hivemind: The Open‑Ended Homogeneity of Language Models (and Beyond)
Authors: Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Yejin Choi
Abstract
Large language models (LMs) often struggle to generate diverse, human‑like creative content, raising concerns about the long‑term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. To address this gap, we introduce Infinity‑Chat, a large‑scale dataset of 26 K diverse, real‑world, open‑ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open‑ended prompts posed to LMs, comprising 6 top‑level categories (e.g., creative content generation, brainstorm & ideation) that further break down into 17 subcategories. Using Infinity‑Chat, we present a large‑scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open‑ended generation of LMs, characterized by (1) intra‑model repetition, where a single model consistently generates similar responses, and more so (2) inter‑model homogeneity, where different models produce strikingly similar outputs. Infinity‑Chat also includes 31 250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual‑specific human preferences in response to open‑ended queries. Our findings show that state‑of‑the‑art LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY‑CHAT presents the first large‑scale resource for systematically studying real‑world open‑ended queries to LMs, revealing critical insights to guide future research for mitigating long‑term AI safety risks posed by the Artificial Hivemind.
Reflections from the Selection Committee
This paper makes a substantial and timely contribution to the understanding of diversity, pluralism, and societal impact in modern language models. The authors introduce Infinity‑Chat, a rigorously constructed benchmark of 26 K real‑world open‑ended queries paired with 31 K dense human annotations, enabling systematic evaluation of creative generation, ideation, and subjective preference alignment—dimensions historically underexamined in AI evaluation. Beyond releasing a valuable dataset, the paper provides deep analytical insights through the first comprehensive taxonomy of open‑ended prompts and an extensive empirical study across more than 70 models, revealing the Artificial Hivemind effect: pronounced intra‑ and inter‑model homogenization that raises serious concerns about long‑term risks to human creativity, value plurality, and independent thinking. The findings expose critical miscalibration between current reward models, automated judges, and diverse human preferences, highlighting the tension between alignment and diversity and establishing a foundation for future work on preserving heterogeneity in AI systems. Overall, this work sets a new standard for datasets and benchmarks that advance scientific understanding and address pressing societal challenges rather than solely improving technical performance.
Gated Attention for Large Language Models: Non‑linearity, Sparsity, and Attention‑Sink‑Free
Authors: Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin
Abstract
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state‑space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating‑augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15 B Mixture‑of‑Experts (MoE) models and 1.7 B dense models trained on a 3.5 trillion‑token dataset. Our central finding is that a simple modification—applying a head‑specific sigmoid gate after the Scaled Dot‑Product Attention (SDPA)—consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non‑linearity upon the low‑rank mapping in the softmax attention, and (2) applying query‑dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates massive activation, attention sink and enhances long‑context extrapolation performance. We also release related code (https://github.com/qiuzh20/gated_attention) and models (https://huggingface.co/QwQZh/gated_attention) to facilitate future research. Furthermore, the most effective SDPA output gating is used in the Qwen3‑Next models (https://huggingface.co/collections/Qwen/qwen3-next).
Reflections from the Selection Committee
The main finding of this paper is that the performance of large language models using softmax attention can be consistently improved by