[Paper] Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascad...
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascad...
Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain s...
Rank-based zeroth-order (ZO) optimization -- which relies only on the ordering of function evaluations -- offers strong robustness to noise and monotone transfo...
TensorFlow is a tool that helps people make apps that can learn from data. It runs on tiny phones and on huge servers, so the same idea can be used at home or i...
The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--intr...
As an increasing number of software systems reach unprecedented scale, relying solely on code-level abstractions is becoming impractical. While architectural ab...
Symbolic regression (SR) has emerged as a powerful method for uncovering interpretable mathematical relationships from data, offering a novel route to both scie...
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To ...
Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure...
This paper proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. W...
In a mathematical model of interacting biological organisms, where external interventions may alter behavior over time, traditional models that assume fixed par...
Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the mod...