[Paper] Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Published: 2 months ago (December 3, 2025 at 08:03 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.04355v1

Overview

The paper introduces gpuFLOPBench, a new benchmark that asks large language models (LLMs) to predict the floating‑point operation (FLOP) count of CUDA kernels without actually running the code. By focusing on forward‑looking reasoning about code complexity, the authors expose a blind spot in today’s code‑generation assistants: they can write GPU code, but they struggle to anticipate performance‑critical details that developers need to know early in the design cycle.

Key Contributions

gpuFLOPBench dataset – 577 real‑world CUDA kernels from the HeCBench suite, each annotated with ground‑truth single‑ and double‑precision FLOP counts and eight execution attributes that flag “easy” vs. “hard” kernels.
Evaluation protocol – a systematic way to measure an LLM’s ability to (1) classify whether a kernel’s FLOP count can be derived analytically and (2) produce an accurate numeric estimate when it is.
Empirical study of state‑of‑the‑art LLMs – benchmarking several closed‑source reasoning models, revealing where they succeed (trivial kernels) and where they fail (implicit FLOPs from divisions, math intrinsics, common subexpressions).
Insight into a core limitation – current code assistants lack an internal model of hardware‑specific microcode and compiler optimizations that affect FLOP counts.
Open‑source release – the full benchmark, annotations, and evaluation scripts are publicly available for the community to build better performance‑aware LLM tools.

Methodology

Kernel selection & annotation – The authors curated 577 CUDA kernels spanning a range of computational patterns (matrix multiplies, stencil codes, reductions, etc.). Each kernel was profiled on an NVIDIA GPU to obtain exact FLOP counts for both single‑ and double‑precision arithmetic.
Attribute tagging – Eight binary attributes capture aspects such as the presence of division, use of intrinsic math functions (e.g., sin, exp), and reliance on compiler‑generated code. These tags help separate kernels that are analytically tractable from those that depend on hidden runtime behavior.
Prompt design – For each kernel, a prompt containing the source code (or a trimmed excerpt) asks the LLM to (a) decide if the FLOP count can be derived statically and (b) output the estimated count.
Scoring – Classification accuracy measures whether the model correctly flags “easy” vs. “hard” kernels. For numeric predictions, the authors compute absolute and relative error, and they also track orders‑of‑magnitude deviations.
Model suite – The benchmark is run on several leading closed‑source reasoning LLMs (e.g., GPT‑4‑Turbo, Claude 3, Gemini Pro) using their default temperature and chain‑of‑thought prompting settings.

Results & Findings

Perfect classification on easy kernels – All evaluated models correctly identified kernels whose FLOP count can be derived by simple arithmetic inspection.
Large errors on hard kernels – When a kernel’s FLOP count hinges on hidden compiler transformations (e.g., division turned into a sequence of multiply‑adds, or math intrinsics that expand to multiple operations), the models’ predictions were off by 1–3 orders of magnitude on average.
Systematic blind spots – The most frequent failure modes involved:
- Division operations (often compiled into reciprocal‑multiply sequences).
- Intrinsic functions (__sinf, __expf) that map to hardware micro‑code with variable FLOP costs.
- Common subexpression elimination that changes the apparent operation count.
No model consistently outperformed the others – While newer models showed modest improvements, the gap between “trivial” and “implicit” kernels remained stark.

Practical Implications

Early performance estimation – Developers could use an LLM‑powered assistant to get a quick sanity check on FLOP intensity before writing or profiling code, saving time in the design phase. The benchmark shows current assistants are reliable only for straightforward kernels.
Tooling for compiler‑aware assistants – To be truly useful, future code assistants must embed a lightweight model of GPU compiler pipelines (e.g., PTX generation, intrinsic expansion). Integrating such knowledge could enable more accurate FLOP predictions and better guidance on algorithmic choices.
Hardware procurement & scheduling – Accurate FLOP estimates help teams size GPUs, plan cloud budgets, and schedule workloads. An LLM that can reason about FLOPs could become a “performance copilot” for data‑center operators.
Benchmark as a development target – gpuFLOPBench provides a concrete, reproducible test for any new LLM or plugin that claims performance‑aware code generation, encouraging the community to iterate toward more hardware‑savvy models.

Limitations & Future Work

Closed‑source model focus – The study evaluates only proprietary LLMs; open‑source alternatives may behave differently but were not included.
Static analysis only – The benchmark assumes a single target GPU architecture; extending to multi‑GPU or upcoming architectures (e.g., Hopper, Ada) would require additional profiling.
Scope of kernels – While 577 kernels are diverse, they are still drawn from a benchmark suite and may not capture all edge‑case patterns seen in production codebases.
Future directions suggested by the authors include:
- Building hybrid models that combine LLM reasoning with symbolic static analysis or compiler IR inspection.
- Expanding the dataset to cover other performance metrics (memory bandwidth, occupancy).
- Open‑sourcing the evaluation pipeline to enable community‑driven leaderboards.

If you’re interested in trying gpuFLOPBench yourself, the repository is available at https://github.com/Scientific-Computing-Lab/gpuFLOPBench.

Authors

Gregory Bolet
Giorgis Georgakoudis
Konstantinos Parasyris
Harshitha Menon
Niranjan Hasabnis
Kirk W. Cameron
Gal Oren

Paper Information

arXiv ID: 2512.04355v1
Categories: cs.DC, cs.AI, cs.PF
Published: December 4, 2025
PDF: Download PDF

[Paper] Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] Training-Time Action Conditioning for Efficient Real-Time Chunking

[Paper] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement