[Paper] Perish or Flourish? A Holistic Evaluation of Large Language Models for Code Generation in Functional Programming

Published: (January 5, 2026 at 07:33 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02060v1

Overview

The paper presents FPEval, a new benchmark and evaluation framework that measures how well state‑of‑the‑art large language models (LLMs) can generate code in functional programming (FP) languages. By testing GPT‑3.5, GPT‑4o, and GPT‑5 on 721 tasks in Haskell, OCaml, and Scala, the authors reveal that while LLMs are getting better at FP, they still lag behind on pure functional languages and often produce code that looks “imperative” rather than idiomatic FP.

Key Contributions

  • FPBench: a curated suite of 721 programming problems spanning three difficulty tiers for three mainstream FP languages (Haskell, OCaml, Scala).
  • FPEval framework: combines automated test‑case validation, static‑analysis checks, and style/maintainability metrics to give a holistic view of generated code quality.
  • Comprehensive LLM evaluation: systematic comparison of GPT‑3.5, GPT‑4o, and GPT‑5 on FP tasks, with Java as an imperative baseline.
  • Insight into error patterns: identification of higher error rates in pure FP languages and frequent “imperative‑style” code in FP outputs.
  • Self‑repair experiments: demonstration that LLMs can partially fix correctness and style issues when fed static‑analysis feedback and targeted prompts.

Methodology

  1. Benchmark construction – The authors collected real‑world snippets, textbook exercises, and open‑source challenges, then classified each into easy, medium, or hard difficulty.
  2. Evaluation pipeline – For every task, an LLM is prompted to write a solution. The generated code is:
    • Executed against a hidden test suite to verify functional correctness.
    • Run through language‑specific static analysis tools (e.g., HLint for Haskell, OCaml‑format, Scalafmt) to detect style violations, unused imports, and anti‑idiomatic patterns.
    • Scored on a composite metric that weights correctness, style adherence, and maintainability.
  3. Baseline comparison – The same pipeline is applied to Java solutions to highlight differences between imperative and functional paradigms.
  4. Self‑repair loop – After the first pass, the system feeds the static‑analysis report back to the LLM with a short instruction (“please fix the style issues”) and measures the improvement.

The whole process is automated, reproducible, and open‑sourced, allowing other researchers or teams to plug in new models or additional languages.

Results & Findings

ModelAvg. Correctness (FP)Avg. Correctness (Java)% of Non‑idiomatic FP
GPT‑3.548 %71 %62 %
GPT‑4o63 %84 %48 %
GPT‑578 %92 %35 %
  • Performance gap: Even the best model (GPT‑5) solves only ~78 % of FP tasks, compared with >90 % on Java.
  • Language effect: Scala (a hybrid FP/OO language) sits between pure FP (Haskell, OCaml) and Java, suggesting that the degree of functional purity matters.
  • Style issues: A large share of generated FP code uses mutable variables, loops, or other imperative constructs, which hurts readability and long‑term maintainability.
  • Self‑repair gains: Providing static‑analysis feedback improves correctness by 5‑12 % and reduces non‑idiomatic patterns by ~20 % on average, but the models still leave residual style violations.

Practical Implications

  • Developer tooling: IDE plugins that invoke LLMs for FP code suggestions should integrate static‑analysis feedback loops to catch and correct style violations automatically.
  • Onboarding & education: Companies looking to lower the FP learning curve can use LLM‑assisted code generation as a “pair‑programming” aid, but must pair it with linting and code‑review processes to enforce idiomatic practices.
  • Productivity estimation: Teams can expect a modest boost in prototype speed for FP projects, yet should budget extra time for manual refactoring to achieve production‑grade quality.
  • Model selection: When the target language is pure functional, opting for the latest LLM (e.g., GPT‑5) yields noticeable gains, but the cost‑benefit trade‑off should be weighed against the remaining error rate.
  • Benchmarking standards: FPEval sets a template for future “code‑generation” benchmarks that go beyond pass/fail tests, encouraging vendors to improve style‑aware generation.

Limitations & Future Work

  • Benchmark scope: While 721 tasks are sizable, they still focus on algorithmic problems; real‑world FP codebases (e.g., concurrent pipelines, effect systems) are not covered.
  • Static analysis granularity: The current style metrics are based on existing linters, which may miss deeper semantic idioms (e.g., proper use of monads).
  • Model diversity: Only OpenAI’s GPT series were evaluated; other architectures (e.g., Claude, LLaMA‑based models) could behave differently.
  • Self‑repair depth: The feedback loop is a single iteration; multi‑turn interactions or more sophisticated prompting might yield larger improvements.

Future research directions include expanding FPBench with large‑scale open‑source projects, incorporating functional‑specific correctness criteria (e.g., purity, referential transparency), and testing multi‑model ensembles that combine generation with dedicated style‑transfer networks.

Authors

  • Nguyet-Anh H. Lang
  • Eric Lang
  • Thanh Le-Cong
  • Bach Le
  • Quyet-Thang Huynh

Paper Information

  • arXiv ID: 2601.02060v1
  • Categories: cs.PL, cs.AI, cs.SE
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »