[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

Published: (February 26, 2026 at 09:28 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.23047v1

Overview

The paper introduces CL4SE, the first systematic benchmark that measures how different “contexts” – extra information fed to a Large Language Model (LLM) at inference time – affect core software‑engineering tasks. By categorising SE‑specific context types and evaluating them on real‑world code, the authors show that context‑driven prompting can lift LLM performance by roughly 25 % without any model fine‑tuning.

Key Contributions

  • Fine‑grained taxonomy of SE contexts – four distinct types:
    1. Interpretable examples (few‑shot code snippets)
    2. Project‑specific context (files, APIs, naming conventions)
    3. Procedural decision‑making context (review guidelines, coding standards)
    4. Positive & negative context (contrastive examples of good/bad patches)
  • Task‑aligned benchmark mapping each context type to a representative SE task:
    • Code generation → interpretable examples
    • Code summarization → project‑specific context
    • Code review → procedural context
    • Patch correctness assessment → mixed positive‑negative context
  • Large, curated dataset: >13 k samples drawn from 30+ open‑source projects, covering diverse languages and domains.
  • Comprehensive evaluation: five popular LLMs (e.g., GPT‑Oss‑120B, Qwen3‑Max, DeepSeek‑V3) assessed on nine metrics (BLEU, PASS@k, accuracy, etc.).
  • Empirical insights: quantifies the per‑task gain from each context type, establishing best‑practice guidelines for prompt engineering in SE.

Methodology

  1. Context Taxonomy Design – The authors surveyed existing prompt‑engineering literature and SE workflows, then distilled four context categories that capture the most common sources of auxiliary information developers naturally use.
  2. Dataset Construction – For each of the four tasks, they extracted real code artifacts (functions, commit diffs, review comments) and paired them with the appropriate context. Quality control involved manual verification and automated consistency checks.
  3. Prompt Templates – Uniform prompt formats were created for each context‑task pair, ensuring that the only variable was the type of context supplied.
  4. Model Evaluation – Five off‑the‑shelf LLMs were queried with the same prompts. Performance was measured using task‑specific metrics (e.g., PASS@1 for generation, BLEU for summarization, binary accuracy for patch assessment).
  5. Statistical Analysis – Improvements were reported as absolute percentage gains over a “no‑context” baseline, with significance testing to rule out random variation.

Results & Findings

TaskContext TypeBest‑performing ModelKey Metric Gain
Code GenerationInterpretable examplesDeepSeek‑V3+5.72 % PASS@1
Code SummarizationProject‑specific contextGPT‑Oss‑120B+14.78 % BLEU
Code ReviewProcedural contextQwen3‑Max+33 % review accuracy
Patch CorrectnessMixed positive‑negativeDeepSeek‑V3+30 % assessment accuracy
  • Overall impact: Adding the right context raised average performance by 24.7 % across all tasks and models.
  • Task‑specific sensitivity: Procedural context had the biggest boost for review‑style tasks, while project‑specific information mattered most for summarization.
  • Model variance: Larger open‑source models (GPT‑Oss‑120B) benefited more from project context, whereas newer instruction‑tuned models (Qwen3‑Max) excelled with procedural prompts.

Practical Implications

  • Prompt engineering becomes a first‑class tool for SE teams using LLMs: developers can achieve near‑fine‑tuned performance simply by curating the right context.
  • IDE and CI integration: Plugins can automatically inject project‑specific headers, coding‑standard snippets, or contrastive examples into LLM calls, improving code suggestions, automated reviews, and patch validation without extra training cycles.
  • Cost‑effective scaling: Since context learning works at inference time, organisations can keep model sizes modest while still reaping large accuracy gains, reducing compute expenses.
  • Dataset as a resource: The released 13 k‑sample benchmark can serve as a test suite for new LLMs, prompting strategies, or even as a training set for context‑aware adapters.

Limitations & Future Work

  • Context quality dependency – The study assumes high‑quality, manually curated context; noisy or outdated context could degrade performance.
  • Language coverage – While the dataset spans several languages, the majority are Java/Python; extending to systems languages (C/C++) or domain‑specific DSLs remains open.
  • Dynamic contexts – Current prompts are static; exploring runtime‑generated context (e.g., live AST analysis) could further boost results.
  • Human‑in‑the‑loop evaluation – The benchmark relies on automated metrics; user studies would clarify how context‑enhanced outputs affect developer productivity and trust.

CL4SE paves the way for a more disciplined, data‑driven approach to prompt engineering in software engineering, turning LLMs into adaptable, context‑aware assistants that can be deployed today with minimal overhead.

Authors

  • Haichuan Hu
  • Ye Shang
  • Guoqing Xie
  • Congqing He
  • Quanjun Zhang

Paper Information

  • arXiv ID: 2602.23047v1
  • Categories: cs.SE
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »