[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

Published: 3 days ago (February 26, 2026 at 09:28 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.23047v1

Overview

The paper introduces CL4SE, the first systematic benchmark that measures how different “contexts” – extra information fed to a Large Language Model (LLM) at inference time – affect core software‑engineering tasks. By categorising SE‑specific context types and evaluating them on real‑world code, the authors show that context‑driven prompting can lift LLM performance by roughly 25 % without any model fine‑tuning.

Key Contributions

Fine‑grained taxonomy of SE contexts – four distinct types:
1. Interpretable examples (few‑shot code snippets)
2. Project‑specific context (files, APIs, naming conventions)
3. Procedural decision‑making context (review guidelines, coding standards)
4. Positive & negative context (contrastive examples of good/bad patches)
Task‑aligned benchmark mapping each context type to a representative SE task:
- Code generation → interpretable examples
- Code summarization → project‑specific context
- Code review → procedural context
- Patch correctness assessment → mixed positive‑negative context
Large, curated dataset: >13 k samples drawn from 30+ open‑source projects, covering diverse languages and domains.
Comprehensive evaluation: five popular LLMs (e.g., GPT‑Oss‑120B, Qwen3‑Max, DeepSeek‑V3) assessed on nine metrics (BLEU, PASS@k, accuracy, etc.).
Empirical insights: quantifies the per‑task gain from each context type, establishing best‑practice guidelines for prompt engineering in SE.

Methodology

Context Taxonomy Design – The authors surveyed existing prompt‑engineering literature and SE workflows, then distilled four context categories that capture the most common sources of auxiliary information developers naturally use.
Dataset Construction – For each of the four tasks, they extracted real code artifacts (functions, commit diffs, review comments) and paired them with the appropriate context. Quality control involved manual verification and automated consistency checks.
Prompt Templates – Uniform prompt formats were created for each context‑task pair, ensuring that the only variable was the type of context supplied.
Model Evaluation – Five off‑the‑shelf LLMs were queried with the same prompts. Performance was measured using task‑specific metrics (e.g., PASS@1 for generation, BLEU for summarization, binary accuracy for patch assessment).
Statistical Analysis – Improvements were reported as absolute percentage gains over a “no‑context” baseline, with significance testing to rule out random variation.

Results & Findings

Task	Context Type	Best‑performing Model	Key Metric Gain
Code Generation	Interpretable examples	DeepSeek‑V3	+5.72 % PASS@1
Code Summarization	Project‑specific context	GPT‑Oss‑120B	+14.78 % BLEU
Code Review	Procedural context	Qwen3‑Max	+33 % review accuracy
Patch Correctness	Mixed positive‑negative	DeepSeek‑V3	+30 % assessment accuracy

Overall impact: Adding the right context raised average performance by 24.7 % across all tasks and models.
Task‑specific sensitivity: Procedural context had the biggest boost for review‑style tasks, while project‑specific information mattered most for summarization.
Model variance: Larger open‑source models (GPT‑Oss‑120B) benefited more from project context, whereas newer instruction‑tuned models (Qwen3‑Max) excelled with procedural prompts.

Practical Implications

Prompt engineering becomes a first‑class tool for SE teams using LLMs: developers can achieve near‑fine‑tuned performance simply by curating the right context.
IDE and CI integration: Plugins can automatically inject project‑specific headers, coding‑standard snippets, or contrastive examples into LLM calls, improving code suggestions, automated reviews, and patch validation without extra training cycles.
Cost‑effective scaling: Since context learning works at inference time, organisations can keep model sizes modest while still reaping large accuracy gains, reducing compute expenses.
Dataset as a resource: The released 13 k‑sample benchmark can serve as a test suite for new LLMs, prompting strategies, or even as a training set for context‑aware adapters.

Limitations & Future Work

Context quality dependency – The study assumes high‑quality, manually curated context; noisy or outdated context could degrade performance.
Language coverage – While the dataset spans several languages, the majority are Java/Python; extending to systems languages (C/C++) or domain‑specific DSLs remains open.
Dynamic contexts – Current prompts are static; exploring runtime‑generated context (e.g., live AST analysis) could further boost results.
Human‑in‑the‑loop evaluation – The benchmark relies on automated metrics; user studies would clarify how context‑enhanced outputs affect developer productivity and trust.

CL4SE paves the way for a more disciplined, data‑driven approach to prompt engineering in software engineering, turning LLMs into adaptable, context‑aware assistants that can be deployed today with minimal overhead.

Authors

Haichuan Hu
Ye Shang
Guoqing Xie
Congqing He
Quanjun Zhang

Paper Information

arXiv ID: 2602.23047v1
Categories: cs.SE
Published: February 26, 2026
PDF: Download PDF

[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Array-Carrying Symbolic Execution for Function Contract Generation

[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

[Paper] Managing Uncertainty in LLM-based Multi-Agent System Operation

[Paper] Productivity and Collaboration in Hybrid Agile Teams: An Interview Study