[Paper] TeleSWEBench: A Commit-Driven Benchmark for Evaluating LLM-Powered Software Engineering in Telecommunications

Published: (June 3, 2026 at 11:19 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2606.05001v1

Overview

The paper presents TeleSWEBench, the first commit‑driven benchmark that evaluates how well large language model (LLM)‑powered software‑engineering agents can handle real‑world telecom codebases such as the open‑source 5G stack srsRAN. By turning actual developer commits into testable tasks, the authors expose a gap in existing coding benchmarks, which rarely capture the stateful, mathematically‑intensive logic that telecom software demands.

Key Contributions

  • Domain‑specific benchmark: 734 realistic “commit‑style” questions extracted from the srsRAN 5G repository, organized into Easy, Medium, and Difficult tiers.
  • Executable validation: Each question ships with a unit‑test suite that can be run automatically to check functional correctness.
  • Hierarchical judging system (TeleJudge): A two‑level LLM‑based evaluator that scores changes at the file level and aggregates verdicts, complementing traditional unit‑test outcomes with semantic similarity and context awareness.
  • Comprehensive evaluation: Benchmarked three open‑source ASE agents (AIDER, OpenHands, ClaudeCode) across six cutting‑edge LLM back‑ends (Qwen‑3, GPT‑OSS, Gemma‑4, Kimi, QwenEncoder‑2.5, etc.).
  • Empirical insight: Demonstrated that even the best agents only achieve ~25 % of “shippable” changes, highlighting deficiencies in localization (finding the right file) and functional correctness.

Methodology

  1. Data Mining – The authors mined the Git history of the srsRAN 5G codebase, selecting commits that modify a single logical unit (e.g., a function, a configuration file).
  2. Task Generation – Each commit is transformed into a self‑contained prompt: a natural‑language description of the change, the pre‑commit code snapshot, and any required context (headers, build scripts).
  3. Difficulty Stratification – Tasks are manually labeled Easy/Medium/Difficult based on factors such as code size, number of files touched, and reliance on telecom‑specific math (e.g., signal‑processing formulas).
  4. Executable Ground Truth – For every task, a unit‑test suite is generated from the post‑commit code, guaranteeing an objective pass/fail metric.
  5. TeleJudge Evaluation – A hierarchical LLM judge first checks whether the agent edited the correct file(s) (localization), then compares the agent’s diff against the ground‑truth diff using semantic similarity and context‑aware scoring. The final score is a weighted blend of TeleJudge verdicts and unit‑test results.
  6. Agent Runs – Each ASE agent is invoked with the same prompt, using a variety of underlying reasoning LLMs. The agents produce a diff, which is fed into the evaluation pipeline.

Results & Findings

Agent (LLM)Overall Shippable Rate*Localization AccuracyFunctional Correctness
AIDER (Qwen‑3)24.8 %38 %22 %
OpenHands (GPT‑OSS)19.3 %31 %18 %
ClaudeCode (Gemma‑4)21.5 %35 %20 %

*Shippable = passes both TeleJudge and unit‑test criteria.

  • Difficulty matters: Easy tier sees ~45 % shippable changes, while Difficult tier drops below 10 %.
  • Localization is the bottleneck: Agents frequently edit the wrong file or miss ancillary files needed for a successful build.
  • Functional correctness lags: Even when the correct file is edited, generated code often fails to satisfy the strict numerical constraints of telecom algorithms.
  • Two‑stage evaluation matters: Pure unit‑test scores underestimate failures; TeleJudge catches many “semantic” mismatches that unit tests miss.

Practical Implications

  • Tooling for telecom operators – ASE agents can already automate a non‑trivial slice of routine code updates (e.g., configuration tweaks, boilerplate refactors), potentially reducing manual effort in O‑RAN and AI‑RAN deployments.
  • CI/CD integration – TeleJudge’s file‑level scoring can be wrapped into a CI gate, allowing teams to automatically accept LLM‑generated patches that meet both functional and localization thresholds.
  • Benchmark‑driven development – Vendors of LLM‑powered coding assistants now have a concrete, domain‑specific yardstick to track improvements, encouraging targeted fine‑tuning on telecom code.
  • Safety‑critical code – The low shippable rates underscore that for core PHY/MAC layers, human review remains essential; however, the benchmark can be used to pre‑filter low‑quality suggestions, saving reviewer time.
  • Open‑source contributions – Contributors to projects like srsRAN could leverage TeleSWEBench to test community‑built bots before merging, fostering a healthier ecosystem of automated contributors.

Limitations & Future Work

  • Scope limited to a single codebase – While srsRAN is representative, other telecom stacks (e.g., OpenAirInterface, commercial OSS) may exhibit different patterns.
  • Commit granularity – The benchmark focuses on single‑commit changes; multi‑commit feature implementations are not covered.
  • Evaluation bias – TeleJudge relies on LLMs for semantic scoring, which may inherit the same biases as the underlying models.
  • Hardware‑specific validation – Unit tests run in a simulated environment; real‑world radio‑hardware constraints (timing, memory footprints) are not captured.
  • Future directions – Expanding to multi‑repo, multi‑language (C++, Python, Rust) telecom projects; incorporating performance‑oriented metrics (latency, throughput); and exploring fine‑tuning strategies that improve localization accuracy.

Authors

  • Pranshav Gajjar
  • Ali Mamaghani
  • Dinesh Bharadia
  • Vijay K Shah

Paper Information

  • arXiv ID: 2606.05001v1
  • Categories: cs.SE
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »