[Paper] Agentic Rubrics as Contextual Verifiers for SWE Agents

Published: (January 7, 2026 at 01:38 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.04171v1

Overview

The paper introduces Agentic Rubrics, a novel way to verify software‑engineering (SWE) code‑generation agents without running the code. By letting an “expert” LLM explore the target repository and produce a context‑aware checklist, the system can score candidate patches quickly and at scale, delivering a measurable boost over existing verification tricks such as test execution or heuristic classifiers.

Key Contributions

  • Agentic Rubrics framework: an LLM‑driven pipeline that creates a repository‑specific rubric and uses it to evaluate patches without executing them.
  • Scalable verification: achieves strong test‑time scaling (TTS) results on the SWE‑Bench Verified benchmark while avoiding costly environment setup.
  • Empirical gains: improves pass‑rate by +3.5 pp over the strongest baseline, reaching 54.2 % on Qwen3‑Coder‑30B‑A3B and 40.6 % on Qwen3‑32B.
  • Interpretability: rubric scores correlate with ground‑truth test outcomes and surface failure modes that tests miss, offering richer diagnostic feedback.
  • Ablation insights: demonstrates that the “agentic” context‑gathering step is crucial for producing unambiguous, codebase‑specific criteria.

Methodology

  1. Contextual Exploration – An expert LLM (the “rubric‑author”) is prompted to browse the target repository: reading README, existing code, build scripts, and any documentation. It extracts the semantic intent of the module it will later evaluate.
  2. Rubric Generation – From this exploration the agent produces a checklist of concrete, verifiable properties (e.g., “function must preserve existing API signature”, “no new import of os”, “maintains backward‑compatible return type”). The checklist is deliberately phrased to be machine‑checkable (e.g., via static analysis or simple pattern matching).
  3. Patch Scoring – When a candidate patch is generated by a SWE agent, the rubric is applied automatically. Each rubric item yields a binary (or graded) score; the aggregate forms the final verification signal.
  4. Parallel Test‑Time Scaling (TTS) – The rubric evaluation runs in parallel across many patches, sidestepping the serial bottleneck of spinning up test environments.
  5. Evaluation – The authors benchmarked the pipeline on SWE‑Bench Verified, comparing against (a) raw test execution, (b) heuristic patch classifiers, and (c) prior static‑analysis‑only baselines.

Results & Findings

ModelBaseline Pass@1*Agentic Rubrics Pass@1
Qwen3‑Coder‑30B‑A3B50.7 %54.2 %
Qwen3‑32B37.1 %40.6 %

*Baseline refers to the strongest non‑rubric method in the paper’s comparison set.

  • Consistency: Rubric scores align with actual test outcomes in > 90 % of cases, confirming that the checklist captures the core correctness criteria.
  • Additional Insight: In ~12 % of evaluated patches the rubric flagged issues (e.g., security‑related imports, style violations) that the test suite missed, suggesting a complementary safety net.
  • Ablation: Removing the context‑gathering step (i.e., using a generic rubric) dropped performance by ~6 pp, underscoring the need for repository‑specific knowledge.

Practical Implications

  • Faster CI pipelines – Teams can plug Agentic Rubrics into continuous integration to get an instant “sanity‑check” score for AI‑generated patches, reserving expensive test runs for only the highest‑scoring candidates.
  • Reduced infrastructure cost – No need to spin up containers, mock services, or provision databases for every patch, which is especially valuable for large monorepos or legacy codebases with heavyweight build steps.
  • Better developer trust – The rubric’s human‑readable checklist gives developers a clear rationale for why a patch is accepted or rejected, easing the hand‑off between AI and human reviewers.
  • Security & compliance – By encoding organization‑specific policies (e.g., “no new network sockets”, “must use approved logging library”) into the rubric, companies can enforce compliance automatically.
  • Extensible to other domains – The same “agentic context → rubric → score” pipeline could be adapted for data‑pipeline generation, infrastructure‑as‑code, or even LLM‑driven documentation updates.

Limitations & Future Work

  • Static‑only perspective – While rubrics avoid execution overhead, they cannot capture dynamic bugs (e.g., race conditions) that only manifest at runtime.
  • Rubric quality depends on LLM – If the expert agent misinterprets the repository, the resulting checklist may be incomplete or overly strict.
  • Scalability of rubric creation – Generating a fresh rubric per repository still incurs an LLM call; future work could explore caching or few‑shot prompting to amortize this cost.
  • Broader evaluation – The study focuses on SWE‑Bench Verified; testing on larger, more heterogeneous codebases (e.g., multi‑language polyglots) would strengthen claims of generality.

Bottom line: Agentic Rubrics offer a pragmatic, interpretable, and cost‑effective verification layer for AI‑assisted software development, positioning themselves as a valuable complement to traditional test‑driven validation in modern DevOps workflows.

Authors

  • Mohit Raghavendra
  • Anisha Gunjal
  • Bing Liu
  • Yunzhong He

Paper Information

  • arXiv ID: 2601.04171v1
  • Categories: cs.LG
  • Published: January 7, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »