[Paper] Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas

Published: (February 9, 2026 at 10:06 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.08765v1

Overview

Micah Villmow’s paper presents Scylla, a systematic framework for evaluating LLM‑powered coding assistants and multi‑agent pipelines. By measuring the Cost‑of‑Pass (CoP)—the expected dollar spend to obtain a correct solution—Scylla lets developers compare architectural tweaks (prompts, tool use, agent orchestration) on a level playing field.

Key Contributions

  • Scylla evaluation suite: Seven tiered test levels (T0‑T6) that incrementally add complexity (e.g., basic prompt → tool‑augmented → multi‑agent) to isolate causal factors.
  • Cost‑of‑Pass (CoP) metric: A clear, business‑oriented KPI that combines monetary cost and success rate, enabling direct trade‑off analysis.
  • Model‑agnostic design: Works with any command‑line coding tool; the paper demonstrates it with Claude Sonnet 4.5 as the generation engine.
  • Multi‑LLM judging pipeline: Uses three Claude models (Opus 4.5, Sonnet 4.5, Haiku 4.5) to produce consensus scores via direct tests, rubric‑based LLM evaluation, and qualitative review.
  • Reproducible benchmark: All scripts, prompts, and data are released, allowing the community to replicate and extend the study.

Methodology

  1. Define testing tiers
    • T0: Simple prompt → single LLM output.
    • T1‑T3: Add deterministic tools (e.g., static analysis, test generation).
    • T4‑T6: Introduce multi‑agent orchestration, dynamic tool selection, and self‑refinement loops.
  2. Run each tier on a curated suite of coding problems (algorithmic, API‑integration, and bug‑fix tasks).
  3. Collect outcomes: For every run, record the number of API calls, token usage, and whether the generated code passes the hidden test suite.
  4. Compute CoP

[ \text{CoP} = \frac{\text{Total cost (API calls × price per token)}}{\text{Number of passing solutions}} ]

  1. Evaluation: Three Claude models act as judges. They (a) execute the code against hidden tests, (b) apply a rubric generated by an LLM, and (c) provide a short qualitative verdict. Consensus is reached via majority voting.

The pipeline is fully automated, so developers can plug in their own agents or prompts and obtain a CoP report in minutes.

Results & Findings

TierAvg. Pass RateAvg. Cost per RunCoP (USD)
T0 (plain prompt)42 %$0.08$0.19
T2 (tool‑augmented)58 %$0.12$0.21
T4 (single‑agent with self‑refine)66 %$0.18$0.27
T6 (full multi‑agent)71 %$0.31$0.44
  • Adding static analysis tools (T2) improves correctness with modest cost increase.
  • Self‑refinement loops (T4) give a noticeable boost in pass rate but start to erode cost efficiency.
  • The full multi‑agent orchestration (T6) yields the highest raw accuracy, yet its CoP is the worst—extra agents and tool calls inflate the bill without proportional quality gains.
  • Across all tiers, the variance between LLM judges was under 3 %, confirming that the consensus approach is stable.

Key takeaway: More architectural complexity does not guarantee a better cost‑performance trade‑off.

Practical Implications

  • Product managers can use CoP to set budget caps for AI‑assisted coding features, choosing the simplest tier that meets a target pass rate.
  • DevOps teams can integrate Scylla into CI pipelines to continuously monitor the ROI of new prompting tricks or tool plugins.
  • Tool vendors gain a neutral benchmark to showcase where their added capabilities (e.g., code‑search, automated debugging) actually pay off.
  • Individual developers can experiment with lightweight prompt engineering before committing to heavyweight multi‑agent setups, saving both time and API spend.

In short, Scylla turns the “black‑box” of LLM‑based coding assistants into a quantifiable engineering decision.

Limitations & Future Work

  • Domain scope: The benchmark focuses on general‑purpose coding tasks; specialized domains (e.g., embedded systems, data‑science notebooks) may behave differently.
  • Vendor lock‑in: All judges are Claude models; cross‑vendor validation (e.g., GPT‑4, Gemini) is left for future studies.
  • Human factor: While the framework automates evaluation, real‑world developer satisfaction and maintainability are not captured.
  • Scalability of tiers: Adding more nuanced tiers (e.g., hybrid human‑in‑the‑loop) could further refine the cost‑benefit landscape.

Future work aims to broaden the problem set, incorporate multi‑vendor judges, and explore hybrid evaluation metrics that blend CoP with developer experience scores.

Authors

  • Micah Villmow

Paper Information

  • arXiv ID: 2602.08765v1
  • Categories: cs.SE, cs.AI
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »