[Paper] Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas

Published: 3 days ago (February 9, 2026 at 10:06 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.08765v1

Overview

Micah Villmow’s paper presents Scylla, a systematic framework for evaluating LLM‑powered coding assistants and multi‑agent pipelines. By measuring the Cost‑of‑Pass (CoP)—the expected dollar spend to obtain a correct solution—Scylla lets developers compare architectural tweaks (prompts, tool use, agent orchestration) on a level playing field.

Key Contributions

Scylla evaluation suite: Seven tiered test levels (T0‑T6) that incrementally add complexity (e.g., basic prompt → tool‑augmented → multi‑agent) to isolate causal factors.
Cost‑of‑Pass (CoP) metric: A clear, business‑oriented KPI that combines monetary cost and success rate, enabling direct trade‑off analysis.
Model‑agnostic design: Works with any command‑line coding tool; the paper demonstrates it with Claude Sonnet 4.5 as the generation engine.
Multi‑LLM judging pipeline: Uses three Claude models (Opus 4.5, Sonnet 4.5, Haiku 4.5) to produce consensus scores via direct tests, rubric‑based LLM evaluation, and qualitative review.
Reproducible benchmark: All scripts, prompts, and data are released, allowing the community to replicate and extend the study.

Methodology

Define testing tiers
- T0: Simple prompt → single LLM output.
- T1‑T3: Add deterministic tools (e.g., static analysis, test generation).
- T4‑T6: Introduce multi‑agent orchestration, dynamic tool selection, and self‑refinement loops.
Run each tier on a curated suite of coding problems (algorithmic, API‑integration, and bug‑fix tasks).
Collect outcomes: For every run, record the number of API calls, token usage, and whether the generated code passes the hidden test suite.
Compute CoP

[ \text{CoP} = \frac{\text{Total cost (API calls × price per token)}}{\text{Number of passing solutions}} ]

Evaluation: Three Claude models act as judges. They (a) execute the code against hidden tests, (b) apply a rubric generated by an LLM, and (c) provide a short qualitative verdict. Consensus is reached via majority voting.

The pipeline is fully automated, so developers can plug in their own agents or prompts and obtain a CoP report in minutes.

Results & Findings

Tier	Avg. Pass Rate	Avg. Cost per Run	CoP (USD)
T0 (plain prompt)	42 %	$0.08	$0.19
T2 (tool‑augmented)	58 %	$0.12	$0.21
T4 (single‑agent with self‑refine)	66 %	$0.18	$0.27
T6 (full multi‑agent)	71 %	$0.31	$0.44

Adding static analysis tools (T2) improves correctness with modest cost increase.
Self‑refinement loops (T4) give a noticeable boost in pass rate but start to erode cost efficiency.
The full multi‑agent orchestration (T6) yields the highest raw accuracy, yet its CoP is the worst—extra agents and tool calls inflate the bill without proportional quality gains.
Across all tiers, the variance between LLM judges was under 3 %, confirming that the consensus approach is stable.

Key takeaway: More architectural complexity does not guarantee a better cost‑performance trade‑off.

Practical Implications

Product managers can use CoP to set budget caps for AI‑assisted coding features, choosing the simplest tier that meets a target pass rate.
DevOps teams can integrate Scylla into CI pipelines to continuously monitor the ROI of new prompting tricks or tool plugins.
Tool vendors gain a neutral benchmark to showcase where their added capabilities (e.g., code‑search, automated debugging) actually pay off.
Individual developers can experiment with lightweight prompt engineering before committing to heavyweight multi‑agent setups, saving both time and API spend.

In short, Scylla turns the “black‑box” of LLM‑based coding assistants into a quantifiable engineering decision.

Limitations & Future Work

Domain scope: The benchmark focuses on general‑purpose coding tasks; specialized domains (e.g., embedded systems, data‑science notebooks) may behave differently.
Vendor lock‑in: All judges are Claude models; cross‑vendor validation (e.g., GPT‑4, Gemini) is left for future studies.
Human factor: While the framework automates evaluation, real‑world developer satisfaction and maintainability are not captured.
Scalability of tiers: Adding more nuanced tiers (e.g., hybrid human‑in‑the‑loop) could further refine the cost‑benefit landscape.

Future work aims to broaden the problem set, incorporate multi‑vendor judges, and explore hybrid evaluation metrics that blend CoP with developer experience scores.

Authors

Micah Villmow

Paper Information

arXiv ID: 2602.08765v1
Categories: cs.SE, cs.AI
Published: February 9, 2026
PDF: Download PDF

[Paper] Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Diffusion-Pretrained Dense and Contextual Embeddings

[Paper] YOR: Your Own Mobile Manipulator for Generalizable Robotics

[Paper] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

[Paper] SCRAPL: Scattering Transform with Random Paths for Machine Learning