[Paper] PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code
Source: arXiv - 2512.10713v1
Overview
The paper introduces PACIFIC, a framework that automatically creates benchmark suites to test how well large language models (LLMs) can follow sequential programming instructions and reason about code without actually running it (dry‑running). By generating fresh, difficulty‑controlled test cases, PACIFIC lets researchers and product teams evaluate the core “instruction‑following” skill of code assistants while sidestepping common pitfalls like data contamination.
Key Contributions
- Automated benchmark generation: A pipeline that synthesizes diverse code‑instruction pairs with known expected outputs, eliminating manual test‑case authoring.
- Difficulty control: Parameters to tune logical depth, language features, and data‑flow complexity, producing a graded ladder of challenges.
- Pure LLM evaluation: Benchmarks are designed to be answered by reasoning alone—no external tools, execution environments, or agentic actions are required.
- Contamination resistance: Because each benchmark variant is freshly generated, the risk that a model has seen the exact test during pre‑training is dramatically reduced.
- Empirical validation: The authors benchmark several state‑of‑the‑art code models (e.g., GPT‑4‑code, Claude‑2, CodeLlama) across difficulty tiers, showing measurable gaps even among top performers.
Methodology
- Instruction & Code Template Library – The authors start with a curated set of small programming tasks (e.g., list manipulation, recursion, string parsing) expressed as natural‑language instructions and skeletal code snippets.
- Parameterizable Transformations – Randomized transformations (variable renaming, control‑flow reordering, adding irrelevant statements) are applied to increase complexity while preserving the logical outcome.
- Expected‑Output Derivation – Because the transformations are deterministic, the framework can compute the exact result the model should produce (e.g., the final value of a variable after “dry‑running” the code).
- Benchmark Assembly – Each test case consists of:
- A multi‑step instruction list (e.g., “first reverse the array, then compute the sum”).
- The transformed code snippet.
- The ground‑truth output for verification.
- Evaluation Protocol – LLMs receive only the instruction + code and must output the final result as plain text. Scoring is a simple string match against the pre‑computed answer, making the process fully automated.
Results & Findings
| Model | Easy (Level 1) | Medium (Level 2) | Hard (Level 3) |
|---|---|---|---|
| GPT‑4‑code | 94 % | 78 % | 52 % |
| Claude‑2 | 92 % | 71 % | 44 % |
| CodeLlama‑34B | 88 % | 63 % | 31 % |
- Performance degrades predictably with difficulty, confirming that the framework can differentiate nuanced reasoning abilities.
- Even the strongest model (GPT‑4‑code) struggles on the hardest tier, indicating that precise step‑by‑step dry‑running remains an open challenge.
- The generated benchmarks expose failure modes not captured by existing code‑generation tests, such as mis‑ordering of instruction steps or overlooking side‑effects in loops.
Practical Implications
- Product QA for code assistants – Teams can integrate PACIFIC into CI pipelines to catch regressions in instruction‑following logic before release.
- Model fine‑tuning targets – The difficulty‑graded data can serve as a curriculum for reinforcement learning from human feedback (RLHF), focusing on the “hard” tier where models currently lag.
- Safety & security – By testing dry‑run reasoning, developers can assess whether a model might hallucinate execution results—a key factor for preventing buggy code suggestions in critical systems.
- Benchmark hygiene – Because each run produces novel test cases, companies can avoid the “benchmark leakage” problem that plagues static datasets like HumanEval.
Limitations & Future Work
- Language scope – The current implementation covers primarily Python and JavaScript; extending to statically typed languages (e.g., Java, Rust) may require richer type‑checking logic.
- Scalability of difficulty – While the authors provide three difficulty levels, finer granularity (e.g., micro‑optimizations, concurrency) is not yet explored.
- Human‑in‑the‑loop validation – The automated expected‑output computation assumes deterministic semantics; edge cases involving undefined behavior still need manual review.
- Future directions include adding multi‑file project scenarios, integrating symbolic execution to verify more complex invariants, and open‑sourcing the benchmark generator for community‑driven expansion.
Authors
- Itay Dreyfuss
- Antonio Abu Nassar
- Samuel Ackerman
- Axel Ben David
- Rami Katan
- Orna Raz
- Marcel Zalmanovici
Paper Information
- arXiv ID: 2512.10713v1
- Categories: cs.SE, cs.AI
- Published: December 11, 2025
- PDF: Download PDF