[Paper] PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code

Published: 1 month ago (December 11, 2025 at 09:49 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.10713v1

Overview

The paper introduces PACIFIC, a framework that automatically creates benchmark suites to test how well large language models (LLMs) can follow sequential programming instructions and reason about code without actually running it (dry‑running). By generating fresh, difficulty‑controlled test cases, PACIFIC lets researchers and product teams evaluate the core “instruction‑following” skill of code assistants while sidestepping common pitfalls like data contamination.

Key Contributions

Automated benchmark generation: A pipeline that synthesizes diverse code‑instruction pairs with known expected outputs, eliminating manual test‑case authoring.
Difficulty control: Parameters to tune logical depth, language features, and data‑flow complexity, producing a graded ladder of challenges.
Pure LLM evaluation: Benchmarks are designed to be answered by reasoning alone—no external tools, execution environments, or agentic actions are required.
Contamination resistance: Because each benchmark variant is freshly generated, the risk that a model has seen the exact test during pre‑training is dramatically reduced.
Empirical validation: The authors benchmark several state‑of‑the‑art code models (e.g., GPT‑4‑code, Claude‑2, CodeLlama) across difficulty tiers, showing measurable gaps even among top performers.

Methodology

Instruction & Code Template Library – The authors start with a curated set of small programming tasks (e.g., list manipulation, recursion, string parsing) expressed as natural‑language instructions and skeletal code snippets.
Parameterizable Transformations – Randomized transformations (variable renaming, control‑flow reordering, adding irrelevant statements) are applied to increase complexity while preserving the logical outcome.
Expected‑Output Derivation – Because the transformations are deterministic, the framework can compute the exact result the model should produce (e.g., the final value of a variable after “dry‑running” the code).
Benchmark Assembly – Each test case consists of:
- A multi‑step instruction list (e.g., “first reverse the array, then compute the sum”).
- The transformed code snippet.
- The ground‑truth output for verification.
Evaluation Protocol – LLMs receive only the instruction + code and must output the final result as plain text. Scoring is a simple string match against the pre‑computed answer, making the process fully automated.

Results & Findings

Model	Easy (Level 1)	Medium (Level 2)	Hard (Level 3)
GPT‑4‑code	94 %	78 %	52 %
Claude‑2	92 %	71 %	44 %
CodeLlama‑34B	88 %	63 %	31 %

Performance degrades predictably with difficulty, confirming that the framework can differentiate nuanced reasoning abilities.
Even the strongest model (GPT‑4‑code) struggles on the hardest tier, indicating that precise step‑by‑step dry‑running remains an open challenge.
The generated benchmarks expose failure modes not captured by existing code‑generation tests, such as mis‑ordering of instruction steps or overlooking side‑effects in loops.

Practical Implications

Product QA for code assistants – Teams can integrate PACIFIC into CI pipelines to catch regressions in instruction‑following logic before release.
Model fine‑tuning targets – The difficulty‑graded data can serve as a curriculum for reinforcement learning from human feedback (RLHF), focusing on the “hard” tier where models currently lag.
Safety & security – By testing dry‑run reasoning, developers can assess whether a model might hallucinate execution results—a key factor for preventing buggy code suggestions in critical systems.
Benchmark hygiene – Because each run produces novel test cases, companies can avoid the “benchmark leakage” problem that plagues static datasets like HumanEval.

Limitations & Future Work

Language scope – The current implementation covers primarily Python and JavaScript; extending to statically typed languages (e.g., Java, Rust) may require richer type‑checking logic.
Scalability of difficulty – While the authors provide three difficulty levels, finer granularity (e.g., micro‑optimizations, concurrency) is not yet explored.
Human‑in‑the‑loop validation – The automated expected‑output computation assumes deterministic semantics; edge cases involving undefined behavior still need manual review.
Future directions include adding multi‑file project scenarios, integrating symbolic execution to verify more complex invariants, and open‑sourcing the benchmark generator for community‑driven expansion.

Authors

Itay Dreyfuss
Antonio Abu Nassar
Samuel Ackerman
Axel Ben David
Rami Katan
Orna Raz
Marcel Zalmanovici

Paper Information

arXiv ID: 2512.10713v1
Categories: cs.SE, cs.AI
Published: December 11, 2025
PDF: Download PDF

[Paper] PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously