[Paper] TerraFormer: Automated Infrastructure-as-Code with LLMs Fine-Tuned via Policy-Guided Verifier Feedback

Published: 3 weeks ago (January 13, 2026 at 12:08 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.08734v1

Overview

TerraFormer tackles a pain point that many DevOps engineers face daily: turning natural‑language intent into reliable Infrastructure‑as‑Code (IaC) scripts. By fine‑tuning a large language model (LLM) with feedback from formal verification tools, the authors demonstrate a system that can generate and mutate Terraform configurations with markedly higher correctness than off‑the‑shelf LLMs—even those that are orders of magnitude larger.

Key Contributions

Neuro‑symbolic framework that blends supervised fine‑tuning with verifier‑guided reinforcement learning for IaC generation.
Two curated NL‑to‑IaC datasets – TF‑Gen (152 k examples) and TF‑Mutn (52 k examples) – built through multi‑stage verification and iterative self‑correction.
Policy‑guided verifier that checks syntax, deployability, and security/compliance policies, feeding structured rewards back to the model.
Empirical superiority: TerraFormer improves its base LLM’s correctness by up to 19.6 % and outperforms much larger commercial models on benchmark test sets.
Best‑practices & security compliance: achieves top scores for adhering to Terraform best‑practice guidelines and industry security policies.

Methodology

Base Model Selection – The authors start with a strong, open‑source LLM pre‑trained on code (e.g., CodeLlama).
Supervised Fine‑Tuning (SFT) – The model is first fine‑tuned on the TF‑Gen dataset, where each entry pairs a natural‑language description with a correct Terraform manifest.
Verifier‑Guided RL – A custom verification pipeline runs each generated manifest through:
- Syntax checker (Terraform CLI validate).
- Deployability tester (plan execution in a sandbox).
- Policy engine (OPA/Rego rules for security and organizational policies).
  The verifier returns a scalar reward (e.g., +1 for passing all checks, –1 for failures) plus detailed error signals.
Reinforcement Learning from Human Feedback (RLHF) style loop – Using Proximal Policy Optimization (PPO), the model updates its parameters to maximize the verifier reward, effectively learning to “self‑correct” before the next generation.
Iterative Self‑Correction – The model can request a second pass on a failing output, allowing it to incorporate verifier hints and produce a corrected script.
Evaluation – Benchmarks include IaC‑Eval (a public IaC correctness suite) and held‑out test splits of TF‑Gen and TF‑Mutn, comparing against 17 state‑of‑the‑art LLMs (including GPT‑4.1, DeepSeek‑R1, and Google Sonnet 3.7).

Results & Findings

Metric	TerraFormer (after RL)	Base LLM	Best Larger Competitor
IaC‑Eval correctness ↑	+15.94 % over base	–	3rd overall
TF‑Gen (Test) accuracy ↑	+11.65 % over base	–	Beats Sonnet 3.7, DeepSeek‑R1
TF‑Mutn (Test) accuracy ↑	+19.60 % over base	–	Beats GPT‑4.1
Best‑practices compliance	Top‑ranked	Lower	Higher violations
Security policy compliance	Highest score	Moderate	Lower

Key takeaways: the verifier‑guided RL loop yields tangible gains even when the base model is already strong. Moreover, TerraFormer’s smaller footprint (≈ 2 B parameters) lets it outperform models that are 10–50× larger on the same tasks.

Practical Implications

Faster IaC authoring – Developers can describe desired infrastructure in plain English and receive a ready‑to‑apply Terraform file, cutting down on boilerplate coding.
Reduced rollout risk – Because each output is pre‑validated against syntax, deployability, and policy checks, the chance of a broken or non‑compliant deployment drops dramatically.
Policy‑as‑code enforcement – Organizations can embed their internal security standards directly into the verifier, guaranteeing that generated scripts never violate mandatory rules.
Cost‑effective automation – TerraFormer achieves top‑tier performance without needing massive proprietary LLM APIs, making it viable for on‑prem or edge deployment in CI/CD pipelines.
Mutation support – The TF‑Mutn dataset and corresponding model capabilities enable safe “what‑if” changes (e.g., scaling a cluster, swapping a resource type) without manual diffing.

Limitations & Future Work

Domain coverage – The datasets focus on Terraform; extending the approach to other IaC languages (Pulumi, CloudFormation) will require new verification tooling.
Verifier latency – Running full plan and policy checks in the RL loop adds overhead; optimizing the feedback pipeline is an open engineering challenge.
Generalization to novel resources – The model may struggle with newly released cloud services that lack representation in the training data.
Human‑in‑the‑loop – While self‑correction works well, integrating real‑time developer feedback could further improve reliability and trust.
Explainability – Providing rationales for why a particular configuration was generated or rejected remains an area for future research.

Authors

Prithwish Jana
Sam Davidson
Bhavana Bhasker
Andrey Kan
Anoop Deoras
Laurent Callot

Paper Information

arXiv ID: 2601.08734v1
Categories: cs.SE, cs.AI
Published: January 13, 2026
PDF: Download PDF

[Paper] TerraFormer: Automated Infrastructure-as-Code with LLMs Fine-Tuned via Policy-Guided Verifier Feedback

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management