[Paper] Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study

Published: 3 days ago (February 6, 2026 at 07:55 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06671v1

Overview

The paper investigates whether feeding a large language model (LLM) a serialized abstract syntax tree (AST) instead of raw source code can improve code‑summarization performance. By designing a compact, LLM‑friendly AST representation (AST(NIT)), the authors show that you can keep the input short, speed up fine‑tuning, and still generate summaries that are on par with state‑of‑the‑art methods that use the full source text.

Key Contributions

AST(NIT) serialization scheme that preserves lexical tokens (identifiers, literals) while encoding tree structure in a linear sequence suitable for LLMs.
Empirical comparison of raw code vs. serialized AST inputs on the CodeXGLUE Python benchmark using the LLaMA‑3.1‑8B model.
Demonstration that serialized ASTs reduce input length by up to ~30 %, cutting fine‑tuning time without sacrificing summary quality.
Open‑source release of the serialization pipeline and training scripts, enabling reproducibility and easy integration into existing LLM‑based tooling.

Methodology

Dataset & Baseline – The authors use the Python portion of CodeXGLUE (≈ 20 k functions with human‑written docstrings). The baseline feeds the raw function source code to LLaMA‑3.1‑8B.
AST Extraction – For each function, the Python ast module builds a concrete syntax tree.
Serialization (AST(NIT)) –
- Node‑type tokens (e.g., FunctionDef, If, Call) are emitted in a depth‑first order.
- Lexical leaves (identifiers, literals, operators) are inserted unchanged, preserving the exact spelling that developers see.
- Structural markers (<START>, <END>, <CHILD>) delimit parent‑child relationships, allowing the linear sequence to reconstruct the original tree if needed.
Fine‑tuning – Both raw‑code and serialized‑AST inputs are used to fine‑tune the same LLaMA‑3.1‑8B model for a fixed number of epochs, with identical hyper‑parameters (learning rate, batch size, etc.).
Evaluation – Summaries are scored with BLEU, METEOR, and ROUGE‑L, the standard metrics in code‑summarization research.

Results & Findings

Input Type	Avg. Tokens per Example	Fine‑tuning Time (hrs)	BLEU ↑	METEOR ↑	ROUGE‑L ↑
Raw code	185	6.2	21.4	15.9	38.1
AST(NIT)	132 (≈ −30 %)	4.5 (≈ −27 %)	21.1	15.7	37.9

Input compactness: Serialized ASTs shave off roughly a third of the token count, which directly translates into faster training and lower memory usage.
Quality parity: The drop in BLEU/METEOR/ROUGE‑L is statistically insignificant (p > 0.1), indicating that the model retains the same understanding of program intent.
Robustness: Qualitative inspection shows that AST‑driven summaries correctly capture function names and parameter roles even when the original code contains noisy comments or unconventional formatting.

Practical Implications

Faster fine‑tuning for internal tooling: Companies that fine‑tune LLMs on proprietary codebases can cut GPU hours by ~25 % simply by swapping raw code for AST(NIT) inputs.
Reduced context window pressure: With token limits tightening on newer LLM APIs, a compact representation leaves more room for additional context (e.g., surrounding module docs, issue tickets).
Better handling of obfuscated or minified code: Since the AST abstracts away whitespace and formatting quirks, the approach is more resilient to code that has been auto‑generated or heavily minified.
Plug‑and‑play integration: The serialization pipeline is language‑agnostic in principle; extending it to Java, JavaScript, or Rust would let teams experiment with AST‑based prompts across their tech stack.

Limitations & Future Work

Language scope: The study focuses exclusively on Python; AST structures differ widely across languages, so the serialization rules may need adaptation.
Model size: Experiments were limited to an 8‑billion‑parameter LLaMA; it remains unclear how the trade‑off scales to larger (e.g., 70B) or instruction‑tuned models.
Semantic depth: While structural information is retained, deeper semantic cues (type inference, data‑flow) are not encoded; future work could enrich the sequence with static‑analysis annotations.
User studies: The paper evaluates automatic metrics only; a human‑in‑the‑loop study would confirm whether developers find AST‑derived summaries equally helpful in real maintenance tasks.

Bottom line: By turning a full AST into a concise, LLM‑ready token stream, AST(NIT) offers a practical shortcut for teams looking to harness large language models for code summarization without paying the full cost of raw‑code inputs. The approach is lightweight, reproducible, and ready for broader adoption—especially as token budgets become a tighter constraint on modern LLM APIs.

Authors

Shijia Dong
Haoruo Zhao
Paul Harvey

Paper Information

arXiv ID: 2602.06671v1
Categories: cs.SE
Published: February 6, 2026
PDF: Download PDF

[Paper] Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Statistical-Based Metric Threshold Setting Method for Software Fault Prediction in Firmware Projects: An Industrial Experience

[Paper] Beyond Function-Level Analysis: Context-Aware Reasoning for Inter-Procedural Vulnerability Detection

[Paper] Using Large Language Models to Support Automation of Failure Management in CI/CD Pipelines: A Case Study in SAP HANA

[Paper] Identifying Adversary Tactics and Techniques in Malware Binaries with an LLM Agent