[Paper] Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study
Source: arXiv - 2602.06671v1
Overview
The paper investigates whether feeding a large language model (LLM) a serialized abstract syntax tree (AST) instead of raw source code can improve code‑summarization performance. By designing a compact, LLM‑friendly AST representation (AST(NIT)), the authors show that you can keep the input short, speed up fine‑tuning, and still generate summaries that are on par with state‑of‑the‑art methods that use the full source text.
Key Contributions
- AST(NIT) serialization scheme that preserves lexical tokens (identifiers, literals) while encoding tree structure in a linear sequence suitable for LLMs.
- Empirical comparison of raw code vs. serialized AST inputs on the CodeXGLUE Python benchmark using the LLaMA‑3.1‑8B model.
- Demonstration that serialized ASTs reduce input length by up to ~30 %, cutting fine‑tuning time without sacrificing summary quality.
- Open‑source release of the serialization pipeline and training scripts, enabling reproducibility and easy integration into existing LLM‑based tooling.
Methodology
- Dataset & Baseline – The authors use the Python portion of CodeXGLUE (≈ 20 k functions with human‑written docstrings). The baseline feeds the raw function source code to LLaMA‑3.1‑8B.
- AST Extraction – For each function, the Python
astmodule builds a concrete syntax tree. - Serialization (AST(NIT)) –
- Node‑type tokens (e.g.,
FunctionDef,If,Call) are emitted in a depth‑first order. - Lexical leaves (identifiers, literals, operators) are inserted unchanged, preserving the exact spelling that developers see.
- Structural markers (
<START>,<END>,<CHILD>) delimit parent‑child relationships, allowing the linear sequence to reconstruct the original tree if needed.
- Node‑type tokens (e.g.,
- Fine‑tuning – Both raw‑code and serialized‑AST inputs are used to fine‑tune the same LLaMA‑3.1‑8B model for a fixed number of epochs, with identical hyper‑parameters (learning rate, batch size, etc.).
- Evaluation – Summaries are scored with BLEU, METEOR, and ROUGE‑L, the standard metrics in code‑summarization research.
Results & Findings
| Input Type | Avg. Tokens per Example | Fine‑tuning Time (hrs) | BLEU ↑ | METEOR ↑ | ROUGE‑L ↑ |
|---|---|---|---|---|---|
| Raw code | 185 | 6.2 | 21.4 | 15.9 | 38.1 |
| AST(NIT) | 132 (≈ −30 %) | 4.5 (≈ −27 %) | 21.1 | 15.7 | 37.9 |
- Input compactness: Serialized ASTs shave off roughly a third of the token count, which directly translates into faster training and lower memory usage.
- Quality parity: The drop in BLEU/METEOR/ROUGE‑L is statistically insignificant (p > 0.1), indicating that the model retains the same understanding of program intent.
- Robustness: Qualitative inspection shows that AST‑driven summaries correctly capture function names and parameter roles even when the original code contains noisy comments or unconventional formatting.
Practical Implications
- Faster fine‑tuning for internal tooling: Companies that fine‑tune LLMs on proprietary codebases can cut GPU hours by ~25 % simply by swapping raw code for AST(NIT) inputs.
- Reduced context window pressure: With token limits tightening on newer LLM APIs, a compact representation leaves more room for additional context (e.g., surrounding module docs, issue tickets).
- Better handling of obfuscated or minified code: Since the AST abstracts away whitespace and formatting quirks, the approach is more resilient to code that has been auto‑generated or heavily minified.
- Plug‑and‑play integration: The serialization pipeline is language‑agnostic in principle; extending it to Java, JavaScript, or Rust would let teams experiment with AST‑based prompts across their tech stack.
Limitations & Future Work
- Language scope: The study focuses exclusively on Python; AST structures differ widely across languages, so the serialization rules may need adaptation.
- Model size: Experiments were limited to an 8‑billion‑parameter LLaMA; it remains unclear how the trade‑off scales to larger (e.g., 70B) or instruction‑tuned models.
- Semantic depth: While structural information is retained, deeper semantic cues (type inference, data‑flow) are not encoded; future work could enrich the sequence with static‑analysis annotations.
- User studies: The paper evaluates automatic metrics only; a human‑in‑the‑loop study would confirm whether developers find AST‑derived summaries equally helpful in real maintenance tasks.
Bottom line: By turning a full AST into a concise, LLM‑ready token stream, AST(NIT) offers a practical shortcut for teams looking to harness large language models for code summarization without paying the full cost of raw‑code inputs. The approach is lightweight, reproducible, and ready for broader adoption—especially as token budgets become a tighter constraint on modern LLM APIs.
Authors
- Shijia Dong
- Haoruo Zhao
- Paul Harvey
Paper Information
- arXiv ID: 2602.06671v1
- Categories: cs.SE
- Published: February 6, 2026
- PDF: Download PDF