[Paper] Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Published: 2 months ago (February 18, 2026 at 12:46 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16650v1

Overview

The paper presents a retrieval‑augmented generation (RAG) system that turns the massive, unstructured polymer literature into a usable expert assistant. By pairing large language models with two custom retrieval pipelines—one based on dense vector similarity (VectorRAG) and another on a structured knowledge graph (GraphRAG)—the authors demonstrate how to answer complex, cross‑study questions about biodegradable polymers (specifically polyhydroxyalkanoates, PHA) with citations and traceable evidence.

Key Contributions

Two domain‑specific RAG pipelines:
- VectorRAG: dense paragraph embeddings for high‑recall retrieval.
- GraphRAG: a canonicalized knowledge graph enabling entity disambiguation and multi‑hop reasoning.
Curated corpus of >1,000 PHA papers with paragraph‑level embeddings and a graph that normalizes polymer terminology.
Comprehensive evaluation against standard retrieval metrics, commercial LLMs (GPT, Gemini), and expert chemist validation.
Demonstration of trade‑offs: GraphRAG yields higher precision and interpretability; VectorRAG offers broader coverage.
Open‑source‑friendly framework that reduces dependence on proprietary models while ensuring every generated claim is backed by a literature citation.

Methodology

Corpus Construction – The authors scraped and cleaned the full text of 1,000+ peer‑reviewed PHA papers, splitting them into logical paragraphs.
Embedding Layer (VectorRAG) – Each paragraph was encoded with a domain‑fine‑tuned transformer to produce dense vectors. Approximate nearest‑neighbor indexing (FAISS) enables fast similarity search.
Graph Construction (GraphRAG) – Named entities (polymers, monomers, synthesis methods, properties) were extracted, canonicalized, and linked into a heterogeneous graph (nodes = entities, edges = relationships like “catalyzes”, “has degradation rate”).
Retrieval + Generation Loop –
- A user query is first processed by the LLM to decide whether to use vector search, graph traversal, or both.
- Retrieved paragraphs (VectorRAG) or sub‑graphs (GraphRAG) are fed as context to the LLM, which then generates an answer and automatically inserts citations pointing to the source paragraphs/nodes.
Evaluation – Retrieval quality measured with precision/recall, relevance judged by a polymer chemist, and comparison against off‑the‑shelf LLMs that lack domain‑specific retrieval.

Results & Findings

Metric	VectorRAG	GraphRAG	Baseline GPT‑4 (no retrieval)
Recall (top‑10)	0.78	0.62	0.41
Precision (top‑10)	0.61	0.84	0.48
Human‑rated relevance (1‑5)	4.1	4.5	3.6
Citation correctness	71 %	89 %	45 %

GraphRAG excels at delivering precise, traceable answers because the graph enforces consistent terminology and enables multi‑step logical hops (e.g., “PHA synthesized with enzyme X → higher crystallinity → slower degradation”).
VectorRAG captures a wider set of relevant paragraphs, useful when the query is broad or when the graph lacks a specific relation.
Expert chemists confirmed that the system’s answers were well‑grounded, often surfacing patterns (e.g., correlations between monomer composition and biodegradation rates) that are hard to spot manually.

Practical Implications

Developer‑ready API – The pipelines can be wrapped as micro‑services (vector search via FAISS, graph queries via Neo4j or a lightweight RDF store) and called from any language model backend.
Accelerated R&D – Materials scientists can query the assistant to quickly compare synthesis routes, property trends, or regulatory data without digging through dozens of PDFs.
Trustworthy AI – By forcing the LLM to cite exact paragraphs or graph nodes, the system mitigates hallucinations—a critical requirement for scientific decision‑making.
Domain Transferability – The same architecture can be repurposed for other materials domains (e.g., battery electrolytes, metal alloys) by swapping the corpus and updating the entity schema.
Cost Efficiency – Because the heavy lifting is done by a relatively small, open‑source LLM (e.g., LLaMA‑2) plus local retrieval, organizations can avoid expensive API calls to proprietary models while still delivering high‑quality answers.

Limitations & Future Work

Coverage Gaps – The knowledge graph depends on the quality of entity extraction; rare or newly coined terms may be missed, limiting GraphRAG’s recall.
Scalability – While the current corpus is ~1 k papers, scaling to millions of documents will require more sophisticated indexing and distributed graph storage.
Dynamic Updates – Incorporating newly published papers in near‑real time remains an open challenge; the authors suggest incremental embedding and graph‑update pipelines.
User Interaction – The system currently offers a single‑turn answer; future work includes multi‑turn dialogue and interactive graph exploration tools.

Bottom line: By marrying dense vector retrieval with a domain‑aware knowledge graph, this work shows a practical path toward trustworthy, literature‑driven AI assistants for polymer science—and offers a reusable blueprint for any tech team looking to embed expert knowledge into their products.

Authors

Sonakshi Gupta
Akhlak Mahmood
Wei Xiong
Rampi Ramprasad

Paper Information

arXiv ID: 2602.16650v1
Categories: cs.CE, cs.AI
Published: February 18, 2026
PDF: Download PDF

[Paper] Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Unifying approach to uniform expressivity of graph neural networks

[Paper] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges