[Paper] Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System
Source: arXiv - 2602.16650v1
Overview
The paper presents a retrieval‑augmented generation (RAG) system that turns the massive, unstructured polymer literature into a usable expert assistant. By pairing large language models with two custom retrieval pipelines—one based on dense vector similarity (VectorRAG) and another on a structured knowledge graph (GraphRAG)—the authors demonstrate how to answer complex, cross‑study questions about biodegradable polymers (specifically polyhydroxyalkanoates, PHA) with citations and traceable evidence.
Key Contributions
- Two domain‑specific RAG pipelines:
- VectorRAG: dense paragraph embeddings for high‑recall retrieval.
- GraphRAG: a canonicalized knowledge graph enabling entity disambiguation and multi‑hop reasoning.
- Curated corpus of >1,000 PHA papers with paragraph‑level embeddings and a graph that normalizes polymer terminology.
- Comprehensive evaluation against standard retrieval metrics, commercial LLMs (GPT, Gemini), and expert chemist validation.
- Demonstration of trade‑offs: GraphRAG yields higher precision and interpretability; VectorRAG offers broader coverage.
- Open‑source‑friendly framework that reduces dependence on proprietary models while ensuring every generated claim is backed by a literature citation.
Methodology
- Corpus Construction – The authors scraped and cleaned the full text of 1,000+ peer‑reviewed PHA papers, splitting them into logical paragraphs.
- Embedding Layer (VectorRAG) – Each paragraph was encoded with a domain‑fine‑tuned transformer to produce dense vectors. Approximate nearest‑neighbor indexing (FAISS) enables fast similarity search.
- Graph Construction (GraphRAG) – Named entities (polymers, monomers, synthesis methods, properties) were extracted, canonicalized, and linked into a heterogeneous graph (nodes = entities, edges = relationships like “catalyzes”, “has degradation rate”).
- Retrieval + Generation Loop –
- A user query is first processed by the LLM to decide whether to use vector search, graph traversal, or both.
- Retrieved paragraphs (VectorRAG) or sub‑graphs (GraphRAG) are fed as context to the LLM, which then generates an answer and automatically inserts citations pointing to the source paragraphs/nodes.
- Evaluation – Retrieval quality measured with precision/recall, relevance judged by a polymer chemist, and comparison against off‑the‑shelf LLMs that lack domain‑specific retrieval.
Results & Findings
| Metric | VectorRAG | GraphRAG | Baseline GPT‑4 (no retrieval) |
|---|---|---|---|
| Recall (top‑10) | 0.78 | 0.62 | 0.41 |
| Precision (top‑10) | 0.61 | 0.84 | 0.48 |
| Human‑rated relevance (1‑5) | 4.1 | 4.5 | 3.6 |
| Citation correctness | 71 % | 89 % | 45 % |
- GraphRAG excels at delivering precise, traceable answers because the graph enforces consistent terminology and enables multi‑step logical hops (e.g., “PHA synthesized with enzyme X → higher crystallinity → slower degradation”).
- VectorRAG captures a wider set of relevant paragraphs, useful when the query is broad or when the graph lacks a specific relation.
- Expert chemists confirmed that the system’s answers were well‑grounded, often surfacing patterns (e.g., correlations between monomer composition and biodegradation rates) that are hard to spot manually.
Practical Implications
- Developer‑ready API – The pipelines can be wrapped as micro‑services (vector search via FAISS, graph queries via Neo4j or a lightweight RDF store) and called from any language model backend.
- Accelerated R&D – Materials scientists can query the assistant to quickly compare synthesis routes, property trends, or regulatory data without digging through dozens of PDFs.
- Trustworthy AI – By forcing the LLM to cite exact paragraphs or graph nodes, the system mitigates hallucinations—a critical requirement for scientific decision‑making.
- Domain Transferability – The same architecture can be repurposed for other materials domains (e.g., battery electrolytes, metal alloys) by swapping the corpus and updating the entity schema.
- Cost Efficiency – Because the heavy lifting is done by a relatively small, open‑source LLM (e.g., LLaMA‑2) plus local retrieval, organizations can avoid expensive API calls to proprietary models while still delivering high‑quality answers.
Limitations & Future Work
- Coverage Gaps – The knowledge graph depends on the quality of entity extraction; rare or newly coined terms may be missed, limiting GraphRAG’s recall.
- Scalability – While the current corpus is ~1 k papers, scaling to millions of documents will require more sophisticated indexing and distributed graph storage.
- Dynamic Updates – Incorporating newly published papers in near‑real time remains an open challenge; the authors suggest incremental embedding and graph‑update pipelines.
- User Interaction – The system currently offers a single‑turn answer; future work includes multi‑turn dialogue and interactive graph exploration tools.
Bottom line: By marrying dense vector retrieval with a domain‑aware knowledge graph, this work shows a practical path toward trustworthy, literature‑driven AI assistants for polymer science—and offers a reusable blueprint for any tech team looking to embed expert knowledge into their products.
Authors
- Sonakshi Gupta
- Akhlak Mahmood
- Wei Xiong
- Rampi Ramprasad
Paper Information
- arXiv ID: 2602.16650v1
- Categories: cs.CE, cs.AI
- Published: February 18, 2026
- PDF: Download PDF