[Paper] Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System

Published: (February 18, 2026 at 12:46 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.16650v1

Overview

The paper presents a retrieval‑augmented generation (RAG) system that turns the massive, unstructured polymer literature into a usable expert assistant. By pairing large language models with two custom retrieval pipelines—one based on dense vector similarity (VectorRAG) and another on a structured knowledge graph (GraphRAG)—the authors demonstrate how to answer complex, cross‑study questions about biodegradable polymers (specifically polyhydroxyalkanoates, PHA) with citations and traceable evidence.

Key Contributions

  • Two domain‑specific RAG pipelines:
    • VectorRAG: dense paragraph embeddings for high‑recall retrieval.
    • GraphRAG: a canonicalized knowledge graph enabling entity disambiguation and multi‑hop reasoning.
  • Curated corpus of >1,000 PHA papers with paragraph‑level embeddings and a graph that normalizes polymer terminology.
  • Comprehensive evaluation against standard retrieval metrics, commercial LLMs (GPT, Gemini), and expert chemist validation.
  • Demonstration of trade‑offs: GraphRAG yields higher precision and interpretability; VectorRAG offers broader coverage.
  • Open‑source‑friendly framework that reduces dependence on proprietary models while ensuring every generated claim is backed by a literature citation.

Methodology

  1. Corpus Construction – The authors scraped and cleaned the full text of 1,000+ peer‑reviewed PHA papers, splitting them into logical paragraphs.
  2. Embedding Layer (VectorRAG) – Each paragraph was encoded with a domain‑fine‑tuned transformer to produce dense vectors. Approximate nearest‑neighbor indexing (FAISS) enables fast similarity search.
  3. Graph Construction (GraphRAG) – Named entities (polymers, monomers, synthesis methods, properties) were extracted, canonicalized, and linked into a heterogeneous graph (nodes = entities, edges = relationships like “catalyzes”, “has degradation rate”).
  4. Retrieval + Generation Loop
    • A user query is first processed by the LLM to decide whether to use vector search, graph traversal, or both.
    • Retrieved paragraphs (VectorRAG) or sub‑graphs (GraphRAG) are fed as context to the LLM, which then generates an answer and automatically inserts citations pointing to the source paragraphs/nodes.
  5. Evaluation – Retrieval quality measured with precision/recall, relevance judged by a polymer chemist, and comparison against off‑the‑shelf LLMs that lack domain‑specific retrieval.

Results & Findings

MetricVectorRAGGraphRAGBaseline GPT‑4 (no retrieval)
Recall (top‑10)0.780.620.41
Precision (top‑10)0.610.840.48
Human‑rated relevance (1‑5)4.14.53.6
Citation correctness71 %89 %45 %
  • GraphRAG excels at delivering precise, traceable answers because the graph enforces consistent terminology and enables multi‑step logical hops (e.g., “PHA synthesized with enzyme X → higher crystallinity → slower degradation”).
  • VectorRAG captures a wider set of relevant paragraphs, useful when the query is broad or when the graph lacks a specific relation.
  • Expert chemists confirmed that the system’s answers were well‑grounded, often surfacing patterns (e.g., correlations between monomer composition and biodegradation rates) that are hard to spot manually.

Practical Implications

  • Developer‑ready API – The pipelines can be wrapped as micro‑services (vector search via FAISS, graph queries via Neo4j or a lightweight RDF store) and called from any language model backend.
  • Accelerated R&D – Materials scientists can query the assistant to quickly compare synthesis routes, property trends, or regulatory data without digging through dozens of PDFs.
  • Trustworthy AI – By forcing the LLM to cite exact paragraphs or graph nodes, the system mitigates hallucinations—a critical requirement for scientific decision‑making.
  • Domain Transferability – The same architecture can be repurposed for other materials domains (e.g., battery electrolytes, metal alloys) by swapping the corpus and updating the entity schema.
  • Cost Efficiency – Because the heavy lifting is done by a relatively small, open‑source LLM (e.g., LLaMA‑2) plus local retrieval, organizations can avoid expensive API calls to proprietary models while still delivering high‑quality answers.

Limitations & Future Work

  • Coverage Gaps – The knowledge graph depends on the quality of entity extraction; rare or newly coined terms may be missed, limiting GraphRAG’s recall.
  • Scalability – While the current corpus is ~1 k papers, scaling to millions of documents will require more sophisticated indexing and distributed graph storage.
  • Dynamic Updates – Incorporating newly published papers in near‑real time remains an open challenge; the authors suggest incremental embedding and graph‑update pipelines.
  • User Interaction – The system currently offers a single‑turn answer; future work includes multi‑turn dialogue and interactive graph exploration tools.

Bottom line: By marrying dense vector retrieval with a domain‑aware knowledge graph, this work shows a practical path toward trustworthy, literature‑driven AI assistants for polymer science—and offers a reusable blueprint for any tech team looking to embed expert knowledge into their products.

Authors

  • Sonakshi Gupta
  • Akhlak Mahmood
  • Wei Xiong
  • Rampi Ramprasad

Paper Information

  • arXiv ID: 2602.16650v1
  • Categories: cs.CE, cs.AI
  • Published: February 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »