[Paper] Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems
Source: arXiv - 2604.20795v1
Overview
The paper proposes a hybrid AI architecture that couples large language models (LLMs) with an external, structured ontology stored as an RDF/OWL knowledge graph. By automatically building and continuously updating this graph from documents, APIs, and dialogue logs, the system gives LLMs a persistent, verifiable memory layer that boosts multi‑step reasoning, planning, and explainability.
Key Contributions
- Automated ontology pipeline: end‑to‑end extraction (entity & relation detection, normalization, triple generation) from heterogeneous sources, followed by SHACL/OWL validation.
- Hybrid inference engine: combines traditional vector‑based retrieval‑augmented generation (RAG) with graph‑based reasoning and tool use during LLM prompting.
- Generation‑Verification‑Correction loop: outputs are checked against ontology constraints, enabling automatic correction or rejection of invalid results.
- Empirical validation: demonstrates measurable gains on classic planning benchmarks (e.g., Tower of Hanoi) and on tasks requiring long‑term, structured knowledge.
- Blueprint for real‑world agents: outlines how the architecture can be plugged into robotics, enterprise assistants, and autonomous software agents that need reliable, explainable decisions.
Methodology
-
Data Ingestion – The system pulls raw material from three channels:
- Unstructured text (PDFs, web pages)
- Structured API specifications (OpenAPI, GraphQL)
- Conversational logs (chat transcripts, voice‑assistant interactions)
-
Information Extraction – A fine‑tuned LLM (or a dedicated NER/RE model) tags entities and relations, then normalizes them to a shared schema (e.g., using CURIEs).
-
Triple Generation – Normalized entities and relations are emitted as RDF triples (
subject – predicate – object). -
Ontology Construction & Validation
- The triples are merged into an OWL ontology.
- SHACL shapes and OWL axioms enforce domain/range, cardinality, and logical constraints.
- Invalid triples are either rejected or sent back for re‑generation.
-
Hybrid Retrieval at Inference Time – When a user query arrives:
- A vector store returns top‑k relevant passages (RAG).
- A SPARQL engine fetches related graph sub‑structures.
- Both contexts are concatenated and fed to the LLM, which can also invoke external tools (e.g., planners, calculators).
-
Verification Loop – The LLM’s generated answer is parsed back into triples and re‑validated against the ontology. If violations are detected, the system either corrects the answer automatically or flags it for human review.
Results & Findings
| Metric | Baseline LLM (RAG only) | Hybrid LLM + Ontology |
|---|---|---|
| Success rate on Tower of Hanoi (≤ 7 disks) | 62 % | 84 % |
| Average planning steps error | 1.9 steps | 0.6 steps |
| Ontology‑based validation pass rate | 71 % (post‑hoc) | 96 % |
| Latency increase (per query) | — | + 120 ms (due to SPARQL lookup) |
What it means: Adding a verified knowledge graph reduces hallucinations and improves the LLM’s ability to keep track of objects and constraints across many reasoning steps. The modest latency overhead is outweighed by the gain in reliability and explainability.
Practical Implications
- Enterprise AI assistants can now reference a single source of truth (the ontology) for product catalogs, compliance rules, or internal processes, ensuring that generated advice never violates policy.
- Robotics & automation: planners can query the graph for object affordances, safety constraints, or workspace layouts, enabling safer task execution without hard‑coding every rule.
- Developer tooling: IDE plugins could auto‑populate a project’s knowledge graph from code, documentation, and issue trackers, letting LLM‑based code assistants reason about API contracts and dependency graphs.
- Explainability & auditability: Every answer can be traced back to the specific triples that justified it, satisfying regulatory requirements in finance, healthcare, and legal tech.
- Scalable long‑term memory: Unlike pure RAG, the graph persists across sessions, allowing agents to accumulate and refine knowledge over weeks or months without re‑training the LLM.
Limitations & Future Work
- Ontology quality depends on extraction accuracy; noisy source data can still propagate errors despite SHACL checks.
- The current pipeline assumes a relatively static schema; rapid schema evolution (e.g., micro‑service churn) may require more dynamic alignment mechanisms.
- Scalability: SPARQL queries on very large graphs can become a bottleneck; the authors suggest incremental indexing and graph partitioning as next steps.
- Generalization: Experiments focus on planning benchmarks; broader evaluation on open‑domain QA, code generation, or multimodal tasks remains open.
The authors plan to explore self‑supervised ontology refinement, tighter integration with LLM‑based tool use (e.g., function calling), and real‑world deployments in warehouse robotics and compliance‑heavy enterprise settings.
Authors
- Pavel Salovskii
- Iuliia Gorshkova
Paper Information
- arXiv ID: 2604.20795v1
- Categories: cs.AI
- Published: April 22, 2026
- PDF: Download PDF