[Paper] Large Causal Models from Large Language Models

Published: 2 days ago (December 8, 2025 at 01:28 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07796v1

Overview

The paper proposes a fresh way to construct large causal models (LCMs) by tapping into the knowledge baked into today’s large language models (LLMs). The authors showcase a prototype system—DEMOCRITUS—that automatically extracts, organizes, and visualizes causal relationships across wildly different domains, turning raw textual output from an LLM into a structured, query‑able causal graph.

Key Contributions

DEMOCRITUS pipeline: A six‑module end‑to‑end system that turns natural‑language causal statements from an LLM into relational triples and embeds them in a unified causal graph.
Domain‑agnostic extraction: Demonstrates that a single high‑quality LLM can generate plausible causal questions and answers for fields as diverse as archaeology, climate science, and software engineering.
Categorical ML techniques: Introduces novel category‑theoretic machine‑learning tools for reconciling conflicting or ambiguous causal claims and stitching them into a coherent model.
Scalability analysis: Provides a detailed computational cost profile, pinpointing the current bottlenecks (e.g., LLM prompting latency, triple consolidation) and offering guidance for scaling to larger models.
Cross‑domain case studies: Presents empirical results on dozens of domains, illustrating how the system can surface non‑obvious causal links that would be hard to discover via traditional hypothesis‑driven experiments.

Methodology

Topic & Question Generation – DEMOCRITUS prompts a high‑capacity LLM (e.g., GPT‑4‑style) to suggest relevant topics and formulate causal “what‑if” questions for each topic.
Causal Statement Extraction – The LLM answers each question, producing natural‑language causal statements (e.g., “Increasing atmospheric CO₂ → higher average global temperature”).
Triple Conversion – A lightweight parser converts each statement into a (cause, effect, relation) triple, normalizing terminology via synonym dictionaries and embeddings.
Conflict Resolution & Integration – Using categorical constructions (e.g., pushouts and pullbacks), the system detects overlapping or contradictory triples and merges them into a consistent graph structure.
Embedding & Storage – The resulting causal graph is embedded in a vector space for fast similarity search and stored in a graph database that supports provenance tracking.
Visualization & Interaction – A web UI lets users explore the causal network, filter by domain, and drill down to the original LLM‑generated evidence.

The pipeline is deliberately modular, allowing developers to swap in alternative LLMs, parsers, or graph back‑ends without redesigning the whole system.

Results & Findings

Coverage: Across 12 test domains, DEMOCRITUS generated an average of 1,200 causal triples per domain, with a precision of ~78 % (validated by domain experts).
Cross‑domain insights: The system uncovered unexpected causal bridges, such as “soil microbiome diversity → crop yield → regional economic stability,” linking biology and economics.
Performance: The end‑to‑end runtime for a medium‑sized domain (≈500 queries) was ~45 minutes on a single GPU node; the biggest bottleneck was the LLM inference latency, not the graph integration step.
Scalability trends: Doubling the number of LLM queries roughly doubled total runtime, but the graph consolidation phase scaled sub‑linearly thanks to the categorical merging algorithm.

Practical Implications

Rapid knowledge graph bootstrapping – Developers can use DEMOCRITUS‑style pipelines to auto‑populate causal knowledge bases for recommendation engines, risk analysis tools, or decision‑support systems without hand‑curating every relationship.
Explainable AI – By exposing a structured causal graph behind model predictions, teams can generate human‑readable “why” explanations that go beyond feature importance scores.
Cross‑disciplinary product design – Engineers building IoT platforms, climate‑impact simulators, or health‑tech apps can quickly surface causal dependencies that span hardware, environment, and user behavior, informing more robust system architectures.
Continuous learning loops – The modular design enables a “listen‑and‑learn” cycle where new textual data (e.g., incident reports, research papers) are fed to the LLM, automatically updating the causal model in production.

Limitations & Future Work

Reliance on LLM quality – The accuracy of extracted causal statements hinges on the LLM’s factual grounding; hallucinations can propagate into the graph.
Ambiguity handling – While categorical merging mitigates conflicts, nuanced causal directionality (e.g., bidirectional feedback loops) remains challenging to capture automatically.
Scalability bottlenecks – LLM inference cost dominates runtime; future work will explore retrieval‑augmented generation and model distillation to reduce latency.
Evaluation depth – Current validation uses expert spot‑checks; a larger‑scale benchmark with ground‑truth causal datasets is needed to quantify recall and long‑term stability.
Interactive refinement – Plans include a human‑in‑the‑loop UI where domain experts can approve, edit, or reject triples, feeding corrections back into the LLM prompting strategy.

Authors

Sridhar Mahadevan

Paper Information

arXiv ID: 2512.07796v1
Categories: cs.AI
Published: December 8, 2025
PDF: Download PDF

[Paper] Large Causal Models from Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Closing the Train-Test Gap in World Models for Gradient-Based Planning

[Paper] LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating

[Paper] FALCON: Few-step Accurate Likelihoods for Continuous Flows

[Paper] Supervised learning pays attention