[Paper] COMPOSE: Composing Future Theorems from Citations and Formal Structure
Source: arXiv - 2605.30333v1
Overview
The paper COMPOSE tackles a surprisingly practical problem: can we automatically suggest plausible future theorems that researchers might prove next? By conditioning a language model on two complementary sources of information—a paper’s citation network and the formal dependency graph of existing theorems—the authors demonstrate a way to generate mathematically grounded conjectures that are both context‑aware and formally consistent.
Key Contributions
- Grounded future‑theorem generation: Introduces a new task that requires a model to respect both scientific citation trends and formal theorem dependencies.
- Dual‑graph conditioning architecture (COMPOSE): A novel framework that feeds a language model with (1) a citation graph (who cites whom) and (2) a formal dependency graph (which lemmas/theorems each result builds on).
- Large‑scale dataset: Curates 108 K paired scientific–formal graph examples from arXiv papers and the Mathlib library, plus a benchmark of 47 K “future” papers (published in 2024‑2025).
- Strong empirical results: COMPOSE outperforms several strong baselines on retrieval to actual future papers and receives the highest scores in a human‑LLM judging setup, indicating more realistic and mathematically richer outputs.
Methodology
-
Graph Construction
- Citation graph: For each “anchor” paper, the authors collect its inbound and outbound citations, forming a local citation sub‑graph that captures the research direction.
- Formal dependency graph: Using Mathlib’s theorem‑proof metadata, they extract which earlier theorems a given theorem depends on, yielding a directed acyclic graph of formal knowledge.
-
Dual‑Graph Encoder
- Two graph neural networks (GNNs) independently embed the citation and formal graphs.
- The embeddings are concatenated and injected as a prefix to a pretrained large language model (LLM) (e.g., GPT‑NeoX).
-
Prompt‑Conditioned Generation
- The LLM receives a prompt that includes the anchor paper’s abstract, the graph‑derived context, and a “generate a plausible future theorem” instruction.
- Beam search with nucleus sampling is used to produce multiple candidate statements.
-
Evaluation Pipeline
- Retrieval: Generated statements are matched against the actual 2024‑2025 papers using semantic similarity; higher overlap indicates better grounding.
- LLM‑judge: A separate LLM rates each candidate on relevance, novelty, and formal correctness, mimicking expert peer review.
Results & Findings
| Metric | COMPOSE | Best Baseline (Citation‑only) | Best Baseline (Formal‑only) |
|---|---|---|---|
| Retrieval@10 (future papers) | 42.7 % | 31.4 % | 28.9 % |
| LLM‑judge overall score (0‑100) | 78.3 | 65.1 | 61.4 |
| Formal dependency violations | 3 % | 12 % | 7 % |
- Dual‑graph conditioning beats single‑source models by a large margin, confirming that citation trends and formal structure provide complementary signals.
- The generated theorems are more often aligned with actual future work, suggesting that the model captures emerging research directions.
- Formal violations (e.g., proposing a theorem that contradicts known dependencies) drop dramatically, indicating that the formal graph effectively constrains the language model.
Practical Implications
- Research assistance: Developers of AI‑powered literature‑review tools could embed COMPOSE to suggest “next‑step” conjectures, helping mathematicians spot low‑hanging research opportunities.
- Automated hypothesis generation: In domains where formal verification is critical (cryptography, formal methods), the dual‑graph approach can propose candidate lemmas that are already guaranteed to respect existing proof dependencies.
- Curriculum design: Educational platforms could use the model to generate progressive problem sets that naturally follow the learning path encoded in citation and dependency graphs.
- Knowledge‑graph enrichment: By feeding back generated, high‑confidence conjectures into citation or formal repositories, we can bootstrap richer, forward‑looking knowledge graphs.
Limitations & Future Work
- Domain coverage: The current dataset focuses on mathematics papers linked to Mathlib; extending to other formal libraries (e.g., Coq, Isabelle) or to less formal scientific fields remains an open challenge.
- Evaluation bias: The LLM‑judge, while useful, may inherit the same biases as the underlying language model; human expert validation on a larger scale would strengthen claims.
- Scalability of graph encoding: Large citation neighborhoods can become computationally expensive; future work could explore hierarchical or sparse graph representations.
- Interactive generation: Incorporating a feedback loop where a human researcher refines the generated conjecture could lead to more usable, co‑creative systems.
COMPOSE demonstrates that marrying bibliometric context with formal theorem dependencies yields a powerful new tool for forward‑looking mathematical discovery—an exciting glimpse of how AI can become a genuine research partner.
Authors
- David Busbib
- Michael Werman
Paper Information
- arXiv ID: 2605.30333v1
- Categories: cs.CL
- Published: May 28, 2026
- PDF: Download PDF