[Paper] COMPOSE: Composing Future Theorems from Citations and Formal Structure

Published: 1 week ago (May 28, 2026 at 01:58 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.30333v1

Overview

The paper COMPOSE tackles a surprisingly practical problem: can we automatically suggest plausible future theorems that researchers might prove next? By conditioning a language model on two complementary sources of information—a paper’s citation network and the formal dependency graph of existing theorems—the authors demonstrate a way to generate mathematically grounded conjectures that are both context‑aware and formally consistent.

Key Contributions

Grounded future‑theorem generation: Introduces a new task that requires a model to respect both scientific citation trends and formal theorem dependencies.
Dual‑graph conditioning architecture (COMPOSE): A novel framework that feeds a language model with (1) a citation graph (who cites whom) and (2) a formal dependency graph (which lemmas/theorems each result builds on).
Large‑scale dataset: Curates 108 K paired scientific–formal graph examples from arXiv papers and the Mathlib library, plus a benchmark of 47 K “future” papers (published in 2024‑2025).
Strong empirical results: COMPOSE outperforms several strong baselines on retrieval to actual future papers and receives the highest scores in a human‑LLM judging setup, indicating more realistic and mathematically richer outputs.

Methodology

Graph Construction
- Citation graph: For each “anchor” paper, the authors collect its inbound and outbound citations, forming a local citation sub‑graph that captures the research direction.
- Formal dependency graph: Using Mathlib’s theorem‑proof metadata, they extract which earlier theorems a given theorem depends on, yielding a directed acyclic graph of formal knowledge.
Dual‑Graph Encoder
- Two graph neural networks (GNNs) independently embed the citation and formal graphs.
- The embeddings are concatenated and injected as a prefix to a pretrained large language model (LLM) (e.g., GPT‑NeoX).
Prompt‑Conditioned Generation
- The LLM receives a prompt that includes the anchor paper’s abstract, the graph‑derived context, and a “generate a plausible future theorem” instruction.
- Beam search with nucleus sampling is used to produce multiple candidate statements.
Evaluation Pipeline
- Retrieval: Generated statements are matched against the actual 2024‑2025 papers using semantic similarity; higher overlap indicates better grounding.
- LLM‑judge: A separate LLM rates each candidate on relevance, novelty, and formal correctness, mimicking expert peer review.

Results & Findings

Metric	COMPOSE	Best Baseline (Citation‑only)	Best Baseline (Formal‑only)
Retrieval@10 (future papers)	42.7 %	31.4 %	28.9 %
LLM‑judge overall score (0‑100)	78.3	65.1	61.4
Formal dependency violations	3 %	12 %	7 %

Dual‑graph conditioning beats single‑source models by a large margin, confirming that citation trends and formal structure provide complementary signals.
The generated theorems are more often aligned with actual future work, suggesting that the model captures emerging research directions.
Formal violations (e.g., proposing a theorem that contradicts known dependencies) drop dramatically, indicating that the formal graph effectively constrains the language model.

Practical Implications

Research assistance: Developers of AI‑powered literature‑review tools could embed COMPOSE to suggest “next‑step” conjectures, helping mathematicians spot low‑hanging research opportunities.
Automated hypothesis generation: In domains where formal verification is critical (cryptography, formal methods), the dual‑graph approach can propose candidate lemmas that are already guaranteed to respect existing proof dependencies.
Curriculum design: Educational platforms could use the model to generate progressive problem sets that naturally follow the learning path encoded in citation and dependency graphs.
Knowledge‑graph enrichment: By feeding back generated, high‑confidence conjectures into citation or formal repositories, we can bootstrap richer, forward‑looking knowledge graphs.

Limitations & Future Work

Domain coverage: The current dataset focuses on mathematics papers linked to Mathlib; extending to other formal libraries (e.g., Coq, Isabelle) or to less formal scientific fields remains an open challenge.
Evaluation bias: The LLM‑judge, while useful, may inherit the same biases as the underlying language model; human expert validation on a larger scale would strengthen claims.
Scalability of graph encoding: Large citation neighborhoods can become computationally expensive; future work could explore hierarchical or sparse graph representations.
Interactive generation: Incorporating a feedback loop where a human researcher refines the generated conjecture could lead to more usable, co‑creative systems.

COMPOSE demonstrates that marrying bibliometric context with formal theorem dependencies yields a powerful new tool for forward‑looking mathematical discovery—an exciting glimpse of how AI can become a genuine research partner.

Authors

David Busbib
Michael Werman

Paper Information

arXiv ID: 2605.30333v1
Categories: cs.CL
Published: May 28, 2026
PDF: Download PDF

[Paper] COMPOSE: Composing Future Theorems from Citations and Formal Structure

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

[Paper] LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

[Paper] What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection