[Paper] COMPOSE: Composing Future Theorems from Citations and Formal Structure

Published: (May 28, 2026 at 01:58 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.30333v1

Overview

The paper COMPOSE tackles a surprisingly practical problem: can we automatically suggest plausible future theorems that researchers might prove next? By conditioning a language model on two complementary sources of information—a paper’s citation network and the formal dependency graph of existing theorems—the authors demonstrate a way to generate mathematically grounded conjectures that are both context‑aware and formally consistent.

Key Contributions

  • Grounded future‑theorem generation: Introduces a new task that requires a model to respect both scientific citation trends and formal theorem dependencies.
  • Dual‑graph conditioning architecture (COMPOSE): A novel framework that feeds a language model with (1) a citation graph (who cites whom) and (2) a formal dependency graph (which lemmas/theorems each result builds on).
  • Large‑scale dataset: Curates 108 K paired scientific–formal graph examples from arXiv papers and the Mathlib library, plus a benchmark of 47 K “future” papers (published in 2024‑2025).
  • Strong empirical results: COMPOSE outperforms several strong baselines on retrieval to actual future papers and receives the highest scores in a human‑LLM judging setup, indicating more realistic and mathematically richer outputs.

Methodology

  1. Graph Construction

    • Citation graph: For each “anchor” paper, the authors collect its inbound and outbound citations, forming a local citation sub‑graph that captures the research direction.
    • Formal dependency graph: Using Mathlib’s theorem‑proof metadata, they extract which earlier theorems a given theorem depends on, yielding a directed acyclic graph of formal knowledge.
  2. Dual‑Graph Encoder

    • Two graph neural networks (GNNs) independently embed the citation and formal graphs.
    • The embeddings are concatenated and injected as a prefix to a pretrained large language model (LLM) (e.g., GPT‑NeoX).
  3. Prompt‑Conditioned Generation

    • The LLM receives a prompt that includes the anchor paper’s abstract, the graph‑derived context, and a “generate a plausible future theorem” instruction.
    • Beam search with nucleus sampling is used to produce multiple candidate statements.
  4. Evaluation Pipeline

    • Retrieval: Generated statements are matched against the actual 2024‑2025 papers using semantic similarity; higher overlap indicates better grounding.
    • LLM‑judge: A separate LLM rates each candidate on relevance, novelty, and formal correctness, mimicking expert peer review.

Results & Findings

MetricCOMPOSEBest Baseline (Citation‑only)Best Baseline (Formal‑only)
Retrieval@10 (future papers)42.7 %31.4 %28.9 %
LLM‑judge overall score (0‑100)78.365.161.4
Formal dependency violations3 %12 %7 %
  • Dual‑graph conditioning beats single‑source models by a large margin, confirming that citation trends and formal structure provide complementary signals.
  • The generated theorems are more often aligned with actual future work, suggesting that the model captures emerging research directions.
  • Formal violations (e.g., proposing a theorem that contradicts known dependencies) drop dramatically, indicating that the formal graph effectively constrains the language model.

Practical Implications

  • Research assistance: Developers of AI‑powered literature‑review tools could embed COMPOSE to suggest “next‑step” conjectures, helping mathematicians spot low‑hanging research opportunities.
  • Automated hypothesis generation: In domains where formal verification is critical (cryptography, formal methods), the dual‑graph approach can propose candidate lemmas that are already guaranteed to respect existing proof dependencies.
  • Curriculum design: Educational platforms could use the model to generate progressive problem sets that naturally follow the learning path encoded in citation and dependency graphs.
  • Knowledge‑graph enrichment: By feeding back generated, high‑confidence conjectures into citation or formal repositories, we can bootstrap richer, forward‑looking knowledge graphs.

Limitations & Future Work

  • Domain coverage: The current dataset focuses on mathematics papers linked to Mathlib; extending to other formal libraries (e.g., Coq, Isabelle) or to less formal scientific fields remains an open challenge.
  • Evaluation bias: The LLM‑judge, while useful, may inherit the same biases as the underlying language model; human expert validation on a larger scale would strengthen claims.
  • Scalability of graph encoding: Large citation neighborhoods can become computationally expensive; future work could explore hierarchical or sparse graph representations.
  • Interactive generation: Incorporating a feedback loop where a human researcher refines the generated conjecture could lead to more usable, co‑creative systems.

COMPOSE demonstrates that marrying bibliometric context with formal theorem dependencies yields a powerful new tool for forward‑looking mathematical discovery—an exciting glimpse of how AI can become a genuine research partner.

Authors

  • David Busbib
  • Michael Werman

Paper Information

  • arXiv ID: 2605.30333v1
  • Categories: cs.CL
  • Published: May 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »