[Paper] Training AI Co-Scientists Using Rubric Rewards

Published: (December 29, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23707v1

Overview

The paper presents a new way to turn large language models (LLMs) into “AI co‑scientists” that can draft research plans from high‑level goals and constraints. By automatically extracting goal statements and grading rubrics from existing papers, the authors train models with reinforcement learning (RL) that self‑grade their own outputs—eliminating the need for costly human annotations. Human experts and downstream “jury” models show that the finetuned system produces noticeably better, more usable plans across machine‑learning and medical domains.

Key Contributions

  • Automated Corpus Construction: A pipeline that mines research papers for (i) explicit research goals and (ii) domain‑specific grading rubrics, creating a large, diverse training set without manual labeling.
  • Self‑Grading RL Framework: Uses a frozen copy of the base model as a “grader” and the extracted rubrics as reward signals, enabling reinforcement learning with a generator‑verifier gap.
  • Empirical Validation on Real Goals: Human experts evaluated 225 h of generated plans for ML research goals, preferring the finetuned Qwen3‑30B‑A3B model 70 % of the time.
  • Cross‑Domain Generalization: The same training recipe improves plan quality for medical research goals and fresh arXiv preprints, with 12‑22 % relative gains measured by frontier‑model juries.
  • Scalable, Human‑Free Training Loop: Demonstrates that a fully automated pipeline can iteratively improve AI co‑scientist capabilities without continuous human supervision.

Methodology

  1. Data Mining:

    • Crawl arXiv and PubMed‑style repositories.
    • Use heuristics and lightweight NLP classifiers to locate sections that state research objectives (e.g., “We aim to…”) and associated evaluation criteria (rubrics).
    • Pair each goal with its rubric to form a goal‑rubric training example.
  2. Base Model & Freezer:

    • Start from the open‑source Qwen3‑30B‑A3B LLM.
    • Clone the model; one copy stays frozen and acts as the grader, the other is the generator to be finetuned.
  3. Reinforcement Learning with Self‑Grading:

    • The generator produces a research plan given a goal.
    • The frozen grader scores the plan against the rubric using a prompt‑based evaluation (e.g., “Does the plan satisfy criterion X?”).
    • The rubric‑derived scores become the reward signal for PPO‑style RL updates.
  4. Evaluation Loop:

    • Human experts rank plans from the base and finetuned models for a set of ML goals.
    • For medical and unseen arXiv goals, a jury of strong frontier models (e.g., GPT‑4‑Turbo, Claude‑3) performs pairwise preference judgments.

The whole pipeline runs end‑to‑end without any manual labeling after the initial data‑mining stage.

Results & Findings

EvaluationPreference for Finetuned ModelRelative Improvement
Human experts (ML goals)70 % of pairwise comparisons
Frontier‑model jury (medical goals)+12 % to +22 % preference over baseline12‑22 %
Rubric approval (human check)84 % of automatically extracted rubrics deemed valid

Key takeaways:

  • The self‑grading RL loop reliably pushes the generator toward plans that better satisfy explicit criteria.
  • The approach works across domains, even where direct execution feedback (e.g., running experiments) is unavailable.
  • The majority of automatically extracted rubrics are high‑quality, confirming the feasibility of large‑scale unsupervised data creation.

Practical Implications

  • Rapid Ideation for Developers: Teams can feed a high‑level research question (e.g., “reduce latency of transformer inference”) into the model and receive a structured, constraint‑aware plan ready for brainstorming or sprint planning.
  • Automated Grant & Proposal Drafting: By swapping the rubric for funding‑agency criteria, the system could generate first‑draft proposals that already align with reviewer expectations.
  • Cross‑Disciplinary Knowledge Transfer: Because the model learns from diverse corpora, it can suggest methods from one field (e.g., medical imaging) to another (e.g., computer vision), accelerating interdisciplinary innovation.
  • Reduced Human Annotation Costs: Companies can build domain‑specific AI assistants without hiring large annotation teams; the pipeline harvests the needed supervision directly from the literature.
  • Plug‑and‑Play for Existing LLMs: The method works with any sufficiently capable base model, making it a reusable recipe for product teams looking to add “research planning” capabilities to their AI assistants.

Limitations & Future Work

  • Rubric Quality Variability: Although 84 % passed human sanity checks, the remaining noisy rubrics could misguide the reward signal, especially in niche sub‑domains.
  • Scalability of the Grader: Using a frozen LLM as a grader incurs inference cost proportional to the plan length; more efficient scoring mechanisms (e.g., learned reward models) could speed up training.
  • Evaluation Bias Toward Textual Quality: Preference judgments focus on readability and rubric compliance, not on downstream experimental success; linking plans to actual experiment outcomes remains an open challenge.
  • Domain‑Specific Constraints: Some fields (e.g., regulatory‑heavy biotech) require constraints that are hard to capture in simple rubrics; extending the pipeline to handle formal constraint languages is a promising direction.

Overall, the paper demonstrates a practical, automated path toward more capable AI co‑scientists, opening the door for developers to embed research‑planning intelligence directly into their tools.

Authors

  • Shashwat Goel
  • Rishi Hazra
  • Dulhan Jayalath
  • Timon Willi
  • Parag Jain
  • William F. Shen
  • Ilias Leontiadis
  • Francesco Barbieri
  • Yoram Bachrach
  • Jonas Geiping
  • Chenxi Whitehouse

Paper Information

  • arXiv ID: 2512.23707v1
  • Categories: cs.LG, cs.CL, cs.HC
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »