[Paper] Training AI Co-Scientists Using Rubric Rewards
Source: arXiv - 2512.23707v1
Overview
The paper presents a new way to turn large language models (LLMs) into “AI co‑scientists” that can draft research plans from high‑level goals and constraints. By automatically extracting goal statements and grading rubrics from existing papers, the authors train models with reinforcement learning (RL) that self‑grade their own outputs—eliminating the need for costly human annotations. Human experts and downstream “jury” models show that the finetuned system produces noticeably better, more usable plans across machine‑learning and medical domains.
Key Contributions
- Automated Corpus Construction: A pipeline that mines research papers for (i) explicit research goals and (ii) domain‑specific grading rubrics, creating a large, diverse training set without manual labeling.
- Self‑Grading RL Framework: Uses a frozen copy of the base model as a “grader” and the extracted rubrics as reward signals, enabling reinforcement learning with a generator‑verifier gap.
- Empirical Validation on Real Goals: Human experts evaluated 225 h of generated plans for ML research goals, preferring the finetuned Qwen3‑30B‑A3B model 70 % of the time.
- Cross‑Domain Generalization: The same training recipe improves plan quality for medical research goals and fresh arXiv preprints, with 12‑22 % relative gains measured by frontier‑model juries.
- Scalable, Human‑Free Training Loop: Demonstrates that a fully automated pipeline can iteratively improve AI co‑scientist capabilities without continuous human supervision.
Methodology
-
Data Mining:
- Crawl arXiv and PubMed‑style repositories.
- Use heuristics and lightweight NLP classifiers to locate sections that state research objectives (e.g., “We aim to…”) and associated evaluation criteria (rubrics).
- Pair each goal with its rubric to form a goal‑rubric training example.
-
Base Model & Freezer:
- Start from the open‑source Qwen3‑30B‑A3B LLM.
- Clone the model; one copy stays frozen and acts as the grader, the other is the generator to be finetuned.
-
Reinforcement Learning with Self‑Grading:
- The generator produces a research plan given a goal.
- The frozen grader scores the plan against the rubric using a prompt‑based evaluation (e.g., “Does the plan satisfy criterion X?”).
- The rubric‑derived scores become the reward signal for PPO‑style RL updates.
-
Evaluation Loop:
- Human experts rank plans from the base and finetuned models for a set of ML goals.
- For medical and unseen arXiv goals, a jury of strong frontier models (e.g., GPT‑4‑Turbo, Claude‑3) performs pairwise preference judgments.
The whole pipeline runs end‑to‑end without any manual labeling after the initial data‑mining stage.
Results & Findings
| Evaluation | Preference for Finetuned Model | Relative Improvement |
|---|---|---|
| Human experts (ML goals) | 70 % of pairwise comparisons | — |
| Frontier‑model jury (medical goals) | +12 % to +22 % preference over baseline | 12‑22 % |
| Rubric approval (human check) | 84 % of automatically extracted rubrics deemed valid | — |
Key takeaways:
- The self‑grading RL loop reliably pushes the generator toward plans that better satisfy explicit criteria.
- The approach works across domains, even where direct execution feedback (e.g., running experiments) is unavailable.
- The majority of automatically extracted rubrics are high‑quality, confirming the feasibility of large‑scale unsupervised data creation.
Practical Implications
- Rapid Ideation for Developers: Teams can feed a high‑level research question (e.g., “reduce latency of transformer inference”) into the model and receive a structured, constraint‑aware plan ready for brainstorming or sprint planning.
- Automated Grant & Proposal Drafting: By swapping the rubric for funding‑agency criteria, the system could generate first‑draft proposals that already align with reviewer expectations.
- Cross‑Disciplinary Knowledge Transfer: Because the model learns from diverse corpora, it can suggest methods from one field (e.g., medical imaging) to another (e.g., computer vision), accelerating interdisciplinary innovation.
- Reduced Human Annotation Costs: Companies can build domain‑specific AI assistants without hiring large annotation teams; the pipeline harvests the needed supervision directly from the literature.
- Plug‑and‑Play for Existing LLMs: The method works with any sufficiently capable base model, making it a reusable recipe for product teams looking to add “research planning” capabilities to their AI assistants.
Limitations & Future Work
- Rubric Quality Variability: Although 84 % passed human sanity checks, the remaining noisy rubrics could misguide the reward signal, especially in niche sub‑domains.
- Scalability of the Grader: Using a frozen LLM as a grader incurs inference cost proportional to the plan length; more efficient scoring mechanisms (e.g., learned reward models) could speed up training.
- Evaluation Bias Toward Textual Quality: Preference judgments focus on readability and rubric compliance, not on downstream experimental success; linking plans to actual experiment outcomes remains an open challenge.
- Domain‑Specific Constraints: Some fields (e.g., regulatory‑heavy biotech) require constraints that are hard to capture in simple rubrics; extending the pipeline to handle formal constraint languages is a promising direction.
Overall, the paper demonstrates a practical, automated path toward more capable AI co‑scientists, opening the door for developers to embed research‑planning intelligence directly into their tools.
Authors
- Shashwat Goel
- Rishi Hazra
- Dulhan Jayalath
- Timon Willi
- Parag Jain
- William F. Shen
- Ilias Leontiadis
- Francesco Barbieri
- Yoram Bachrach
- Jonas Geiping
- Chenxi Whitehouse
Paper Information
- arXiv ID: 2512.23707v1
- Categories: cs.LG, cs.CL, cs.HC
- Published: December 29, 2025
- PDF: Download PDF