[Paper] PaperBanana: Automating Academic Illustration for AI Scientists

Published: 1 week ago (January 30, 2026 at 01:33 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.23265v1

Overview

The paper presents PaperBanana, an end‑to‑end framework that lets AI researchers generate publication‑ready figures (methodology diagrams, plots, etc.) automatically. By combining large vision‑language models (VLMs) with modern image generators and a set of specialized “agents,” the system handles everything from gathering reference material to polishing the final illustration, dramatically cutting down the manual effort that currently slows down the research‑to‑paper pipeline.

Key Contributions

Agentic illustration pipeline – a modular system of agents (retrieval, planning, rendering, self‑critique) that coordinates VLMs and diffusion‑based image generators to produce scholarly figures.
PaperBananaBench – a new benchmark of 292 real‑world illustration tasks extracted from upcoming NeurIPS 2025 papers, covering a wide range of domains (ML, CV, NLP) and visual styles.
Comprehensive evaluation metrics – quantitative and human‑rated scores for faithfulness to the described method, conciseness, readability, and aesthetic quality, showing consistent gains over existing baselines.
Extension to statistical plots – demonstrates that the same agentic workflow can generate accurate, high‑resolution charts (e.g., loss curves, confusion matrices) without hand‑crafted code.
Open‑source release – code, model checkpoints, and the benchmark dataset are publicly released to foster reproducibility and community extensions.

Methodology

Reference Retrieval Agent – parses the paper’s text, extracts figure captions, and searches a curated image corpus (arXiv PDFs, prior conference figures) for style and content cues.
Planning Agent – uses a VLM (e.g., GPT‑4V) to translate the textual description into a structured “scene graph” that lists visual components (blocks, arrows, labels) and desired styling (color palette, font).
Rendering Agent – feeds the scene graph to a diffusion image generator (Stable Diffusion‑XL or a custom fine‑tuned model) that produces a high‑resolution draft illustration.
Self‑Critique & Refinement Loop – the VLM evaluates the draft against the original description, flags mismatches (e.g., missing arrows, wrong axis labels), and iteratively prompts the renderer to adjust until a stopping criterion (confidence threshold or max iterations) is met.

All agents communicate through a lightweight JSON protocol, making it easy to swap out components (e.g., replace the VLM with a newer multimodal model).

Results & Findings

Faithfulness: PaperBanana achieved a 23 % higher match score (human‑rated) to the intended methodology compared with the strongest baseline (a prompt‑only diffusion approach).
Conciseness & Readability: Figures were rated 1.8 points higher on a 5‑point scale for avoiding unnecessary visual clutter and for clear labeling.
Aesthetics: Using a learned aesthetic predictor, PaperBanana’s outputs scored in the top 10 % of all figures in the benchmark, surpassing baselines by 0.42 on a 0–1 scale.
Statistical Plots: When tasked with generating line charts and bar graphs, the system produced plots with <2 % numerical error and received a 4.6/5 readability rating from domain experts.
Efficiency: End‑to‑end generation averaged 45 seconds per figure on a single A100 GPU, compared with an estimated 30‑60 minutes of manual design per figure for a typical researcher.

Practical Implications

Speed up manuscript preparation – Researchers can request a figure with a single sentence (“show the encoder‑decoder architecture with attention”) and receive a ready‑to‑publish illustration, freeing time for experiments and writing.
Consistent visual style across a paper – By feeding a style reference once, all subsequent figures inherit the same palette, fonts, and layout, improving the professional look of submissions.
Automated report generation – Companies building internal AI dashboards can integrate PaperBanana to auto‑create method diagrams for model cards, compliance documents, or technical blogs.
Educational tools – Platforms teaching ML concepts can generate custom diagrams on‑the‑fly to match a learner’s preferred visual style or to illustrate novel architectures not covered in textbooks.
Reduced reliance on graphic designers – Small labs or solo researchers can produce high‑quality figures without hiring external design help, lowering the barrier to high‑impact publications.

Limitations & Future Work

Domain‑specific symbols – The current VLM sometimes misinterprets niche symbols (e.g., custom loss functions) and may need additional fine‑tuning on specialized corpora.
Scalability of self‑critique – The iterative refinement loop can become costly for very complex figures; future work will explore learned stopping policies or hierarchical planning.
Evaluation breadth – PaperBananaBench focuses on NeurIPS 2025 papers; extending the benchmark to other venues (ICML, CVPR) and non‑English papers will test generality.
Interactive editing – While the system produces a final image, integrating a lightweight UI for post‑generation tweaks (e.g., moving an arrow) would make it more user‑friendly.

Overall, PaperBanana marks a significant step toward fully automated scientific illustration, promising to streamline the research publishing workflow and open new possibilities for AI‑driven content creation.

Authors

Dawei Zhu
Rui Meng
Yale Song
Xiyu Wei
Sujian Li
Tomas Pfister
Jinsung Yoon

Paper Information

arXiv ID: 2601.23265v1
Categories: cs.CL, cs.CV
Published: January 30, 2026
PDF: Download PDF

[Paper] PaperBanana: Automating Academic Illustration for AI Scientists

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UEval: A Benchmark for Unified Multimodal Generation

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound