[Paper] BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Published: 3 weeks ago (April 17, 2026 at 01:00 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.16241v1

Overview

The paper introduces BAGEL, a new benchmark that measures how well large language models (LLMs) know animal‑related facts—from taxonomy and morphology to vocalizations and species interactions—without relying on external retrieval at inference time. By unifying a wide range of scientific and reference sources into a closed‑book question‑answer format, BAGEL lets researchers pinpoint exactly where LLMs excel or stumble in biodiversity knowledge, a domain that’s increasingly relevant for environmental tech, bioinformatics tools, and AI‑driven conservation platforms.

Key Contributions

A unified animal‑knowledge benchmark built from curated and automatically generated QA pairs sourced from bioRxiv, Global Biotic Interactions (GloBI), Xeno‑canto, Wikipedia, and other repositories.
Fine‑grained taxonomy of knowledge categories (taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, species interactions) enabling detailed error analysis.
Closed‑book evaluation protocol that isolates the model’s internal knowledge from retrieval‑augmented pipelines, providing a clean signal of “knowledge memorization” versus “search‑and‑retrieve”.
Cross‑domain and cross‑taxa breakdowns (e.g., mammals vs. insects, marine vs. terrestrial) to surface systematic biases or gaps in model training data.
Open‑source release of the dataset and evaluation scripts, encouraging community‑wide benchmarking and future extensions.

Methodology

Data Collection – The authors harvested structured facts from scientific literature (bioRxiv), curated interaction databases (GloBI), audio‑recording archives (Xeno‑canto), and general encyclopedic entries (Wikipedia).
Question Generation – Two pipelines were used:
- Curated examples: human‑written QA pairs ensuring high quality for complex concepts.
- Automatic generation: template‑based transformations (e.g., “What family does Panthera leo belong to?”) applied to structured triples, producing thousands of diverse items.
Closed‑Book Formatting – All QA pairs are presented as a single prompt (e.g., “Q: … A:”) with no external context, forcing the model to rely solely on its stored parameters.
Evaluation Suite – Models are scored using exact‑match and token‑level F1, and results are broken down by source, taxonomic group, and knowledge category.
Baseline Models – The study evaluates several state‑of‑the‑art LLMs (e.g., GPT‑3.5, LLaMA‑2, Claude) to establish reference performance levels.

Results & Findings

Overall performance: Even the largest LLMs achieve only 55–62 % exact‑match accuracy on BAGEL, far below human expert levels (~95 %).
Category disparities: Models excel at taxonomy and geographic distribution (≈70 % accuracy) but struggle with vocalization and species interactions (≈40 %).
Source bias: QA items derived from Wikipedia are answered more accurately than those from specialized databases like GloBI, indicating a training‑data skew toward general‑purpose text.
Taxonomic gaps: Accuracy for insects and marine invertebrates lags behind mammals and birds by 15–20 %, reflecting under‑representation in pre‑training corpora.
Model size vs. knowledge: Scaling up model parameters yields diminishing returns beyond ~30 B parameters for many animal‑specific queries, suggesting that sheer size isn’t enough to cover niche domains.

Practical Implications

Biodiversity tech – Companies building AI‑assisted field guides, wildlife monitoring dashboards, or automated species‑identification pipelines can use BAGEL to audit their LLM back‑ends before deployment.
Retrieval‑augmented systems – BAGEL highlights where a pure LLM will fail, guiding developers to integrate external knowledge bases (e.g., taxonomic APIs) for high‑risk queries.
Prompt engineering – The fine‑grained error analysis suggests that few‑shot prompts with domain‑specific exemplars can boost performance on weak categories like vocalization.
Regulatory & safety – For AI applications that influence conservation policy or public health (e.g., zoonotic disease tracking), BAGEL provides a measurable baseline to certify that the model’s factual grounding meets required standards.
Dataset creation pipelines – The hybrid curated/automatic approach can be replicated for other niche domains (e.g., plant pathology, marine chemistry), enabling rapid expansion of domain‑specific benchmarks.

Limitations & Future Work

Static knowledge – BAGEL captures a snapshot of animal facts; it does not test models on recent taxonomic revisions or newly discovered species.
Closed‑book focus – While useful for measuring memorized knowledge, many real‑world systems will combine LLMs with retrieval; future work should evaluate hybrid pipelines on the same questions.
Language coverage – The benchmark is currently English‑only, limiting assessment of multilingual models that may be deployed in biodiversity‑rich regions.
Depth of reasoning – Most questions are factual recall; extending BAGEL to causal or mechanistic queries (e.g., “Why do certain birds migrate at night?”) would probe deeper reasoning abilities.

The authors release BAGEL publicly, inviting the community to address these gaps, expand the taxonomy, and ultimately build more reliable AI tools for the planet’s biodiversity challenges.

Authors

Jiacheng Shen
Masato Hagiwara
Milad Alizadeh
Ellen Gilsenan-McMahon
Marius Miron
David Robinson
Emmanuel Chemla
Sara Keen
Gagan Narula
Mathieu Laurière
Matthieu Geist
Olivier Pietquin

Paper Information

arXiv ID: 2604.16241v1
Categories: cs.CL, cs.AI
Published: April 17, 2026
PDF: Download PDF

[Paper] BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

[Paper] Detecting and Suppressing Reward Hacking with Gradient Fingerprints