[Paper] GENIUS: Generative Fluid Intelligence Evaluation Suite
Source: arXiv - 2602.11144v1
Overview
The GENIUS benchmark shines a light on a missing piece of today’s multimodal AI evaluation: generative fluid intelligence—the ability of models to infer patterns, respect ad‑hoc constraints, and adapt to fresh contexts on the fly. By moving beyond static knowledge recall, the authors expose how current Unified Multimodal Models (UMMs) struggle when asked to “think on their feet” in visual generation tasks.
Key Contributions
- Formal definition of Generative Fluid Intelligence (GFI) as a combination of three core primitives: pattern induction, constraint execution, and contextual adaptation.
- GENIUS suite: a curated set of multimodal tasks that require on‑the‑spot reasoning (e.g., personalizing visual styles, visualizing abstract metaphors, simulating counter‑intuitive physics).
- Comprehensive evaluation of 12 state‑of‑the‑art UMMs, revealing systematic performance gaps on GFI tasks.
- Diagnostic analysis that isolates the root cause of failures to limited context comprehension rather than generative capacity.
- Training‑free attention intervention: a lightweight method that re‑weights cross‑modal attention at inference time, yielding measurable gains without extra training data.
- Open‑source release of the dataset, evaluation scripts, and intervention code to foster reproducibility and community adoption.
Methodology
-
Task Design – Each GFI task is built around a single prompt that contains all necessary information; no external knowledge bases are consulted. The three primitives are instantiated in concrete visual generation scenarios:
- Inducing Implicit Patterns: The model must infer a user’s hidden aesthetic preference from a few example images and generate new content accordingly.
- Executing Ad‑hoc Constraints: The prompt includes abstract constraints (e.g., “draw a city that feels like a jazz solo”), forcing the model to map non‑visual concepts to visual elements.
- Adapting to Contextual Knowledge: Scenarios such as “show a ball that rolls uphill” require the model to violate everyday physics while staying coherent.
-
Benchmark Construction – Over 1,200 prompts were authored, balanced across the three primitives and spanning diverse domains (art, UI design, scientific illustration). Human‑verified reference outputs provide a gold‑standard for evaluation.
-
Evaluation Protocol – Generated images are scored using a mix of automated metrics (CLIP‑based similarity, constraint‑specific classifiers) and human judgments (crowd‑sourced rating of pattern fidelity, constraint satisfaction, and contextual plausibility).
-
Attention Intervention – At inference, the authors compute a context relevance map from the prompt’s token embeddings and boost attention weights toward tokens that encode the current primitive’s cues. This requires no retraining, only a forward‑pass modification.
Results & Findings
- Baseline gap: Across the board, the best‑performing UMM (a diffusion model with CLIP guidance) achieved only 42% average human‑rated satisfaction on GFI tasks, compared to > 80% on traditional knowledge‑recall benchmarks.
- Primitive‑wise performance: Models were relatively better at pattern induction (≈ 48%) but struggled heavily with ad‑hoc constraints (≈ 35%) and contextual adaptation (≈ 33%).
- Diagnostic insight: Ablation studies showed that when the prompt’s contextual cues were explicitly highlighted (e.g., by duplicating key tokens), performance rose by up to 12%, indicating that the bottleneck is context parsing rather than image synthesis.
- Attention intervention impact: Applying the training‑free re‑weighting boosted average scores by 7–9% across models, with the largest gains on constraint‑heavy prompts. No degradation was observed on standard generation tasks.
Practical Implications
- Product design & personalization: Tools that need to adapt to a user’s evolving style (e.g., AI‑assisted UI mockups) can benefit from GFI‑aware training or inference tricks to better capture implicit preferences.
- Creative AI assistants: For brainstorming sessions where designers ask for “visual metaphors” or “impossible physics,” incorporating GFI evaluation can guide model selection and fine‑tuning.
- Safety & alignment: Understanding a model’s ability to respect ad‑hoc constraints is crucial for preventing unintended outputs in regulated domains (e.g., medical illustration, autonomous vehicle simulation).
- Rapid prototyping: The training‑free attention intervention offers a low‑cost way to improve existing pipelines without the expense of large‑scale fine‑tuning.
Limitations & Future Work
- Scope of primitives – The current three‑primitive formulation, while expressive, may not capture all facets of fluid intelligence (e.g., temporal reasoning or multimodal dialogue).
- Dataset bias – Prompt creation relied on human authors from a limited cultural background, potentially skewing what counts as “intuitive” or “counter‑intuitive.”
- Metric reliance on CLIP – Automated scores depend heavily on CLIP embeddings, which inherit their own biases and may not fully reflect nuanced human judgments.
- Intervention generality – The attention re‑weighting works well for diffusion‑based generators but its efficacy on autoregressive or transformer‑only visual models remains untested.
Future research directions include expanding GENIUS to video generation, integrating multimodal dialogue contexts, and exploring learned attention‑modulation modules that can adapt dynamically during inference.
Authors
- Ruichuan An
- Sihan Yang
- Ziyu Guo
- Wei Dai
- Zijun Shen
- Haodong Li
- Renrui Zhang
- Xinyu Wei
- Guopeng Li
- Wenshan Wu
- Wentao Zhang
Paper Information
- arXiv ID: 2602.11144v1
- Categories: cs.LG, cs.AI, cs.CV
- Published: February 11, 2026
- PDF: Download PDF