[Paper] CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Source: arXiv - 2601.10649v1
Overview
The paper introduces CURVE, a new benchmark that pushes video‑understanding models to reason about long, culturally diverse videos in many languages. By moving beyond the usual Western‑centric, English‑only datasets, CURVE exposes how current Video‑LLMs handle real‑world visual cues that are tied to specific cultures and languages.
Key Contributions
- Multicultural, multilingual benchmark – 18 global locales, each with native‑language video clips, questions, answers, and multi‑step reasoning traces.
- Human‑generated annotations – All content (including translations) is created by native speakers, avoiding the noise of automatic translation pipelines.
- Evidence‑based reasoning graphs – The authors convert the provided reasoning steps into structured graphs that can be used to pinpoint where a model’s reasoning goes wrong.
- Iterative error‑analysis strategy – A novel method that leverages the reasoning graphs to isolate fine‑grained perception and inference failures.
- Comprehensive evaluation – State‑of‑the‑art Video‑LLMs are benchmarked, revealing a large gap to human performance and highlighting cultural perception as the biggest bottleneck.
Methodology
- Data collection – Curators in each of the 18 regions sourced locally relevant long‑form videos (e.g., festivals, sports, daily life).
- Annotation pipeline – Native speakers wrote complex, multi‑hop questions that require understanding of visual context, cultural practices, and language nuances. For each question they also supplied a step‑by‑step reasoning chain and the final answer, all in the original language.
- Graph construction – Each reasoning chain is transformed into a directed graph where nodes represent visual or textual entities and edges capture logical dependencies (e.g., “the dancer’s costume → indicates a traditional ceremony”).
- Iterative evaluation – Models first generate an answer and a reasoning trace. The trace is aligned with the ground‑truth graph; mismatches are traced back to specific nodes, allowing the system to report whether the error is due to visual perception, language understanding, or logical inference.
The pipeline is deliberately kept simple enough for developers to reproduce or extend with their own video data.
Results & Findings
| Model | Avg. Accuracy (English) | Avg. Accuracy (Native) | Human Baseline |
|---|---|---|---|
| Flamingo‑Video‑LLM | 38 % | 31 % | 92 % |
| InternVideo‑Chat | 42 % | 35 % | 92 % |
| GPT‑4‑Vision (zero‑shot) | 45 % | 38 % | 92 % |
- Performance drop in native languages: All models lose ~7‑10 % when answering in the video’s original language, confirming that multilingual grounding is a major challenge.
- Error taxonomy: Using the reasoning‑graph analysis, ~60 % of failures stem from mis‑identifying cultural visual cues (e.g., traditional garments, regional food), ~25 % from language parsing, and the remaining ~15 % from logical chaining.
- Human‑level gap: Even the strongest Video‑LLM is more than 50 % behind human annotators, indicating that current architectures lack deep cultural situational awareness.
Practical Implications
- Global product localization – Companies building video‑based assistants, content moderation tools, or recommendation engines can use CURVE to audit whether their models truly understand region‑specific content, reducing cultural bias in user experiences.
- Multilingual video search – Search engines that index long videos (e.g., lecture recordings, cultural documentaries) can benchmark and improve cross‑language retrieval pipelines with CURVE’s native‑language queries.
- Safety & compliance – Automated moderation systems can be evaluated on culturally sensitive scenes (e.g., religious rituals) to avoid false positives/negatives that stem from mis‑recognizing cultural symbols.
- Model debugging – The evidence‑graph framework gives engineers a concrete way to trace back a wrong answer to a specific visual perception error, enabling targeted data augmentation (e.g., adding more examples of a particular traditional costume).
Overall, CURVE provides a realistic “stress test” for any video‑AI product that will be deployed worldwide.
Limitations & Future Work
- Scope of locales – While 18 regions cover a broad spectrum, many languages and sub‑cultures remain unrepresented (e.g., Indigenous groups, low‑resource languages).
- Static annotation style – The reasoning steps are handcrafted; future work could explore crowdsourced or model‑generated traces to increase diversity.
- Model‑centric focus – The benchmark evaluates existing Video‑LLMs but does not propose architectural changes; extending the work to incorporate cultural priors (e.g., knowledge graphs) is an open direction.
- Scalability – Curating high‑quality, multilingual long‑form videos is labor‑intensive; automating parts of the pipeline while preserving annotation fidelity is a promising research avenue.
By addressing these gaps, the community can move toward truly universal video understanding systems that respect and reflect the world’s cultural richness.
Authors
- Darshan Singh
- Arsha Nagrani
- Kawshik Manikantan
- Harman Singh
- Dinesh Tewari
- Tobias Weyand
- Cordelia Schmid
- Anelia Angelova
- Shachi Dave
Paper Information
- arXiv ID: 2601.10649v1
- Categories: cs.CV
- Published: January 15, 2026
- PDF: Download PDF