[Paper] Jina-VLM: Small Multilingual Vision Language Model
Source: arXiv - 2512.04032v1
Overview
Jina‑VLM is a 2.4 B‑parameter vision‑language model that pushes the frontier of multilingual visual question answering (VQA) while staying in the “small‑model” regime. By marrying a high‑performing SigLIP2 vision encoder with the powerful Qwen‑3 language model through a novel attention‑pooling connector, the system can ingest images of any resolution without exploding the token count, delivering state‑of‑the‑art results on multilingual VQA benchmarks and solid performance on pure‑text tasks.
Key Contributions
- Compact multilingual VLM: First open‑source model under 3 B parameters that simultaneously excels at multilingual VQA and text‑only tasks.
- Attention‑pooling connector: A lightweight module that compresses variable‑size visual feature maps into a fixed‑length token sequence, enabling token‑efficient processing of high‑resolution images.
- SigLIP2 + Qwen‑3 fusion: Demonstrates that coupling a modern contrastive vision encoder (SigLIP2) with a large‑scale LLM (Qwen‑3) yields superior cross‑modal reasoning without massive parameter growth.
- State‑of‑the‑art multilingual VQA: Outperforms all open 2 B‑scale VLMs on standard VQA datasets (e.g., VQAv2, GQA) and multilingual extensions (e.g., X‑VQA, MME‑Multi).
- Open‑source release: Model weights, training scripts, and evaluation pipelines are publicly available, encouraging community adoption and further research.
Methodology
-
Vision Backbone – SigLIP2
- Trained with a contrastive image‑text objective on a large, diverse image corpus.
- Produces a dense feature map (height × width × channels) for any input resolution.
-
Attention‑Pooling Connector
- Takes the 2‑D feature map and applies a multi‑head self‑attention layer that learns to “pool” the spatial tokens into a small, fixed‑size set (e.g., 8‑12 tokens).
- This preserves salient visual information while keeping the token budget low for the language model.
-
Language Backbone – Qwen‑3
- A decoder‑only transformer pre‑trained on massive multilingual text data (≈ 100 languages).
- Receives the pooled visual tokens prepended to the textual prompt, enabling joint reasoning.
-
Training Regime
- Stage 1: Freeze the vision encoder, fine‑tune the connector + language model on a mixture of image‑text pairs (ITC) and instruction‑following data.
- Stage 2: End‑to‑end fine‑tuning on multilingual VQA datasets, using cross‑entropy loss on answer tokens.
- Curriculum: Start with low‑resolution images, gradually increase resolution to teach the connector to handle arbitrary sizes.
-
Inference Pipeline
- Input image → SigLIP2 → attention‑pooling → token sequence → Qwen‑3 → generated answer.
- Because the visual token count is constant, inference latency scales mainly with language model size, not image resolution.
Results & Findings
| Benchmark | Model (2.4 B) | Prior Open‑Source 2 B‑Scale VLM | Text‑Only (e.g., MMLU) |
|---|---|---|---|
| VQAv2 (English) | 78.4 % | 73.1 % | 71.2 % |
| GQA (English) | 71.9 % | 66.5 % | — |
| X‑VQA (10 languages) | 65.3 % avg | 58.7 % avg | — |
| MME‑Multi (multilingual) | 62.1 % | 55.4 % | — |
| MMLU (text‑only) | 71.8 % | 70.2 % | — |
- Token efficiency: The attention‑pooling connector reduces visual token count from ~1,000 (full patch grid) to ≤12, cutting cross‑modal attention cost by ~90 % without hurting accuracy.
- Resolution robustness: Experiments with 224 px up to 1,024 px images show <2 % performance drift, confirming that the connector generalizes across scales.
- Multilingual transfer: Even languages with limited VQA data (e.g., Swahili, Urdu) see >10 % absolute gains over baselines, indicating strong cross‑lingual visual grounding.
Practical Implications
- Enterprise AI assistants: Companies can embed Jina‑VLM in chat‑bots that need to understand screenshots, product photos, or UI mockups in multiple languages without paying the inference cost of 10 B‑plus models.
- Edge & mobile deployment: The fixed, tiny visual token stream makes it feasible to run the model on devices with limited GPU memory (e.g., NVIDIA Jetson, Apple M‑series) while still handling high‑resolution inputs.
- Content moderation & accessibility: Multilingual visual QA can power automated captioning, image‑based FAQ systems, or accessibility tools that answer visual queries in the user’s native language.
- Rapid prototyping: Open‑source weights and a simple API let developers experiment with “visual prompting” (e.g., “What’s the error code on this screen?”) across global user bases.
Limitations & Future Work
- Scale ceiling: While 2.4 B parameters strike a good balance, the model still lags behind the very latest 10 B‑plus VLMs on niche visual reasoning tasks (e.g., detailed scene graph generation).
- Language coverage: Performance drops noticeably for low‑resource languages not well‑represented in the Qwen‑3 pre‑training corpus; further multilingual pre‑training is needed.
- Connector interpretability: The attention‑pooling step is a black box; visualizing which patches contribute to each pooled token remains an open research direction.
- Future directions proposed by the authors include: scaling the connector to multi‑token visual “memory” slots, integrating retrieval‑augmented generation for open‑domain visual QA, and extending training to video‑question answering scenarios.
Authors
- Andreas Koukounas
- Georgios Mastrapas
- Florian Hönicke
- Sedigheh Eslami
- Guillaume Roncari
- Scott Martens
- Han Xiao
Paper Information
- arXiv ID: 2512.04032v1
- Categories: cs.CL, cs.AI, cs.CV
- Published: December 3, 2025
- PDF: Download PDF