[Paper] LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation
Source: arXiv - 2512.16891v1
Overview
The paper “LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next‑Generation Video Recommendation” tackles a practical roadblock: turning the powerful, knowledge‑rich reasoning of Video Large Language Models (VLLMs) into a fast, scalable engine for video recommendation. By extracting a new “LinkedOut” representation directly from raw video frames, the authors bridge the gap between world‑knowledge awareness and the low‑latency, multi‑video demands of real‑world recommender systems.
Key Contributions
- LinkedOut representation: a token‑level, knowledge‑aware embedding extracted from raw frames, preserving fine‑grained visual detail while injecting VLLM world knowledge.
- Prompt‑driven token extraction: uses lightweight, query‑style prompts (and optional auxiliary signals) to pull out semantically relevant tokens without full language generation.
- Cross‑layer Knowledge Fusion MoE: a mixture‑of‑experts module that dynamically selects the most appropriate abstraction level from the VLLM’s deep feature hierarchy for each recommendation query.
- Multi‑video history support: the architecture natively ingests a sequence of user‑watched videos, enabling personalized recommendations with a single forward pass.
- State‑of‑the‑art results on standard video recommendation benchmarks, surpassing prior VLLM‑based and classic baselines while delivering orders‑of‑magnitude lower inference latency.
- Interpretability analysis: demonstrates that the fused layer tokens can be traced back to concrete visual concepts, offering transparent reasoning for recommendations.
Methodology
-
Frame‑level tokenization – Raw video frames are fed into a pretrained VLLM (e.g., Flamingo‑style). Instead of generating full sentences, the model is prompted with short queries like “object present?”, “scene mood?”, or “action type?”. The VLLM returns a set of knowledge‑aware tokens (vector embeddings) that capture both visual cues and the model’s world‑knowledge priors.
-
Layer‑wise feature harvesting – VLLMs produce hierarchical features across many transformer layers. Early layers encode low‑level textures; deeper layers capture high‑level semantics and external knowledge. The authors expose all these layers to downstream processing.
-
Cross‑layer Fusion MoE – A lightweight Mixture‑of‑Experts network learns, for each token, which layer’s representation is most useful for the current recommendation context (e.g., user profile, watch history). The MoE gates are trained end‑to‑end, allowing the system to automatically balance detail vs. abstraction.
-
Multi‑video aggregation – Tokens from a user’s recent video history are concatenated and passed through a simple transformer encoder that models temporal dependencies. The final pooled representation is fed into a ranking head that scores candidate videos.
-
Training – The whole pipeline is fine‑tuned on public video recommendation datasets (e.g., MovieLens‑20M video split, YouTube‑8M). Losses combine pairwise ranking (BPR) with a knowledge‑preservation regularizer that keeps the extracted tokens faithful to the original VLLM outputs.
Results & Findings
| Dataset | Metric (HR@10) | Δ vs. Best Prior |
|---|---|---|
| MovieLens‑20M (video) | 0.742 | +4.3 % |
| YouTube‑8M (rec) | 0.618 | +3.9 % |
| Retrieval latency (per user) | ≈ 45 ms | ↓ 70 % vs. decode‑only VLLM |
- Performance boost stems mainly from the layer‑wise fusion: ablating the MoE drops HR@10 by ~2 pp, confirming that different recommendation scenarios rely on different abstraction levels.
- Latency reduction: By avoiding full language generation and using a fixed‑size token set, inference is ~10× faster than a decode‑only VLLM baseline.
- Interpretability: Visualizing top‑gated layers shows that “scene‑mood” queries lean on deeper layers (world‑knowledge), while “object‑presence” queries rely on early visual layers, matching human intuition.
Practical Implications
- Deployable recommender services – Companies can plug LinkedOut into existing video pipelines without redesigning their data collection (no need for hand‑crafted tags or metadata).
- Low‑cost inference – The token extraction step runs on a single GPU with sub‑50 ms latency, making it viable for real‑time personalization on edge servers or cloud functions.
- Cross‑modal extensibility – Because the representation is token‑based, it can be combined with audio embeddings, textual subtitles, or user interaction logs without retraining the whole VLLM.
- Explainable recommendations – The MoE gating decisions can be surfaced to developers or end‑users, helping debug bias or compliance issues (e.g., why a certain genre is being promoted).
- Future‑proofing – As newer, larger VLLMs become available, LinkedOut can simply swap in the upgraded backbone, preserving the same downstream architecture.
Limitations & Future Work
- Dependence on pretrained VLLM quality – If the underlying VLLM lacks coverage for niche domains (e.g., specialized sports), the extracted tokens may miss critical cues.
- Prompt design overhead – While the paper uses a fixed set of prompts, scaling to new recommendation contexts may require manual prompt engineering or an automated prompt‑search module.
- Memory footprint for long histories – Aggregating many video tokens can grow linearly; the authors suggest hierarchical pooling as a next step.
Future Directions
- Learning prompts jointly with the MoE,
- Extending the framework to multimodal live‑stream recommendation, and
- Exploring distillation techniques to further shrink the VLLM backbone for edge deployment.
Authors
- Haichao Zhang
- Yao Lu
- Lichen Wang
- Yunzhe Li
- Daiwei Chen
- Yunpeng Xu
- Yun Fu
Paper Information
- arXiv ID: 2512.16891v1
- Categories: cs.CV, cs.AI, cs.IR, cs.LG, cs.MM
- Published: December 18, 2025
- PDF: Download PDF