[Paper] Unique Lives, Shared World: Learning from Single-Life Videos

Published: (December 3, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.04085v1

Overview

A new study proposes “single‑life” learning: training a vision model solely on egocentric video recorded from one person’s daily routine. By exploiting the many viewpoints naturally captured in a single individual’s life, the authors show that self‑supervised encoders learn a robust, geometry‑aware representation that transfers across environments and rivals models trained on massive, diverse web datasets.

Key Contributions

  • Single‑life paradigm – Demonstrates that a model trained on just one person’s egocentric footage can acquire a universal visual understanding.
  • Cross‑life alignment metric – Introduces a cross‑attention based measure to quantify how closely internal representations from different single‑life models align geometrically.
  • Strong transfer performance – Shows that encoders learned from a single life achieve competitive results on downstream tasks such as depth estimation, even in unseen indoor/outdoor scenes.
  • Data efficiency – Finds that ~30 h of video from one week of a single person matches the performance of ~30 h of heterogeneous web video, highlighting the richness of personal lifelog data.

Methodology

  1. Data collection – The authors gathered several egocentric video datasets, each capturing a different individual’s “life” over multiple days (both indoor and outdoor activities).
  2. Self‑supervised training – Using a contrastive learning framework, the model predicts whether two video clips are temporally adjacent, encouraging the encoder to capture the underlying 3D geometry rather than superficial appearance.
  3. Cross‑attention alignment – To compare models trained on different lives, they compute attention maps between the two encoders’ feature spaces and derive a similarity score that reflects functional alignment of the learned geometry.
  4. Evaluation – Trained encoders are frozen and fine‑tuned on downstream tasks (e.g., monocular depth prediction) in completely new environments to test generalization.

Results & Findings

  • Geometric alignment – Encoders trained on completely different lives produce highly correlated feature spaces (average cross‑attention similarity > 0.85), indicating a shared geometric understanding of the world.
  • Depth transfer – When fine‑tuned on a standard depth benchmark (NYU‑Depth V2), single‑life models achieve within 2–3 % of the performance of models pre‑trained on large web video corpora.
  • Data parity – Training on 30 h from a single person’s week yields depth error (RMSE) comparable to training on 30 h of diverse internet video, confirming that personal lifelog data is surprisingly information‑dense.
  • Robustness across domains – The learned representations remain effective when transferred from indoor to outdoor scenes and vice‑versa, underscoring the universality of the captured geometry.

Practical Implications

  • Personalized AI assistants – Developers can build vision models that adapt to a user’s own visual environment using only a week of wearable camera footage, reducing reliance on massive public datasets.
  • Privacy‑preserving training – Since the data never leaves the user’s device, single‑life learning offers a pathway to on‑device self‑supervised pre‑training for AR glasses, robotics, or smart home cameras.
  • Cost‑effective data collection – Companies can bootstrap high‑quality visual representations without expensive crowdsourced video labeling pipelines; a single participant’s lifelog suffices for many downstream tasks.
  • Domain adaptation – The strong alignment across different lives suggests that models trained on one user can be quickly fine‑tuned for another, accelerating deployment in heterogeneous settings (e.g., construction sites, warehouses).

Limitations & Future Work

  • Scope of activities – The current datasets focus on relatively routine daily activities; extreme or highly specialized tasks (e.g., surgery, sports) may require additional data diversity.
  • Temporal coverage – While 30 h proved sufficient, longer‑term variations (seasonal lighting, clothing changes) were not explored and could affect representation stability.
  • Scalability of alignment metric – The cross‑attention similarity computation is expensive for very large models; future work could devise lighter alignment diagnostics.
  • Integration with multimodal signals – Extending single‑life learning to incorporate audio, inertial, or language cues could further enrich the learned representations.

Bottom line: This paper shows that a week’s worth of personal egocentric video can teach a vision model the geometry of the world as well as massive web‑scale datasets, opening doors to personalized, privacy‑first AI that learns directly from our own daily lives.

Authors

  • Tengda Han
  • Sayna Ebrahimi
  • Dilara Gokay
  • Li Yang Ku
  • Maks Ovsjanikov
  • Iva Babukova
  • Daniel Zoran
  • Viorica Patraucean
  • Joao Carreira
  • Andrew Zisserman
  • Dima Damen

Paper Information

  • arXiv ID: 2512.04085v1
  • Categories: cs.CV
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »