[Paper] Towards Observation Lakehouses: Living, Interactive Archives of Software Behavior

Published: (December 2, 2025 at 09:12 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.02795v1

Overview

The paper proposes Observation Lakehouses, a new way to store and query massive streams of runtime observations (stimuli, responses, and execution context) from software tests and CI pipelines. By treating these observations as a continuously growing, append‑only table, developers can instantly materialize rich behavioral views (SRMs/SRCs) without re‑running code, opening the door to fast, data‑driven debugging, version comparison, and model training.

Key Contributions

  • Continual SRC storage model: Defines a “tall” observation table that logs every stimulus‑response tuple together with its context, enabling on‑demand reconstruction of higher‑level behavior matrices and cubes.
  • Lakehouse implementation: Combines Apache Parquet, Apache Iceberg, and DuckDB to provide ACID‑compatible, append‑only storage with fast SQL‑based slicing.
  • Integration pipelines: Shows how to feed observations from controlled experiments (LASSO) and real CI runs (unit tests) into the lakehouse automatically.
  • Scalable analytics on a laptop: Demonstrates that 8.6 M observation rows (≈ 51 MiB) can be ingested and queried in < 100 ms, proving that large‑scale behavior mining does not require a distributed cluster.
  • Open‑source release: Publishes the full lakehouse code and a benchmark dataset (509 problems) on GitHub for the community to adopt and extend.

Methodology

  1. Data model – Each observation is a record (stimulus, response, context, version, timestamp). The table grows only by appending new rows, preserving a complete history.
  2. Storage stack
    • Parquet provides columnar, compressed files for efficient I/O.
    • Iceberg adds schema evolution, partitioning, and snapshot isolation, turning the raw files into a true lakehouse.
    • DuckDB runs fast, in‑process SQL queries directly on the Parquet files, enabling instant materialization of SRMs (2‑D matrices) and SRCs (3‑D cubes).
  3. Ingestion pipelines
    • LASSO (a controlled stimulus generator) produces systematic test inputs and captures responses.
    • CI integration hooks into existing unit‑test runners, automatically logging each test case execution as an observation.
  4. Analytics – Using simple SQL, the authors slice the observation table by version, problem, or stimulus subset to reconstruct SRMs/SRCs, then apply clustering or consensus‑oracle algorithms on the resulting views.

Results & Findings

  • Data volume: 8.6 M observation rows across 509 benchmark problems, stored in < 51 MiB (high compression).
  • Query latency: Rebuilding any SRM or SRC slice and running clustering took < 100 ms on a standard laptop (no GPU, no cluster).
  • Behavioral insights: The lakehouse enabled n‑version comparison (detecting regressions across implementations) and automatic clustering of similar behavior patterns without re‑executing tests.
  • Practicality: The end‑to‑end pipeline—from CI test run to queryable behavior view—operated entirely locally, showing that “continual behavior mining” is feasible for typical development teams.

Practical Implications

  • Debugging & regression detection – Developers can query historic behavior across versions instantly, spotting subtle functional drifts that static diff tools miss.
  • LLM training data curation – By providing a ground‑truth, runtime‑validated behavior archive, the lakehouse can filter out buggy or mislabeled code before feeding it to code‑generating models.
  • Continuous integration analytics – CI systems can surface behavioral metrics (e.g., consensus oracle failures) as first‑class test results, enabling smarter gating policies.
  • Behavior‑driven testing – Teams can generate stimulus‑response clusters to automatically synthesize new test cases that cover uncovered behavior regions.
  • Low‑cost infrastructure – Since the approach runs efficiently on a laptop, small teams and open‑source projects can adopt it without investing in big data clusters.

Limitations & Future Work

  • Scalability ceiling – While the prototype handles millions of rows, the authors note that extremely large fleets (billions of observations) may still benefit from distributed query engines.
  • Context richness – Current context fields are limited to version and timestamp; richer provenance (e.g., hardware, OS, library versions) would improve cross‑environment analyses.
  • Automated oracle generation – The paper demonstrates consensus oracles but leaves the design of fully automated correctness oracles as future work.
  • Security & privacy – Storing raw execution data may expose proprietary code or data; future extensions should explore encryption and access‑control mechanisms.

The Observation Lakehouse opens a practical path to treat software behavior as a first‑class data asset, turning everyday test runs into a searchable, queryable knowledge base for developers, researchers, and AI model builders alike.

Authors

  • Marcus Kessel

Paper Information

  • arXiv ID: 2512.02795v1
  • Categories: cs.SE
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »