[Paper] Towards Observation Lakehouses: Living, Interactive Archives of Software Behavior

Published: 1 month ago (December 2, 2025 at 09:12 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02795v1

Overview

The paper proposes Observation Lakehouses, a new way to store and query massive streams of runtime observations (stimuli, responses, and execution context) from software tests and CI pipelines. By treating these observations as a continuously growing, append‑only table, developers can instantly materialize rich behavioral views (SRMs/SRCs) without re‑running code, opening the door to fast, data‑driven debugging, version comparison, and model training.

Key Contributions

Continual SRC storage model: Defines a “tall” observation table that logs every stimulus‑response tuple together with its context, enabling on‑demand reconstruction of higher‑level behavior matrices and cubes.
Lakehouse implementation: Combines Apache Parquet, Apache Iceberg, and DuckDB to provide ACID‑compatible, append‑only storage with fast SQL‑based slicing.
Integration pipelines: Shows how to feed observations from controlled experiments (LASSO) and real CI runs (unit tests) into the lakehouse automatically.
Scalable analytics on a laptop: Demonstrates that 8.6 M observation rows (≈ 51 MiB) can be ingested and queried in < 100 ms, proving that large‑scale behavior mining does not require a distributed cluster.
Open‑source release: Publishes the full lakehouse code and a benchmark dataset (509 problems) on GitHub for the community to adopt and extend.

Methodology

Data model – Each observation is a record (stimulus, response, context, version, timestamp). The table grows only by appending new rows, preserving a complete history.
Storage stack –
- Parquet provides columnar, compressed files for efficient I/O.
- Iceberg adds schema evolution, partitioning, and snapshot isolation, turning the raw files into a true lakehouse.
- DuckDB runs fast, in‑process SQL queries directly on the Parquet files, enabling instant materialization of SRMs (2‑D matrices) and SRCs (3‑D cubes).
Ingestion pipelines –
- LASSO (a controlled stimulus generator) produces systematic test inputs and captures responses.
- CI integration hooks into existing unit‑test runners, automatically logging each test case execution as an observation.
Analytics – Using simple SQL, the authors slice the observation table by version, problem, or stimulus subset to reconstruct SRMs/SRCs, then apply clustering or consensus‑oracle algorithms on the resulting views.

Results & Findings

Data volume: 8.6 M observation rows across 509 benchmark problems, stored in < 51 MiB (high compression).
Query latency: Rebuilding any SRM or SRC slice and running clustering took < 100 ms on a standard laptop (no GPU, no cluster).
Behavioral insights: The lakehouse enabled n‑version comparison (detecting regressions across implementations) and automatic clustering of similar behavior patterns without re‑executing tests.
Practicality: The end‑to‑end pipeline—from CI test run to queryable behavior view—operated entirely locally, showing that “continual behavior mining” is feasible for typical development teams.

Practical Implications

Debugging & regression detection – Developers can query historic behavior across versions instantly, spotting subtle functional drifts that static diff tools miss.
LLM training data curation – By providing a ground‑truth, runtime‑validated behavior archive, the lakehouse can filter out buggy or mislabeled code before feeding it to code‑generating models.
Continuous integration analytics – CI systems can surface behavioral metrics (e.g., consensus oracle failures) as first‑class test results, enabling smarter gating policies.
Behavior‑driven testing – Teams can generate stimulus‑response clusters to automatically synthesize new test cases that cover uncovered behavior regions.
Low‑cost infrastructure – Since the approach runs efficiently on a laptop, small teams and open‑source projects can adopt it without investing in big data clusters.

Limitations & Future Work

Scalability ceiling – While the prototype handles millions of rows, the authors note that extremely large fleets (billions of observations) may still benefit from distributed query engines.
Context richness – Current context fields are limited to version and timestamp; richer provenance (e.g., hardware, OS, library versions) would improve cross‑environment analyses.
Automated oracle generation – The paper demonstrates consensus oracles but leaves the design of fully automated correctness oracles as future work.
Security & privacy – Storing raw execution data may expose proprietary code or data; future extensions should explore encryption and access‑control mechanisms.

The Observation Lakehouse opens a practical path to treat software behavior as a first‑class data asset, turning everyday test runs into a searchable, queryable knowledge base for developers, researchers, and AI model builders alike.

Authors

Marcus Kessel

Paper Information

arXiv ID: 2512.02795v1
Categories: cs.SE
Published: December 2, 2025
PDF: Download PDF

[Paper] Towards Observation Lakehouses: Living, Interactive Archives of Software Behavior

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MicroRacer: Detecting Concurrency Bugs for Cloud Service Systems

[Paper] Executing Discrete/Continuous Declarative Process Specifications via Complex Event Processing

[Paper] Compiling Away the Overhead of Race Detection

[Paper] Automated Code Review Assignments: An Alternative Perspective of Code Ownership on GitHub