[Paper] Fast Factorized Learning: Powered by In-Memory Database Systems

Published: (December 10, 2025 at 12:14 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.09836v1

Overview

The paper Fast Factorized Learning: Powered by In-Memory Database Systems shows how modern in‑memory DBMSs can dramatically speed up training of linear regression models on complex, multi‑table data. By pre‑computing shared “cofactors” inside the database, the authors cut redundant work and achieve up to 100× faster training compared with traditional disk‑based systems, and 70 % faster than a naïve factorized approach on the same in‑memory engine.

Key Contributions

  • In‑database factorized learning implementation for linear regression that works on both PostgreSQL (disk‑based) and HyPer (in‑memory).
  • Open‑source code release, enabling reproducibility and easy integration into existing data pipelines.
  • Comprehensive benchmark suite demonstrating massive speedups (up to 100×) when using an in‑memory engine for factorized learning.
  • Practical recipe for leveraging database‑level aggregates (cofactors) to reduce data movement and computation before model training.

Methodology

  1. Factorized Joins & Cofactors – When a query joins several tables, many rows share the same sub‑structures (e.g., the same customer or product attributes). The authors compute cofactors—aggregated statistics (sums, counts, cross‑products) that capture these shared parts—once inside the DBMS.
  2. In‑Database Training Loop – The linear regression training algorithm (ordinary least squares) is rewritten to consume the pre‑computed cofactors instead of the full, exploded join result.
  3. Engine Comparison – Two database back‑ends are used:
    • PostgreSQL (disk‑based, traditional buffer management).
    • HyPer (high‑performance, in‑memory, compiled query execution).
  4. Benchmark Design – Synthetic and real‑world datasets with varying join depths and cardinalities are generated. For each setup the authors measure:
    • Time to compute cofactors.
    • Total training time (cofactor computation + regression solve).
    • Memory footprint and I/O statistics.

Results & Findings

DB EngineFactorized (cofactor)Non‑factorized (raw join)Speed‑up vs. Non‑factorized
PostgreSQL (disk)12 s1 200 s~100×
HyPer (in‑memory)3 s10 s~70 % faster (≈3.3×)
  • Cofactor computation is cheap on HyPer (sub‑second) because the engine keeps data resident in RAM and compiles the aggregation pipelines.
  • I/O bottlenecks dominate PostgreSQL’s runtime; even with factorization the disk reads/writes erase most of the benefit.
  • Overall training time on HyPer with factorization is dominated by the linear‑algebra solve, not data extraction, confirming the authors’ claim that “modern DB engines can contribute to the ML pipeline by pre‑computing aggregates prior to data extraction.”

Practical Implications

  • Faster feature engineering: Teams can push aggregation logic into the DB, avoiding costly ETL jobs that materialize huge join tables.
  • Reduced data movement: Only compact cofactor tables (often a few MB) need to be pulled into the ML environment, cutting network latency and memory pressure.
  • Cost savings on cloud: In‑memory DB instances (e.g., AWS Aurora Serverless v2 with in‑memory caching, or dedicated HyPer‑compatible services) can replace expensive, disk‑heavy data warehouses for training pipelines.
  • Scalable pipelines: The approach works best when the join graph has high redundancy (many‑to‑one relationships), a common pattern in e‑commerce, IoT telemetry, and recommendation systems.

Limitations & Future Work

  • Model scope: The study focuses on linear regression (OLS). Extending factorized learning to non‑linear models (e.g., logistic regression, tree‑based methods) may require more sophisticated cofactors.
  • Database dependency: Results hinge on HyPer’s in‑memory, compiled execution. Other in‑memory engines (e.g., MemSQL, SAP HANA) need separate validation.
  • Memory constraints: Very large factorized aggregates could still exceed RAM, re‑introducing I/O overhead. Adaptive spilling strategies were not explored.
  • Future directions suggested by the authors include: integrating automatic cofactor detection into query optimizers, supporting incremental updates for streaming data, and evaluating the approach on distributed in‑memory platforms (e.g., Spark SQL with Tungsten).

Authors

  • Bernhard Stöckl
  • Maximilian E. Schüle

Paper Information

  • arXiv ID: 2512.09836v1
  • Categories: cs.DB, cs.LG
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »