[Paper] Fast Factorized Learning: Powered by In-Memory Database Systems

Published: 2 months ago (December 10, 2025 at 12:14 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.09836v1

Overview

The paper Fast Factorized Learning: Powered by In-Memory Database Systems shows how modern in‑memory DBMSs can dramatically speed up training of linear regression models on complex, multi‑table data. By pre‑computing shared “cofactors” inside the database, the authors cut redundant work and achieve up to 100× faster training compared with traditional disk‑based systems, and 70 % faster than a naïve factorized approach on the same in‑memory engine.

Key Contributions

In‑database factorized learning implementation for linear regression that works on both PostgreSQL (disk‑based) and HyPer (in‑memory).
Open‑source code release, enabling reproducibility and easy integration into existing data pipelines.
Comprehensive benchmark suite demonstrating massive speedups (up to 100×) when using an in‑memory engine for factorized learning.
Practical recipe for leveraging database‑level aggregates (cofactors) to reduce data movement and computation before model training.

Methodology

Factorized Joins & Cofactors – When a query joins several tables, many rows share the same sub‑structures (e.g., the same customer or product attributes). The authors compute cofactors—aggregated statistics (sums, counts, cross‑products) that capture these shared parts—once inside the DBMS.
In‑Database Training Loop – The linear regression training algorithm (ordinary least squares) is rewritten to consume the pre‑computed cofactors instead of the full, exploded join result.
Engine Comparison – Two database back‑ends are used:
- PostgreSQL (disk‑based, traditional buffer management).
- HyPer (high‑performance, in‑memory, compiled query execution).
Benchmark Design – Synthetic and real‑world datasets with varying join depths and cardinalities are generated. For each setup the authors measure:
- Time to compute cofactors.
- Total training time (cofactor computation + regression solve).
- Memory footprint and I/O statistics.

Results & Findings

DB Engine	Factorized (cofactor)	Non‑factorized (raw join)	Speed‑up vs. Non‑factorized
PostgreSQL (disk)	12 s	1 200 s	~100×
HyPer (in‑memory)	3 s	10 s	~70 % faster (≈3.3×)

Cofactor computation is cheap on HyPer (sub‑second) because the engine keeps data resident in RAM and compiles the aggregation pipelines.
I/O bottlenecks dominate PostgreSQL’s runtime; even with factorization the disk reads/writes erase most of the benefit.
Overall training time on HyPer with factorization is dominated by the linear‑algebra solve, not data extraction, confirming the authors’ claim that “modern DB engines can contribute to the ML pipeline by pre‑computing aggregates prior to data extraction.”

Practical Implications

Faster feature engineering: Teams can push aggregation logic into the DB, avoiding costly ETL jobs that materialize huge join tables.
Reduced data movement: Only compact cofactor tables (often a few MB) need to be pulled into the ML environment, cutting network latency and memory pressure.
Cost savings on cloud: In‑memory DB instances (e.g., AWS Aurora Serverless v2 with in‑memory caching, or dedicated HyPer‑compatible services) can replace expensive, disk‑heavy data warehouses for training pipelines.
Scalable pipelines: The approach works best when the join graph has high redundancy (many‑to‑one relationships), a common pattern in e‑commerce, IoT telemetry, and recommendation systems.

Limitations & Future Work

Model scope: The study focuses on linear regression (OLS). Extending factorized learning to non‑linear models (e.g., logistic regression, tree‑based methods) may require more sophisticated cofactors.
Database dependency: Results hinge on HyPer’s in‑memory, compiled execution. Other in‑memory engines (e.g., MemSQL, SAP HANA) need separate validation.
Memory constraints: Very large factorized aggregates could still exceed RAM, re‑introducing I/O overhead. Adaptive spilling strategies were not explored.
Future directions suggested by the authors include: integrating automatic cofactor detection into query optimizers, supporting incremental updates for streaming data, and evaluating the approach on distributed in‑memory platforms (e.g., Spark SQL with Tungsten).

Authors

Bernhard Stöckl
Maximilian E. Schüle

Paper Information

arXiv ID: 2512.09836v1
Categories: cs.DB, cs.LG
Published: December 10, 2025
PDF: Download PDF

[Paper] Fast Factorized Learning: Powered by In-Memory Database Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously